CN112801130B

CN112801130B - Image clustering quality evaluation method, system, medium, and apparatus

Info

Publication number: CN112801130B
Application number: CN202011488380.XA
Authority: CN
Inventors: 凌英剑; 田国栋
Original assignee: Guangzhou Yuncong Dingwang Technology Co ltd
Current assignee: Guangzhou Yuncong Dingwang Technology Co ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2022-02-08
Anticipated expiration: 2040-12-16
Also published as: CN112801130A

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a method, a system, a medium and a device for evaluating image cluster quality. The invention aims to solve the technical problem of how to uniformly perform quality evaluation on the clustering clusters to realize automatic error correction so as to reduce low-quality clusters of wrong clustering and further improve the image processing performance. For the purpose, the invention searches neighbor nodes for clusters to be evaluated in the incremental clustering through traversing nodes and vector similarity of the traversed nodes, calculates the neighbor coverage rate of the nodes to further obtain the cluster neighbor coverage rate, and further eliminates the clusters with the coverage rate lower than the threshold value. Therefore, new or changed clusters with poor quality are eliminated through the cluster neighbor coverage rate, and the defects of low image processing efficiency, resource waste and the like are avoided.

Description

Image clustering quality evaluation method, system, medium, and apparatus

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method, a system, a medium and a device for evaluating image cluster quality.

Background

In environments and scenes where image processing is required, such as a personnel management system and a video monitoring system, corresponding target image samples are often required to be obtained from massive image data samples stored in a database, for example, for monitoring and recognizing human faces, the number of human faces acquired by the video monitoring system generally increases exponentially, and a massive human face library is formed. However, for large-scale image targets such as human faces, the feature libraries of the image data samples are compared one by one, and the calculation complexity is high, so that the unmarked samples to be identified are clustered to reduce the number of the retrieved and compared samples, and the retrieval is more targeted, thereby reducing the algorithm complexity and shortening the retrieval time.

The clustering refers to classifying data without labels, so that data with the same category are in the same category as much as possible, and data with different categories are in different categories. The incremental clustering refers to clustering data in batches, so that data in the same batch and among different batches can be clustered. The incremental clustering is widely applied to scenes such as security monitoring and auxiliary labeling.

However, under the influence of environmental factors, the cluster quality obtained by clustering is often poor due to complex data sample distribution, and a large number of data samples of different types are often grouped together by mistake due to factors such as environment/scene (for example, different faces wearing sunglasses or hats are grouped in the same cluster).

Accordingly, there is a need in the art to evaluate the quality of clusters in the existing image processing process and correct errors accordingly, thereby improving the quality of clusters and improving the performance of overall image processing.

Disclosure of Invention

In order to overcome the above drawbacks, the present invention is proposed to provide a solution or at least a partial solution: how to carry out quality evaluation on the cluster clusters uniformly to realize automatic error correction so as to reduce low-quality clusters of wrong clusters and further improve the technical problem of image processing performance. The invention provides an automatic error correction method, system, medium and device based on cluster quality evaluation for solving the technical problems.

In a first aspect, the present invention provides a method for evaluating quality of image cluster, including: clustering the image data to obtain clusters to be evaluated, and traversing nodes of each cluster in all the clusters to be evaluated to obtain all the nodes of all the clusters to be evaluated; obtaining the neighbor node of each node in all the nodes by a vector similarity retrieval algorithm; aiming at each cluster to be evaluated, obtaining cluster neighbor coverage rate of each cluster to be evaluated according to each node in each cluster to be evaluated and the corresponding neighbor node thereof; and comparing the cluster neighbor coverage rate of each cluster to be evaluated with a threshold value, and determining the quality of each cluster to be evaluated.

The obtaining of the neighbor node of each node in all the nodes through the vector similarity retrieval algorithm specifically includes: taking the feature vector of each node in all the nodes as a query vector, and performing vector similarity retrieval in preset bottom library vectors to obtain k bottom library vectors most similar to the feature vector of each node; taking the nodes corresponding to the k bottom base vectors as neighbor nodes of each node, and establishing a set I by using all the neighbor nodes corresponding to all the nodes; and the k most similar bottom library vectors are the k bottom library vectors with the largest similarity values obtained by vector similarity retrieval.

The method for obtaining cluster neighbor coverage of each cluster to be evaluated according to each node and corresponding neighbor node in each cluster to be evaluated for each cluster to be evaluated specifically comprises the following steps: traversing all the clusters to be evaluated to select a current cluster to be evaluated; traversing all nodes in the cluster to be evaluated currently to select any current node; acquiring a neighbor node corresponding to the current node, and calculating the neighbor coverage rate of the current node; and averaging the calculated neighbor coverage rates of all nodes in the current cluster to be evaluated to obtain the cluster neighbor coverage rate of the current cluster to be evaluated.

The calculating of the neighbor coverage of the current node specifically includes: calculating the intersection of the neighbor node of the current node and all nodes in the current to-be-evaluated cluster where the current node is located, and recording the size of the intersection as n; taking the minimum value of the number of neighbor nodes corresponding to the current node and the number of all nodes in the current cluster to be evaluated where the current node is located, and recording the minimum value as m; calculating the neighbor coverage rate of the current node as n/m; wherein if m is 0, the neighbor coverage of the current node is 0.

The clustering of the image data to obtain a cluster to be evaluated comprises the following steps: performing incremental clustering on newly added image data in the image data to obtain a new cluster; and taking the new cluster and the old cluster which is changed due to the insertion of the newly added image data as all the clusters to be evaluated, and establishing a set C as follows:

C＝{c₁，c₂，...c_i，...}

wherein i is a natural number of 1 or more, representing the number, c_iIs the ith cluster to be evaluated; wherein, all node sets N of all the clusters to be evaluated are also established as follows:

wherein the content of the first and second substances,

is the ith node belonging to the set c;

wherein, the neighbor node set I of all the node sets N and the corresponding similarity degree value set D obtained by the vector similarity retrieval are respectively:

wherein the content of the first and second substances,

the j-th neighbor node representing the ith node in the set N, j ∈ [1, k ]]K is a natural number greater than or equal to 1; wherein the content of the first and second substances,

representing the similarity between the ith node and the jth neighbor node in the set N;

wherein the cluster to be evaluated currently is marked as c:

wherein the content of the first and second substances,

the current node is the ith current node which is optional after the current cluster c to be evaluated is traversed;

acquiring a neighbor node set of an ith current node from a neighbor node set I, recording the neighbor node set as Ii, calculating a neighbor coverage rate r of the ith current node, and calculating an average value of all neighbor coverage rates r in a cluster c as a cluster neighbor coverage rate of the cluster c, wherein the r calculation formula is as follows:

wherein | c | N | I_iI represents the neighbor node set I of the ith current node_iThe size of the intersection with all nodes in the cluster c is n, min (| c |, | I |)_iI) represents a node from the set of neighbor nodes I_iAnd selecting a minimum value m between the number of the inner nodes and the number of all nodes in the cluster c.

Comparing the cluster neighbor coverage rate of each cluster to be evaluated with a threshold value, and determining the quality of each cluster to be evaluated, wherein the method specifically comprises the following steps: if the cluster neighbor coverage rate of each cluster to be evaluated is smaller than the threshold value, the quality of each cluster to be evaluated is unqualified; and adding all the nodes in the clusters to be evaluated with unqualified quality into the image data to be clustered next time or discarding the nodes.

The incremental clustering adopts an algorithm of K-means, DBSCAN or hierarchical clustering algorithm; and/or the vector similarity retrieval algorithm is any one algorithm of brute force retrieval, Hash retrieval, IVFFlat, IVFPQ and HNSW; and/or the vector similarity is calculated by adopting any one of vector inner product calculation, L1 distance calculation and L2 distance calculation.

In a second aspect, the present invention provides an image cluster quality evaluation system, including: the node acquisition unit is used for traversing nodes of each selected cluster to be evaluated so as to obtain all nodes of all the clusters to be evaluated; the retrieval unit is used for obtaining the neighbor nodes of each node in all the nodes through a vector similarity retrieval algorithm; a coverage rate obtaining unit, configured to obtain, for each cluster to be evaluated, a cluster neighbor coverage rate of each cluster to be evaluated according to each node in each cluster to be evaluated and a corresponding neighbor node thereof; and the evaluation unit is used for comparing the cluster neighbor coverage rate of each cluster to be evaluated with a threshold value and determining the quality of each cluster to be evaluated.

Wherein the retrieval unit is specifically configured to: taking the feature vector of each node in all the nodes as a query vector, and performing vector similarity retrieval in preset bottom library vectors to obtain k bottom library vectors most similar to the feature vector of each node; the nodes corresponding to the k bottom base vectors are used as neighbor nodes of each node, and all neighbor nodes corresponding to all the nodes are used for establishing a set I; and the k most similar bottom library vectors are the k bottom library vectors with the largest similarity values obtained by vector similarity retrieval.

Wherein, the coverage rate obtaining unit is specifically configured to: traversing all the clusters to be evaluated to select a current cluster to be evaluated; traversing all nodes in the cluster to be evaluated currently to select any current node; acquiring a neighbor node corresponding to the current node, and calculating the neighbor coverage rate of the current node; and averaging the calculated neighbor coverage rates of all nodes in the current cluster to be evaluated to obtain the cluster neighbor coverage rate of the current cluster to be evaluated.

The step of calculating the neighbor coverage of the current node by the coverage obtaining unit specifically includes: calculating the intersection of the neighbor node of the current node and all nodes in the current to-be-evaluated cluster where the current node is located, and recording the size of the intersection as n; taking the minimum value of the number of neighbor nodes corresponding to the current node and the number of all nodes in the current cluster to be evaluated where the current node is located, and recording the minimum value as m; calculating the neighbor coverage rate of the current node as n/m; wherein if m is 0, the neighbor coverage of the current node is 0.

Wherein, still include: the clustering unit is used for carrying out incremental clustering on the newly added image data in the image data to obtain a new cluster; a selecting unit, configured to take the new cluster and the old cluster that has changed due to insertion of the new image data as all the clusters to be evaluated, and establish a set C as:

C＝{c₁，c₂，...c_i，...}

wherein i is a natural number of 1 or more, representing the number, c_iIs the ith cluster to be evaluated;

the node acquisition unit is further specifically configured to: establishing all node sets N of all the clusters to be evaluated as follows:

wherein the content of the first and second substances,

is the ith node belonging to the set c;

the retrieval unit is further specifically configured to: through the vector similarity retrieval, the obtained neighbor node set I of all the node sets N and the corresponding similarity degree value set D thereof are respectively:

wherein the content of the first and second substances,

j-th neighbor node representing ith node in set N, j ∈ [1, k ]]K is a natural number greater than or equal to 1; wherein the content of the first and second substances,

wherein each cluster to be evaluated is marked as c:

wherein the content of the first and second substances,

representing the optional ith current node after traversing each cluster c to be evaluated;

the coverage rate obtaining unit is further specifically configured to: obtaining the neighbor node set of the ith current node from the neighbor node set I and recording as I_iCalculating the neighbor coverage rate r of the ith current node, and calculating the average value of all the neighbor coverage rates r in the cluster c as the cluster neighbor coverage rate of the cluster c, wherein the r is calculated by the formula:

Wherein the evaluation unit is specifically configured to: if the cluster neighbor coverage rate of each cluster to be evaluated is smaller than the threshold value, the quality of the current cluster to be evaluated is unqualified; and adding all the nodes in the clusters to be evaluated with unqualified quality into the image data to be clustered next time or discarding the nodes.

In a third aspect, the present invention provides a computer readable storage medium, having a plurality of program codes stored therein, wherein when the plurality of program codes are loaded and executed by a processor, the method for evaluating the quality of image cluster clusters according to any one of the first aspect is implemented.

In a fourth aspect, the present invention provides a processing apparatus comprising a processor and a storage device, the storage device being adapted to store a plurality of program codes, wherein the program codes are adapted to be loaded and executed by the processor to perform the image cluster quality assessment method according to any of the preceding first aspects.

One or more technical schemes of the invention at least have one or more of the following beneficial effects:

the method comprises the steps of clustering input samples through an incremental clustering algorithm, collecting new clusters and changed old clusters, retrieving neighbor nodes of nodes in the new clusters and the changed old clusters through a vector similarity retrieval algorithm, performing intersection calculation by combining traversed nodes, further calculating neighbor coverage rate of each node as average neighbor coverage rate of corresponding clustering clusters, eliminating clusters with the average neighbor coverage rate lower than a preset threshold value, and eliminating the clustering clusters considered to be low-quality. Therefore, the quality of the clustering cluster is automatically evaluated and timely corrected, low-quality clusters generated by wrong clustering are reduced and eliminated, the conditions of poor clustering effect, low quality and more chaotic errors of categories caused by the interference and influence of factors such as environment and the like on an image data sample are avoided, and the defects of poor clustering and poor robustness in the prior art are effectively overcome; the clustering effect is optimized, and the algorithm robustness is improved. Furthermore, the overall performance of subsequent image processing is improved, and the waste of processing resources caused by low-quality clusters is avoided.

Drawings

Embodiments of the invention are described below with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart illustrating the main steps of an embodiment of an image cluster quality evaluation method according to the present invention;

FIG. 2 is a block diagram of the main modules of an embodiment of the image cluster quality evaluation system according to the present invention;

fig. 3 and 4 are schematic main structural diagrams of an embodiment applied to a terminal device according to the scheme of the invention.

Detailed Description

Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

In the description of the present invention, a "module" or "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, may comprise software components such as program code, or may be a combination of software and hardware. The processor may be a central processing unit, microprocessor, image processor, digital signal processor, or any other suitable processor. The processor has data and/or signal processing functionality. The processor may be implemented in software, hardware, or a combination thereof. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random-access memory, and the like. The term "a and/or B" denotes all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" means similar to "A and/or B" and may include only A, only B, or both A and B. The singular forms "a", "an" and "the" may include the plural forms as well. Of course, the above alternative embodiments, and the alternative embodiments and the preferred embodiments can also be used in a cross-matching manner, so that a new embodiment is combined to be suitable for a more specific application scenario.

In one embodiment of the image cluster quality evaluation method, automatic error correction (removing unqualified poor quality) is performed by evaluating the cluster quality of the image. Specifically, clustering the feature vectors of the input samples by an incremental clustering algorithm, collecting all new clusters and changed old clusters, and performing K neighbor node retrieval on nodes in each cluster, wherein the retrieval adopts a vector similarity retrieval algorithm; after traversing each node in each cluster, calculating the intersection of the K neighbor node of a given node and the node of the cluster to which the node belongs for the given node, recording the size of the intersection as n, then calculating the minimum value of the number of the K neighbor nodes of each node and the number of the nodes of the cluster to which the node belongs, recording the minimum value as m, and calculating that the neighbor coverage rate of each node is n/m, wherein if m is 0, the coverage rate is also 0; and calculating the coverage rate of each cluster, for example, the average neighbor coverage rate of all nodes of the cluster, and comparing the average neighbor coverage rate with a predetermined threshold value to eliminate low-quality clusters with too low average coverage rate. Different from other existing evaluation modes, the embodiment of the invention uses the statistics related to k neighbor to evaluate the cluster quality, combines the nodes of the new cluster and the changed old cluster, and is combined with the clustering algorithm to optimize the clustering result of the clustering algorithm, reduce the algorithm complexity, ensure that the quality of the rest clusters after elimination meets the requirement, contribute to the high quality of the clusters used in the subsequent image processing, and improve the overall processing performance.

The meaning or effect of some terms involved in the present invention is explained first:

k is adjacent to: the method is characterized in that given to-be-retrieved feature vectors, k vectors with the highest similarity with the feature vectors are retrieved from other base library feature vectors.

Clustering: the set formed by all the nodes belonging to the same category obtained by the clustering algorithm is called a cluster.

Node B: such as individual image data/samples in a clustered cluster.

The clustering algorithm is for example: k-means, DBSCAN, hierarchical clustering, etc.

The vector similarity search algorithm is, for example: violence search, Hash search, IVFFlat, IVFPQ, HNSW, etc.

Vector similarity calculation, for example: vector inner product calculation, L1 distance calculation, L2 distance calculation, etc.

[ example 1 ]

In an embodiment of the method for evaluating the quality of image cluster of the present invention, as shown in fig. 1, the method includes:

s110, obtaining the cluster to be evaluated.

In a specific embodiment, incremental clustering is performed on given image data, then all clusters obtained through the incremental clustering are counted, and then a newly added cluster and an old cluster which is changed by inserting new data are selected from all the counted clusters as all the clusters to be evaluated. All the clusters to be evaluated can be used to establish a cluster set C. For example, it can be written as:

C＝{c₁，c₂，...c_i，...}

wherein i is a natural number of 1 or more, representing the number, c_iIs the ith cluster to be evaluated.

Incremental clustering includes but is not limited to various clustering modes such as K-means, DBSCAN, hierarchical clustering and the like.

And S120, traversing each cluster in all the clusters to be evaluated to obtain all nodes of all the clusters to be evaluated.

In an embodiment, the clusters may be sequentially traversed to obtain all nodes in each cluster, and all nodes in all clusters to be evaluated may be sequentially collected, so that all nodes in each cluster are obtained one by one and sequentially recorded. Further, all the nodes of each cluster obtained in sequence can be established into a set of all nodes, and recorded in sequence as N. For example:

wherein the content of the first and second substances,

is the ith node belonging to the set c.

Step S130, obtaining the neighbor nodes of each node in all the nodes through a vector similarity retrieval algorithm.

In one embodiment, the feature vector of each node in all the nodes is used as a query vector, and vector similarity retrieval is performed in preset bottom library vectors to obtain k bottom library vectors most similar to the feature vector of each node; taking the nodes corresponding to the k bottom base vectors as neighbor nodes of each node, and establishing a set I by using all the neighbor nodes corresponding to all the nodes; and the k most similar bottom library vectors are the k bottom library vectors with the largest similarity values obtained by vector similarity retrieval.

As an example, the bottom library vector may be a bottom library vector constructed from all nodes. For another example, the bottom library vector includes vectors corresponding to nodes of all changed old clusters and new clusters, vectors corresponding to nodes of all old clusters and new clusters, and the like, and can be set as needed. Further, the base library vector needs to contain the query vector itself (i.e., the node corresponding to the query vector in the neighbor nodes is in the queried node).

Further, vector similarity retrieval includes, but is not limited to, for example: brute force search, Hash search, ivfflt, IVFPQ, HNSW, etc., vector similarity calculation (calculating a score of how similar two vectors are) includes, but is not limited to, for example including, but not limited to: vector inner product calculation, L1 distance calculation, L2 distance calculation, etc.

Furthermore, through the vector similarity retrieval, similarity scores between the feature vectors of each node and the bottom base vectors are obtained, after the scores are ranked, k nodes corresponding to k bottom base vectors with the largest scores, namely the k bottom base vectors ranked at the top, are taken as neighbor nodes of each node.

Then all nodes in all clusters to be identified are collected

Each of which is

And establishing a neighbor node set I corresponding to the set N by the corresponding k neighbor nodes, and recording as:

wherein the content of the first and second substances,

j-th neighbor node representing ith node in set N, j ∈ [1, k ]]And k is a natural number greater than or equal to 1. And, each neighbor node in set I

All have corresponding similarity scores, corresponding sets I have corresponding similarity score sets D,

wherein the content of the first and second substances,

representing the similarity between the ith node and the jth neighbor node in the set N.

And S140, obtaining cluster neighbor coverage rate of the current cluster to be evaluated according to each node and corresponding neighbor node in the current cluster to be evaluated.

In one embodiment, all of the clusters to be evaluated are traversed to select a current cluster to be evaluated. Through the traversal, each cluster can be evaluated.

In one embodiment, all nodes in the cluster to be currently evaluated are traversed to select any current node. Similarly, through a traversal mode, neighbor coverage rate calculation can be performed on each node in the cluster, so that neighbor coverage rates of all nodes in the cluster are obtained, calculation is performed, and further, average neighbor coverage rate is obtained.

In one embodiment, a neighbor node corresponding to the current node is obtained, and a neighbor coverage of the current node is calculated. Specifically, the neighbor node corresponding to the current node may be obtained from the set I. For example, when the current cluster is c2, the 1 st node is selected by traversal, and the set N corresponds to the current cluster

Assuming c1 has 3 nodes, placed in set N in turn, c2 has node 1 ranked in set N, i.e., node 4. Correspondingly, the neighbor nodes corresponding to the node in the neighbor node set I have:

let k be 4, i.e.

Calculating the neighbor coverage of the current node, specifically for example:

calculating the intersection of the neighbor node of the current node and all nodes in the current to-be-evaluated cluster where the current node is located, and recording the size of the intersection as n; taking the minimum value of the number of neighbor nodes corresponding to the current node and the number of all nodes in the current cluster to be evaluated where the current node is located, and recording the minimum value as m; calculating the neighbor coverage rate of the current node as n/m; and if m is 0, the neighbor coverage rate of the current node is 0.

Further, the cluster to be evaluated currently is marked as c:

wherein the content of the first and second substances,

is the ith current node which is selected after traversing the current cluster c to be evaluated. Further, a neighbor node set I for obtaining the ith current node in the set N from the neighbor node set I is recorded as I_iCalculating the neighbor coverage rate r of the ith current node, and calculating the average value of all the neighbor coverage rates r in the cluster c as the cluster neighbor coverage rate of the cluster c, wherein the r is calculated by the formula:

wherein | c | N | I_iI represents the neighbor node set I of the ith current node_iThe size of the intersection with all nodes in the cluster c is n, min (| c |, | I |)_i|) represents the number of nodes from the neighbor node set IiAnd selecting a minimum value m between the number of all nodes in the cluster c.

In the above example, when c is c2, the 1 st node is selected

The 4 th node in the set N corresponds to 4 neighbor nodes in the set I

Which constitutes a set of neighbor nodes I of the ith node in c2_iIn this case c₂The 1 st node of (A) is

Calculating the neighbor coverage of the 1 st node:

c₂has 3 nodes, I₁Two of the 4 nodes of (a) are c₂In (1), is marked as | c ≈ I_iN 2; then, the minimum value function is used for calculating the values from c and I_iI.e. c₂And I₁The node with the least number is selected as min (| c |, | I)_iL) m 3; then r n/m 2/3.

Step S150, the cluster neighbor coverage rate of each cluster to be evaluated is compared with a threshold value, and the quality of each cluster to be evaluated is determined.

In one embodiment, if the cluster neighbor coverage of the current cluster to be evaluated is smaller than the threshold, the quality of the current cluster to be evaluated is not qualified; and adding all the nodes in the current cluster to be evaluated with unqualified quality into the image data sample needing clustering next time.

Thus, each cluster in all the clusters to be evaluated is traversed, each cluster is evaluated, the clusters with unqualified quality are eliminated, the clusters are scattered, and all nodes in the clusters are added into the data to be clustered next time.

Further, in the foregoing example, the cluster c2 has 3 nodes, the neighbor coverage rate assumes that the 1 st node is 2/3, the 2 nd node is 2/3, the 3 rd node is 1/3, and the neighbor coverage rates of all nodes are averaged after being added: that is, (2/3+2/3+1/3)/3 is 0.556, and assuming that the threshold is 0.5, the cluster c2 is qualified in quality, and is not eliminated and is reserved as a new cluster; assuming that the threshold is 0.6, the quality of the cluster c2 is not qualified, and the cluster c2 is eliminated, and 3 nodes of the cluster c2 are added into the data to be clustered next time.

It should be noted that each image data sample contains a large number of feature vectors, after incremental clustering is performed, regardless of the quality of the clustering cluster, a large number of repeated and tedious and invalid works are brought to subsequent image processing, the image processing performance is affected, and computing resources are wasted.

It should be noted that although the detailed steps of the method of the present invention have been described in detail, those skilled in the art can combine, separate and change the order of the above steps without departing from the basic principle of the present invention, and the modified technical solution does not change the basic concept of the present invention and thus falls into the protection scope of the present invention.

[ example 2 ]

In an embodiment of the image cluster quality evaluation system of the present invention, as shown in fig. 2, the system includes:

the cluster selecting unit 210 acquires a cluster to be identified.

A clustering unit 2101 configured to count all clusters obtained by performing incremental clustering on given image data; a selecting unit 2202, configured to select, from the all clusters, a newly added cluster and an old cluster that is changed by inserting new data as the all clusters to be evaluated, and establish a set C.

C＝{c₁，c₂，...c_i，...}

The node obtaining unit 220 is configured to perform node traversal on each cluster in all selected clusters to be evaluated to obtain all nodes of all clusters to be evaluated.

wherein the content of the first and second substances,

is the ith node belonging to the set c.

The retrieval unit 230 is configured to obtain a neighbor node of each node in all the nodes through a vector similarity retrieval algorithm;

Then all nodes in all clusters to be identified are collected

Each of which is

wherein the content of the first and second substances,

the ith node in the set Nj neighbor nodes, j ∈ [1, k ]]And k is a natural number greater than or equal to 1. And, each neighbor node in set I

wherein the content of the first and second substances,

A coverage obtaining unit 240, configured to obtain a cluster neighbor coverage of the current cluster to be evaluated according to each node in the current cluster to be evaluated and its corresponding neighbor node.

let k be 4, i.e.

Further, the cluster to be evaluated currently is marked as c:

wherein the content of the first and second substances,

is the ith current node which is selected after traversing the current cluster c to be evaluated. Further, a neighbor node set corresponding to the ith current node in the set N is obtained from the neighbor node set I and is recorded as I_iCalculating the neighbor coverage rate r of the ith current node, and calculating the average value of all the neighbor coverage rates r in the cluster c as the cluster neighbor coverage rate of the cluster c, wherein the r is calculated by the formula:

wherein | c | N | I_i| represents the intersection size n, min (| c |, | I) of the neighbor node set Ii of the ith current node and all nodes in the cluster c_iI) represents a node from the set of neighbor nodes I_iThe number of inner nodes and all of the nodes in the cluster cAnd selecting the minimum value m among the node numbers.

In the above example, when c is c2, the 1 st node is selected

The 4 th node in the set N corresponds to 4 neighbor nodes in the set I

Calculating the neighbor coverage of the 1 st node:

An evaluating unit 250, configured to compare the cluster neighbor coverage of the current cluster to be evaluated with a threshold, and determine the quality of the current cluster to be evaluated.

In the image processing system provided in this embodiment, the center cluster representation and the scene cluster representation corresponding to each new cluster are generated as cluster representations, then the center cluster representation in the cluster representation corresponding to each new cluster is used as a query vector, and all the center cluster representations in the cluster representations corresponding to all the old clusters are used as base library vectors to perform vector similarity retrieval, so as to implement coarse-grained retrieval, and then the scene cluster representations in the cluster representations are used to perform matching during similarity matching processing, so as to implement fine-grained matching. Therefore, the comparison of the new image data sample and the old image data sample is carried out according to the central cluster representation and the scene cluster representation, the comparison process of the new image data sample and the old image data sample is accelerated, and the efficiency of rapidly processing massive and complex image data samples by using limited computing resources is further improved.

It should be noted that, the image processing system provided in the foregoing embodiment is only illustrated by dividing the functional modules (such as the generating module, the retrieving module, the matching module, the combining and updating module, and the like), and in practical applications, the functional modules may be completed by different functional modules according to needs, that is, the functional modules in the embodiment of the present invention are further decomposed or combined, for example, the functional modules in the foregoing embodiment may be combined into one functional module, or may be further split into a plurality of sub-modules, so as to complete all or part of the functions described above. The names of the function modules related to the embodiments of the present invention are only for distinguishing and are not to be construed as an improper limitation of the present invention.

[ example 3 ]

The embodiment further explains the implementation of the present invention mainly by applying to a scenario of a terminal device. The hardware structure of the terminal device is shown in fig. 3. The terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like. In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

Fig. 4 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. Fig. 4 is a specific embodiment of fig. 3 in an implementation process. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment. The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200. The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device. The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 referred to in the embodiment of fig. 4 can be implemented as the input device in the embodiment of fig. 3.

[ example 4 ]

It will be appreciated by those skilled in the art that the present embodiment provides a computer readable storage medium having stored thereon a plurality of program codes, the program codes being adapted to be loaded and executed by a processor to perform any of the aforementioned image cluster quality assessment methods. The storage medium includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

[ example 5 ]

In a processing apparatus provided in this embodiment, the processing apparatus includes a processor and a storage device, and the storage device is adapted to store a plurality of program codes, and is characterized in that the program codes are adapted to be loaded and executed by the processor to perform the image clustering quality evaluation method according to any of the foregoing embodiments.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims of the present invention, any of the claimed embodiments may be used in any combination.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. An image cluster quality evaluation method is characterized by comprising the following steps:

clustering the image data to obtain clusters to be evaluated, and traversing nodes of each cluster in all the clusters to be evaluated to obtain all the nodes of all the clusters to be evaluated;

obtaining the neighbor node of each node in all the nodes by a vector similarity retrieval algorithm;

aiming at each cluster to be evaluated, obtaining cluster neighbor coverage rate of each cluster to be evaluated according to each node in each cluster to be evaluated and the corresponding neighbor node thereof;

comparing the cluster neighbor coverage rate of each cluster to be evaluated with a threshold value, and determining the quality of each cluster to be evaluated;

clustering image data to obtain a cluster to be evaluated, wherein the clustering comprises the following steps:

performing incremental clustering on newly added image data in the image data to obtain a new cluster;

and taking the new cluster and the old cluster which is changed due to the insertion of the newly added image data as all the clusters to be evaluated, and establishing a set C as follows:

C＝{c₁，c₂，…c_i，…}

wherein the content of the first and second substances,

is the ith node belonging to the set C;

wherein the content of the first and second substances,

wherein each cluster to be evaluated is marked as c:

wherein the content of the first and second substances,

the ith current node is selected after traversing each cluster c to be evaluated;

wherein, the neighbor node set of the ith current node is acquired from the neighbor node set I and is marked as I_iCalculating the neighbor coverage rate r of the ith current node, and calculating the average value of all the neighbor coverage rates r in the cluster c as the cluster neighbor coverage rate of the cluster c, wherein the r is calculated by the formula:

2. The method according to claim 1, wherein the obtaining the neighbor node of each node of all the nodes by the vector similarity search algorithm specifically comprises:

taking the feature vector of each node in all the nodes as a query vector, and performing vector similarity retrieval in preset bottom library vectors to obtain k bottom library vectors most similar to the feature vector of each node;

taking the nodes corresponding to the k bottom base vectors as neighbor nodes of each node, and establishing a set I by using all the neighbor nodes corresponding to all the nodes;

and the k most similar bottom library vectors are the k bottom library vectors with the largest similarity values obtained by vector similarity retrieval.

3. The method according to claim 1, wherein obtaining, for each cluster to be evaluated, a cluster neighbor coverage of each cluster to be evaluated according to each node and its corresponding neighbor node in each cluster to be evaluated specifically comprises:

traversing all the clusters to be evaluated to select a current cluster to be evaluated;

traversing all nodes in the cluster to be evaluated currently to select any current node;

acquiring a neighbor node corresponding to the current node, and calculating the neighbor coverage rate of the current node;

and averaging the calculated neighbor coverage rates of all nodes in the current cluster to be evaluated to obtain the cluster neighbor coverage rate of the current cluster to be evaluated.

4. The method according to claim 3, wherein calculating the neighbor coverage of the current node specifically comprises:

calculating the intersection of the neighbor node of the current node and all nodes in the current to-be-evaluated cluster where the current node is located, and recording the size of the intersection as n;

taking the minimum value of the number of neighbor nodes corresponding to the current node and the number of all nodes in the current cluster to be evaluated where the current node is located, and recording the minimum value as m;

and calculating the neighbor coverage rate of the current node as n/m.

5. The method of claim 1, wherein comparing the cluster neighbor coverage of each cluster to be evaluated to a threshold to determine the quality of each cluster to be evaluated comprises:

if the cluster neighbor coverage rate of each cluster to be evaluated is smaller than the threshold value, the quality of each cluster to be evaluated is unqualified;

and adding all the nodes in the clusters to be evaluated with unqualified quality into the image data to be clustered next time or discarding the nodes.

6. The method according to any one of claims 1 to 4,

the algorithm adopted by the incremental clustering is K-means, DBSCAN or hierarchical clustering algorithm; and/or the like and/or,

the vector similarity retrieval algorithm is any one algorithm of brute force retrieval, Hash retrieval, IVFFlat, IVFPQ and HNSW; and/or the like and/or,

the vector similarity is calculated by adopting any one of vector inner product calculation, L1 distance calculation and L2 distance calculation.

7. An image cluster quality evaluation system, comprising:

the node acquisition unit is used for traversing nodes of each selected cluster to be evaluated so as to obtain all nodes of all the clusters to be evaluated;

the retrieval unit is used for obtaining the neighbor nodes of each node in all the nodes through a vector similarity retrieval algorithm;

a coverage rate obtaining unit, configured to obtain, for each cluster to be evaluated, a cluster neighbor coverage rate of each cluster to be evaluated according to each node in each cluster to be evaluated and a corresponding neighbor node thereof;

the evaluation unit is used for comparing the cluster neighbor coverage rate of each cluster to be evaluated with a threshold value and determining the quality of each cluster to be evaluated;

the clustering unit is used for carrying out incremental clustering on the newly added image data in the image data to obtain a new cluster; a selecting unit, configured to take the new cluster and the old cluster that has changed due to insertion of the new image data as all the clusters to be evaluated, and establish a set C as:

C＝{c₁，c₂，...c_i，...}

wherein the content of the first and second substances,

is the ith node belonging to the set C;

wherein the content of the first and second substances,

wherein each cluster to be evaluated is marked as c:

wherein the content of the first and second substances,

8. The system of claim 7, wherein the retrieval unit is specifically configured to:

the nodes corresponding to the k bottom base vectors are used as neighbor nodes of each node, and all neighbor nodes corresponding to all the nodes are used for establishing a set I;

9. The system of claim 7, wherein the coverage rate obtaining unit is specifically configured to:

10. The system according to claim 9, wherein the coverage obtaining unit "calculating the neighbor coverage of the current node" is specifically:

and calculating the neighbor coverage rate of the current node as n/m.

11. The system of claim 7, wherein the evaluation unit is specifically configured to:

12. The system of claim 10,

further comprising: performing incremental clustering on the image data to obtain the cluster to be evaluated, wherein the incremental clustering adopts an algorithm of K-means, DBSCAN or hierarchical clustering algorithm; and/or the like and/or,

13. A computer-readable storage medium having a plurality of program codes stored therein, wherein when the plurality of program codes are loaded and executed by a processor, the method for evaluating the quality of image clusters according to any one of claims 1 to 6 is implemented.

14. A processing apparatus comprising a processor and a storage device adapted to store a plurality of program codes, characterized in that the program codes are adapted to be loaded and run by the processor to perform the image cluster quality assessment method according to any one of claims 1 to 6.