CN116704225A

CN116704225A - File cleaning method and related device, electronic equipment and storage medium

Info

Publication number: CN116704225A
Application number: CN202310528714.9A
Authority: CN
Inventors: 柯辛玥; 陈立力; 周明伟
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-09-05

Abstract

The application discloses a file cleaning method and a related device, electronic equipment and a storage medium, wherein the file cleaning method comprises the following steps: screening a plurality of files to be processed to obtain suspected impure files, wherein object pictures in the suspected impure files are suspected to belong to different objects, clustering the object pictures in the suspected impure files to obtain a plurality of first clusters, and screening the object pictures in the first clusters based on the similarity between local features of the same parts in different object pictures in the first clusters to obtain second clusters; based on this, a purification profile is obtained based on the second class of clusters. Above-mentioned scheme can improve archives abluent accuracy.

Description

File cleaning method and related device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image clustering technologies, and in particular, to a method and an apparatus for cleaning files, an electronic device, and a storage medium.

Background

With the development of science and technology, image clustering is widely applied. For example, during authentication, it is necessary to search a related information base according to an object picture whose identity is not confirmed, and based on this, it is necessary to create an object-first-file information base in which images of the same object belong to the same file.

At present, images of different objects are generally compared, and then images of the same object are assigned to the same archive. However, there is a limitation in comparing the target pictures, and when new files are generated by clustering, a cleaning error often occurs. In view of this, how to improve the accuracy of file cleaning is a urgent issue.

Disclosure of Invention

The application mainly solves the technical problem of providing a file cleaning method, a related device, electronic equipment and a storage medium, which can improve the accuracy of file cleaning.

In order to solve the above technical problems, a first aspect of the present application provides a file cleaning method, including: screening a plurality of files to be processed to obtain suspected impure files, wherein object pictures in the suspected impure files are suspected to belong to different objects; clustering object pictures in suspected impure files to obtain a plurality of first clusters; screening the object pictures in the first class cluster based on the similarity between local features of the same part in different object pictures in the first class cluster to obtain a second class cluster; based on this, a purification profile is obtained based on the second class of clusters.

In order to solve the technical problem, a second aspect of the present application provides a file cleaning device, which includes a first screening module, a picture clustering module, a second screening module and a file purifying module; the first screening module is used for screening a plurality of files to be processed to obtain suspected impure files; the object pictures in the suspected impure files are suspected to belong to different objects; the picture clustering module is used for clustering the object pictures in the suspected impure files to obtain a plurality of first clusters; the second screening module is used for screening the object pictures in the first type of clusters based on the similarity between the local features of the same parts in different object pictures in the first type of clusters to obtain second type of clusters; the archive purification module is used for obtaining a purified archive based on the second class cluster.

In order to solve the above-mentioned problems, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the file cleaning method in the first aspect.

In order to solve the above-mentioned technical problem, a fourth aspect of the present application provides a computer readable storage medium storing program instructions executable by a processor for implementing the archive cleaning method in the above-mentioned first aspect.

According to the scheme, the suspected impure files are obtained through screening from the files to be processed, and object pictures in the suspected impure files are suspected to belong to different objects; clustering object pictures in suspected impure files to obtain a plurality of first clusters; screening the object pictures in the first class cluster based on the similarity between local features of the same part in different object pictures in the first class cluster to obtain a second class cluster; on the basis, based on the second type of clusters, the purified files are obtained, on one hand, the suspected impure files are obtained by screening a plurality of files to be processed, and object pictures in the suspected impure files are clustered to obtain a plurality of first type of clusters, so that the accuracy of clustering the object pictures to obtain the first type of clusters is improved, further the efficiency of file cleaning is improved, on the other hand, based on the similarity between local features of the same parts in different object pictures in the first type of clusters, the object pictures in the first type of clusters are screened to obtain the second type of clusters, further the object pictures in the first type of clusters are screened again through the local features, further the accuracy of file cleaning is improved, and in addition, based on the second type of clusters, the purified files are selected, so that the applicability of the file cleaning method is improved. Therefore, the file cleaning accuracy can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of an embodiment of a method for cleaning files according to the present application;

FIG. 2 is a flowchart of an embodiment of step S12 in FIG. 1;

FIG. 3 is a flowchart illustrating the step S12 of FIG. 1 according to another embodiment;

FIG. 4 is a flowchart of a further embodiment of step S12 in FIG. 1;

FIG. 5 is a flowchart illustrating an embodiment of step S13 in FIG. 1;

FIG. 6 is a flowchart illustrating the step S13 in FIG. 1 according to another embodiment;

FIG. 7 is a schematic diagram of a frame of an embodiment of a file cleaning apparatus according to the present application;

FIG. 8 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;

FIG. 9 is a schematic diagram of a frame of an embodiment of a computer readable storage medium of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. "several" means at least one. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a file cleaning method according to the present application.

Specifically, the method may include the steps of:

step S11: and screening a plurality of files to be processed to obtain suspected impure files.

In the embodiment of the disclosure, the object pictures in the suspected impurity file are suspected to belong to different objects. Of course, the suspected impure file may also include an object picture that cannot identify the object to which the suspected impure file belongs, a picture that does not belong to the object, and the like, and the object picture in the suspected impure file may be identified according to the actual situation, which is not specifically limited herein.

In one implementation scenario, the object picture may include, but is not limited to, a portrait picture, an animal image, a vehicle image, etc., and of course, the object picture may be determined according to the actual situation, which is not limited herein.

In one implementation scenario, the plurality of files to be processed may not include suspected impurity files, and when the plurality of files to be processed do not include suspected impurity files, the plurality of files to be processed are screened to directly obtain purified files, and it can be understood that the purified files are files to be processed; the files to be processed can also comprise part of suspected impure files, and after the part of suspected impure files are cleaned, cleaned purified files and purified files contained in the files to be processed are obtained; the files to be processed can only comprise suspected impure files, and after the suspected impure files are cleaned, cleaned purified files are obtained.

In one implementation scenario, in order to obtain a suspected impure file through screening, feature extraction may be performed on object pictures in a file to be processed to obtain picture features, picture similarity between the picture features of different object pictures is calculated, then a mean value of the picture similarity is calculated, the mean value of the picture similarity of the file to be processed is used as a third abnormal value, the third abnormal value is compared with a preset threshold value, the file to be processed with the third abnormal value not smaller than the preset threshold value is used as a suspected impure file, and the file to be processed with the third abnormal value smaller than the preset threshold value is used as a purified file.

In another implementation scenario, different from the foregoing embodiment, in order to further improve the accuracy of screening to obtain the suspected impure file, a first outlier of an outlier factor in the file to be processed may be obtained first, where the outlier factor characterizes an outlier dimension of the file to be processed, and the outlier factor may include similarity of object pictures in different object pictures in the file to be processed, similarity of target part pictures in different object pictures in the file to be processed, a change condition of the number of object pictures in the file to be processed, update time of the object pictures in the file to be processed, and so on, and further the corresponding first outlier value is obtained through the outlier factor, where the first outlier value characterizes an outlier degree of the file to be processed in the corresponding outlier factor dimension. The method comprises the steps of firstly extracting object features and target position features of object pictures in a file to be processed, wherein the object features are obtained by directly extracting the features of the object pictures, the target position pictures are obtained by intercepting the object pictures, the target position features are obtained by extracting the features of the target position pictures, then calculating the similarity between the object features of different object pictures and the similarity between the different target position features, further, calculating the similarity mean value between the object features and the similarity mean value between the target position features in the file to be processed, and taking the similarity mean value between the object features and the similarity mean value between the target position features as first abnormal values respectively; of course, the minimum value of the similarity between the features of different objects and the similarity between the features of different target portions in the file to be processed may also be selected as the first outlier. In addition, the first abnormal value can be determined according to whether the number of the object pictures in the file to be processed is suddenly changed, for example, when the object pictures are suddenly changed, the corresponding first abnormal value is smaller, and when the object pictures are slightly changed, the corresponding first abnormal value is larger; or, the corresponding first abnormal value may be determined according to the update time of the object picture in the file to be processed, for example, when the update time interval of the object picture in the file to be processed is longer, the corresponding first abnormal value is smaller, and when the update time interval of the object picture is shorter, the corresponding first abnormal value is larger. The first outliers are further fused to obtain a second outlier, specifically, the second outlier may be obtained by weighting a plurality of the first outliers, or the second outlier may be obtained by directly summing a plurality of the first outliers, where the second outlier may be determined according to an actual situation, and is not specifically limited herein. It will be appreciated that when there is only one first outlier, the first outlier may be directly taken as the second outlier. After the second abnormal value is obtained, whether the file to be processed is a suspected impure file is determined based on whether the second abnormal value meets a third condition, and it is understood that the third condition may be that the second abnormal value is not greater than a preset threshold, the preset threshold may be 0.5, 0.6, 0.7, and the like, and when the second abnormal value meets the third condition, the corresponding file to be processed is regarded as the suspected impure file. The third condition may be set according to the determination manner of the abnormality factor and the second abnormality value, and is not particularly limited herein. According to the method, the first abnormal value of the abnormal factors in the files to be processed is obtained, the first abnormal value is fused to obtain the second abnormal value, the first abnormal values of the abnormal factors in the files to be processed are determined, so that the diversity of the first abnormal values is improved, the second abnormal value is determined through the fusion of the first abnormal values, the accuracy of the second abnormal values is improved, whether the files to be processed are suspected impure files or not is determined based on the fact that whether the second abnormal values meet the third condition or not, the screening accuracy of the files to be processed is improved, the accuracy of obtaining the suspected impure files is improved, and the accuracy of file cleaning is further improved.

In one implementation scenario, the feature enrichment in each part of the object is located in at least one part of the pre-set sequence, the target part can be determined according to the object category, for example, the face is usually rich for most living things, so the target part can be set as the face; for a non-biological "vehicle", the recognition of the head and tail is generally high, and the target portion may be set as the head and tail. It should be understood that the above-mentioned arrangement manner is only one possible arrangement manner in practical application, and is not limited to the arrangement manner of the target portion in practical application, and the specific arrangement manner of the target portion may be determined according to practical situations, which is not limited herein.

Step S12: and clustering the object pictures in the suspected impure files to obtain a plurality of first clusters.

In one implementation scenario, in order to obtain the first type of clusters, feature extraction may be performed on the object pictures in the suspected impure file to obtain first features, and then clustering may be performed on the object pictures in the suspected impure file based on the similarity between the first features of different object pictures in the suspected impure file, where the clustering manner may be, but is not limited to, a K-means method, a hierarchical clustering tree method, a density-based clustering method, and so on, so as to obtain a plurality of first type of clusters. Different from the foregoing embodiment, after the first feature is extracted, clustering the object pictures in the suspected impurity file based on the first feature to obtain a plurality of third class clusters, and screening the object pictures in the third class clusters based on a detection result of whether the object pictures in the third class clusters contain the target part to obtain a fourth class cluster; on the basis, combining different fourth-class clusters meeting the first condition to obtain a plurality of first-class clusters. According to the mode, the object pictures in the suspected impure files are clustered based on the first characteristics to obtain a plurality of third class clusters, the object pictures in the third class clusters are screened based on the detection result of whether the object pictures in the third class clusters contain the target part or not to obtain fourth class clusters, the screening accuracy of the fourth class clusters is improved, and then different fourth class clusters meeting the first condition are combined to obtain a plurality of first class clusters, so that the clustering effect of the first class clusters is improved, the difference between the object pictures in the first class clusters is as small as possible, and the file cleaning accuracy is further improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of step S12 in fig. 1. Specifically, the method may include the steps of:

step S21: and taking each object picture in the suspected impure file as a fifth type cluster.

It can be appreciated that, in order to enhance clustering of the object pictures in the suspected impure file, each object picture in the suspected impure file may be used as a fifth cluster, i.e. the number of the fifth clusters is the number of the object pictures in the suspected impure file.

Step S22: based on the first feature, a first distance between different fifth class clusters is calculated.

In an implementation scenario, the first distances between the different fifth clusters may be obtained by calculating cosine distances between the first features, or the first distances between the different fifth clusters may be obtained by calculating euclidean distances between the first features, and the calculation manner of the first distances between the different fifth clusters may be determined according to actual situations, which is not specifically limited herein.

Step S23: and merging the corresponding fifth class clusters in response to the first distance meeting the second condition to obtain a first merged class cluster.

In one implementation scenario, the second condition may be that the first distance is not greater than a preset threshold, which may be set to 0.3, 0.4, and so on. The preset threshold may be determined according to practical situations, and is not specifically limited herein.

In one implementation scenario, when the first distance does not satisfy the second condition, characterizing that the corresponding fifth class cluster is suspected to not belong to the same object, continuing to calculate the first distance between different fifth class clusters and judging whether the first distance satisfies the second condition. When the first distance meets the second condition, merging the corresponding fifth clusters to obtain a first merged cluster, and continuing to calculate the first distance between different clusters and judging whether the first distance meets the second condition, wherein the different clusters comprise the fifth clusters and the first merged cluster, and when the first distance between any two clusters meets the second condition, merging the corresponding two clusters to obtain a new first merged cluster until the first distance between all the different clusters does not meet the second condition.

Step S24: and taking the first combined cluster and the fifth cluster which is not combined in the suspected impurity file as a third cluster.

It can be understood that the first combined cluster and the fifth cluster which is not combined in the suspected impurity file are used as the third cluster, that is, the corresponding fifth cluster is combined under the condition that the first distance between different fifth clusters meets the second condition, so as to obtain the third cluster. According to the method, the first distances among different fifth class clusters are obtained through calculation, when the first distances meet the second conditions, the corresponding class clusters are combined, so that the efficiency of combining the fifth class clusters is improved, and the first combined class cluster and the fifth class clusters which are not combined in the suspected impure files are used as the third class clusters, so that the accuracy of the third class clusters is improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating another embodiment of step S12 in fig. 1. Specifically, the method may include the steps of:

step S31: judging whether the object picture in the third class cluster contains a target part or not; if not, executing step S32; otherwise, step S33 is performed.

In one implementation scenario, the object pictures in the third class cluster may all include the target portion, or may partially include the target portion, or of course, may not include the target portion. It can be understood that, for the object picture, the object picture has a strong association relationship with the target portion, when the object picture can be associated with the target portion, the object picture indicates that the object picture contains the target portion, and when the object picture is not associated with the target portion, the object picture indicates that the object picture does not contain the target portion. Unlike the foregoing embodiments, whether the target portion is included in the subject picture may be detected by a network model, which may be, but is not limited to, CNN (convolution neural network, convolutional neural network), RNN (Recurrent Neural Network, cyclic neural network), or the like.

Step S32: and eliminating the third cluster.

In one implementation scenario, the third class cluster does not include the target portion, that is, none of the object pictures in the third class cluster includes the target portion, and then the third class cluster is eliminated. It can be understood that, when the object picture in the third class cluster does not uniformly contain the target portion, the object to which the object picture belongs is uncertain, so as to improve the accuracy of file cleaning, the third class cluster which does not contain the target portion is removed.

Step S33: the third class of clusters is regarded as a fourth class of clusters.

In one implementation scenario, the third class cluster includes a target portion, that is, the object pictures in the third class cluster all include the target portion, or part of the object pictures in the third class cluster include the target portion, and then the third class cluster is directly used as the fourth class cluster. It can be understood that when the third class cluster includes the target portion, the object to which the suspected third class cluster belongs can be determined, and the corresponding object picture can be further determined, so that the file cleaning accuracy is improved, and the file cleaning efficiency is improved. According to the method, whether the target part is contained in the target picture or not is judged, and the third type clusters corresponding to the target picture which does not contain the target part are removed, so that the efficiency of file cleaning is improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a step S12 in fig. 1 according to another embodiment. Specifically, the method may include the steps of:

step S41: and extracting the characteristics of the target part picture to obtain a second characteristic.

In the implementation scene of the disclosure, the target part picture is obtained by intercepting the target part picture based on the object picture in the suspected impure file, and the intercepted target part picture is subjected to feature extraction to obtain the second feature. In addition, the fourth class cluster may include the object picture without the target portion picture, and the second feature of the object picture without the target portion picture may be selected from any second feature including the target portion picture in the same class cluster as the second feature of the object picture without the target portion picture.

Step S42: based on the second feature, a second distance between different fourth class clusters is calculated.

In one implementation, the second distance between the different fourth type clusters calculated from the second feature may be a minimum distance, a maximum distance, an average distance, etc. between the two clusters. Illustratively, the second distance between the different fourth class of clusters is the maximum distance between two clusters, i.e. the second distance between the different fourth class of clusters, of the two furthest second features. It will be appreciated that when the second distance between the clusters of the fourth type is the maximum distance between the clusters, the data set containing noise between the clusters can be well separated.

Step S43: and combining the corresponding fourth class clusters in response to the second distance meeting the first condition to obtain a second combined class cluster.

In one implementation scenario, the first condition may be that the second distance is not greater than a preset threshold, which may be 0.3, 0.4, etc.; the first condition may also be that the second distance is not greater than a preset distance, which may be an average distance that two clusters belong to the same object. The first condition may be determined according to actual circumstances, and is not particularly limited herein.

In one implementation scenario, when the second distance does not satisfy the first condition, characterizing that the corresponding fourth class cluster is suspected to not belong to the same object, continuing to calculate the second distance between different fourth class clusters and judging whether the second distance satisfies the first condition. When the second distance meets the first condition, merging the corresponding fourth clusters to obtain a second merged cluster, and continuing to calculate the second distance between different clusters and judging whether the second distance meets the first condition, wherein the different clusters comprise the fourth clusters and the second merged clusters, and when the second distance between any two clusters meets the first condition, merging the corresponding two clusters to obtain a new second merged cluster until the second distance between all the different clusters does not meet the first condition.

Step S44: and taking the second combined cluster and the fourth cluster which are not combined as the first cluster.

It can be understood that the second merged cluster and the fourth non-merged clusters are used as the first clusters, that is, the corresponding fourth clusters are merged to obtain the first clusters when the second distances between the different fourth clusters satisfy the first condition. According to the method, the second distances among different fourth class clusters are obtained through calculation based on the second characteristics, when the second distances meet the first conditions, the corresponding fourth class clusters are combined to obtain the second combined class cluster, further judgment on whether object pictures in the combined class clusters belong to the same object is facilitated, then the second combined class cluster and the fourth class clusters which are not combined are used as the first class cluster, accuracy of combining the class clusters is facilitated, and accuracy and stability of file cleaning are further improved.

Step S13: and screening the object pictures in the first class cluster based on the similarity between the local features of the same part in different object pictures in the first class cluster to obtain a second class cluster.

In one implementation scenario, different object pictures in the first class cluster may be segmented into a plurality of local pictures, and then feature extraction is performed on each local picture to obtain local features. The partitioning method may be determined according to different first-class clusters, or the same partitioning method may be used for all the first-class clusters. Specifically, the dividing mode may be that the human body picture is equally divided into a plurality of parts from top to bottom to obtain a local picture; the segmentation mode can be determined through a model, namely, the upper half, the lower half, the shoes, the left hand part of the body and the like of the human body picture are positioned, and then the local picture is obtained.

In one implementation scene, the average value of the similarity between the local features in the two different object pictures can be calculated, whether the average value of the similarity between the local features in the two different object pictures is larger than a preset threshold value is judged, and when the average value of the similarity between the local features in the two different object pictures is larger than the preset threshold value, the corresponding first class cluster is used as the second class cluster; and when the average value of the similarity between the local features in the two different object pictures is not greater than a preset threshold value, eliminating the corresponding first class cluster. Different from the foregoing embodiment, the number of contradiction points between any two object pictures may be calculated based on the similarity between local features of the same part in different object pictures in the first class cluster; it can be understood that the number of contradiction points characterizes the number of parts suspected to belong to different objects in the corresponding two object pictures; and screening the object pictures in the first class cluster based on the number of the contradiction points to obtain a second class cluster. According to the method, the number of contradiction points between any two object pictures is calculated based on the similarity between local features of the same parts in different object pictures in the first type of clusters, and then the object pictures in the first type of clusters are screened based on the number of contradiction points to obtain the second type of clusters, so that accuracy of screening results of the object pictures in the first type of clusters is improved, and further accuracy of file cleaning is improved.

Referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of step S13 in fig. 1. Specifically, the method may include the steps of:

step S51: and obtaining the similarity between the local features of the same part in the two object pictures, and taking the similarity as the local similarity.

It can be understood that the local feature may be obtained by referring to the above disclosed local feature extraction method, which is not described herein. In addition, the similarity between the local features may be determined by acquiring cosine similarity between the features, or may be determined by acquiring euclidean distance between the features, and the similarity between the local features may be determined according to actual situations, which is not specifically limited herein.

Step S52: judging whether the local similarity is smaller than a first threshold value or not; if yes, go to step S53; otherwise, step S54 is performed.

In one implementation scenario, the first threshold may be set to 0.8, 0.9, and so on. The first threshold may be determined according to practical situations, and is not specifically limited herein.

Step S53: the corresponding part of the local feature is a contradiction point.

In one implementation scenario, when the local similarity is smaller than a first threshold, the similarity in different object pictures representing the same part is lower, namely the part corresponding to the local feature is a contradiction point.

Step S54: the location corresponding to the local feature is not a contradictory point.

In one implementation scenario, when the local similarity is not smaller than a first threshold, the similarity in different object pictures representing the same part is higher, i.e. the part corresponding to the local feature is not a contradictory point.

It can be understood that, for whether the similarity between the local features of different object pictures in the first cluster is smaller than the first threshold, until the positions corresponding to all the local features are determined to be contradictory points, the number of contradictory points between two corresponding object pictures is obtained through the sum of contradictory points between any two object pictures. According to the method, whether the part corresponding to the local feature is the contradiction point position is determined, accuracy of determining the contradiction point position is improved, the number of the contradiction point positions between the two corresponding object pictures is obtained based on the sum of the contradiction point positions between any two object pictures, accuracy of determining the number of the contradiction point positions between any two object pictures is improved, and further accuracy of file cleaning is improved.

Referring to fig. 6, fig. 6 is a flowchart illustrating another embodiment of step S13 in fig. 1. Specifically, the method may include the steps of:

Step S61: and acquiring a first picture which does not contain the target part in the first cluster, and acquiring a second picture which contains the target part in the first cluster.

It can be understood that, the method for determining the first picture in the first cluster that does not include the target portion and the second picture in the first cluster that includes the target portion may refer to the method for determining whether the target portion is included in the target picture in the foregoing disclosed embodiment, which is not described herein.

Step S62: and obtaining the first weights of the first picture and each second picture belonging to different objects based on the number of contradiction points between the first picture and each second picture.

In one implementation scenario, the number of contradiction points between the first picture and each second picture may be used as a first weight corresponding to the first picture and each second picture; the ratio of the number of contradiction points between the first picture and each second picture to the number of local images in the object picture can also be used as the first weight corresponding to the first picture and each second picture. The first weight may be determined according to practical situations, and is not specifically limited herein.

Step S63: and fusing the first weights to obtain second weights.

In one implementation scenario, the sum of the first weights of the first picture and the second picture in the first cluster belonging to different objects may be used as the second weight, or the average value of the first weights of the first picture and the second picture in the first cluster belonging to different objects may be used as the second weight. The second weight may be determined according to practical situations, and is not specifically limited herein.

Step S64: judging whether the second weight is smaller than a second threshold value or not; if not, executing step S65; otherwise, step S66 is performed.

In one implementation, the second threshold may be determined based on a manner of calculation of the first weight and the second weight. For example, the first weight is a ratio of the number of contradiction points between the first picture and each second picture to the number of local images in the object picture, and the second weight is a mean value of the first weights of the first picture and each second picture belonging to different objects in the first cluster, and then the second threshold may be 0.6, 0.7, and so on. The second threshold may be determined according to practical situations, and is not specifically limited herein.

Step S65: and eliminating the corresponding first picture.

In one implementation scenario, when the second weight is not smaller than the second threshold, the first picture and the second picture which belong to the first picture are characterized as suspected not belonging to the same object, and the corresponding first picture is removed, so that noise in the file cleaning process is reduced, and further the file cleaning accuracy is improved.

Step S66: the corresponding first picture is assigned to the second class cluster.

In the embodiment of the disclosure, the second type cluster at least includes a second picture including the target portion in the first type cluster. It can be understood that the second type cluster includes a second picture including a target portion in the first type cluster, and further judges whether the object to which the first picture and the second picture belong is the same, and when the second weight is smaller than the second threshold, the first picture and the second picture belong to the same object, and the first picture is directly classified into the second type cluster. According to the mode, whether the second weight is smaller than the second threshold value or not is judged, when the second weight is smaller than the second threshold value, the corresponding first picture is classified into the second class cluster, and when the second weight is not smaller than the second threshold value, the corresponding first picture is removed, so that noise in the file cleaning process is reduced, and further the file cleaning accuracy is improved.

Step S14: based on the second class of clusters, a purification profile is obtained.

In one embodiment, as a possible implementation manner, at least one second cluster is obtained by cleaning the suspected impurity file, and one of the second clusters with the largest number of object pictures can be selected as a purified file, which is different from the previous embodiment, after the suspected impurity file is cleaned to obtain the second clusters, each second cluster is respectively used as a purified file. According to the mode, the second type cluster is selected to obtain the purified file, so that the cleaning efficiency of the purified file is improved; in addition, the files are selectively purified, and the applicability of the cleaning files is further improved.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an embodiment of a file cleaning apparatus according to the present application. Archive cleaning device 70 includes a first screening module 71, a picture clustering module 72, a second screening module 73, and an archive purification module 74. The first screening module 71 is configured to screen a suspected impurity file from a plurality of files to be processed; the object pictures in the suspected impure files are suspected to belong to different objects; the picture clustering module 72 is configured to cluster the object pictures in the suspected impurity file to obtain a plurality of first clusters; the second screening module 73 is configured to screen the object pictures in the first class cluster based on the similarity between the local features of the same part in different object pictures in the first class cluster, so as to obtain a second class cluster; archive purification module 74 is configured to obtain a purified archive based on the second class of clusters.

According to the scheme, on one hand, the suspected impure files are obtained by screening the files to be processed, the object pictures in the suspected impure files are clustered to obtain the first clusters, the accuracy of clustering the object pictures to obtain the first clusters is improved, and further the efficiency of file cleaning is improved, on the other hand, the object pictures in the first clusters are screened to obtain the second clusters based on the similarity between the local features of the same parts in different object pictures in the first clusters, and further the object pictures in the first clusters are screened again through the local features, so that the accuracy of file cleaning is further improved, and on the other hand, the purified files are selected based on the second clusters, and the applicability of the file cleaning method is improved. Therefore, the file cleaning accuracy can be improved.

In some disclosed embodiments, the second screening module 73 includes a calculation sub-module and a screening sub-module. The computing submodule is used for computing the number of contradiction points between any two object pictures based on the similarity between local features of the same parts in different object pictures in the first cluster, and the number of contradiction points represents the number of parts suspected to belong to different objects in the corresponding two object pictures; the screening submodule is used for screening the object pictures in the first class cluster based on the number of the contradiction points to obtain a second class cluster.

Therefore, the number of contradiction points between any two object pictures is calculated based on the similarity between local features of the same parts in different object pictures in the first type cluster, and then the object pictures in the first type cluster are screened based on the number of contradiction points to obtain the second type cluster, so that the accuracy of screening results of the object pictures in the first type cluster is improved, and the accuracy of file cleaning is improved.

In some disclosed embodiments, the computing submodule includes a judgment unit, a determination unit, and a computing unit. The judging unit is used for judging whether the similarity between the local features of the same part in any two object pictures is smaller than a first threshold value; the determining unit is used for determining that the part corresponding to the local features is a contradiction point position in response to the similarity between the local features being smaller than a first threshold value; the computing unit is used for obtaining the number of the contradiction points between the two corresponding object pictures based on the sum of the contradiction points between any two object pictures.

Therefore, by determining whether the part corresponding to the local feature is a contradiction point, the accuracy of determining the contradiction point is improved, and the number of the contradiction points between the two corresponding object pictures is obtained based on the sum of the contradiction points between any two object pictures, so that the accuracy of determining the number of the contradiction points between any two object pictures is improved, and the accuracy of file cleaning is improved.

In some disclosed embodiments, the screening sub-module includes an acquisition unit, a determination unit, and a screening unit. The acquisition unit is used for acquiring a first picture which does not contain the target part in the first cluster and acquiring a second picture which contains the target part in the first cluster; the determining unit is used for obtaining first weights of different objects of the first picture and each second picture based on the number of contradiction points between the first picture and each second picture; the screening unit is used for screening the object pictures in the first class cluster based on the first weight to obtain a second class cluster.

In some disclosed embodiments, the screening unit includes a fusion subunit, a judgment subunit, a first response subunit, and a second response subunit. The fusion subunit is used for fusing the first weights to obtain second weights; the judging subunit is used for judging whether the second weight is smaller than a second threshold value; the first response subunit is configured to attribute the corresponding first picture to a second class cluster in response to the second weight being less than a second threshold; the second type cluster at least comprises a second picture containing a target part in the first type cluster; and the second response subunit is used for rejecting the corresponding first picture in response to the second weight being not smaller than a second threshold.

Therefore, by judging whether the second weight is smaller than the second threshold, when the second weight is smaller than the second threshold, the corresponding first picture is classified into the second class cluster, and when the second weight is not smaller than the second threshold, the corresponding first picture is removed, so that noise in the file cleaning process is reduced, and the file cleaning accuracy is improved.

In some disclosed embodiments, the picture clustering module 72 includes an extraction sub-module, a clustering sub-module, a screening sub-module, and a merging sub-module. The extraction submodule is used for extracting features of the object pictures in the suspected impurity files to obtain first features; the clustering submodule is used for clustering object pictures in the suspected impure files based on the first characteristics to obtain a plurality of third class clusters; the screening submodule is used for screening the object pictures in the third class cluster based on the detection result of whether the object pictures in the third class cluster contain the target part or not to obtain a fourth class cluster; the merging sub-module is used for merging different fourth-class clusters meeting the first condition to obtain a plurality of first-class clusters.

Therefore, clustering object pictures in suspected impure files based on the first characteristics to obtain a plurality of third class clusters, screening the object pictures in the third class clusters based on the detection result of whether the object pictures in the third class clusters contain target parts or not to obtain fourth class clusters, improving the screening accuracy of the fourth class clusters, combining different fourth class clusters meeting the first conditions to obtain a plurality of first class clusters, improving the clustering effect of the first class clusters, enabling the difference between the object pictures in the first class clusters to be as small as possible, and further improving the file cleaning accuracy.

In some disclosed embodiments, the clustering submodule includes a first determination unit, a distance calculation unit, a cluster-like merging unit, and a second determination unit. The first determining unit is used for taking each object picture in the suspected impure file as a fifth class cluster; the distance calculation unit is used for calculating first distances among different fifth-class clusters based on the first features; the class cluster merging unit is used for merging the corresponding fifth class clusters in response to the first distance meeting the second condition to obtain a first merged class cluster; the second determining unit is configured to use the first merged cluster and a fifth cluster that is not merged in the suspected impurity file as a third cluster.

Therefore, the first distances among different fifth class clusters are obtained through calculation, when the first distances meet the second condition, the corresponding class clusters are combined, so that the efficiency of combining the fifth class clusters is improved, and the first combined class cluster and the fifth class clusters which are not combined in the suspected impure file are used as the third class clusters, so that the accuracy of the third class clusters is improved.

In some disclosed embodiments, the screening submodule includes a first response element and a second response element. The first response unit is used for responding to the fact that the object picture in the third class cluster contains a target part, and the third class cluster is used as a fourth class cluster; the second response unit is used for eliminating the third class cluster in response to the fact that the object picture in the third class cluster does not contain the target part.

Therefore, by judging whether the target part is included in the target picture and eliminating the third type cluster corresponding to the target picture which does not include the target part, the efficiency of file cleaning is improved.

In some disclosed embodiments, the merging submodule includes an extraction unit, a calculation unit, a merging unit, and a determination unit. The extraction unit is used for extracting the characteristics of the target part picture to obtain a second characteristic, and the target part picture is obtained by intercepting the target picture in the suspected impurity file; the computing unit is used for computing a second distance between different fourth-class clusters based on the second characteristics; the merging unit is used for merging the corresponding fourth class clusters to obtain a second merged class cluster in response to the second distance meeting the first condition; the determining unit is used for taking the second combined cluster and the fourth cluster which are not combined as the first cluster.

Therefore, a second distance between different fourth class clusters is obtained through calculation based on the second characteristics, when the second distance meets the first condition, the corresponding fourth class clusters are combined to obtain a second combined class cluster, so that the improvement of whether the object pictures in the combined class clusters belong to the same object is facilitated, the second combined class cluster and the fourth class clusters which are not combined are used as the first class cluster, the improvement of accuracy of combining the class clusters is facilitated, and the accuracy and stability of file cleaning are further improved.

In some disclosed embodiments, the first screening module 71 includes an acquisition sub-module, a fusion sub-module, and a determination sub-module. The acquisition sub-module is used for acquiring a first abnormal value of an abnormal factor in the file to be processed, wherein the abnormal factor represents an abnormal dimension of the file to be processed; the first outlier characterizes the degree of abnormality of the file to be processed in the dimension of the corresponding outlier; the fusion submodule is used for fusing the first abnormal value to obtain a second abnormal value; the determining submodule is used for determining whether the file to be processed is a suspected impure file or not based on whether the second abnormal value meets a third condition.

Therefore, the first abnormal value of the abnormal factors in the file to be processed is obtained, the first abnormal value is fused to obtain the second abnormal value, the first abnormal values of the abnormal factors in the file to be processed are determined, so that the diversity of the first abnormal values is improved, the second abnormal value is determined through the fusion of the first abnormal values, the accuracy of the second abnormal value is improved, whether the file to be processed is a suspected impure file is determined based on whether the second abnormal value meets the third condition, the screening accuracy of the file to be processed is improved, the accuracy of the suspected impure file is improved, and the cleaning accuracy of the file is further improved.

In some disclosed embodiments, archive purification module 74 includes a selection sub-module and a determination sub-module. The selecting submodule is used for selecting one of the second class clusters with the largest number of object pictures as a purification file; the determining submodule is used for respectively taking each second class cluster as a purification file.

Therefore, the second cluster is selected to obtain the purified file, so that the cleaning efficiency of the purified file is improved; in addition, the files are selectively purified, and the applicability of the cleaning files is further improved.

Referring to fig. 8, fig. 8 is a schematic diagram of a frame of an electronic device according to an embodiment of the application. The electronic device 80 comprises a memory 81 and a processor 82 coupled to each other, the memory 81 having stored therein program instructions, the processor 82 being adapted to execute the program instructions to implement the steps of any of the above-described embodiments of the method for cleaning a file. In particular, the electronic device 80 may include, but is not limited to: desktop computers, notebook computers, servers, cell phones, tablet computers, and the like, are not limited herein.

Specifically, the processor 82 is configured to control itself and the memory 81 to implement the steps of any of the above-described embodiments of the file cleaning method. The processor 82 may also be referred to as a CPU (Central Processing Unit ). The processor 82 may be an integrated circuit chip having signal processing capabilities. The processor 82 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 82 may be commonly implemented by an integrated circuit chip.

In the above scheme, the electronic device 80 may be configured to implement the steps in any of the above embodiments of the file cleaning method, on one hand, by screening a suspected impure file from a plurality of files to be processed, and clustering object pictures in the suspected impure file to obtain a plurality of first clusters, so as to help to improve accuracy of clustering the object pictures to obtain the first clusters, further improve efficiency of file cleaning, and on the other hand, based on similarity between local features of the same parts in different object pictures in the first clusters, screen the object pictures in the first clusters to obtain a second cluster, further screen the object pictures in the first clusters again through the local features, further improve accuracy of file cleaning, and on the other hand, based on the second clusters, select to obtain purified files, so as to help to improve applicability of the file cleaning method. Therefore, the file cleaning accuracy can be improved.

Referring to fig. 9, fig. 9 is a schematic diagram of a frame of an embodiment of a computer readable storage medium according to the present application. The computer readable storage medium 90 stores program instructions 91 that can be executed by a processor, the program instructions 91 being used to implement the steps of any of the above-described embodiments of the method for cleaning a file.

In the above-mentioned scheme, the computer-readable storage medium 90 may be used to implement the steps in any of the above-mentioned embodiments of the file cleaning method, on the one hand, by screening a suspected impure file from a plurality of files to be processed, and clustering object pictures in the suspected impure file to obtain a plurality of first clusters, so as to help improve accuracy of clustering the object pictures to obtain the first clusters, further improve efficiency of file cleaning, on the other hand, based on similarity between local features of the same parts in different object pictures in the first clusters, object pictures in the first clusters are screened to obtain a second cluster, further screen the object pictures in the first clusters again through the local features, further improve accuracy of file cleaning, and on the other hand, based on the second clusters, select to obtain a purified file, so as to help improve applicability of the file cleaning method. Therefore, the file cleaning accuracy can be improved.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

Claims

1. A method for cleaning a file, comprising:

Screening a plurality of files to be processed to obtain suspected impure files; wherein the object pictures in the suspected impure files are suspected to belong to different objects;

clustering the object pictures in the suspected impure files to obtain a plurality of first clusters;

screening the object pictures in the first type of clusters based on the similarity between local features of the same parts in different object pictures in the first type of clusters to obtain a second type of clusters;

and obtaining a purified archive based on the second class cluster.

2. The method of claim 1, wherein the filtering the object pictures in the first cluster based on the similarity between local features of the same location in different object pictures in the first cluster to obtain a second cluster includes:

calculating the number of contradiction points between any two object pictures based on the similarity between local features of the same parts in different object pictures in the first class cluster; the number of the contradiction points represents the number of parts suspected to belong to different objects in the corresponding two object pictures;

and screening the object pictures in the first class cluster based on the number of the contradiction points to obtain the second class cluster.

3. The method according to claim 2, wherein the calculating, based on the similarity between local features of the same location in different object pictures in the first cluster, the number of contradictory points between any two object pictures includes:

judging whether the similarity between the local features of the same part in any two object pictures is smaller than a first threshold value;

determining that the part corresponding to the local features is a contradictory point position in response to the similarity between the local features being smaller than the first threshold;

and obtaining the number of the contradiction points between the two corresponding object pictures based on the sum of the contradiction points between any two object pictures.

4. The method according to claim 2, wherein the screening the object pictures in the first type cluster based on the number of contradictory points to obtain the second type cluster includes:

acquiring a first picture which does not contain a target part in the first cluster, and acquiring a second picture which contains the target part in the first cluster;

based on the number of contradiction points between the first picture and each second picture, obtaining first weights of the first picture and each second picture belonging to different objects;

And screening the object pictures in the first class cluster based on the first weight to obtain the second class cluster.

5. The method of claim 4, wherein the filtering the object pictures in the first type cluster based on the first weight to obtain the second type cluster includes:

fusing the first weights to obtain second weights;

judging whether the second weight is smaller than a second threshold value or not;

in response to the second weight being less than the second threshold, attributing the corresponding first picture to the second class of clusters; the second type cluster at least comprises a second picture containing a target part in the first type cluster;

and rejecting the corresponding first picture in response to the second weight not being smaller than the second threshold.

6. The method of claim 1, wherein clustering the object pictures in the suspected impure archive to obtain a plurality of first clusters comprises:

extracting the characteristics of the object pictures in the suspected impure files to obtain first characteristics;

clustering object pictures in the suspected impure files based on the first characteristics to obtain a plurality of third class clusters;

Screening the object pictures in the third class cluster based on the detection result of whether the object pictures in the third class cluster contain the target part or not to obtain a fourth class cluster;

and merging the fourth type clusters which meet the first condition to obtain a plurality of first type clusters.

7. The method of claim 6, wherein clustering the object pictures in the suspected impure archive based on the first feature to obtain a plurality of third clusters comprises:

taking each object picture in the suspected impure file as a fifth class cluster;

calculating to obtain a first distance between different fifth clusters based on the first characteristic;

combining the fifth class clusters corresponding to the first distance meeting a second condition to obtain a first combined class cluster;

and taking the first combined cluster and the fifth cluster which is not combined in the suspected impure file as the third cluster.

8. The method of claim 6, wherein the screening the object picture in the third class cluster to obtain a fourth class cluster based on a detection result of whether the object picture in the third class cluster includes a target portion, includes:

Responding to the object picture in the third class cluster to contain a target part, and taking the third class cluster as the fourth class cluster;

and eliminating the third class cluster in response to the object picture in the third class cluster not containing the target part.

9. The method of claim 6, wherein the merging the fourth clusters that satisfy the first condition to obtain a plurality of first clusters includes:

extracting features of the target part picture to obtain a second feature; the target part picture is obtained by intercepting the target picture based on the object picture in the suspected impure file;

calculating a second distance between different fourth-class clusters based on the second characteristic;

combining the fourth class clusters corresponding to the second distance meeting the first condition to obtain a second combined class cluster;

and taking the second combined cluster and the fourth cluster which are not combined as the first cluster.

10. The method of claim 1, wherein the screening the suspected impurity files from the plurality of files to be processed comprises:

acquiring a first abnormal value of an abnormal factor in the file to be processed; the anomaly factors represent anomaly dimensions of the files to be processed; the first abnormal value represents the degree of abnormality of the file to be processed in the dimension corresponding to the abnormal factor;

Fusing the first abnormal values to obtain second abnormal values;

and determining whether the file to be processed is a suspected impure file or not based on whether the second abnormal value meets a third condition or not.

11. The method of claim 1, wherein obtaining a purified archive based on the second class of clusters comprises:

selecting one of the second class clusters with the largest number of the object pictures as the purification archive;

or, each second cluster is used as the purification file.

12. A file cleaning device, comprising:

a first screening module; the method comprises the steps of screening a plurality of files to be processed to obtain suspected impure files; wherein the object pictures in the suspected impure files are suspected to belong to different objects;

the picture clustering module is used for clustering the object pictures in the suspected impure files to obtain a plurality of first clusters;

a second screening module; the method comprises the steps of screening object pictures in a first class cluster based on similarity between local features of the same part in different object pictures in the first class cluster to obtain a second class cluster;

and the archive purification module is used for obtaining a purified archive based on the second class cluster.

13. An electronic device comprising a memory and a processor coupled to each other, the memory having program instructions stored therein, the processor being configured to execute the program instructions to implement the archive cleaning method of any one of claims 1 to 11.

14. A computer readable storage medium, characterized in that program instructions executable by a processor are stored, said program instructions being for implementing the archive cleaning method of any one of claims 1 to 11.