WO2015184992A1

WO2015184992A1 - Method for recognizing duplicate image, and image search and deduplication method and device thereof

Info

Publication number: WO2015184992A1
Application number: PCT/CN2015/080713
Authority: WO
Inventors: 朱茂清; 韩玉刚
Original assignee: 北京奇虎科技有限公司; 奇智软件（北京）有限公司
Priority date: 2014-06-05
Filing date: 2015-06-03
Publication date: 2015-12-10
Also published as: CN103984776A; CN103984776B

Abstract

The present invention provides a method for recognizing a duplicate image, and an image search and deduplication method and a device thereof. The method comprises: segmenting a Phash value of an image to be recognized to obtain Phash subvalues, comparing each Phash subvalue with Phash subvalues of each image saved in a chaining database, and determining whether the image to be recognized is a duplicate of other images when the Phash subvalues of other images in the chaining database are hit. In an embodiment of the present invention, a Phash value of an image is segmented to obtain multiple Phash subvalues, and the Phash value of the image is compared with hit Phash values of other images when a certain Phash subvalue hits Phash subvalues of other images in a chaining database. Therefore, the accuracy of duplicate image recognition is ensured, and the efficiency of duplicate image recognition is effectively improved.

Description

Method for identifying repeated pictures, image search deduplication method and device thereof

Technical field

The present invention relates to the field of picture recognition technologies, and in particular, to a method for identifying a repeated picture, a method for de-duplicating a picture search, and a device thereof.

Background technique

After searching based on the image input by the user, in order to improve the user experience and improve the accuracy of the search result, it is generally necessary to perform weighting processing on the searched image, that is, to identify the same image in the search result.

In the prior art, when identifying the same picture in the search result, according to whether the content of the picture is the same, or according to whether the link address of the picture is the same, a simple judgment is made, but sometimes the content of the same picture may not be, or the link address of the same picture is also It is not the same, so the above method can not achieve a good recognition effect.

In order to achieve a better recognition effect, a series of feature quantization processes can be performed on the image, and the feature quantization process is performed before the weighting process is performed. Although the method can achieve an ideal recognition effect, it takes a long time and cannot satisfy the picture. Search, provide real-time requirements.

In addition, when performing the same picture recognition, it is also possible to judge by comparing the Phash values of the pictures, but the method needs to compare the Phash values of each of the two pictures, and the method is also very time consuming for the massive search results. The real-time nature of image search is not guaranteed.

Summary of the invention

In view of the above problems, the present invention has been made in order to provide a method for recognizing a repeated picture, a method for recognizing a deduplication of a picture, and an apparatus thereof, which overcome the above problems or at least partially solve the above problems.

According to an aspect of the present invention, an embodiment of the present invention provides a method for identifying a repeated picture, the method comprising:

Determining a Phash value of the to-be-identified picture, and segmenting the Phash value to obtain each Phash score after the segmentation;

Determining whether each Phash score after the segmentation of the image to be identified hits a Phash score after other image segments in the zipper database;

When the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified and other pictures in the zipper database are duplicated;

Otherwise, the information of the picture to be identified is saved in the zipper database.

In accordance with another aspect of the present invention, an embodiment of the present invention provides a method for de-duplication of a picture search, the method comprising:

Receiving a query word input by the user, and searching for a picture resource that matches the query word input by the user;

Remove duplicate images from image assets;

Returning the result of the picture resource after the duplicate picture is removed to the user;

The removing the repeated picture in the picture resource can be obtained by adopting the above method for identifying the repeated picture.

According to still another aspect of the present invention, an embodiment of the present invention provides an apparatus for identifying a repeated picture, the apparatus comprising:

a segmentation module, configured to determine a Phash value of the to-be-identified picture, segment the Phash value, and obtain each Phash score after the segmentation;

a zipper database for storing Phash values of pictures and each Phash score;

a judging module, configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database;

a comparison identification module, configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits a Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When it is determined that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the picture to be recognized is saved into the zipper database.

According to still another aspect of the present invention, an embodiment of the present invention provides an apparatus for image search deduplication, the apparatus comprising:

Receiving a search module, configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;

a de-duplication module for removing duplicate pictures in a picture resource;

And providing a module, configured to return a picture resource result after the duplicate picture is removed to the user.

According to still another aspect of the present invention, a computer program comprising computer readable The code, when the computer readable code is run on a computing device, causes the computing device to perform a method of identifying a duplicate picture according to any of the above, and/or a method of deduplication of the picture search described above.

According to still another aspect of the present invention, a computer readable medium is provided, wherein the computer program described above is stored.

An embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplicating a picture, and a device thereof. In this method, a Phash value of a picture to be identified is segmented, and each Phash score is obtained, and each picture to be identified is obtained. The Phash score is compared with the Phash score of each picture saved in the zipper database. When the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified is duplicated with other pictures. In the embodiment of the present invention, the Phash value of the picture is segmented to obtain multiple Phash scores. When a Phash score hits the Phash score of other pictures in the zipper database, the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.

The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.

DRAWINGS

Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:

FIG. 1 is a schematic diagram of a process for identifying a duplicate picture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 1 of the present invention; FIG.

FIG. 3 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 2 of the present invention; FIG.

4 is a schematic diagram of a process of de-duplication of image search according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus for identifying a duplicate picture according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for image search deduplication according to an embodiment of the present invention;

Figure 7 is a schematic block diagram of a computing device for performing a method of identifying duplicate pictures in accordance with the present invention, and/or a method of picture search deduplication;

Fig. 8 schematically shows a storage unit for holding or carrying a program code implementing a method of recognizing a duplicate picture according to the present invention, and/or a method of de-duplicating a picture search.

detailed description

The invention is further described below in conjunction with the drawings and specific embodiments.

In order to ensure the accuracy of the same picture recognition and improve the recognition efficiency of the same picture, the embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplication of a picture search, and a device thereof.

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present invention have been shown in the drawings, the embodiments Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.

The embodiments of the present invention will be described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a process for identifying a duplicate picture according to an embodiment of the present invention, where the process includes the following steps:

S101: Determine a Phash value of the to-be-identified picture, and segment the Phash value to obtain each Phash score after the segmentation.

In the process of the same picture recognition, the preset method is used to segment the Phash value of the picture for each picture, so that the Phash value of each picture is segmented in the same manner, which is convenient for subsequent Phash points. Comparison of values.

S102: Determine whether each Phash score after the segmentation of the image to be identified hits the Phash score of other image segments in the zipper database. When the determination result is yes, proceed to step S103; otherwise, proceed to step S104.

The Phash scores of the other image segments in the Phash score after the segmentation of the image to be identified include:

The Phash score after the segmentation of the image to be identified is the same as the Phash score of other image segments in the zipper database.

After the Phash value of the to-be-identified picture is segmented, multiple Phash scores are obtained, for example, P1, P2, P3, ..., Pn, and each picture in the corresponding zipper database also has n Phash corresponding to each other. Score. When the judgment is made, the Phash score P1 after the segmentation of the image to be identified is respectively compared with each Phash score after each image segmentation in the zipper database, and the Phash score of the segment to be identified is determined. Whether P1 hits other images in the zipper database after segmentation The Phash score, that is, whether the Phash score P1 after the segmentation of the image to be identified is the same as a Phash score after segmentation of a certain picture in the zipper database. When the Phash score P1 of the to-be-identified picture segment is the same as a Phash score after the second picture segmentation in the zipper database, it is determined that the Phash score P1 hits the zipper database in the image segment to be identified. The Phash score after the two picture segments.

In addition, the other processes for the Phash scores of the to-be-identified picture are also used to determine whether each Phash score of the to-be-identified picture hits the Phash score of other pictures in the zipper database.

S103: Determine whether the picture to be identified is overlapped with other pictures in the zipper database.

Specifically, determining whether the picture to be identified and other pictures in the zipper database are duplicated includes:

For each other picture in the zipper database whose Phash score is hit, according to the Phash value of the picture to be recognized and the Hamming distance of each other picture Phash value, it is determined whether the picture to be identified and other pictures in the zipper database are repeat.

When comparing the Phash value of the picture to be recognized with the Hamming distance of the Phash value of the hit picture, the comparison threshold may be set, when the Phash value of the picture to be recognized and the Hamming distance of the Phash value of the hit picture are less than the set comparison threshold. And determining that the to-be-identified picture and the hit picture are repeated pictures; otherwise, determining that the picture to be recognized and the hit picture are not repeated.

S104: Save the information of the to-be-identified picture into the zipper database.

When each Phash score of the to-be-identified picture misses the Phash score of each picture in the zipper database, it is determined that there is no picture duplicated with the picture to be identified in the zipper database, or, by comparing the Hamming distance, determining When the recognition picture and the hit picture are not repeated, in order to facilitate the subsequent recognition of the repeated picture, the information of the picture to be recognized is added to the zipper database, and each picture stored in the zipper database is not repeated.

In the embodiment of the present invention, the Phash value of the picture is segmented to obtain multiple Phash scores. When a Phash score hits the Phash score of other pictures in the zipper database, the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.

In the embodiment of the present invention, the zipper database is used to store information of a picture, including: a Phash value of the picture and a Phash score of the picture, and the zipper database may also store the identification information of the picture, such as the sequence number of the picture in the search result, or the picture. The serial number and so on throughout the identification process.

At the beginning of repeated image recognition for each image search result, the zipper database is empty, and in the subsequent recognition process, according to the recognition result, it will be saved in the zipper database. The information of the picture whose picture is not repeated is saved in the zipper database, so it can be considered that each picture stored in the zipper database is different. Specifically, the zipper database may use each Phash score of each picture as a plurality of key values of the picture, and save the Phash value of the picture as the zipper data of the picture. And in the zipper database, in order to specifically distinguish each picture, and reduce the amount of data storage, the ID of the picture in the search result may also be saved, or each picture is determined in the zipper database according to the order in which the picture is saved to the zipper database. ID in .

After searching for multiple pictures based on the user's input, the Phash value of the picture is determined for each picture, and the Phash value is segmented according to a preset method to obtain each Phash score after segmentation. Specifically, when segmenting the Phash value of the picture, as long as each picture adopts the same segmentation mode, whether to directly segment the Phash value or segment the Phash value by using the idea of sequential shift, Or use interval extraction to determine each Phash score. In addition, the number of bits included in the Phash score after segmentation can also be arbitrarily determined. As long as the value of each Phash score is determined for different pictures, the same determination method can be adopted.

Specifically, the Phash value of the picture is a 64-bit value. In the embodiment of the present invention, the Phash value of 21 bits can be selected. To reduce the number of comparisons, 21 bits are divided into three unit segments, each of which is 8 bits and 7 bits respectively. And 6bit, using the sequential shifting idea, the final generation of corresponding mask data bits 512, the amount of data after deduplication is 506, that is, each data is divided into up to 506 zippers, and the maximum number of zippers generated is 2^8 *2^7*2^6=200w.

The following is a specific example to illustrate the process of segmenting the Phash value.

Considering the timeliness of news data, the estimated maximum amount is 200w (data will expire), and the phash value is 21bit. In order to reduce the number of comparisons, 21bit is divided into 3 units, which are 8bit, 7bit, 6bit, and sequentially shifted. The idea is that the corresponding mask data is 512, and the data volume after deduplication is 506, that is, each data is divided into up to 506 zippers, and the maximum number of zippers is 2^8*2^7*2^6= 200w, each zipper is located in the zipper database. In the embodiment of the invention, each zipper can be used as the key value of each Phash score.

After obtaining each Phash score of the picture to be identified, comparing each Phash score of the picture with the Phash score of each picture saved in the zipper database, and determining whether each Phash score of the picture to be identified is Hit the Phash score of other images in the zipper database. The zipper database stores information about multiple images, including Phash scores and Phash values for each image. When each Phash score of the picture to be identified is hit and judged by the Phash score of other pictures in the zipper database, specifically, for each Phash score of the picture to be identified, each picture saved in the zipper database is determined. Whether the Phash score is related to the Phash score of the picture to be identified Similarly, when the same, the Phash score of the to-be-identified picture is considered to hit the Phash score of the other picture in the zipper database.

There are multiple Phash scores for each image saved in the zipper database, and the Phash scores of multiple images are saved in the zipper database. It is determined whether each Phash score of the image is hit for each picture to be identified. When a Phash score of other pictures in the zipper database is used, there is a possibility that a Phash score of the to-be-identified picture hits a Phash score of a plurality of pictures in the zipper database, for example, the first Phash score of the picture to be recognized hits The first Phash score of the second picture and hits the second Phash score of the third picture. There may also be a plurality of Phash scores of the to-be-identified picture in the zipper database, and there is a Phash score of the hit, for example, the first Phash score of the picture to be identified hits the Phash score of the four pictures in the zipper database, The two Phash scores hit the Phash scores of the three pictures in the zipper database.

After determining the Phash score of each Phash score of the to-be-identified picture in the zipper database, comparing the Phash value of the to-be-identified picture with the Hamming distance of the Phash value of the picture of each hit Phash score, determining the Whether the picture to be identified is duplicated with the picture.

FIG. 2 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 1 of the present invention, where the process includes the following steps:

S201: Extract a picture to be identified in the search result, and determine a Phash value of the picture to be identified.

S202: Segment the Phash value according to a preset method to obtain each Phash score after the segmentation.

S203: Extract each Phash score of the to-be-identified picture, compare each Phash score with each Phash score of each picture in the zipper database, and determine whether each Phash score hits other pictures in the zipper database. The Phash score, when the determination result is YES, proceeds to step S204, otherwise, proceeds to step S207.

S204: Compare whether the Phash value of the to-be-identified picture and the Hamming distance of the Phash value of the hit picture are smaller than a set comparison threshold. If the determination result is yes, proceed to step S205; otherwise, proceed to step S206.

A picture in which the Phash score in the zipper database and the Phash score of the to-be-identified picture are hit is referred to as a hit picture.

S205: Determine that the to-be-identified picture and the hit picture are duplicate pictures.

S206: Whether the current hit picture is the last picture in the hit picture, and if the determination result is yes, proceeding to step S207; otherwise, performing step S204 for the next picture of the hit.

S207: an ID of the to-be-identified picture in the search result, and each Phash of the to-be-identified picture Scores and Phash values are added to the zipper database.

Or, in order to improve the recognition efficiency of the repeated pictures, according to each Phash score of the picture to be identified, the Phash score of other pictures in the zipper database hit by each Phash score is determined, and the hit Phash is determined. The other pictures corresponding to the scores are used as hit images. When determining whether the to-be-identified picture and the hit picture in the zipper database are duplicated, it includes:

Determining a Hamming distance of the to-be-identified picture and each of the other pictures, and extracting a minimum value of the Hamming distance;

Determining whether the minimum value is less than a set comparison threshold;

When the minimum value is less than the set comparison threshold, it is determined that the to-be-identified picture is overlapped with other pictures in the zipper database; otherwise, it is determined that the to-be-identified picture does not overlap with other pictures in the zipper database.

In a specific implementation, the Hamming distance of the to-be-recognized picture and each hit picture may be determined in turn, the Hamming distance minimum value is selected, and the minimum value of the Hamming distance is determined to be less than a set comparison threshold, when the Hamming distance is When the minimum value is smaller than the set comparison threshold, it is determined that the picture corresponding to the minimum value of the to-be-identified picture is a duplicate picture.

In the embodiment of the present invention, in order to improve the accuracy of the repeated picture recognition, determining whether the picture to be identified and other pictures in the zipper database are repeated include:

Determining, according to the first picture in each other picture, a Hamming value of the to-be-identified picture and a Hamming distance of the Phash value of the first picture, and determining whether the Hamming distance is less than a set first threshold;

When the Hamming distance is less than the set first threshold, determining that the to-be-identified picture is repeated with the first picture;

When the Hamming distance is not less than the set first threshold, determining whether the Hamming distance is less than a set second threshold, wherein the first threshold is less than the second threshold;

Determining a Hamming distance of the to-be-identified picture and each of the remaining other pictures when the Hamming distance is less than a set second threshold, extracting a minimum value of the Hamming distance, and determining the minimum value Whether it is less than the set first threshold, when the minimum value is less than the set first threshold, determining that the to-be-identified picture is overlapped with other pictures in the zipper database; otherwise, determining the to-be-identified picture and the other picture Not repeating.

The first threshold may be used as a trusted threshold. When the Hamming distance between the two pictures is less than the first threshold, the two pictures may be considered as duplicate pictures; the second threshold may be regarded as a moderately trusted threshold when two pictures are When the Hamming distance is greater than the first threshold, but less than the second threshold, the to-be-identified picture is added to the suspicious picture queue, and the judging distance between the to-be-identified picture and other hit pictures is continuously determined. To determine whether the picture to be identified is duplicated with the picture in the zipper database.

Further, in order to improve the recognition efficiency of the repeated pictures, the Hamming distance between the picture to be recognized and the hit picture may be separately determined, and the minimum value of the Hamming distance may be extracted, and the minimum value and the set first threshold and the second threshold are performed. Comparing to determine whether the picture to be identified is duplicated with the picture in the zipper database.

FIG. 3 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 2 of the present invention, where the process includes the following steps:

S301: Determine a Phash value of the to-be-identified picture, and segment the Phash value according to a preset method to obtain each Phash score after the segmentation.

S302: Extract each Phash score of the to-be-identified picture, compare each Phash score with the Phash score of each picture in the zipper database, and determine whether each Phash score hits the Phash of other pictures in the zipper database. The score, when the result of the determination is YES, proceeds to step S303, otherwise, proceeds to step S306.

S303: Determine a Hamming distance between the to-be-identified picture and each hit picture, and extract a minimum value of the Hamming distance.

S304: Determine whether the minimum value of the Hamming distance is less than a set comparison threshold. When the determination result is YES, proceed to step S305; otherwise, proceed to step S306.

S305: The to-be-identified picture has duplicate pictures in the zipper database.

S306: The picture to be identified and the picture in the zipper database are not duplicated, and the information of the picture to be identified is saved in the zipper database.

After determining the Hamming distance between the to-be-identified picture and other pictures in the embodiment of the present invention, comparing the Hamming distance with the two thresholds, the accuracy of the repeated picture recognition can be effectively ensured, and the repeated picture recognition device is not affected. Recall.

In order to effectively improve the efficiency of the repeated picture recognition, the repeated picture can be quickly found. In the embodiment of the present invention, when it is determined that the picture to be recognized and the picture in the zipper database are duplicated, the information of the picture to be identified is saved to the zipper database. And storing the Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture in a header of the zipper database, wherein the zipper database saves each picture from front to back according to the time when the picture is generated.

For example, the picture to be identified may be a news picture and a hot picture. This is because pictures based on the same event will appear at the same time, so based on the proximity of the event and the proximity of the appearing image, In the embodiment of the present invention, when it is determined that the picture to be recognized does not have a duplicate picture in the zipper database, the information of the picture to be recognized is added to the head of the zipper database, that is, the front of the zipper database, so when performing repeated picture recognition, It may be first determined whether the picture of the head of the zipper database is duplicated with the picture to be identified, thereby improving the efficiency of repeated picture recognition.

The process of recognizing the repeated pictures of the embodiment of the present invention is described below through a specific embodiment.

The value of the Phash value of the to-be-identified picture is M. After the Phash value is segmented according to a preset method, each Phash value P1, P2, ..., Pn after the segmentation of the picture to be identified is obtained. The first threshold set is a, the second threshold is b, and a<b.

Comparing the Phash score P1 of the to-be-identified picture with each Phash score of other pictures in the zipper database, and determining whether the Phash score P1 of the to-be-identified picture hits the Phash score of other pictures in the zipper database, for example, the The Phash score P1 of the recognition picture hits a certain Phash score of picture 1 in the zipper database, and hits a certain Phash score of picture 2. Comparing the Phash score P2 of the picture to be identified with each Phash score of other pictures in the zipper database, and determining whether the Phash score P2 of the picture to be identified hits the Phash score of other pictures in the zipper database, for example, the to-be-identified The Phash score of the picture P2 hits another Phash score of picture 1 in the zipper database and hits a Phash score of picture 3.

Each Phash score of the image to be identified is sequentially compared with each Phash score of other images in the zipper database to determine whether to hit the Phash score of other images in the zipper database, thereby determining the hit of the to-be-identified image. Each picture in the zipper database. If each Phash score of the to-be-identified picture misses each Phash score of other pictures in the zipper database, it is determined that the picture to be recognized and each picture in the zipper database are not duplicated, and the picture to be recognized is not to be recognized. The identification information of the Phash score and the Phash value of the image are saved in the zipper database, and the identification information of the image to be identified may be the sequence number of the image in the search result, or the entire repeated image recognition process of the image The serial number and so on.

After determining each picture that the picture to be identified hits in the zipper database, compare the Phash value of the picture to be recognized with the Hamming distance of the Phash value of the hit picture for each hit picture, and determine whether the Hamming distance is less than The first threshold is determined. When the Hamming distance is less than the set first threshold a, that is, the Hamming distance is in the [0, a) interval, it is determined that the picture to be recognized overlaps with the picture in the zipper database.

When the Hamming distance is greater than the set first threshold a but less than the set second threshold b, that is, the Hamming distance is in the interval [a, b), where a is smaller than b, the picture to be recognized is added to Can team In the column. Comparing the Hamming distance between the to-be-identified picture and the other hit picture, and identifying the minimum value of the Hamming distance. When the minimum value is less than the set first threshold a, determining that the picture to be recognized and the picture in the zipper database are duplicated, otherwise The image to be identified is not duplicated with the image in the zipper database, and the identification information of the image to be identified, each Phash score and the Phash value of the image are saved in the zipper database.

When the Hamming distance is greater than the set second threshold b, that is, the Hamming distance is in the [b, ∞) interval, it is determined that the picture to be recognized and the picture in the zipper database are not duplicated, and the picture to be recognized is Identification information, each Phash score and Phash value of the picture is saved in the zipper database.

FIG. 4 is a schematic diagram of a process for de-duplication of image search according to an embodiment of the present invention, where the process includes:

S401: Receive a query word input by the user, and search for a picture resource that matches the query word input by the user.

S402: Determine a Phash value of each picture in the picture resource, and segment the Phash value according to a preset method to obtain each Phash score after the segmentation.

S403: Extract each Phash score of the to-be-identified picture, compare each Phash score with a Phash score of each picture in the zipper database, and determine whether each Phash score hits the Phash of other pictures in the zipper database. The score, when the result of the determination is YES, proceeds to step S404, otherwise, proceeds to step S409.

S404: Determine a Hamming distance between the to-be-identified picture and each hit picture, and extract a minimum value of the Hamming distance.

S405: Determine whether the minimum value of the Hamming distance is less than a set comparison threshold. When the determination result is YES, proceed to step S406; otherwise, proceed to step S409.

S406: The to-be-identified picture has duplicate pictures in the zipper database.

S407: Determine whether the to-be-identified picture is the last picture in the picture resource related information. If the determination result is yes, proceed to step S408; otherwise, use the next picture as the to-be-identified picture, and proceed to step S403.

S408: Return the picture in the zipper database after the duplicate picture is removed to the user.

S409: The picture to be identified and the picture in the zipper database are not duplicated, and the picture to be identified is added to the front of the zipper database. Then, step S407 is performed.

In the embodiment of the present invention, the Phash value of the picture is segmented to obtain multiple Phash scores. When a Phash score hits the Phash score of other pictures in the zipper database, the Phash value of the picture is compared and The Phash value of other images hit, so the duplicate image is guaranteed The accuracy of recognition can also effectively improve the recognition efficiency of repeated pictures.

FIG. 5 is a schematic structural diagram of an apparatus for identifying a duplicate picture according to an embodiment of the present disclosure, where the apparatus includes:

The segmentation module 51 is configured to determine a Phash value of the to-be-identified picture, and segment the Phash value to obtain each Phash score after the segmentation;

a zipper database 52, configured to store a Phash value of the picture and each Phash score;

The determining module 53 is configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database;

The comparison identification module 54 is configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits the Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When the module determines that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the picture to be recognized is saved into the zipper database.

The comparison identification module 54 is specifically configured to determine, according to the Phash value of the to-be-identified picture and the Hamming distance of each other picture Phash value, for each other picture in the zipper database that is hit by the Phash score. Identify whether the picture is duplicated with other pictures in the zipper database.

The determining module 53 is specifically configured to determine whether the Phash score of the to-be-identified picture is the same as the Phash score of other pictures in the zipper database.

The comparison identification module 54 is specifically configured to determine a Hamming distance between the to-be-identified picture and each of the other pictures, and extract a minimum value of the Hamming distance; determine whether the minimum value is less than a set comparison threshold. When the minimum value is less than the set comparison threshold, determining that the to-be-identified picture is repeated with other pictures in the zipper database, otherwise, determining that the to-be-identified picture does not overlap with other pictures in the zipper database.

The comparison identification module 54 is configured to determine, for each of the other pictures, a Phash value of the to-be-identified picture and a Hamming distance of the Phash value of the first picture; and determine whether the Hamming distance is less than Determining a first threshold; determining that the to-be-recognized picture is repeated with the first picture when the Hamming distance is less than a set first threshold; and determining that the Hamming distance is not less than a set first threshold Whether the Hamming distance is less than a set second threshold, wherein the first threshold is less than the second threshold; and when the Hamming distance is less than the set second threshold, determining the to-be-identified picture and the remaining each other a Hamming distance of the picture, extracting a minimum value of the Hamming distance, determining whether the minimum value is less than a set first threshold, and determining the to-be-identified picture when the minimum value is less than a set first threshold Duplicate with other pictures in the zipper database, otherwise, determine the picture to be identified and the Other images are not repeated.

The comparison identification module 54 is configured to save the Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture in a header of the zipper database, where the zipper database is generated according to the time of the picture. Save the information of each picture from front to back.

The segmentation module 51 is specifically configured to divide the Phash value into a plurality of unit segments, each of which adopts a different number of bits; and adopts a sequential shift method to obtain each Phash score.

FIG. 6 is a schematic structural diagram of an apparatus for image search deduplication according to an embodiment of the present disclosure, where the apparatus includes:

The receiving search module 61 is configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;

The de-duplication module 62 is configured to remove duplicate pictures in the picture resource;

a providing module 63, configured to return a picture resource result after removing the duplicate picture to the user;

The deduplication module 62 removes duplicate pictures in the picture resource by using the above device for identifying duplicate pictures.

In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.

Similarly, the various features of the invention are sometimes grouped together into a single embodiment, in the above description of the exemplary embodiments of the invention, Figure, or a description of it. However, the method disclosed is not to be interpreted as reflecting the intention that the claimed invention requires more features than those recited in the claims. Rather, as the following claims reflect, inventive aspects reside in less than all features of the single embodiments disclosed herein. Therefore, follow The claims of the detailed description are hereby expressly incorporated into the specific embodiments, and each of the claims

Those skilled in the art will appreciate that the modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components. In addition to such features and/or at least some of the processes or units being mutually exclusive, any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined. Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.

In addition, those skilled in the art will appreciate that, although some embodiments described herein include certain features that are included in other embodiments and not in other features, combinations of features of different embodiments are intended to be within the scope of the present invention. Different embodiments are formed and formed. For example, in the following claims, any one of the claimed embodiments can be used in any combination.

The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some of the means for identifying duplicate pictures and/or picture search deduplication devices in accordance with embodiments of the present invention or Some or all of the features of all components. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

For example, FIG. 7 illustrates a computing device that may implement a method of identifying duplicate pictures, and/or a method of picture search deduplication in accordance with the present invention. The computing device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720. Memory 720 can be an electronic memory such as a flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Memory 720 has a memory space 730 for program code 731 for performing any of the method steps described above. For example, storage space 730 for program code may include various program code 731 for implementing various steps in the above methods, respectively. These program generations The code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 720 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 731', ie, code readable by a processor, such as 710, that when executed by a computing device causes the computing device to perform each of the methods described above step.

"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.

It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.

In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

A method of identifying duplicate pictures, the method comprising:

Determining a Phash value of the to-be-identified picture, and segmenting the Phash value to obtain each Phash score after the segmentation;

Determining whether each Phash score after the segmentation of the image to be identified hits a Phash score after other image segments in the zipper database;

When the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified and other pictures in the zipper database are duplicated;

Otherwise, the information of the picture to be identified is saved in the zipper database.
The method of claim 1, wherein the determining whether the picture to be recognized and other pictures in the zipper database are duplicated comprises:

For each other picture in the zipper database whose Phash score is hit, according to the Phash value of the picture to be recognized and the Hamming distance of each other picture Phash value, it is determined whether the picture to be identified and other pictures in the zipper database are repeat.
The method according to any one of claims 1 to 2, wherein the determining whether the picture to be recognized and other pictures in the zipper database are duplicated comprises:

Determining a Hamming distance of the to-be-identified picture and each of the other pictures, and extracting a minimum value of the Hamming distance;

Determining whether the minimum value is less than a set comparison threshold;

When the minimum value is less than the set comparison threshold, it is determined that the to-be-identified picture is overlapped with other pictures in the zipper database; otherwise, it is determined that the to-be-identified picture does not overlap with other pictures in the zipper database.
The method according to any one of claims 1 to 3, wherein the determining whether the picture to be recognized and other pictures in the zipper database are duplicated comprises:

Determining, according to the first picture in each other picture, a Hamming value of the to-be-identified picture and a Hamming distance of the Phash value of the first picture, and determining whether the Hamming distance is less than a set first threshold;

When the Hamming distance is less than the set first threshold, determining that the to-be-identified picture is repeated with the first picture;

When the Hamming distance is not less than the set first threshold, determining whether the Hamming distance is less than a set second threshold, wherein the first threshold is less than the second threshold;

Determining the to-be-identified picture and the remaining when the Hamming distance is less than a set second threshold Calculating a Hamming distance of each of the other pictures, extracting a minimum value of the Hamming distance, determining whether the minimum value is less than a set first threshold, and determining when the minimum value is less than a set first threshold The picture to be identified is repeated with other pictures in the zipper database. Otherwise, it is determined that the picture to be identified does not overlap with the other pictures.
The method according to any one of claims 1 to 4, wherein the saving the information of the picture to be recognized into the zipper database comprises:

The Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture are saved in a header of the zipper database, wherein the zipper database stores information of each picture from front to back according to the time when the picture is generated.
The method according to any one of claims 1 to 5, wherein the picture to be identified comprises a news picture and a hot picture.
The method of any one of claims 1 to 6, wherein the segmenting the Phash value comprises:

Dividing the Phash value into a plurality of unit segments, each unit segment adopting a different number of bits;

Using the sequential shift method, each Phash score is obtained.
A method for image search deduplication, the method comprising:

Receiving a query word input by the user, and searching for a picture resource that matches the query word input by the user;

Remove duplicate images from image assets;

Returning the result of the picture resource after the duplicate picture is removed to the user;

The repeated picture in the picture resource is obtained by the method for identifying a duplicate picture according to any one of claims 1 to 7.
A device for identifying duplicate pictures, the device comprising:

a segmentation module, configured to determine a Phash value of the to-be-identified picture, segment the Phash value, and obtain each Phash score after the segmentation;

a zipper database for storing Phash values of pictures and each Phash score;

a judging module, configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database;

a comparison identification module, configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits a Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When it is determined that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the to-be-identified picture is saved. Go to the zipper database.
The apparatus according to claim 9, wherein said comparison identifying module is specifically configured to: for each other picture in the zipper database that is hit for the Phash score, according to the Phash value of the picture to be recognized and each other picture Phash The Hamming distance of the value determines whether the picture to be identified and the other pictures in the zipper database are duplicated.
The device according to claim 10, wherein the comparison identifying module is specifically configured to determine a Hamming distance of the to-be-identified picture and each of the other pictures, and extract a minimum value of the Hamming distance; Whether the minimum value is smaller than the set comparison threshold; when the minimum value is less than the set comparison threshold, determining that the to-be-identified picture is duplicated with other pictures in the zipper database; otherwise, determining the to-be-identified picture and the zipper database Other pictures are not repeated.
The device of claim 10, wherein the comparison identifying module is configured to determine a Phash value of the to-be-identified picture and a Hamming value of the Phash value of the first picture for the first picture in each of the other pictures. Determining whether the Hamming distance is less than a set first threshold; determining that the to-be-recognized picture is repeated with the first picture when the Hamming distance is less than a set first threshold; when the Hamming distance is not less than Determining whether the Hamming distance is less than a set second threshold when the first threshold is determined, wherein the first threshold is less than a second threshold; and when the Hamming distance is less than a set second threshold, determining the waiting Identifying a Hamming distance of the picture and each of the remaining other pictures, extracting a minimum value of the Hamming distance, determining whether the minimum value is less than a set first threshold, and when the minimum value is less than a set number When a threshold is reached, it is determined that the to-be-identified picture is overlapped with other pictures in the zipper database, otherwise, it is determined that the to-be-identified picture does not overlap with the other pictures.
The device according to any one of claims 9 to 12, wherein the comparison identifying module is configured to save a Phash score of the to-be-identified picture and a Phash value of the to-be-identified picture in the zipper database. The head, wherein the zipper database saves the information of each picture from front to back according to the time generated by the picture.
The apparatus according to any one of claims 9 to 12, wherein the segmentation module is specifically configured to divide the Phash value into a plurality of unit segments, each of which adopts a different number of bits; The bit method gets each Phash score.
A device for image search deduplication, the device comprising:

Receiving a search module, configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;

a de-duplication module for removing duplicate pictures in a picture resource;

Providing a module, configured to return a picture resource result after removing the duplicate picture to the user;

The deduplication module removes duplicate pictures in the picture resource by using the apparatus for identifying duplicate pictures according to any one of claims 9 to 14.
A computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform the method of identifying a duplicate picture according to any one of claims 1-7, And/or the method for de-duplication of picture search according to claim 8.
A computer readable medium storing the computer program of claim 16.