WO2015184992A1 - Method for recognizing duplicate image, and image search and deduplication method and device thereof - Google Patents

Method for recognizing duplicate image, and image search and deduplication method and device thereof Download PDF

Info

Publication number
WO2015184992A1
WO2015184992A1 PCT/CN2015/080713 CN2015080713W WO2015184992A1 WO 2015184992 A1 WO2015184992 A1 WO 2015184992A1 CN 2015080713 W CN2015080713 W CN 2015080713W WO 2015184992 A1 WO2015184992 A1 WO 2015184992A1
Authority
WO
WIPO (PCT)
Prior art keywords
picture
phash
database
zipper
identified
Prior art date
Application number
PCT/CN2015/080713
Other languages
French (fr)
Chinese (zh)
Inventor
朱茂清
韩玉刚
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2015184992A1 publication Critical patent/WO2015184992A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to the field of picture recognition technologies, and in particular, to a method for identifying a repeated picture, a method for de-duplicating a picture search, and a device thereof.
  • a series of feature quantization processes can be performed on the image, and the feature quantization process is performed before the weighting process is performed.
  • the method can achieve an ideal recognition effect, it takes a long time and cannot satisfy the picture. Search, provide real-time requirements.
  • the present invention has been made in order to provide a method for recognizing a repeated picture, a method for recognizing a deduplication of a picture, and an apparatus thereof, which overcome the above problems or at least partially solve the above problems.
  • an embodiment of the present invention provides a method for identifying a repeated picture, the method comprising:
  • an embodiment of the present invention provides a method for de-duplication of a picture search, the method comprising:
  • the removing the repeated picture in the picture resource can be obtained by adopting the above method for identifying the repeated picture.
  • an embodiment of the present invention provides an apparatus for identifying a repeated picture, the apparatus comprising:
  • a segmentation module configured to determine a Phash value of the to-be-identified picture, segment the Phash value, and obtain each Phash score after the segmentation;
  • a zipper database for storing Phash values of pictures and each Phash score
  • a judging module configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database
  • a comparison identification module configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits a Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When it is determined that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the picture to be recognized is saved into the zipper database.
  • an embodiment of the present invention provides an apparatus for image search deduplication, the apparatus comprising:
  • Receiving a search module configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;
  • a de-duplication module for removing duplicate pictures in a picture resource
  • a computer program comprising computer readable
  • the code when the computer readable code is run on a computing device, causes the computing device to perform a method of identifying a duplicate picture according to any of the above, and/or a method of deduplication of the picture search described above.
  • a computer readable medium wherein the computer program described above is stored.
  • An embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplicating a picture, and a device thereof.
  • a Phash value of a picture to be identified is segmented, and each Phash score is obtained, and each picture to be identified is obtained.
  • the Phash score is compared with the Phash score of each picture saved in the zipper database.
  • the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified is duplicated with other pictures.
  • the Phash value of the picture is segmented to obtain multiple Phash scores.
  • the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.
  • FIG. 1 is a schematic diagram of a process for identifying a duplicate picture according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 1 of the present invention
  • FIG. 3 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 2 of the present invention.
  • FIG. 4 is a schematic diagram of a process of de-duplication of image search according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an apparatus for identifying a duplicate picture according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an apparatus for image search deduplication according to an embodiment of the present invention.
  • Figure 7 is a schematic block diagram of a computing device for performing a method of identifying duplicate pictures in accordance with the present invention, and/or a method of picture search deduplication;
  • Fig. 8 schematically shows a storage unit for holding or carrying a program code implementing a method of recognizing a duplicate picture according to the present invention, and/or a method of de-duplicating a picture search.
  • the embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplication of a picture search, and a device thereof.
  • FIG. 1 is a schematic diagram of a process for identifying a duplicate picture according to an embodiment of the present invention, where the process includes the following steps:
  • S101 Determine a Phash value of the to-be-identified picture, and segment the Phash value to obtain each Phash score after the segmentation.
  • the preset method is used to segment the Phash value of the picture for each picture, so that the Phash value of each picture is segmented in the same manner, which is convenient for subsequent Phash points. Comparison of values.
  • step S102 Determine whether each Phash score after the segmentation of the image to be identified hits the Phash score of other image segments in the zipper database. When the determination result is yes, proceed to step S103; otherwise, proceed to step S104.
  • the Phash scores of the other image segments in the Phash score after the segmentation of the image to be identified include:
  • the Phash score after the segmentation of the image to be identified is the same as the Phash score of other image segments in the zipper database.
  • Phash score P1 after the segmentation of the image to be identified is respectively compared with each Phash score after each image segmentation in the zipper database, and the Phash score of the segment to be identified is determined. Whether P1 hits other images in the zipper database after segmentation
  • the Phash score that is, whether the Phash score P1 after the segmentation of the image to be identified is the same as a Phash score after segmentation of a certain picture in the zipper database.
  • Phash score P1 of the to-be-identified picture segment is the same as a Phash score after the second picture segmentation in the zipper database, it is determined that the Phash score P1 hits the zipper database in the image segment to be identified.
  • the other processes for the Phash scores of the to-be-identified picture are also used to determine whether each Phash score of the to-be-identified picture hits the Phash score of other pictures in the zipper database.
  • S103 Determine whether the picture to be identified is overlapped with other pictures in the zipper database.
  • determining whether the picture to be identified and other pictures in the zipper database are duplicated includes:
  • the comparison threshold may be set, when the Phash value of the picture to be recognized and the Hamming distance of the Phash value of the hit picture are less than the set comparison threshold. And determining that the to-be-identified picture and the hit picture are repeated pictures; otherwise, determining that the picture to be recognized and the hit picture are not repeated.
  • S104 Save the information of the to-be-identified picture into the zipper database.
  • each Phash score of the to-be-identified picture misses the Phash score of each picture in the zipper database, it is determined that there is no picture duplicated with the picture to be identified in the zipper database, or, by comparing the Hamming distance, determining When the recognition picture and the hit picture are not repeated, in order to facilitate the subsequent recognition of the repeated picture, the information of the picture to be recognized is added to the zipper database, and each picture stored in the zipper database is not repeated.
  • the Phash value of the picture is segmented to obtain multiple Phash scores.
  • a Phash score hits the Phash score of other pictures in the zipper database
  • the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.
  • the zipper database is used to store information of a picture, including: a Phash value of the picture and a Phash score of the picture, and the zipper database may also store the identification information of the picture, such as the sequence number of the picture in the search result, or the picture. The serial number and so on throughout the identification process.
  • the zipper database is empty, and in the subsequent recognition process, according to the recognition result, it will be saved in the zipper database.
  • the information of the picture whose picture is not repeated is saved in the zipper database, so it can be considered that each picture stored in the zipper database is different.
  • the zipper database may use each Phash score of each picture as a plurality of key values of the picture, and save the Phash value of the picture as the zipper data of the picture.
  • the ID of the picture in the search result may also be saved, or each picture is determined in the zipper database according to the order in which the picture is saved to the zipper database. ID in .
  • the Phash value of the picture is determined for each picture, and the Phash value is segmented according to a preset method to obtain each Phash score after segmentation.
  • a preset method to obtain each Phash score after segmentation.
  • the number of bits included in the Phash score after segmentation can also be arbitrarily determined. As long as the value of each Phash score is determined for different pictures, the same determination method can be adopted.
  • the Phash value of the picture is a 64-bit value.
  • the Phash value of 21 bits can be selected.
  • 21 bits are divided into three unit segments, each of which is 8 bits and 7 bits respectively.
  • the estimated maximum amount is 200w (data will expire), and the phash value is 21bit.
  • 21bit is divided into 3 units, which are 8bit, 7bit, 6bit, and sequentially shifted.
  • each zipper can be used as the key value of each Phash score.
  • the zipper database stores information about multiple images, including Phash scores and Phash values for each image.
  • each Phash score of the picture to be identified is hit and judged by the Phash score of other pictures in the zipper database, specifically, for each Phash score of the picture to be identified, each picture saved in the zipper database is determined. Whether the Phash score is related to the Phash score of the picture to be identified Similarly, when the same, the Phash score of the to-be-identified picture is considered to hit the Phash score of the other picture in the zipper database.
  • Phash scores for each image saved in the zipper database There are multiple Phash scores for each image saved in the zipper database, and the Phash scores of multiple images are saved in the zipper database. It is determined whether each Phash score of the image is hit for each picture to be identified. When a Phash score of other pictures in the zipper database is used, there is a possibility that a Phash score of the to-be-identified picture hits a Phash score of a plurality of pictures in the zipper database, for example, the first Phash score of the picture to be recognized hits The first Phash score of the second picture and hits the second Phash score of the third picture.
  • Phash scores of the to-be-identified picture in the zipper database there may also be a plurality of Phash scores of the to-be-identified picture in the zipper database, and there is a Phash score of the hit, for example, the first Phash score of the picture to be identified hits the Phash score of the four pictures in the zipper database, The two Phash scores hit the Phash scores of the three pictures in the zipper database.
  • FIG. 2 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 1 of the present invention, where the process includes the following steps:
  • S201 Extract a picture to be identified in the search result, and determine a Phash value of the picture to be identified.
  • S202 Segment the Phash value according to a preset method to obtain each Phash score after the segmentation.
  • step S204 Compare whether the Phash value of the to-be-identified picture and the Hamming distance of the Phash value of the hit picture are smaller than a set comparison threshold. If the determination result is yes, proceed to step S205; otherwise, proceed to step S206.
  • a picture in which the Phash score in the zipper database and the Phash score of the to-be-identified picture are hit is referred to as a hit picture.
  • S205 Determine that the to-be-identified picture and the hit picture are duplicate pictures.
  • step S206 Whether the current hit picture is the last picture in the hit picture, and if the determination result is yes, proceeding to step S207; otherwise, performing step S204 for the next picture of the hit.
  • S207 an ID of the to-be-identified picture in the search result, and each Phash of the to-be-identified picture Scores and Phash values are added to the zipper database.
  • the Phash score of other pictures in the zipper database hit by each Phash score is determined, and the hit Phash is determined.
  • the other pictures corresponding to the scores are used as hit images.
  • the minimum value is less than the set comparison threshold, it is determined that the to-be-identified picture is overlapped with other pictures in the zipper database; otherwise, it is determined that the to-be-identified picture does not overlap with other pictures in the zipper database.
  • the Hamming distance of the to-be-recognized picture and each hit picture may be determined in turn, the Hamming distance minimum value is selected, and the minimum value of the Hamming distance is determined to be less than a set comparison threshold, when the Hamming distance is When the minimum value is smaller than the set comparison threshold, it is determined that the picture corresponding to the minimum value of the to-be-identified picture is a duplicate picture.
  • determining whether the picture to be identified and other pictures in the zipper database are repeated include:
  • the first threshold may be used as a trusted threshold.
  • the two pictures may be considered as duplicate pictures;
  • the second threshold may be regarded as a moderately trusted threshold when two pictures are
  • the to-be-identified picture is added to the suspicious picture queue, and the judging distance between the to-be-identified picture and other hit pictures is continuously determined. To determine whether the picture to be identified is duplicated with the picture in the zipper database.
  • the Hamming distance between the picture to be recognized and the hit picture may be separately determined, and the minimum value of the Hamming distance may be extracted, and the minimum value and the set first threshold and the second threshold are performed. Comparing to determine whether the picture to be identified is duplicated with the picture in the zipper database.
  • FIG. 3 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 2 of the present invention, where the process includes the following steps:
  • S301 Determine a Phash value of the to-be-identified picture, and segment the Phash value according to a preset method to obtain each Phash score after the segmentation.
  • step S302 Extract each Phash score of the to-be-identified picture, compare each Phash score with the Phash score of each picture in the zipper database, and determine whether each Phash score hits the Phash of other pictures in the zipper database. The score, when the result of the determination is YES, proceeds to step S303, otherwise, proceeds to step S306.
  • S303 Determine a Hamming distance between the to-be-identified picture and each hit picture, and extract a minimum value of the Hamming distance.
  • a picture in which the Phash score in the zipper database and the Phash score of the to-be-identified picture are hit is referred to as a hit picture.
  • step S304 Determine whether the minimum value of the Hamming distance is less than a set comparison threshold. When the determination result is YES, proceed to step S305; otherwise, proceed to step S306.
  • S305 The to-be-identified picture has duplicate pictures in the zipper database.
  • the accuracy of the repeated picture recognition can be effectively ensured, and the repeated picture recognition device is not affected. Recall.
  • the repeated picture can be quickly found.
  • the information of the picture to be identified is saved to the zipper database. And storing the Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture in a header of the zipper database, wherein the zipper database saves each picture from front to back according to the time when the picture is generated.
  • the picture to be identified may be a news picture and a hot picture. This is because pictures based on the same event will appear at the same time, so based on the proximity of the event and the proximity of the appearing image,
  • the information of the picture to be recognized is added to the head of the zipper database, that is, the front of the zipper database, so when performing repeated picture recognition, It may be first determined whether the picture of the head of the zipper database is duplicated with the picture to be identified, thereby improving the efficiency of repeated picture recognition.
  • the value of the Phash value of the to-be-identified picture is M. After the Phash value is segmented according to a preset method, each Phash value P1, P2, ..., Pn after the segmentation of the picture to be identified is obtained.
  • the first threshold set is a
  • the second threshold is b
  • a ⁇ b is a
  • Each Phash score of the image to be identified is sequentially compared with each Phash score of other images in the zipper database to determine whether to hit the Phash score of other images in the zipper database, thereby determining the hit of the to-be-identified image.
  • Each picture in the zipper database If each Phash score of the to-be-identified picture misses each Phash score of other pictures in the zipper database, it is determined that the picture to be recognized and each picture in the zipper database are not duplicated, and the picture to be recognized is not to be recognized.
  • the identification information of the Phash score and the Phash value of the image are saved in the zipper database, and the identification information of the image to be identified may be the sequence number of the image in the search result, or the entire repeated image recognition process of the image The serial number and so on.
  • the search picture to be recognized After determining each picture that the picture to be identified hits in the zipper database, compare the Phash value of the picture to be recognized with the Hamming distance of the Phash value of the hit picture for each hit picture, and determine whether the Hamming distance is less than The first threshold is determined.
  • the Hamming distance is less than the set first threshold a, that is, the Hamming distance is in the [0, a) interval, it is determined that the picture to be recognized overlaps with the picture in the zipper database.
  • the Hamming distance is greater than the set first threshold a but less than the set second threshold b, that is, the Hamming distance is in the interval [a, b), where a is smaller than b
  • the picture to be recognized is added to Can team In the column. Comparing the Hamming distance between the to-be-identified picture and the other hit picture, and identifying the minimum value of the Hamming distance.
  • the minimum value is less than the set first threshold a, determining that the picture to be recognized and the picture in the zipper database are duplicated, otherwise The image to be identified is not duplicated with the image in the zipper database, and the identification information of the image to be identified, each Phash score and the Phash value of the image are saved in the zipper database.
  • the Hamming distance is greater than the set second threshold b, that is, the Hamming distance is in the [b, ⁇ ) interval, it is determined that the picture to be recognized and the picture in the zipper database are not duplicated, and the picture to be recognized is Identification information, each Phash score and Phash value of the picture is saved in the zipper database.
  • FIG. 4 is a schematic diagram of a process for de-duplication of image search according to an embodiment of the present invention, where the process includes:
  • S401 Receive a query word input by the user, and search for a picture resource that matches the query word input by the user.
  • S402 Determine a Phash value of each picture in the picture resource, and segment the Phash value according to a preset method to obtain each Phash score after the segmentation.
  • step S403 Extract each Phash score of the to-be-identified picture, compare each Phash score with a Phash score of each picture in the zipper database, and determine whether each Phash score hits the Phash of other pictures in the zipper database. The score, when the result of the determination is YES, proceeds to step S404, otherwise, proceeds to step S409.
  • S404 Determine a Hamming distance between the to-be-identified picture and each hit picture, and extract a minimum value of the Hamming distance.
  • step S405 Determine whether the minimum value of the Hamming distance is less than a set comparison threshold. When the determination result is YES, proceed to step S406; otherwise, proceed to step S409.
  • S406 The to-be-identified picture has duplicate pictures in the zipper database.
  • step S407 Determine whether the to-be-identified picture is the last picture in the picture resource related information. If the determination result is yes, proceed to step S408; otherwise, use the next picture as the to-be-identified picture, and proceed to step S403.
  • step S409 The picture to be identified and the picture in the zipper database are not duplicated, and the picture to be identified is added to the front of the zipper database. Then, step S407 is performed.
  • the Phash value of the picture is segmented to obtain multiple Phash scores.
  • a Phash score hits the Phash score of other pictures in the zipper database
  • the Phash value of the picture is compared and The Phash value of other images hit, so the duplicate image is guaranteed.
  • the accuracy of recognition can also effectively improve the recognition efficiency of repeated pictures.
  • FIG. 5 is a schematic structural diagram of an apparatus for identifying a duplicate picture according to an embodiment of the present disclosure, where the apparatus includes:
  • the segmentation module 51 is configured to determine a Phash value of the to-be-identified picture, and segment the Phash value to obtain each Phash score after the segmentation;
  • a zipper database 52 configured to store a Phash value of the picture and each Phash score
  • the determining module 53 is configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database;
  • the comparison identification module 54 is configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits the Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When the module determines that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the picture to be recognized is saved into the zipper database.
  • the comparison identification module 54 is specifically configured to determine, according to the Phash value of the to-be-identified picture and the Hamming distance of each other picture Phash value, for each other picture in the zipper database that is hit by the Phash score. Identify whether the picture is duplicated with other pictures in the zipper database.
  • the determining module 53 is specifically configured to determine whether the Phash score of the to-be-identified picture is the same as the Phash score of other pictures in the zipper database.
  • the comparison identification module 54 is specifically configured to determine a Hamming distance between the to-be-identified picture and each of the other pictures, and extract a minimum value of the Hamming distance; determine whether the minimum value is less than a set comparison threshold. When the minimum value is less than the set comparison threshold, determining that the to-be-identified picture is repeated with other pictures in the zipper database, otherwise, determining that the to-be-identified picture does not overlap with other pictures in the zipper database.
  • the comparison identification module 54 is configured to determine, for each of the other pictures, a Phash value of the to-be-identified picture and a Hamming distance of the Phash value of the first picture; and determine whether the Hamming distance is less than Determining a first threshold; determining that the to-be-recognized picture is repeated with the first picture when the Hamming distance is less than a set first threshold; and determining that the Hamming distance is not less than a set first threshold Whether the Hamming distance is less than a set second threshold, wherein the first threshold is less than the second threshold; and when the Hamming distance is less than the set second threshold, determining the to-be-identified picture and the remaining each other a Hamming distance of the picture, extracting a minimum value of the Hamming distance, determining whether the minimum value is less than a set first threshold, and determining the to-be-identified picture when the minimum value is less than a set first threshold Duplicate with other pictures in the zipper database, otherwise, determine the
  • the comparison identification module 54 is configured to save the Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture in a header of the zipper database, where the zipper database is generated according to the time of the picture. Save the information of each picture from front to back.
  • the segmentation module 51 is specifically configured to divide the Phash value into a plurality of unit segments, each of which adopts a different number of bits; and adopts a sequential shift method to obtain each Phash score.
  • FIG. 6 is a schematic structural diagram of an apparatus for image search deduplication according to an embodiment of the present disclosure, where the apparatus includes:
  • the receiving search module 61 is configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;
  • the de-duplication module 62 is configured to remove duplicate pictures in the picture resource
  • a providing module 63 configured to return a picture resource result after removing the duplicate picture to the user
  • the deduplication module 62 removes duplicate pictures in the picture resource by using the above device for identifying duplicate pictures.
  • An embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplicating a picture, and a device thereof.
  • a Phash value of a picture to be identified is segmented, and each Phash score is obtained, and each picture to be identified is obtained.
  • the Phash score is compared with the Phash score of each picture saved in the zipper database.
  • the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified is duplicated with other pictures.
  • the Phash value of the picture is segmented to obtain multiple Phash scores.
  • the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined.
  • Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some of the means for identifying duplicate pictures and/or picture search deduplication devices in accordance with embodiments of the present invention or Some or all of the features of all components.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 7 illustrates a computing device that may implement a method of identifying duplicate pictures, and/or a method of picture search deduplication in accordance with the present invention.
  • the computing device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720.
  • Memory 720 can be an electronic memory such as a flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • Memory 720 has a memory space 730 for program code 731 for performing any of the method steps described above.
  • storage space 730 for program code may include various program code 731 for implementing various steps in the above methods, respectively.
  • These program generations The code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 720 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 731', ie, code readable by a processor, such as 710, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Abstract

The present invention provides a method for recognizing a duplicate image, and an image search and deduplication method and a device thereof. The method comprises: segmenting a Phash value of an image to be recognized to obtain Phash subvalues, comparing each Phash subvalue with Phash subvalues of each image saved in a chaining database, and determining whether the image to be recognized is a duplicate of other images when the Phash subvalues of other images in the chaining database are hit. In an embodiment of the present invention, a Phash value of an image is segmented to obtain multiple Phash subvalues, and the Phash value of the image is compared with hit Phash values of other images when a certain Phash subvalue hits Phash subvalues of other images in a chaining database. Therefore, the accuracy of duplicate image recognition is ensured, and the efficiency of duplicate image recognition is effectively improved.

Description

一种识别重复图片的方法、图片搜索去重方法及其装置Method for identifying repeated pictures, image search deduplication method and device thereof 技术领域Technical field
本发明涉及图片识别技术领域,尤其涉及一种识别重复图片的方法、图片搜索去重方法及其装置。The present invention relates to the field of picture recognition technologies, and in particular, to a method for identifying a repeated picture, a method for de-duplicating a picture search, and a device thereof.
背景技术Background technique
基于用户输入的图片进行搜索后,为了提高用户的体验,并提高搜索结果的准确性,一般需要对搜索到的图片进行排重处理,即识别搜索结果中的相同图片。After searching based on the image input by the user, in order to improve the user experience and improve the accuracy of the search result, it is generally necessary to perform weighting processing on the searched image, that is, to identify the same image in the search result.
现有技术在识别搜索结果中的相同图片时,根据图片的内容是否相同,或者根据图片的链接地址是否相同进行简单判断,但有时相同的图片其内容可能不会,或者相同图片的链接地址也并不相同,因此采用上述方法不能达到很好的识别效果。In the prior art, when identifying the same picture in the search result, according to whether the content of the picture is the same, or according to whether the link address of the picture is the same, a simple judgment is made, but sometimes the content of the same picture may not be, or the link address of the same picture is also It is not the same, so the above method can not achieve a good recognition effect.
为了达到较好的识别效果,可以对图片进行一系列的特征量化处理,进行特征量化处理后再进行排重处理,该方法虽然可以达到比较理想的识别效果,但是耗时较长,无法满足图片搜索、提供的实时性需求。In order to achieve a better recognition effect, a series of feature quantization processes can be performed on the image, and the feature quantization process is performed before the weighting process is performed. Although the method can achieve an ideal recognition effect, it takes a long time and cannot satisfy the picture. Search, provide real-time requirements.
另外,在进行相同图片识别时,也可以通过比较图片的Phash值进行判断,但是该方法需要将每两张图片的Phash值进行比较,对于海量搜索结果而言,该方法也非常的耗费时间,无法保证图片搜索提供的实时性。In addition, when performing the same picture recognition, it is also possible to judge by comparing the Phash values of the pictures, but the method needs to compare the Phash values of each of the two pictures, and the method is also very time consuming for the massive search results. The real-time nature of image search is not guaranteed.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种识别重复图片的方法、图片搜索去重方法及其装置。In view of the above problems, the present invention has been made in order to provide a method for recognizing a repeated picture, a method for recognizing a deduplication of a picture, and an apparatus thereof, which overcome the above problems or at least partially solve the above problems.
根据本发明的一个方面,本发明实施例提供了一种识别重复图片的方法,该方法包括:According to an aspect of the present invention, an embodiment of the present invention provides a method for identifying a repeated picture, the method comprising:
确定待识别图片的Phash值,对所述Phash值进行分段,得到分段后的每个Phash分值; Determining a Phash value of the to-be-identified picture, and segmenting the Phash value to obtain each Phash score after the segmentation;
判断所述待识别图片分段后的每个Phash分值是否命中拉链数据库中其他图片分段后的Phash分值;Determining whether each Phash score after the segmentation of the image to be identified hits a Phash score after other image segments in the zipper database;
当待识别图片的Phash分值命中拉链数据库中其他图片的Phash分值时,确定所述待识别图片与拉链数据库中的其他图片是否重复;When the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified and other pictures in the zipper database are duplicated;
否则,将所述待识别图片的信息保存到所述拉链数据库中。Otherwise, the information of the picture to be identified is saved in the zipper database.
基于本发明的另一个方面,本发明实施例提供了一种图片搜索去重的方法,该方法包括:In accordance with another aspect of the present invention, an embodiment of the present invention provides a method for de-duplication of a picture search, the method comprising:
接收用户输入的查询词,并搜索与用户输入的查询词相匹配的图片资源;Receiving a query word input by the user, and searching for a picture resource that matches the query word input by the user;
去除图片资源中的重复图片;Remove duplicate images from image assets;
将去除重复图片后的图片资源结果返回给所述用户;Returning the result of the picture resource after the duplicate picture is removed to the user;
所述去除图片资源中的重复图片可以通过采用上述识别重复图片的方法得到。The removing the repeated picture in the picture resource can be obtained by adopting the above method for identifying the repeated picture.
根据本发明的又一个方面,本发明实施例提供了一种识别重复图片的装置,该装置包括:According to still another aspect of the present invention, an embodiment of the present invention provides an apparatus for identifying a repeated picture, the apparatus comprising:
分段模块,用于确定待识别图片的Phash值,对所述Phash值进行分段,得到分段后的每个Phash分值;a segmentation module, configured to determine a Phash value of the to-be-identified picture, segment the Phash value, and obtain each Phash score after the segmentation;
拉链数据库,用于存储图片的Phash值及每个Phash分值;a zipper database for storing Phash values of pictures and each Phash score;
判断模块,用于判断所述待识别图片分段后的每个Phash分值是否命中拉链数据库中的其他图片分段后的Phash分值;a judging module, configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database;
比较识别模块,用于当判断模块判断所述待识别图片的Phash分值命中拉链数据库中其他图片的Phash分值时,确定所述待识别图片与拉链数据库中的其他图片是否重复;当判断模块判断待识别图片的每个Phash分值未命中拉链数据库中的其他图片的Phash分值时,将所述待识别图片的信息保存到所述拉链数据库中。a comparison identification module, configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits a Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When it is determined that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the picture to be recognized is saved into the zipper database.
根据本发明的再一方面,本发明实施例提供了一种图片搜索去重的装置,该装置包括:According to still another aspect of the present invention, an embodiment of the present invention provides an apparatus for image search deduplication, the apparatus comprising:
接收搜索模块,用于接收用户输入的查询词,并搜索与用户输入的查询词相匹配的图片资源;Receiving a search module, configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;
去重模块,用于去除图片资源中的重复图片;a de-duplication module for removing duplicate pictures in a picture resource;
提供模块,用于将去除重复图片后的图片资源结果返回给所述用户。And providing a module, configured to return a picture resource result after the duplicate picture is removed to the user.
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读 代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据上述的任一个所述的识别重复图片的方法,和/或,上述的图片搜索去重的方法。According to still another aspect of the present invention, a computer program comprising computer readable The code, when the computer readable code is run on a computing device, causes the computing device to perform a method of identifying a duplicate picture according to any of the above, and/or a method of deduplication of the picture search described above.
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了上述的计算机程序。According to still another aspect of the present invention, a computer readable medium is provided, wherein the computer program described above is stored.
本发明实施例提供了一种识别重复图片的方法、图片搜索去重方法及其装置,该方法中将待识别图片的Phash值分段,得到每个Phash分值,将待识别图片的每个Phash分值与拉链数据库中保存的每个图片的Phash分值进行比较,当待识别图片的Phash分值命中拉链数据库中其他图片的Phash分值时,确定待识别图片是否与其他图片重复。由于在本发明实施例中将图片的Phash值进行分段得到了多个Phash分值,当某一Phash分值命中拉链数据库中其他图片的Phash分值时,才比较该图片的Phash值及与其命中的其他图片的Phash值,因此保证了重复图片识别的准确性,同时也能有效的提高重复图片的识别效率。An embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplicating a picture, and a device thereof. In this method, a Phash value of a picture to be identified is segmented, and each Phash score is obtained, and each picture to be identified is obtained. The Phash score is compared with the Phash score of each picture saved in the zipper database. When the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified is duplicated with other pictures. In the embodiment of the present invention, the Phash value of the picture is segmented to obtain multiple Phash scores. When a Phash score hits the Phash score of other pictures in the zipper database, the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1为本发明实施例提供的一种识别重复图片的过程示意图;FIG. 1 is a schematic diagram of a process for identifying a duplicate picture according to an embodiment of the present invention;
图2为本发明实施例一提供的一种识别重复图片的过程示意图;FIG. 2 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 1 of the present invention; FIG.
图3为本发明实施例二提供的一种识别重复图片的过程示意图;FIG. 3 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 2 of the present invention; FIG.
图4为本发明实施例提供的一种图片搜索去重的过程示意图;4 is a schematic diagram of a process of de-duplication of image search according to an embodiment of the present invention;
图5为本发明实施例提供的一种识别重复图片的装置结构示意图;FIG. 5 is a schematic structural diagram of an apparatus for identifying a duplicate picture according to an embodiment of the present invention;
图6为本发明实施例提供的一种图片搜索去重的装置结构示意图;FIG. 6 is a schematic structural diagram of an apparatus for image search deduplication according to an embodiment of the present invention;
图7示意性地示出了用于执行根据本发明的识别重复图片的方法,和/或,图片搜索去重的方法的计算设备的框图;以及 Figure 7 is a schematic block diagram of a computing device for performing a method of identifying duplicate pictures in accordance with the present invention, and/or a method of picture search deduplication;
图8示意性地示出了用于保持或者携带实现根据本发明的识别重复图片的方法,和/或,图片搜索去重的方法的程序代码的存储单元。Fig. 8 schematically shows a storage unit for holding or carrying a program code implementing a method of recognizing a duplicate picture according to the present invention, and/or a method of de-duplicating a picture search.
具体实施方式detailed description
下面结合附图和具体的实施方式对本发明作进一步的描述。The invention is further described below in conjunction with the drawings and specific embodiments.
为了保证相同图片识别的准确性,并提高相同图片的识别效率,本发明实施例提供了一种识别重复图片的方法、图片搜索去重的方法及其装置。In order to ensure the accuracy of the same picture recognition and improve the recognition efficiency of the same picture, the embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplication of a picture search, and a device thereof.
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present invention have been shown in the drawings, the embodiments Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.
下面结合说明附图,对本发明实施例进行说明。The embodiments of the present invention will be described below with reference to the accompanying drawings.
图1为本发明实施例提供的一种识别重复图片的过程示意图,该过程包括以下步骤:FIG. 1 is a schematic diagram of a process for identifying a duplicate picture according to an embodiment of the present invention, where the process includes the following steps:
S101:确定待识别图片的Phash值,对所述Phash值进行分段,得到分段后的每个Phash分值。S101: Determine a Phash value of the to-be-identified picture, and segment the Phash value to obtain each Phash score after the segmentation.
在相同图片识别的过程中,针对每张图片都采用该预设的方法对图片的Phash值进行分段,这样可以保证每张图片的Phash值都采用相同的方式进行分段,便于后续Phash分值的比较。In the process of the same picture recognition, the preset method is used to segment the Phash value of the picture for each picture, so that the Phash value of each picture is segmented in the same manner, which is convenient for subsequent Phash points. Comparison of values.
S102:判断所述待识别图片分段后的每个Phash分值是否命中拉链数据库中其他图片分段后的Phash分值,当判断结果为是时,进行步骤S103,否则,进行步骤S104。S102: Determine whether each Phash score after the segmentation of the image to be identified hits the Phash score of other image segments in the zipper database. When the determination result is yes, proceed to step S103; otherwise, proceed to step S104.
所述待识别图片分段后的Phash分值命中拉链数据库中的其他图片分段后的Phash分值包括:The Phash scores of the other image segments in the Phash score after the segmentation of the image to be identified include:
所述待识别图片分段后的Phash分值与拉链数据库中的其他图片分段后的Phash分值相同。The Phash score after the segmentation of the image to be identified is the same as the Phash score of other image segments in the zipper database.
待识别图片的Phash值进行分段后,得到多个Phash分值,例如分别为P1、P2、P3、……、Pn,相应的拉链数据库中的每张图片的也都对应的存在n个Phash分值。在进行判断时,将待识别图片分段后的Phash分值P1分别与拉链数据库中每张图片分段后的每个Phash分值分别进行比较,判断该待识别图片分段后的Phash分值P1是否命中拉链数据库中其他图片分段后的 Phash分值,即判断该待识别图片分段后的Phash分值P1是否与拉链数据库中某一图片分段后的某一Phash分值相同。当该待识别图片分段后的Phash分值P1与拉链数据库中第二图片分段后的某一Phash分值相同时,确定该待识别图片分段后的Phash分值P1命中拉链数据库中第二图片分段后的该Phash分值。After the Phash value of the to-be-identified picture is segmented, multiple Phash scores are obtained, for example, P1, P2, P3, ..., Pn, and each picture in the corresponding zipper database also has n Phash corresponding to each other. Score. When the judgment is made, the Phash score P1 after the segmentation of the image to be identified is respectively compared with each Phash score after each image segmentation in the zipper database, and the Phash score of the segment to be identified is determined. Whether P1 hits other images in the zipper database after segmentation The Phash score, that is, whether the Phash score P1 after the segmentation of the image to be identified is the same as a Phash score after segmentation of a certain picture in the zipper database. When the Phash score P1 of the to-be-identified picture segment is the same as a Phash score after the second picture segmentation in the zipper database, it is determined that the Phash score P1 hits the zipper database in the image segment to be identified. The Phash score after the two picture segments.
另外,针对待识别图片的其他Phash分值也采用上述过程,一一判断该待识别图片每个Phash分值是否命中拉链数据库中其他图片的Phash分值。In addition, the other processes for the Phash scores of the to-be-identified picture are also used to determine whether each Phash score of the to-be-identified picture hits the Phash score of other pictures in the zipper database.
S103:确定所述待识别图片与拉链数据库中的其他图片是否重复。S103: Determine whether the picture to be identified is overlapped with other pictures in the zipper database.
具体的,确定所述待识别图片与拉链数据库中的其他图片是否重复包括:Specifically, determining whether the picture to be identified and other pictures in the zipper database are duplicated includes:
针对Phash分值被命中的拉链数据库中的每个其他图片,根据该待识别图片的Phash值和每个其他图片Phash值的汉明距离,确定所述待识别图片与拉链数据库中的其他图片是否重复。For each other picture in the zipper database whose Phash score is hit, according to the Phash value of the picture to be recognized and the Hamming distance of each other picture Phash value, it is determined whether the picture to be identified and other pictures in the zipper database are repeat.
在比较待识别图片的Phash值与命中图片的Phash值的汉明距离时,可以设定比较阈值,当待识别图片的Phash值与命中图片的Phash值的汉明距离小于设定的比较阈值时,确定待识别图片和命中图片为重复图片,否则,确定待识别图片和命中图片不重复。When comparing the Phash value of the picture to be recognized with the Hamming distance of the Phash value of the hit picture, the comparison threshold may be set, when the Phash value of the picture to be recognized and the Hamming distance of the Phash value of the hit picture are less than the set comparison threshold. And determining that the to-be-identified picture and the hit picture are repeated pictures; otherwise, determining that the picture to be recognized and the hit picture are not repeated.
S104:将所述待识别图片的信息保存到所述拉链数据库中。S104: Save the information of the to-be-identified picture into the zipper database.
当待识别图片的每个Phash分值未命中拉链数据库中的每个图片的Phash分值时,确定该拉链数据库中不存在与待识别图片重复的图片,或者,通过比较汉明距离,确定待识别图片和命中图片不重复时,为了方便后续重复图片的识别,将待识别图片的信息添加到拉链数据库中,该拉链数据库中保存的每张图片都不重复。When each Phash score of the to-be-identified picture misses the Phash score of each picture in the zipper database, it is determined that there is no picture duplicated with the picture to be identified in the zipper database, or, by comparing the Hamming distance, determining When the recognition picture and the hit picture are not repeated, in order to facilitate the subsequent recognition of the repeated picture, the information of the picture to be recognized is added to the zipper database, and each picture stored in the zipper database is not repeated.
由于在本发明实施例中将图片的Phash值进行分段得到了多个Phash分值,当某一Phash分值命中拉链数据库中其他图片的Phash分值时,才比较该图片的Phash值及与其命中的其他图片的Phash值,因此保证了重复图片识别的准确性,同时也能有效的提高重复图片的识别效率。In the embodiment of the present invention, the Phash value of the picture is segmented to obtain multiple Phash scores. When a Phash score hits the Phash score of other pictures in the zipper database, the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.
本发明实施例中拉链数据库用于保存图片的信息,包括:图片的Phash值及图片的Phash分值,该拉链数据库中还可以保存图片的标识信息,例如图片在搜索结果中的序号,或者图片在整个的识别过程中的序号等等。In the embodiment of the present invention, the zipper database is used to store information of a picture, including: a Phash value of the picture and a Phash score of the picture, and the zipper database may also store the identification information of the picture, such as the sequence number of the picture in the search result, or the picture. The serial number and so on throughout the identification process.
在针对每个图片搜索的结果进行重复图片识别之初,该拉链数据库为空,在进行后续的识别过程中,根据识别的结果,将与拉链数据库中保存的 图片不重复的图片的信息保存到该拉链数据库中,因此可以认为该拉链数据库中保存的每张图片都不同。具体的,该拉链数据库可以将每张图片的每个Phash分值作为该图片的多个key值,将该图片的Phash值作为该图片的拉链数据保存。并且该拉链数据库中为了具体的区分每张图片,并减小数据的存储量,还可以保存搜索结果中该图片的ID,或者按照图片保存到拉链数据库的顺序,确定每个图片在该拉链数据库中的ID。At the beginning of repeated image recognition for each image search result, the zipper database is empty, and in the subsequent recognition process, according to the recognition result, it will be saved in the zipper database. The information of the picture whose picture is not repeated is saved in the zipper database, so it can be considered that each picture stored in the zipper database is different. Specifically, the zipper database may use each Phash score of each picture as a plurality of key values of the picture, and save the Phash value of the picture as the zipper data of the picture. And in the zipper database, in order to specifically distinguish each picture, and reduce the amount of data storage, the ID of the picture in the search result may also be saved, or each picture is determined in the zipper database according to the order in which the picture is saved to the zipper database. ID in .
基于用户的输入搜索到了多张图片后,针对每张图片,确定该图片的Phash值,并按照预设的方法对该Phash值进行分段,得到分段后的每个Phash分值。具体的在对图片的Phash值进行分段时,只要每张图片采用的是相同的分段方式即可,无论是直接将Phash值分段,还是采用顺序移位的思想将Phash值分段,或者是采用间隔提取的方式确定每个Phash分值都可以。另外分段后的Phash分值包含多少bit也是可以任意确定的,只要针对不同图片,在确定每个Phash分值时,采用了相同的确定方式即可。After searching for multiple pictures based on the user's input, the Phash value of the picture is determined for each picture, and the Phash value is segmented according to a preset method to obtain each Phash score after segmentation. Specifically, when segmenting the Phash value of the picture, as long as each picture adopts the same segmentation mode, whether to directly segment the Phash value or segment the Phash value by using the idea of sequential shift, Or use interval extraction to determine each Phash score. In addition, the number of bits included in the Phash score after segmentation can also be arbitrarily determined. As long as the value of each Phash score is determined for different pictures, the same determination method can be adopted.
具体的,图片的Phash值为一个64bit的数值,在本发明实施例中可以选取Phash值21bit表示,为了减少比较的次数,将21bit分为3个单元段,每个单元段分别为8bit、7bit和6bit,采用顺序移位思想,最终产生对应掩码数据位512个,消重后的数据量为506,即每个数据最多分到506个拉链中,而产生的最大拉链数为2^8*2^7*2^6=200w。Specifically, the Phash value of the picture is a 64-bit value. In the embodiment of the present invention, the Phash value of 21 bits can be selected. To reduce the number of comparisons, 21 bits are divided into three unit segments, each of which is 8 bits and 7 bits respectively. And 6bit, using the sequential shifting idea, the final generation of corresponding mask data bits 512, the amount of data after deduplication is 506, that is, each data is divided into up to 506 zippers, and the maximum number of zippers generated is 2^8 *2^7*2^6=200w.
下面以一个具体的例子来说明,对Phash值进行分段的过程。The following is a specific example to illustrate the process of segmenting the Phash value.
考虑到新闻数据时效性,预估最大量200w(数据会过期失效),选取phash值21bit表示,为了减少比较次数,将21bit分成3个单元段,分别为8bit,7bit,6bit,采用顺序移位思想,最终产生对应掩码数据为512个,消重后数据量为506,即每个数据最多分到506个拉链中,而产生最大拉链数为2^8*2^7*2^6=200w,每个拉链位于拉链数据库中。在本发明实施例中每个拉链可以作为每个Phash分值的key值。Considering the timeliness of news data, the estimated maximum amount is 200w (data will expire), and the phash value is 21bit. In order to reduce the number of comparisons, 21bit is divided into 3 units, which are 8bit, 7bit, 6bit, and sequentially shifted. The idea is that the corresponding mask data is 512, and the data volume after deduplication is 506, that is, each data is divided into up to 506 zippers, and the maximum number of zippers is 2^8*2^7*2^6= 200w, each zipper is located in the zipper database. In the embodiment of the invention, each zipper can be used as the key value of each Phash score.
得到待识别的图片的每个Phash分值后,将该图片的每个Phash分值分别与拉链数据库保存的每张图片的Phash分值进行比较,判断该待识别图片的每个Phash分值是否命中拉链数据库中其他图片的Phash分值。拉链数据库中保存有多张不同图片的信息,包括每张图片的Phash分值和Phash值。当将待识别图片的每个Phash分值与拉链数据库中其他图片的Phash分值进行命中判断时,具体针对待识别图片的每个Phash分值,判断该拉链数据库中保存的每张图片的每个Phash分值是否与该待识别图片的该Phash分值相 同,当相同时,认为该待识别图片的该Phash分值命中拉链数据库中该其他图片的Phash分值。After obtaining each Phash score of the picture to be identified, comparing each Phash score of the picture with the Phash score of each picture saved in the zipper database, and determining whether each Phash score of the picture to be identified is Hit the Phash score of other images in the zipper database. The zipper database stores information about multiple images, including Phash scores and Phash values for each image. When each Phash score of the picture to be identified is hit and judged by the Phash score of other pictures in the zipper database, specifically, for each Phash score of the picture to be identified, each picture saved in the zipper database is determined. Whether the Phash score is related to the Phash score of the picture to be identified Similarly, when the same, the Phash score of the to-be-identified picture is considered to hit the Phash score of the other picture in the zipper database.
拉链数据库中保存的每张图片的Phash分值存在多个,并且拉链数据库中保存了多张图片的Phash分值,在针对每张待识别的图片,判断该图片的每个Phash分值是否命中拉链数据库中其他图片的Phash分值时,很可能存在该待识别图片的某一Phash分值命中拉链数据库中多个图片的Phash分值的情况,例如该待识别图片的第一Phash分值命中第二图片的第一Phash分值,并命中第三图片的第二Phash分值。也可能会存在,该待识别图片的多个Phash分值在拉链数据库中都存在命中的Phash分值,例如待识别图片的第一Phash分值在拉链数据库命中4张图片的Phash分值,第二Phash分值命中拉链数据库中3张图片的Phash分值等。There are multiple Phash scores for each image saved in the zipper database, and the Phash scores of multiple images are saved in the zipper database. It is determined whether each Phash score of the image is hit for each picture to be identified. When a Phash score of other pictures in the zipper database is used, there is a possibility that a Phash score of the to-be-identified picture hits a Phash score of a plurality of pictures in the zipper database, for example, the first Phash score of the picture to be recognized hits The first Phash score of the second picture and hits the second Phash score of the third picture. There may also be a plurality of Phash scores of the to-be-identified picture in the zipper database, and there is a Phash score of the hit, for example, the first Phash score of the picture to be identified hits the Phash score of the four pictures in the zipper database, The two Phash scores hit the Phash scores of the three pictures in the zipper database.
确定了待识别图片的每个Phash分值在拉链数据库中命中的Phash分值后,比较该待识别图片的Phash值及每个命中的Phash分值的图片的Phash值的汉明距离,判断该待识别图片是否与该图片重复。After determining the Phash score of each Phash score of the to-be-identified picture in the zipper database, comparing the Phash value of the to-be-identified picture with the Hamming distance of the Phash value of the picture of each hit Phash score, determining the Whether the picture to be identified is duplicated with the picture.
图2为本发明实施例一提供的一种识别重复图片的过程示意图,该过程包括以下步骤:FIG. 2 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 1 of the present invention, where the process includes the following steps:
S201:提取搜索结果中的待识别图片,确定该待识别图片的Phash值。S201: Extract a picture to be identified in the search result, and determine a Phash value of the picture to be identified.
S202:按照预设的方法对该Phash值进行分段,得到分段后的每个Phash分值。S202: Segment the Phash value according to a preset method to obtain each Phash score after the segmentation.
S203:提取该待识别图片的每个Phash分值,将该每个Phash分值与拉链数据库中每张图片的每个Phash分值进行比较,判断每个Phash分值是否命中拉链数据库中其他图片的Phash分值,当判断结果为是时,进行步骤S204,否则,进行步骤S207。S203: Extract each Phash score of the to-be-identified picture, compare each Phash score with each Phash score of each picture in the zipper database, and determine whether each Phash score hits other pictures in the zipper database. The Phash score, when the determination result is YES, proceeds to step S204, otherwise, proceeds to step S207.
S204:比较该待识别图片的Phash值与命中图片的Phash值的汉明距离是否小于设定的比较阈值,当判断结果为是时,进行步骤S205,否则,进行步骤S206。S204: Compare whether the Phash value of the to-be-identified picture and the Hamming distance of the Phash value of the hit picture are smaller than a set comparison threshold. If the determination result is yes, proceed to step S205; otherwise, proceed to step S206.
将拉链数据库中Phash分值与该待识别图片的Phash分值命中的图片称为命中图片。A picture in which the Phash score in the zipper database and the Phash score of the to-be-identified picture are hit is referred to as a hit picture.
S205:确定该待识别图片与该命中图片为重复图片。S205: Determine that the to-be-identified picture and the hit picture are duplicate pictures.
S206:当前命中图片是否为命中图片中的最后一张图片,判断结果为是时,进行步骤S207,否则,针对该命中的下一张图片,进行步骤S204。S206: Whether the current hit picture is the last picture in the hit picture, and if the determination result is yes, proceeding to step S207; otherwise, performing step S204 for the next picture of the hit.
S207:将该待识别图片在搜索结果中的ID,该待识别图片的每个Phash 分值及Phash值添加到拉链数据库中。S207: an ID of the to-be-identified picture in the search result, and each Phash of the to-be-identified picture Scores and Phash values are added to the zipper database.
或者,本发明实施例为了提高重复图片的识别效率,根据待识别图片的每个Phash分值,确定了每个Phash分值命中的拉链数据库中其他图片的Phash分值后,将该命中的Phash分值对应的其他图片作为命中图片。在确定该待识别图片与拉链数据库中的命中图片是否重复时包括:Or, in order to improve the recognition efficiency of the repeated pictures, according to each Phash score of the picture to be identified, the Phash score of other pictures in the zipper database hit by each Phash score is determined, and the hit Phash is determined. The other pictures corresponding to the scores are used as hit images. When determining whether the to-be-identified picture and the hit picture in the zipper database are duplicated, it includes:
确定所述待识别图片与所述每个其他图片的汉明距离,提取所述汉明距离的最小值;Determining a Hamming distance of the to-be-identified picture and each of the other pictures, and extracting a minimum value of the Hamming distance;
判断所述最小值是否小于设定的比较阈值;Determining whether the minimum value is less than a set comparison threshold;
当所述最小值小于设定的比较阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与拉链数据库中的其他图片不重复。When the minimum value is less than the set comparison threshold, it is determined that the to-be-identified picture is overlapped with other pictures in the zipper database; otherwise, it is determined that the to-be-identified picture does not overlap with other pictures in the zipper database.
在具体实施时,可以依次确定该待识别图片和每个命中图片的汉明距离,选择汉明距离最小值,判断该汉明距离的最小值是否小于设定的比较阈值,当该汉明距离的最小值小于设定的比较阈值时,确定该待识别图片与该距离最小值对应的图片为重复图片。In a specific implementation, the Hamming distance of the to-be-recognized picture and each hit picture may be determined in turn, the Hamming distance minimum value is selected, and the minimum value of the Hamming distance is determined to be less than a set comparison threshold, when the Hamming distance is When the minimum value is smaller than the set comparison threshold, it is determined that the picture corresponding to the minimum value of the to-be-identified picture is a duplicate picture.
在本发明实施例中为了提高重复图片识别的准确性,所述确定所述待识别图片与拉链数据库中的其他图片是否重复包括:In the embodiment of the present invention, in order to improve the accuracy of the repeated picture recognition, determining whether the picture to be identified and other pictures in the zipper database are repeated include:
针对每个其他图片中的第一图片,确定该待识别图片的Phash值与该第一图片的Phash值的汉明距离,并判断该汉明距离是否小于设定的第一阈值;Determining, according to the first picture in each other picture, a Hamming value of the to-be-identified picture and a Hamming distance of the Phash value of the first picture, and determining whether the Hamming distance is less than a set first threshold;
当该汉明距离小于设定的第一阈值时,确定该待识别图片与该第一图片重复;When the Hamming distance is less than the set first threshold, determining that the to-be-identified picture is repeated with the first picture;
当该汉明距离不小于设定的第一阈值时,判断所述汉明距离是否小于设定的第二阈值,其中第一阈值小于第二阈值;When the Hamming distance is not less than the set first threshold, determining whether the Hamming distance is less than a set second threshold, wherein the first threshold is less than the second threshold;
当所述汉明距离小于设定的第二阈值时,确定所述待识别图片与所述剩余的每个其他图片的汉明距离,提取所述汉明距离的最小值,判断所述最小值是否小于设定的第一阈值,当所述最小值小于设定的第一阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与所述其他图片不重复。Determining a Hamming distance of the to-be-identified picture and each of the remaining other pictures when the Hamming distance is less than a set second threshold, extracting a minimum value of the Hamming distance, and determining the minimum value Whether it is less than the set first threshold, when the minimum value is less than the set first threshold, determining that the to-be-identified picture is overlapped with other pictures in the zipper database; otherwise, determining the to-be-identified picture and the other picture Not repeating.
第一阈值可以作为可信阈值,当两张图片的汉明距离小于第一阈值时,可以认为该两张图片为重复图片;第二阈值可以认为是中度可信阈值,当两张图片的汉明距离大于第一阈值时,但小于第二阈值时,将该待识别图片添加到可疑图片队列中,继续判断该待识别图片是否与其他命中图片的汉明距 离,从而确定该待识别图片是否与拉链数据库中的图片重复。The first threshold may be used as a trusted threshold. When the Hamming distance between the two pictures is less than the first threshold, the two pictures may be considered as duplicate pictures; the second threshold may be regarded as a moderately trusted threshold when two pictures are When the Hamming distance is greater than the first threshold, but less than the second threshold, the to-be-identified picture is added to the suspicious picture queue, and the judging distance between the to-be-identified picture and other hit pictures is continuously determined. To determine whether the picture to be identified is duplicated with the picture in the zipper database.
进一步的,为了提高重复图片的识别效率,也可以分别确定待识别图片与命中图片的汉明距离,提取汉明距离的最小值,将该最小值与设定的第一阈值和第二阈值进行比较,从而判断该待识别图片是否与拉链数据库中的图片重复。Further, in order to improve the recognition efficiency of the repeated pictures, the Hamming distance between the picture to be recognized and the hit picture may be separately determined, and the minimum value of the Hamming distance may be extracted, and the minimum value and the set first threshold and the second threshold are performed. Comparing to determine whether the picture to be identified is duplicated with the picture in the zipper database.
图3为本发明实施例二提供的一种识别重复图片的过程示意图,该过程包括以下步骤:FIG. 3 is a schematic diagram of a process for identifying a duplicate picture according to Embodiment 2 of the present invention, where the process includes the following steps:
S301:确定待识别图片的Phash值,并按照预设的方法对所述Phash值进行分段,得到分段后的每个Phash分值。S301: Determine a Phash value of the to-be-identified picture, and segment the Phash value according to a preset method to obtain each Phash score after the segmentation.
S302:提取该待识别图片的每个Phash分值,将每个Phash分值分别与拉链数据库中每个图片的Phash分值进行比较,判断每个Phash分值是否命中拉链数据库中其他图片的Phash分值,当判断结果为是时,进行步骤S303,否则,进行步骤S306。S302: Extract each Phash score of the to-be-identified picture, compare each Phash score with the Phash score of each picture in the zipper database, and determine whether each Phash score hits the Phash of other pictures in the zipper database. The score, when the result of the determination is YES, proceeds to step S303, otherwise, proceeds to step S306.
S303:确定该待识别图片与每个命中图片的汉明距离,提取汉明距离的最小值。S303: Determine a Hamming distance between the to-be-identified picture and each hit picture, and extract a minimum value of the Hamming distance.
将拉链数据库中Phash分值与该待识别图片的Phash分值命中的图片称为命中图片。A picture in which the Phash score in the zipper database and the Phash score of the to-be-identified picture are hit is referred to as a hit picture.
S304:判断该汉明距离的最小值是否小于设定的比较阈值,当判断结果为是时,进行步骤S305,否则,进行步骤S306。S304: Determine whether the minimum value of the Hamming distance is less than a set comparison threshold. When the determination result is YES, proceed to step S305; otherwise, proceed to step S306.
S305:该待识别图片在该拉链数据库中存在重复的图片。S305: The to-be-identified picture has duplicate pictures in the zipper database.
S306:该待识别图片与拉链数据库中的图片都不重复,将该待识别图片的信息保存到拉链数据库中。S306: The picture to be identified and the picture in the zipper database are not duplicated, and the information of the picture to be identified is saved in the zipper database.
在本发明实施例中确定了该待识别图片与其他图片的汉明距离后,将该汉明距离与两个阈值进行比较,可以有效的保证重复图片识别的准确性,不影响重复图片识别装置的召回。After determining the Hamming distance between the to-be-identified picture and other pictures in the embodiment of the present invention, comparing the Hamming distance with the two thresholds, the accuracy of the repeated picture recognition can be effectively ensured, and the repeated picture recognition device is not affected. Recall.
为了有效的提高重复图片识别的效率,能够快速的查找到重复图片,在本发明实施例中当确定待识别的图片与拉链数据库中的图片重复时,在将待识别图片的信息保存到拉链数据库中时,将所述待识别图片的Phash分值及所述待识别图片的Phash值保存在所述拉链数据库的头部,其中所述拉链数据库按照图片产生的时间,从前到后保存各图片。In order to effectively improve the efficiency of the repeated picture recognition, the repeated picture can be quickly found. In the embodiment of the present invention, when it is determined that the picture to be recognized and the picture in the zipper database are duplicated, the information of the picture to be identified is saved to the zipper database. And storing the Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture in a header of the zipper database, wherein the zipper database saves each picture from front to back according to the time when the picture is generated.
例如待识别图片可以是新闻图片和热点图片。这是因为基于同一事件的图片会在同一时间出现,因此基于事件的临近性及出现的图片的临近性,在 本发明实施例中当确定待识别图片在拉链数据库中不存在重复图片时,将该待识别图片的信息添加到拉链数据库的头部,即该拉链数据库的前面,因此在进行重复图片识别时,可以首先判断拉链数据库头部的图片是否与该待识别图片重复,从而提高重复图片识别的效率。For example, the picture to be identified may be a news picture and a hot picture. This is because pictures based on the same event will appear at the same time, so based on the proximity of the event and the proximity of the appearing image, In the embodiment of the present invention, when it is determined that the picture to be recognized does not have a duplicate picture in the zipper database, the information of the picture to be recognized is added to the head of the zipper database, that is, the front of the zipper database, so when performing repeated picture recognition, It may be first determined whether the picture of the head of the zipper database is duplicated with the picture to be identified, thereby improving the efficiency of repeated picture recognition.
下面通过一个具体的实施例,说明本发明实施例的重复图片的识别过程。The process of recognizing the repeated pictures of the embodiment of the present invention is described below through a specific embodiment.
待识别图片的Phash值为M,按照预设的方法对该Phash值进行分段后,得到该待识别图片分段后的每个Phash值P1、P2、……、Pn。设定的第一阈值为a,第二阈值为b,a<b。The value of the Phash value of the to-be-identified picture is M. After the Phash value is segmented according to a preset method, each Phash value P1, P2, ..., Pn after the segmentation of the picture to be identified is obtained. The first threshold set is a, the second threshold is b, and a<b.
将该待识别图片的Phash分值P1与拉链数据库中的其他图片的每个Phash分值比较,判断该待识别图片的Phash分值P1是否命中拉链数据库中其他图片的Phash分值,例如该待识别图片的Phash分值P1命中拉链数据库中图片1的某一Phash分值,并命中图片2的某一Phash分值。将待识别图片的Phash分值P2与拉链数据库中的其他图片的每个Phash分值比较,判断该待识别图片的Phash分值P2是否命中拉链数据库中其他图片的Phash分值,例如该待识别图片的Phash分值P2命中拉链数据库中图片1的另一Phash分值,并命中图片3的某一Phash分值。Comparing the Phash score P1 of the to-be-identified picture with each Phash score of other pictures in the zipper database, and determining whether the Phash score P1 of the to-be-identified picture hits the Phash score of other pictures in the zipper database, for example, the The Phash score P1 of the recognition picture hits a certain Phash score of picture 1 in the zipper database, and hits a certain Phash score of picture 2. Comparing the Phash score P2 of the picture to be identified with each Phash score of other pictures in the zipper database, and determining whether the Phash score P2 of the picture to be identified hits the Phash score of other pictures in the zipper database, for example, the to-be-identified The Phash score of the picture P2 hits another Phash score of picture 1 in the zipper database and hits a Phash score of picture 3.
依次将该待识别图片的每个Phash分值,分别与拉链数据库中其他图片的每个Phash分值进行比较,判断是否命中拉链数据库中其他图片的Phash分值,从而确定该待识别图片命中的拉链数据库中的每个图片。如果待识别图片的每个Phash分值都未命中拉链数据库中的其他图片的每个Phash分值时,则确定该待识别图片与拉链数据库中的每张图片都不重复,将该待识别图片的标识信息,该图片的每个Phash分值及Phash值保存到拉链数据库中,该待识别图片的标识信息可以为该图片在搜索结果中的排序的序号,或者该图片整个重复图片识别过程中的序号等。Each Phash score of the image to be identified is sequentially compared with each Phash score of other images in the zipper database to determine whether to hit the Phash score of other images in the zipper database, thereby determining the hit of the to-be-identified image. Each picture in the zipper database. If each Phash score of the to-be-identified picture misses each Phash score of other pictures in the zipper database, it is determined that the picture to be recognized and each picture in the zipper database are not duplicated, and the picture to be recognized is not to be recognized. The identification information of the Phash score and the Phash value of the image are saved in the zipper database, and the identification information of the image to be identified may be the sequence number of the image in the search result, or the entire repeated image recognition process of the image The serial number and so on.
确定了待识别图片在拉链数据库中命中的每个图片后,针对每个命中图片,比较该待识别图片的Phash值与该命中图片的Phash值的汉明距离,判断该汉明距离是否小于设定的第一阈值,当该汉明距离小于设定的第一阈值a时,即该汉明距离位于[0,a)区间时,确定该待识别图片与拉链数据库中图片重复。After determining each picture that the picture to be identified hits in the zipper database, compare the Phash value of the picture to be recognized with the Hamming distance of the Phash value of the hit picture for each hit picture, and determine whether the Hamming distance is less than The first threshold is determined. When the Hamming distance is less than the set first threshold a, that is, the Hamming distance is in the [0, a) interval, it is determined that the picture to be recognized overlaps with the picture in the zipper database.
当该汉明距离大于设定的第一阈值a但小于设定的第二阈值b时,即该汉明距离位于[a,b)区间时,其中a小于b,将该待识别图片添加到可以队 列中。比较该待识别图片与其他命中图片的汉明距离,识别该汉明距离的最小值,当该最小值小于设定的第一阈值a时,确定该待识别图片与拉链数据库中图片重复,否则,确定该待识别图片与拉链数据库中的图片都不重复,将该待识别图片的标识信息,该图片的每个Phash分值及Phash值保存到拉链数据库中。When the Hamming distance is greater than the set first threshold a but less than the set second threshold b, that is, the Hamming distance is in the interval [a, b), where a is smaller than b, the picture to be recognized is added to Can team In the column. Comparing the Hamming distance between the to-be-identified picture and the other hit picture, and identifying the minimum value of the Hamming distance. When the minimum value is less than the set first threshold a, determining that the picture to be recognized and the picture in the zipper database are duplicated, otherwise The image to be identified is not duplicated with the image in the zipper database, and the identification information of the image to be identified, each Phash score and the Phash value of the image are saved in the zipper database.
当该汉明距离大于设定的第二阈值b时,即该汉明距离位于[b,∞)区间时,确定该待识别图片与拉链数据库中的图片都不重复,将该待识别图片的标识信息,该图片的每个Phash分值及Phash值保存到拉链数据库中。When the Hamming distance is greater than the set second threshold b, that is, the Hamming distance is in the [b, ∞) interval, it is determined that the picture to be recognized and the picture in the zipper database are not duplicated, and the picture to be recognized is Identification information, each Phash score and Phash value of the picture is saved in the zipper database.
图4为本发明实施例提供的一种图片搜索去重的过程示意图,该过程包括:FIG. 4 is a schematic diagram of a process for de-duplication of image search according to an embodiment of the present invention, where the process includes:
S401:接收用户输入的查询词,并搜索与用户输入的查询词相匹配的图片资源。S401: Receive a query word input by the user, and search for a picture resource that matches the query word input by the user.
S402:确定图片资源中每张图片的Phash值,并按照预设的方法对所述Phash值进行分段,得到分段后的每个Phash分值。S402: Determine a Phash value of each picture in the picture resource, and segment the Phash value according to a preset method to obtain each Phash score after the segmentation.
S403:提取该待识别图片的每个Phash分值,将每个Phash分值分别与拉链数据库中每个图片的Phash分值进行比较,判断每个Phash分值是否命中拉链数据库中其他图片的Phash分值,当判断结果为是时,进行步骤S404,否则,进行步骤S409。S403: Extract each Phash score of the to-be-identified picture, compare each Phash score with a Phash score of each picture in the zipper database, and determine whether each Phash score hits the Phash of other pictures in the zipper database. The score, when the result of the determination is YES, proceeds to step S404, otherwise, proceeds to step S409.
S404:确定该待识别图片与每个命中图片的汉明距离,提取汉明距离的最小值。S404: Determine a Hamming distance between the to-be-identified picture and each hit picture, and extract a minimum value of the Hamming distance.
S405:判断该汉明距离的最小值是否小于设定的比较阈值,当判断结果为是时,进行步骤S406,否则,进行步骤S409。S405: Determine whether the minimum value of the Hamming distance is less than a set comparison threshold. When the determination result is YES, proceed to step S406; otherwise, proceed to step S409.
S406:该待识别图片在该拉链数据库中存在重复的图片。S406: The to-be-identified picture has duplicate pictures in the zipper database.
S407:判断该待识别图片是否为该图片资源相关信息中的最后一张图片,当判断结果为是时,进行步骤S408,否则,将下一张图片作为待识别图片,进行步骤S403。S407: Determine whether the to-be-identified picture is the last picture in the picture resource related information. If the determination result is yes, proceed to step S408; otherwise, use the next picture as the to-be-identified picture, and proceed to step S403.
S408:将去除重复图片后的拉链数据库中的图片返回给所述用户。S408: Return the picture in the zipper database after the duplicate picture is removed to the user.
S409:该待识别图片与拉链数据库中的图片都不重复,将该待识别图片添加到拉链数据库的前面。之后进行步骤S407。S409: The picture to be identified and the picture in the zipper database are not duplicated, and the picture to be identified is added to the front of the zipper database. Then, step S407 is performed.
由于在本发明实施例中将图片的Phash值进行分段得到了多个Phash分值,当某一Phash分值命中拉链数据库中其他图片的Phash分值时,才比较该图片的Phash值及与其命中的其他图片的Phash值,因此保证了重复图片 识别的准确性,同时也能有效的提高重复图片的识别效率。In the embodiment of the present invention, the Phash value of the picture is segmented to obtain multiple Phash scores. When a Phash score hits the Phash score of other pictures in the zipper database, the Phash value of the picture is compared and The Phash value of other images hit, so the duplicate image is guaranteed The accuracy of recognition can also effectively improve the recognition efficiency of repeated pictures.
图5为本发明实施例提供的一种识别重复图片的装置结构示意图,所述装置包括:FIG. 5 is a schematic structural diagram of an apparatus for identifying a duplicate picture according to an embodiment of the present disclosure, where the apparatus includes:
分段模块51,用于确定待识别图片的Phash值,对所述Phash值进行分段,得到分段后的每个Phash分值;The segmentation module 51 is configured to determine a Phash value of the to-be-identified picture, and segment the Phash value to obtain each Phash score after the segmentation;
拉链数据库52,用于存储图片的Phash值及每个Phash分值;a zipper database 52, configured to store a Phash value of the picture and each Phash score;
判断模块53,用于判断所述待识别图片分段后的每个Phash分值是否命中拉链数据库中的其他图片分段后的Phash分值;The determining module 53 is configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database;
比较识别模块54,用于当判断模块判断所述待识别图片的Phash分值命中拉链数据库中其他图片的Phash分值时,确定所述待识别图片与拉链数据库中的其他图片是否重复;当判断模块判断待识别图片的每个Phash分值未命中拉链数据库中的其他图片的Phash分值时,将所述待识别图片的信息保存到所述拉链数据库中。The comparison identification module 54 is configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits the Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When the module determines that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the picture to be recognized is saved into the zipper database.
所述比较识别模块54,具体用于针对Phash分值被命中的拉链数据库中的每个其他图片,根据该待识别图片的Phash值和每个其他图片Phash值的汉明距离,确定所述待识别图片与拉链数据库中的其他图片是否重复。The comparison identification module 54 is specifically configured to determine, according to the Phash value of the to-be-identified picture and the Hamming distance of each other picture Phash value, for each other picture in the zipper database that is hit by the Phash score. Identify whether the picture is duplicated with other pictures in the zipper database.
所述判断模块53,具体用于判断待识别图片的Phash分值与拉链数据库中的其他图片的Phash分值是否相同。The determining module 53 is specifically configured to determine whether the Phash score of the to-be-identified picture is the same as the Phash score of other pictures in the zipper database.
所述比较识别模块54,具体用于确定所述待识别图片与所述每个其他图片的汉明距离,提取所述汉明距离的最小值;判断所述最小值是否小于设定的比较阈值;当所述最小值小于设定的比较阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与拉链数据库中的其他图片不重复。The comparison identification module 54 is specifically configured to determine a Hamming distance between the to-be-identified picture and each of the other pictures, and extract a minimum value of the Hamming distance; determine whether the minimum value is less than a set comparison threshold. When the minimum value is less than the set comparison threshold, determining that the to-be-identified picture is repeated with other pictures in the zipper database, otherwise, determining that the to-be-identified picture does not overlap with other pictures in the zipper database.
所述比较识别模块54,具体用于针对每个其他图片中的第一图片,确定该待识别图片的Phash值与该第一图片的Phash值的汉明距离;判断该汉明距离是否小于设定的第一阈值;当该汉明距离小于设定的第一阈值时,确定该待识别图片与该第一图片重复;当该汉明距离不小于设定的第一阈值时,判断所述汉明距离是否小于设定的第二阈值,其中第一阈值小于第二阈值;当所述汉明距离小于设定的第二阈值时,确定所述待识别图片与所述剩余的每个其他图片的汉明距离,提取所述汉明距离的最小值,判断所述最小值是否小于设定的第一阈值,当所述最小值小于设定的第一阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与所述 其他图片不重复。The comparison identification module 54 is configured to determine, for each of the other pictures, a Phash value of the to-be-identified picture and a Hamming distance of the Phash value of the first picture; and determine whether the Hamming distance is less than Determining a first threshold; determining that the to-be-recognized picture is repeated with the first picture when the Hamming distance is less than a set first threshold; and determining that the Hamming distance is not less than a set first threshold Whether the Hamming distance is less than a set second threshold, wherein the first threshold is less than the second threshold; and when the Hamming distance is less than the set second threshold, determining the to-be-identified picture and the remaining each other a Hamming distance of the picture, extracting a minimum value of the Hamming distance, determining whether the minimum value is less than a set first threshold, and determining the to-be-identified picture when the minimum value is less than a set first threshold Duplicate with other pictures in the zipper database, otherwise, determine the picture to be identified and the Other images are not repeated.
所述比较识别模块54,具体用于将所述待识别图片的Phash分值及所述待识别图片的Phash值保存在所述拉链数据库的头部,其中所述拉链数据库按照图片产生的时间,从前到后保存各图片的信息。The comparison identification module 54 is configured to save the Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture in a header of the zipper database, where the zipper database is generated according to the time of the picture. Save the information of each picture from front to back.
所述分段模块51,具体用于将所述Phash值分为多个单元段,每个单元段采用不同的比特数;采用顺序移位方法,得到每个Phash分值。The segmentation module 51 is specifically configured to divide the Phash value into a plurality of unit segments, each of which adopts a different number of bits; and adopts a sequential shift method to obtain each Phash score.
图6为本发明实施例提供的一种图片搜索去重的装置结构示意图,所述装置包括:FIG. 6 is a schematic structural diagram of an apparatus for image search deduplication according to an embodiment of the present disclosure, where the apparatus includes:
接收搜索模块61,用于接收用户输入的查询词,并搜索与用户输入的查询词相匹配的图片资源;The receiving search module 61 is configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;
去重模块62,用于去除图片资源中的重复图片;The de-duplication module 62 is configured to remove duplicate pictures in the picture resource;
提供模块63,用于将去除重复图片后的图片资源结果返回给所述用户;a providing module 63, configured to return a picture resource result after removing the duplicate picture to the user;
所述去重模块62去除图片资源中的重复图片通过采用上述识别重复图片的装置得到。The deduplication module 62 removes duplicate pictures in the picture resource by using the above device for identifying duplicate pictures.
本发明实施例提供了一种识别重复图片的方法、图片搜索去重方法及其装置,该方法中将待识别图片的Phash值分段,得到每个Phash分值,将待识别图片的每个Phash分值与拉链数据库中保存的每个图片的Phash分值进行比较,当待识别图片的Phash分值命中拉链数据库中其他图片的Phash分值时,确定待识别图片是否与其他图片重复。由于在本发明实施例中将图片的Phash值进行分段得到了多个Phash分值,当某一Phash分值命中拉链数据库中其他图片的Phash分值时,才比较该图片的Phash值及与其命中的其他图片的Phash值,因此保证了重复图片识别的准确性,同时也能有效的提高重复图片的识别效率。An embodiment of the present invention provides a method for identifying a repeated picture, a method for de-duplicating a picture, and a device thereof. In this method, a Phash value of a picture to be identified is segmented, and each Phash score is obtained, and each picture to be identified is obtained. The Phash score is compared with the Phash score of each picture saved in the zipper database. When the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified is duplicated with other pictures. In the embodiment of the present invention, the Phash value of the picture is segmented to obtain multiple Phash scores. When a Phash score hits the Phash score of other pictures in the zipper database, the Phash value of the picture is compared and The Phash value of other pictures hits, thus ensuring the accuracy of repeated picture recognition, and also effectively improving the recognition efficiency of repeated pictures.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循 具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, the various features of the invention are sometimes grouped together into a single embodiment, in the above description of the exemplary embodiments of the invention, Figure, or a description of it. However, the method disclosed is not to be interpreted as reflecting the intention that the claimed invention requires more features than those recited in the claims. Rather, as the following claims reflect, inventive aspects reside in less than all features of the single embodiments disclosed herein. Therefore, follow The claims of the detailed description are hereby expressly incorporated into the specific embodiments, and each of the claims
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will appreciate that the modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components. In addition to such features and/or at least some of the processes or units being mutually exclusive, any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined. Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。In addition, those skilled in the art will appreciate that, although some embodiments described herein include certain features that are included in other embodiments and not in other features, combinations of features of different embodiments are intended to be within the scope of the present invention. Different embodiments are formed and formed. For example, in the following claims, any one of the claimed embodiments can be used in any combination.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的识别重复图片的装置和/或图片搜索去重的装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some of the means for identifying duplicate pictures and/or picture search deduplication devices in accordance with embodiments of the present invention or Some or all of the features of all components. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图7示出了可以实现根据本发明的识别重复图片的方法,和/或,图片搜索去重的方法的计算设备。该计算设备传统上包括处理器710和以存储器720形式的计算机程序产品或者计算机可读介质。存储器720可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器720具有用于执行上述方法中的任何方法步骤的程序代码731的存储空间730。例如,用于程序代码的存储空间730可以包括分别用于实现上面的方法中的各种步骤的各个程序代码731。这些程序代 码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图8所述的便携式或者固定存储单元。该存储单元可以具有与图7的计算设备中的存储器720类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码731’,即可以由例如诸如710之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, FIG. 7 illustrates a computing device that may implement a method of identifying duplicate pictures, and/or a method of picture search deduplication in accordance with the present invention. The computing device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 720. Memory 720 can be an electronic memory such as a flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Memory 720 has a memory space 730 for program code 731 for performing any of the method steps described above. For example, storage space 730 for program code may include various program code 731 for implementing various steps in the above methods, respectively. These program generations The code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 720 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 731', ie, code readable by a processor, such as 710, that when executed by a computing device causes the computing device to perform each of the methods described above step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。&quot;an embodiment,&quot; or &quot;an embodiment,&quot; or &quot;an embodiment,&quot; In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。 In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (17)

  1. 一种识别重复图片的方法,该方法包括:A method of identifying duplicate pictures, the method comprising:
    确定待识别图片的Phash值,对所述Phash值进行分段,得到分段后的每个Phash分值;Determining a Phash value of the to-be-identified picture, and segmenting the Phash value to obtain each Phash score after the segmentation;
    判断所述待识别图片分段后的每个Phash分值是否命中拉链数据库中其他图片分段后的Phash分值;Determining whether each Phash score after the segmentation of the image to be identified hits a Phash score after other image segments in the zipper database;
    当待识别图片的Phash分值命中拉链数据库中其他图片的Phash分值时,确定所述待识别图片与拉链数据库中的其他图片是否重复;When the Phash score of the picture to be identified hits the Phash score of other pictures in the zipper database, it is determined whether the picture to be identified and other pictures in the zipper database are duplicated;
    否则,将所述待识别图片的信息保存到所述拉链数据库中。Otherwise, the information of the picture to be identified is saved in the zipper database.
  2. 如权利要求1所述的方法,其中,所述确定所述待识别图片与拉链数据库中的其他图片是否重复包括:The method of claim 1, wherein the determining whether the picture to be recognized and other pictures in the zipper database are duplicated comprises:
    针对Phash分值被命中的拉链数据库中的每个其他图片,根据该待识别图片的Phash值和每个其他图片Phash值的汉明距离,确定所述待识别图片与拉链数据库中的其他图片是否重复。For each other picture in the zipper database whose Phash score is hit, according to the Phash value of the picture to be recognized and the Hamming distance of each other picture Phash value, it is determined whether the picture to be identified and other pictures in the zipper database are repeat.
  3. 如权利要求1~2任一项所述的方法,其中,所述确定所述待识别图片与拉链数据库中的其他图片是否重复包括:The method according to any one of claims 1 to 2, wherein the determining whether the picture to be recognized and other pictures in the zipper database are duplicated comprises:
    确定所述待识别图片与所述每个其他图片的汉明距离,提取所述汉明距离的最小值;Determining a Hamming distance of the to-be-identified picture and each of the other pictures, and extracting a minimum value of the Hamming distance;
    判断所述最小值是否小于设定的比较阈值;Determining whether the minimum value is less than a set comparison threshold;
    当所述最小值小于设定的比较阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与拉链数据库中的其他图片不重复。When the minimum value is less than the set comparison threshold, it is determined that the to-be-identified picture is overlapped with other pictures in the zipper database; otherwise, it is determined that the to-be-identified picture does not overlap with other pictures in the zipper database.
  4. 如权利要求1~3任一项所述的方法,其中,所述确定所述待识别图片与拉链数据库中的其他图片是否重复包括:The method according to any one of claims 1 to 3, wherein the determining whether the picture to be recognized and other pictures in the zipper database are duplicated comprises:
    针对每个其他图片中的第一图片,确定该待识别图片的Phash值与该第一图片的Phash值的汉明距离,并判断该汉明距离是否小于设定的第一阈值;Determining, according to the first picture in each other picture, a Hamming value of the to-be-identified picture and a Hamming distance of the Phash value of the first picture, and determining whether the Hamming distance is less than a set first threshold;
    当该汉明距离小于设定的第一阈值时,确定该待识别图片与该第一图片重复;When the Hamming distance is less than the set first threshold, determining that the to-be-identified picture is repeated with the first picture;
    当该汉明距离不小于设定的第一阈值时,判断所述汉明距离是否小于设定的第二阈值,其中第一阈值小于第二阈值;When the Hamming distance is not less than the set first threshold, determining whether the Hamming distance is less than a set second threshold, wherein the first threshold is less than the second threshold;
    当所述汉明距离小于设定的第二阈值时,确定所述待识别图片与所述剩 余的每个其他图片的汉明距离,提取所述汉明距离的最小值,判断所述最小值是否小于设定的第一阈值,当所述最小值小于设定的第一阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与所述其他图片不重复。Determining the to-be-identified picture and the remaining when the Hamming distance is less than a set second threshold Calculating a Hamming distance of each of the other pictures, extracting a minimum value of the Hamming distance, determining whether the minimum value is less than a set first threshold, and determining when the minimum value is less than a set first threshold The picture to be identified is repeated with other pictures in the zipper database. Otherwise, it is determined that the picture to be identified does not overlap with the other pictures.
  5. 如权利要求1~4任一项所述的方法,其中,所述将所述待识别图片的信息保存到所述拉链数据库中包括:The method according to any one of claims 1 to 4, wherein the saving the information of the picture to be recognized into the zipper database comprises:
    将所述待识别图片的Phash分值及所述待识别图片的Phash值保存在所述拉链数据库的头部,其中所述拉链数据库按照图片产生的时间,从前到后保存各图片的信息。The Phash score of the to-be-identified picture and the Phash value of the to-be-identified picture are saved in a header of the zipper database, wherein the zipper database stores information of each picture from front to back according to the time when the picture is generated.
  6. 如权利要求1~5任一项所述的方法,其中,所述待识别图片包括新闻图片和热点图片。The method according to any one of claims 1 to 5, wherein the picture to be identified comprises a news picture and a hot picture.
  7. 如权利要求1~6任一项所述的方法,其中,所述对所述Phash值进行分段包括:The method of any one of claims 1 to 6, wherein the segmenting the Phash value comprises:
    将所述Phash值分为多个单元段,每个单元段采用不同的比特数;Dividing the Phash value into a plurality of unit segments, each unit segment adopting a different number of bits;
    采用顺序移位方法,得到每个Phash分值。Using the sequential shift method, each Phash score is obtained.
  8. 一种图片搜索去重的方法,所述方法包括:A method for image search deduplication, the method comprising:
    接收用户输入的查询词,并搜索与用户输入的查询词相匹配的图片资源;Receiving a query word input by the user, and searching for a picture resource that matches the query word input by the user;
    去除图片资源中的重复图片;Remove duplicate images from image assets;
    将去除重复图片后的图片资源结果返回给所述用户;Returning the result of the picture resource after the duplicate picture is removed to the user;
    所述去除图片资源中的重复图片通过采用如权利要求1~7任一项所述的识别重复图片的方法得到。The repeated picture in the picture resource is obtained by the method for identifying a duplicate picture according to any one of claims 1 to 7.
  9. 一种识别重复图片的装置,所述装置包括:A device for identifying duplicate pictures, the device comprising:
    分段模块,用于确定待识别图片的Phash值,对所述Phash值进行分段,得到分段后的每个Phash分值;a segmentation module, configured to determine a Phash value of the to-be-identified picture, segment the Phash value, and obtain each Phash score after the segmentation;
    拉链数据库,用于存储图片的Phash值及每个Phash分值;a zipper database for storing Phash values of pictures and each Phash score;
    判断模块,用于判断所述待识别图片分段后的每个Phash分值是否命中拉链数据库中的其他图片分段后的Phash分值;a judging module, configured to determine whether each Phash score after the segmentation of the image to be identified hits a Phash score of other image segments in the zipper database;
    比较识别模块,用于当判断模块判断所述待识别图片的Phash分值命中拉链数据库中其他图片的Phash分值时,确定所述待识别图片与拉链数据库中的其他图片是否重复;当判断模块判断待识别图片的每个Phash分值未命中拉链数据库中的其他图片的Phash分值时,将所述待识别图片的信息保存 到所述拉链数据库中。a comparison identification module, configured to determine, when the judgment module determines that the Phash score of the to-be-identified picture hits a Phash score of other pictures in the zipper database, whether the picture to be recognized and other pictures in the zipper database are duplicated; When it is determined that each Phash score of the to-be-identified picture misses the Phash score of other pictures in the zipper database, the information of the to-be-identified picture is saved. Go to the zipper database.
  10. 如权利要求9所述的装置,其中,所述比较识别模块,具体用于针对Phash分值被命中的拉链数据库中的每个其他图片,根据该待识别图片的Phash值和每个其他图片Phash值的汉明距离,确定所述待识别图片与拉链数据库中的其他图片是否重复。The apparatus according to claim 9, wherein said comparison identifying module is specifically configured to: for each other picture in the zipper database that is hit for the Phash score, according to the Phash value of the picture to be recognized and each other picture Phash The Hamming distance of the value determines whether the picture to be identified and the other pictures in the zipper database are duplicated.
  11. 如权利要求10所述的装置,其中,所述比较识别模块,具体用于确定所述待识别图片与所述每个其他图片的汉明距离,提取所述汉明距离的最小值;判断所述最小值是否小于设定的比较阈值;当所述最小值小于设定的比较阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与拉链数据库中的其他图片不重复。The device according to claim 10, wherein the comparison identifying module is specifically configured to determine a Hamming distance of the to-be-identified picture and each of the other pictures, and extract a minimum value of the Hamming distance; Whether the minimum value is smaller than the set comparison threshold; when the minimum value is less than the set comparison threshold, determining that the to-be-identified picture is duplicated with other pictures in the zipper database; otherwise, determining the to-be-identified picture and the zipper database Other pictures are not repeated.
  12. 如权利要求10所述的装置,其中,所述比较识别模块,具体用于针对每个其他图片中的第一图片,确定该待识别图片的Phash值与该第一图片的Phash值的汉明距离;判断该汉明距离是否小于设定的第一阈值;当该汉明距离小于设定的第一阈值时,确定该待识别图片与该第一图片重复;当该汉明距离不小于设定的第一阈值时,判断所述汉明距离是否小于设定的第二阈值,其中第一阈值小于第二阈值;当所述汉明距离小于设定的第二阈值时,确定所述待识别图片与所述剩余的每个其他图片的汉明距离,提取所述汉明距离的最小值,判断所述最小值是否小于设定的第一阈值,当所述最小值小于设定的第一阈值时,确定所述待识别图片与拉链数据库中的其他图片重复,否则,确定该待识别图片与所述其他图片不重复。The device of claim 10, wherein the comparison identifying module is configured to determine a Phash value of the to-be-identified picture and a Hamming value of the Phash value of the first picture for the first picture in each of the other pictures. Determining whether the Hamming distance is less than a set first threshold; determining that the to-be-recognized picture is repeated with the first picture when the Hamming distance is less than a set first threshold; when the Hamming distance is not less than Determining whether the Hamming distance is less than a set second threshold when the first threshold is determined, wherein the first threshold is less than a second threshold; and when the Hamming distance is less than a set second threshold, determining the waiting Identifying a Hamming distance of the picture and each of the remaining other pictures, extracting a minimum value of the Hamming distance, determining whether the minimum value is less than a set first threshold, and when the minimum value is less than a set number When a threshold is reached, it is determined that the to-be-identified picture is overlapped with other pictures in the zipper database, otherwise, it is determined that the to-be-identified picture does not overlap with the other pictures.
  13. 如权利要求9~12任一项所述的装置,其中,所述比较识别模块,具体用于将所述待识别图片的Phash分值及所述待识别图片的Phash值保存在所述拉链数据库的头部,其中所述拉链数据库按照图片产生的时间,从前到后保存各图片的信息。The device according to any one of claims 9 to 12, wherein the comparison identifying module is configured to save a Phash score of the to-be-identified picture and a Phash value of the to-be-identified picture in the zipper database. The head, wherein the zipper database saves the information of each picture from front to back according to the time generated by the picture.
  14. 如权利要求9~12任一项所述的装置,其中,所述分段模块,具体用于将所述Phash值分为多个单元段,每个单元段采用不同的比特数;采用顺序移位方法,得到每个Phash分值。The apparatus according to any one of claims 9 to 12, wherein the segmentation module is specifically configured to divide the Phash value into a plurality of unit segments, each of which adopts a different number of bits; The bit method gets each Phash score.
  15. 一种图片搜索去重的装置,所述装置包括:A device for image search deduplication, the device comprising:
    接收搜索模块,用于接收用户输入的查询词,并搜索与用户输入的查询词相匹配的图片资源;Receiving a search module, configured to receive a query word input by the user, and search for a picture resource that matches the query word input by the user;
    去重模块,用于去除图片资源中的重复图片;a de-duplication module for removing duplicate pictures in a picture resource;
    提供模块,用于将去除重复图片后的图片资源结果返回给所述用户; Providing a module, configured to return a picture resource result after removing the duplicate picture to the user;
    所述去重模块去除图片资源中的重复图片通过采用如权利要求9~14任一项所述的识别重复图片的装置得到。The deduplication module removes duplicate pictures in the picture resource by using the apparatus for identifying duplicate pictures according to any one of claims 9 to 14.
  16. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-7中的任一个所述的识别重复图片的方法,和/或,权利要求8所述的图片搜索去重的方法。A computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform the method of identifying a duplicate picture according to any one of claims 1-7, And/or the method for de-duplication of picture search according to claim 8.
  17. 一种计算机可读介质,其中存储了如权利要求16所述的计算机程序。 A computer readable medium storing the computer program of claim 16.
PCT/CN2015/080713 2014-06-05 2015-06-03 Method for recognizing duplicate image, and image search and deduplication method and device thereof WO2015184992A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410247778.2 2014-06-05
CN201410247778.2A CN103984776B (en) 2014-06-05 2014-06-05 Repeated image identification method and image search duplicate removal method and device

Publications (1)

Publication Number Publication Date
WO2015184992A1 true WO2015184992A1 (en) 2015-12-10

Family

ID=51276748

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/080713 WO2015184992A1 (en) 2014-06-05 2015-06-03 Method for recognizing duplicate image, and image search and deduplication method and device thereof

Country Status (2)

Country Link
CN (1) CN103984776B (en)
WO (1) WO2015184992A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552864A (en) * 2020-03-20 2020-08-18 上海恒生聚源数据服务有限公司 Method, system, storage medium and electronic equipment for removing duplicate information
US11055344B2 (en) 2018-03-21 2021-07-06 Walmart Apollo, Llc Product image evaluation system and method

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984776B (en) * 2014-06-05 2017-05-03 北京奇虎科技有限公司 Repeated image identification method and image search duplicate removal method and device
CN104461297A (en) * 2014-12-05 2015-03-25 上海斐讯数据通信技术有限公司 Mobile terminal with screen and screen image capturing method thereof
CN104881470B (en) * 2015-05-28 2018-05-08 暨南大学 A kind of data de-duplication method towards mass picture data
CN106560840B (en) * 2015-09-30 2019-08-13 腾讯科技(深圳)有限公司 A kind of image information identifying processing method and device
CN105678334A (en) * 2016-01-05 2016-06-15 广州市久邦数码科技有限公司 Method of recognizing duplicate photographs and realization system thereof
CN105930499B (en) * 2016-05-09 2019-11-22 深圳市数极科技有限公司 A kind of image searching method and system
CN106327426A (en) * 2016-08-19 2017-01-11 携程计算机技术(上海)有限公司 Image replication removing method and image replication removing system
CN106682130B (en) * 2016-12-14 2022-11-15 北京五八信息技术有限公司 Similar picture detection method and device
CN107169057B (en) * 2017-04-27 2022-04-05 腾讯科技(深圳)有限公司 Method and device for detecting repeated pictures
CN107729935B (en) * 2017-10-12 2019-11-12 杭州贝购科技有限公司 The recognition methods of similar pictures and device, server, storage medium
CN109033261B (en) * 2018-07-06 2021-06-22 北京旷视科技有限公司 Image processing method, image processing apparatus, image processing device, and storage medium
CN109189963B (en) * 2018-08-31 2021-07-06 北京诸葛找房信息技术有限公司 House resource duplication eliminating method based on house resource information similarity and picture recognition
CN109040784A (en) * 2018-09-14 2018-12-18 北京蓝拓扑科技股份有限公司 Commercial detection method and device
CN110321447A (en) * 2019-07-08 2019-10-11 北京字节跳动网络技术有限公司 Determination method, apparatus, electronic equipment and the storage medium of multiimage

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103353990A (en) * 2013-06-19 2013-10-16 海南大学 Intelligent-texture anti-counterfeiting method based on perceptual hashing
CN103678702A (en) * 2013-12-30 2014-03-26 优视科技有限公司 Video duplicate removal method and device
CN103984776A (en) * 2014-06-05 2014-08-13 北京奇虎科技有限公司 Repeated image identification method and image search duplicate removal method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350826B (en) * 2008-08-22 2012-09-05 中兴通讯股份有限公司 Method for monitoring communication system picture or multimedia video picture
CN101887457B (en) * 2010-07-02 2012-10-03 杭州电子科技大学 Content-based copy image detection method
CN102622366B (en) * 2011-01-28 2014-07-30 阿里巴巴集团控股有限公司 Similar picture identification method and similar picture identification device
CN102567473A (en) * 2011-12-14 2012-07-11 鸿富锦精密工业(深圳)有限公司 Network information retrieval system and retrieval method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103353990A (en) * 2013-06-19 2013-10-16 海南大学 Intelligent-texture anti-counterfeiting method based on perceptual hashing
CN103678702A (en) * 2013-12-30 2014-03-26 优视科技有限公司 Video duplicate removal method and device
CN103984776A (en) * 2014-06-05 2014-08-13 北京奇虎科技有限公司 Repeated image identification method and image search duplicate removal method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055344B2 (en) 2018-03-21 2021-07-06 Walmart Apollo, Llc Product image evaluation system and method
CN111552864A (en) * 2020-03-20 2020-08-18 上海恒生聚源数据服务有限公司 Method, system, storage medium and electronic equipment for removing duplicate information
CN111552864B (en) * 2020-03-20 2023-09-12 上海恒生聚源数据服务有限公司 Information deduplication method, system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN103984776A (en) 2014-08-13
CN103984776B (en) 2017-05-03

Similar Documents

Publication Publication Date Title
WO2015184992A1 (en) Method for recognizing duplicate image, and image search and deduplication method and device thereof
CN109189991B (en) Duplicate video identification method, device, terminal and computer readable storage medium
WO2019128529A1 (en) Url attack detection method and apparatus, and electronic device
US9009149B2 (en) Systems and methods for mobile search using Bag of Hash Bits and boundary reranking
WO2016015621A1 (en) Human face picture name recognition method and system
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
WO2016180268A1 (en) Text aggregate method and device
WO2017092555A1 (en) Method and device for parsing amount of money in judgement document
JP5558997B2 (en) Method, information processing system, and computer program for mutual search and alert (mutual search and alert between structured and unstructured data sources)
WO2016155627A1 (en) Method and apparatus for recognizing characters in picture
JP2017503273A5 (en)
JP2013525916A5 (en)
WO2014000536A1 (en) System and method for identifying phishing website
JP2017508197A5 (en)
JP2020525935A (en) Method and apparatus for determining duplicate video
CN108881947A (en) A kind of infringement detection method and device of live stream
WO2019184518A1 (en) Audio retrieval and identification method and device
WO2015196964A1 (en) Matching picture search method, picture search method and apparatuses
KR102260631B1 (en) Duplication Image File Searching Method and Apparatus
WO2022116419A1 (en) Automatic determination method and apparatus for domain name infringement, electronic device, and storage medium
CN109697240B (en) Image retrieval method and device based on features
CN107577943B (en) Sample prediction method and device based on machine learning and server
WO2015131528A1 (en) Method and apparatus for determining topic distribution of given text
US8386792B1 (en) Asymmetric content fingerprinting with adaptive window sizing
US8370390B1 (en) Method and apparatus for identifying near-duplicate documents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15803342

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15803342

Country of ref document: EP

Kind code of ref document: A1