WO2013071953A1 - Mise en correspondance rapide de base de données - Google Patents

Mise en correspondance rapide de base de données Download PDF

Info

Publication number
WO2013071953A1
WO2013071953A1 PCT/EP2011/070075 EP2011070075W WO2013071953A1 WO 2013071953 A1 WO2013071953 A1 WO 2013071953A1 EP 2011070075 W EP2011070075 W EP 2011070075W WO 2013071953 A1 WO2013071953 A1 WO 2013071953A1
Authority
WO
WIPO (PCT)
Prior art keywords
record
key
list
sample
mask
Prior art date
Application number
PCT/EP2011/070075
Other languages
English (en)
Inventor
Donald Martin Monro
Original Assignee
Donald Martin Monro
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donald Martin Monro filed Critical Donald Martin Monro
Priority to EP11784481.1A priority Critical patent/EP2780830A1/fr
Priority to PCT/EP2011/070075 priority patent/WO2013071953A1/fr
Priority to BR112014011646A priority patent/BR112014011646A2/pt
Publication of WO2013071953A1 publication Critical patent/WO2013071953A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/268Lexical context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/197Matching; Classification

Definitions

  • the invention relates to the field of database systems.
  • it relates to a method and system for improving the speed with which a candidate or sample record may reliably be matched against a record previously enrolled within the database.
  • biometrics in which the requirement is to determine whether or not the individual who has provided a particular biometric sample is already in the database.
  • a further exemplary field is that of digital rights management, where the need is to check whether a particular piece of music, video, image or text matches a corresponding record within a database of copyright works.
  • Databases of the type described can be extremely large, and it may be impractical to attempt a full match analysis between the sample record and every one of the records within the database.
  • a variety of pre-screening processes are in use, but many of these have very restricted fields of application since they often rely upon specific peculiarities of the matching algorithm or of the data that are to be matched.
  • a method of identifying possible matches between a sample record and a plurality of stored records comprising: (a) Defining a plurality of reference positions within a data record, and;
  • the required number for matching may be determined according to any convenient algorithm, such as a threshold dependent upon the application.
  • the threshold may conveniently be a simple numerical count, or could alternatively be some more complex metric depending not only upon the number of matching key values, but also upon the number of times that those key values match the sample record and/or match the corresponding stored record.
  • the numerical count may be modified or scaled according to the particular masks associated with enrolled and sample records.
  • any or all of the reference positions, bit patterns and means of forming a key value or list of key values may be hand-crafted (user generated) or alternatively could be generated automatically from the stored records.
  • the list of key values could be selective (for example some of the words to be found within the text of a book), or could be comprehensive (all occurring words are automatically added to the list).
  • the key values may all be of the same type or class, but that is not essential and it is contemplated that a single list may contain features of a variety of types (for example individual words, phrases, font size and font information, layout information and so on). Instead of being a fragment of the stored record, the key values might alternatively be derived in some other way, for example by hashing of the record or applying some other type of operation to it or to a part of it.
  • the enrolment and/or sample masks may be hand-crafted (user generated) or alternatively could be generated automatically from the stored records.
  • sample record and the list of possible matching records may then be passed to a more sophisticated or exhaustive matching algorithm to determine which of the possible matches are true matches.
  • Such a method provides very fast candidate-matching at the expense of additional effort and memory utilization when registering a new record within the database.
  • the trade-off is well worth while in a system where a record is enrolled only once and subsequently searched against many sample records. This is true of many, if not most applications. It can be of great advantage to devote more processing cost to enrolling than to searching, and as is not generally appreciated, trade faster matching for larger memory.
  • a system for identifying possible matches between a sample record and a plurality of stored records comprising:
  • separate processors may be used for matching key values against sample records, and for identifying stored records as possible matches. These processors may be on separate computers, and may be remote from each other.
  • the main data list including the full collection of stored records may be held separately from the lists of record identifiers. That allows a local processor, for example a processor embedded within a photocopying machine, to carry out the initial analysis using key values extracted from a sample record such as a photocopied page of text. Once a list of possible matches has been identified, that list can then be passed to a remote server, where a more detailed analysis can be carried out by comparing the sample with the full text of each of the possible matches.
  • This approach has the further advantage that the designer of the system does not need to distribute to a large number of users full copies of the entire corpus of copyright works. Instead, each user simply receives a list of key values, which is enough for the initial analysis to be carried locally. Where one or more possible matches are found, the system may then be automatically report to a central location where further analysis can be carried out against the full documents.
  • Figure 1 shows records of text in a database of books with associated data according to an exemplary embodiment of the invention.
  • Figure 2 illustrates the formation of key values from key patterns in a record of text according to an exemplary embodiment of the invention.
  • Figure 3 illustrates a mask that may be associated with reference positions in a record of data.
  • Figure 4 shows how lists of Record Identifiers are associated with Key Values at Reference Positions through a Key Mask according to an embodiment of the invention
  • Figure 5 describes the process of enrolment of record identifiers from a data record into a database when the particular key value has not previously occurred at a particular reference position;
  • Figure 6 describes the process of enrolment of record identifiers from a data record into a database when the particular key value has previously occurred at a particular reference position
  • Figure 7 illustrates the indexed matching of a sample record.
  • Figure 8 is a histogram exemplifying the matching for an example embodiment of the invention.
  • Figure 9 is another exemplary histogram
  • Figure 10 shows some exemplary hardware.
  • a database contains details of a large number of published books which are currently in copyright.
  • a website has been found onto which has been posted lengthy extracts from a variety of books.
  • the task is to determine which, if any, of those extracts have been taken from books which are recorded within the database.
  • the database structure of this exemplary embodiment is shown schematically in Figure 1. Details of the individual books within the database are held within a case list or table 10, each row 11 of which represents an individual book. Columns 12, 13, 14 respectively hold a unique record identifier for each book, the book title, and the author. Of course, in a practical embodiment, many more details about each individual book would probably be held.
  • the text of each book is held in the data records of column 15 in some suitable encoded form. It may be convenient to subdivide the text of a book into pages, for example. More generally, the column 14 may be considered to hold some generalised record of data.
  • the representation of data in a database may take many different forms, not limited to the details in Figure 1 which are specific to the exemplary embodiment may contain many different types of information encoded in many different forms. What is important to the present invention that records of data, such as represented here by the text of a book 15, should be associated with a unique record identifier 14. It will be understood, of course, that the table 10 could be replaced by multiple linked tables in some embodiments.
  • the text of the book Peter Rabbit 16 will be used to describe the enrolment of a the data record and also the matching of a sample record against enrolled data records.
  • Figure 2 illustrates the formation of key values from key patterns in a record of text 200 which is taken from the book whose unique record identifier is 237, namely the book 'Peter Rabbit'.
  • a record could be a record for enrolment or a sample record for matching.
  • key values may be created from patterns of data at pre-defined or calculated reference positions in a record of data. In the embodiment being described, however, for simplicity the key values are individual words from reference positions in the text in different books. In the book Peter Rabbit, for example, some of these may be "mother” 201, "accident” 202, and "parsley” 203.
  • the reference positions are on different successive pages, however the invention is not so limited.
  • the specified data positions are in different places on each page of text. It is not necessary that all pages are chosen, nor that the pages are in any particular order; however, the reference positions chosen should be the same whenever key values are formed for a particular data record.
  • These reference positions need not be fixed however provided there is some means of determining them, for example by processing of the data itself, such that the reference positions are always the same for the same record of data.
  • individual words of text will be used in the illustration, however the invention is not so limited.
  • the data patterns used to construct a key value could be of any form and could be different for different reference positions and can be constructed by any method provided that they always form the same key value for a particular reference position in a particular record of data.
  • a pattern of data positions 201, 202, 203, 204, 205 is created.
  • the positions might be on the same on different pages, for example.
  • a pattern of data is created.
  • this might be individual words or even characters from a pattern related to the chosen page. For simplicity in the example individual words are chosen. It may, however, give better results in a practical embodiment for the pattern to be a selection of individual characters from particular parts of the page which are therefore unlikely to form a recognizable word.
  • the chosen pattern of data is then combined into a key value, which is the means of referencing the database.
  • the key value might be a string of characters made up from the pattern of characters selected for the particular page, and the process of combining them may, for example, be a simple rearrangement, or indeed a mathematical operation which turns the characters into an item of binary data.
  • a mask which is in one to one correspondence with the data positions, which selects which positions in the data are to be entered in the database.
  • a mask is binary although it could be more general providing that it can be used to select or exclude a particular key value for processing.
  • a mask may select only certain pages of a book.
  • books may have different numbers of pages, for example, with pages that do not exist masked out.
  • Other pages from within the book may be masked out, for example if they contain little or no text, or are missing or incomplete or in some other way unreliable.
  • this mask may be used, and the examples given are not intended to be limiting.
  • Figure 3 illustrates a mask that may be associated with reference positions in a record of data, and is used to select or exclude key values: in the illustrative embodiment using key values from the book 'Peter Rabbit'.
  • positions are shown as a list 301, although no such physical list of positions is actually necessary provided that the formation and use of key values is always associated implicitly or explicitly with a reference position.
  • the key values associated with the positions are also shown as a list, although no such physical list of key values is necessary provided every key value is associated implicitly or explicitly with a reference position.
  • the mask is shown as a list 302, although no such physical list is necessary provided every key value used in an embodiment has a mask associated with it and is associated implicitly or explicitly with a reference position. Such a mask could be one which selects all positions or no positions either implicitly or explicitly (but those are but special cases of the present invention).
  • Figure 3 shows how the mask is associated with the reference positions in the book 'Peter Rabbit' to select only pages on which words of the story actually occur: pages 8, 9, 14, 15, 20, 21, 26, 27, 32, 33 for example have text, others may be illustrations or may be blank.
  • Such a mask may be used with the present invention either when enrolling or matching records of data. For enrolment, it is clearly only useful to select reference positions where there is useful data. Similarly for matching its only useful to select reference positions where there is useful data. These selected positions may, in general, be different for different instances of the same data for enrolment or matching. Therefore in some embodiments of the present invention, the mask may be enrolled with the data.
  • Enrol and sample records may have different masks, and therefore not all enrolled values may be used for matching.
  • the use of the mask prevents spurious key values from participating in matching and may therefore enhance the accuracy of masking, preventing for example false matches.
  • the specified positions may be a subset of portions of the iris known to be reliable, for example avoiding ill defined boundaries of positions where reflections are known to occur.
  • the reflection of the nose from the surface of the cornea is a feature known to degrade recognition of irides.
  • Such a mask may be fixed for a set or subset of irides enrolled in a database.
  • the mask may be used to exclude portions of a particular iris that are nor useful, for example an eyelid which conceals the iris texture. They eyelid position will in practice be slightly different in every data record of the same eye, so this mask may be different for an enrolled iris and a sample iris from the same eye.
  • Intelligent use of masks may allow the method to be used in applications where data positions may be variable rather than fixed, for example in fingerprints where the relationship between features is more important than their exact position.
  • Groups of features can be enrolled as keys in the database at several positions, and then on matching keys from a sample record can be matched against sets of positions.
  • Figure 4 shows how lists of Record Identifiers are associated with Key Values at Reference Positions through a Key Mask according to an embodiment of the invention.
  • Figure 4 illustrates the masked index 400 with multiple reference positions 410.
  • a separate record identifier list 420 is created for each of the chosen key values 430 at each of the reference positions 410.
  • a list of key values 440 is presented and selected by a mask 450 either for enrolment or for matching as appropriate.
  • each row 430 holds a variety of different key values which may be found as in Figure 1 within the records of column 15 within the case list 10.
  • reference positions are shown as a list 410, although no such physical list of positions is actually necessary provided that the formation and use of key values 440 is always associated implicitly or explicitly with a reference position.
  • the key values associated with the positions are shown as a list 430, although no such physical list of key values is necessary provided that every key value used is associated implicitly or explicitly with a reference position.
  • the mask is shown as a list 450, although no such physical list is necessary provided every key value has a mask associated with it and is associated implicitly or explicitly with a reference position.
  • Such a mask could be one which selects all positions or no positions either implicitly or explicitly (but those are but special cases of the present invention).
  • each row in each record identifier list 420 or table simply contains the reference number of a single book which includes the relevant key at the relevant position, as will be further described below.
  • a new book is to be registered or enrolled within the database, its details are added to the case 10 and a check is carried out to see which of the keys values 430 are contained at the particular reference positions 410 within that new book.
  • the book's record identifier is then added, as appropriate, to the individual record identifier occurrence lists 420. If desired, one or more new key values may be added to the key value lists 430, in which case additional record identifier lists 420 are automatically created.
  • Figure 5 illustrates enrolment of a book in the database when a key value is not previously known at a particular reference position.
  • a record identifier 505 for enrolment, a record identifier 505, reference position 506, key value 507 and key mask 508 are provided in any way which may be convenient to an embodiment.
  • the key value 'mother' 507 is formed as at 201 in Figure 2.
  • reference position 8 506 is selected for enrolment by the key mask 508f or 'Peter Rabbit' at 304 in Figure 3, the key value list for reference position 8 at
  • Figure 6 illustrates enrolment of a book in the database when a key value is previously known at a particular reference position.
  • a record identifier 605, reference position 606, key value 607 and key mask 608 are provided in any way which may be convenient to an embodiment.
  • the key value 'accident' is formed as at 202 in Figure 2.
  • reference position 9 is selected for enrolment by the key mask for 'Peter Rabbit' at 305 in Figure 3
  • the key value list for reference position 9 at 601 in Figure 5 is examined and it is seen that no book with the key value 'mother' has previously been enrolled for this position.
  • the process of enrolment involves checking the key value list 601 for previous enrolment of a key value at a reference position, it will be clear to one skilled in the art that there may be a speed advantage if the key value list 601 is ordered, although the present invention is not limited to key value lists 601 which are ordered.
  • the key value list 601 is ordered, although the present invention is not limited to key value lists 601 which are ordered.
  • the record identifier of the new data record is added to the record identifier list for 'accident' 603 at 604, in this case the identifier '237' of 'Peter Rabbit'.
  • a sample mask 450 may be associated with the test sample, to exclude or include particular portions of the sample data. For example, only selected pages may be available. By this means great flexibility in the selection of key values and the reference positions in which they match may be used.
  • Two kinds of matching tasks are common in the fields of use, namely 1: 1 matching in which one is required to verify whether a sample record is a match with a particular chosen data record, and 1:N matching in which a sample record is to be matched against a database of N enrolled records with no prior knowledge of the expected answer.
  • the present invention can be used for both purposes, although the illustrative embodiment is concerned with the 1:N case when a sample of text is compared against an entire database of enrolled books to seek a match.
  • a match will occur if a sufficient number of key values at selected reference positions return the same record identifier. It may be an exact match in either 1 : 1 or 1:N matching if a particular key value occurs at all selected reference positions.
  • Figure 7 illustrates the indexed matching of a sample record which is the text of case 237, the book 'Peter Rabbit' in a database index such as 400 after a significant number of books have been enrolled.
  • a number of key values k s 701, k 9 702, k 27 703 are provided together with mask values m 8 704, m 9 705, m 27 706 , all taken from a sample record.
  • the sample record may be an exact match to a book in the database, as for example 'Peter Rabbit' at these positions.
  • the sample record may be a partial match, as for example 'Peter Rabbit' with pages missing or key values in error.
  • the sample record may be from a different book which happens to have the same key values at some positions.
  • the enrolled key values are held in lists 707 708 709 for each reference position which are not ordered. In other embodiments this list may be ordered or may not physically exist. To match, the key values from a reference position of the sample are used to look up in the database the record identifier list 710 711 712 for the particular key value at the particular reference position. In some embodiments where the appropriate record identifier list may be selected by some automatic method, therefore, the key value lists may not physically exist. However, the record identifier lists are physically created and maintained.
  • the key value ' & 8 mother' is presented 701 and the key mask m 8 704 indicates this to be a selected position.
  • a record identifier list 710 is selected which contains all the record identifiers of all data records which contain the selected key value at the selected reference position 701. All the record identifiers in the selected record identifier list 710 are passed to a means of counting the occurrences or 'hits' on particular data records 713. In the case of 1: 1 matching this may consist simply of counting the hits at a particular sample record identifier that has been presented for verification.
  • counting of hits may at 713 be by a more general method, including but not restricted to the formation of a histogram or bin-count 714 for at least some of the record identifiers in the database.
  • Such a histogram counts the hits 715 for a selection of record identifiers 716 could be created and initialised in advance, for example, or on the fly as a sample match proceeds.
  • the processing can continue to extract and count record identifiers from selected record identifier lists in which a key value is enrolled at selected reference positions.
  • the occurrences of record identifiers are counted at reference positions which match a key value extracted from the reference positions and selected by a mask.
  • Figure 8 illustrates an example in which the sample text has matched against the key values "mother” and "accident”. The count is shown schematically as histogram, although such a histogram would not necessarily be plotted in a working embodiment.
  • there are two books in the database that have two matched key values, namely "The Lion, the Witch and the Wardrobe” and "Peter Rabbit” . "The Witches" has one match and "Peter Pan” has none.
  • a threshold is applied to the count, and any book which scores at least the threshold value is considered to be a candidate match.
  • the threshold is taken as one, all of the books except Peter Pan are candidate matches, and if the threshold is taken as two then the candidates are The Lion, the Witch and the Wardrobe and Peter Rabbit .
  • Figure 9 represents another text sample in which matches have been found against the key values "mother”, “accident” and “parsley”. If a threshold of three is chosen, a match has been found in Peter Rabbit.
  • the value of the threshold may be selected by the user by trial and error, according to the particular application and the extent to which the pre-selection process needs to remove a large number of cases from consideration in order to speed up the overall matching process.
  • a simple count and a fixed threshold is a convenient way of dividing possible matches from non- matches, other algorithms could equally well be used.
  • One possible approach, for example, would be to select as a possible match all of those cases having a record identifier count which is more than a fixed percentage higher than the average (e.g. mean, median or mode) characteristic count taken across all cases.
  • a sample record is presented to a database for matching, then different data records may have been enrolled with different numbers of reference positions selected.
  • the data records may have been of different lengths, for example in the case of books the number of pages may vary widely, so that it is possible that a short book such as 'Peter Rabbit' which has only 17 pages with text may be matched against a much more substantial volume such as 'The lion, the witch and the wardrobe' with over 200 pages. Because of the difference in size, in general a longer text may have more hits that a shorter one.
  • the present invention can provide a means of correcting for differences in the number of reference positions selected using the key value mask. On enrolment a key value mask provided for enrolment may be saved for data records.
  • a practical example of scaling the hits in matching masked data records may be in the field of biometrics, for example in matching data records which are templates coded from images of human irides.
  • biometrics for example in matching data records which are templates coded from images of human irides.
  • An enrolled template may, for example, be accompanied by a mask of length s t which indicates that some regions of the iris are not to be processed, for example eyelids, eyelashes and unwanted reflections particularly but not exclusively from sources of illumination. Only key values selected by the mask presented at enrolment may be used to select record identifier lists 503 where record identifiers are entered 504.
  • the number of positions where the record identifier is entered in a record identifier list will usually be less than the total number of reference positions used, s t , because of the masking.
  • the mask presented at enrolment may be saved in the database and associated with the record identifier in some way. Later, on presenting a sample for 1: 1 or 1:N matching, only those reference positions selected by the sample mask are used for retrieval of the record identifiers indicated by the key value.
  • the sample mask will, in general, be different from the mask saved at enrolment. Therefore the number of reference positions from which the matching identifier may produce hits is reduced still further. Thus the number of hits will always be no greater than the number of positions selected by both masks, which we call the intersection s t often considerably less.
  • One method of scaling may therefore be to scale the number of hits according to
  • a matching iris which has its number of hits arising from matching here called the Raw Hits
  • the Raw Hits will in general have its score increased if the combined effect of the sample and enrolment mask reduces the number of available reference positions. This may make the Scaled Hits a more reliable indication of the quality of a match, and may lead to a smaller number of false matches in practice.
  • very few reference positions are available because of very heavy masking leading to a small intersection, it may be better to reject a data record rather than risk a false match which could be the result of a large scaling factor— .
  • This factor could of course be infinite, although one skilled in the art would be expected to avoid this occurring.
  • a large scaling factor may be a very rare event, but should be borne in mind, for example in some biometric systems where a false match may be considered far more serious than a false rejection.
  • sample Depending upon the size of the sample to be evaluated, it may not be necessary to use the sample in its entirety. For example, if the sample consists of several chapters of a book, it may be enough to carry out the pre- selection based on just one page of text.
  • a key value might be a fragment or pattern of data of a stored record, or it might alternatively be derived in some other way from the stored record, for example by applying some operation such as a hash function.
  • some operation such as a hash function.
  • the latter approach may be advantageous in some applications since it can avoid the need to carry out a search when matching the sample record. Instead, the sample record is simply processed (eg by hashing) to extract one or more key values from it, these then directly being used as indexes to a list of key values with pointers to all the lists of stored records which contain those particular keys.
  • the number of possible key values is finite and is known in advance, it may be desirable in some applications for all possible key values within a defined range to be pre-registered. Such an arrangement obviates the need, on matching, to search the key values lists 430. Instead the sample record is simply processed to extract its key values, and subject to selection by the mask, the corresponding rows in the lists 430 or 707 for example are used as pointers to the record identifier lists applicable to those particular key values.
  • explicit or physical key value lists may not be necessary for example in a biometric database where key values may be 16 bit numbers, in which case there are 2 ⁇ 16 (65536) possible key values and hence 2 ⁇ 16 (65536) possible record identifier lists for each reference position.
  • lists of key values may be stored in the database, and a search may be required to determine if the key value exists and where its associated record identifier list is to be found.
  • a list of key values is ordered there may be strategies for locating the key values and associated lists quickly, for example by a binary search of the values and an associated list of pointers to the lists of record identifiers.
  • the present invention is not limited by any particular method of associating a key value with a record identifier list and those skilled in the art may identify many such methods.
  • the key values do not represent every possible word and every possible position, but are stored in lists of key values 430, 701, 702, 703. It is accordingly necessary to examine the lists when carrying out a match. This might be done by a straightforward search of key values at a reference position to determine if a sample key value exists. A similar search is carried out at enrolment as described above. However when matching, nothing will be added to the database, but instead information will be extracted from the record identifier lists to determine to what extent the sample record matches a data record already enrolled in the data base.
  • the key value might be a numeric code of a particular length (e.g. 16 bits, allowing 65536 possible characteristic values to occur).
  • a database there might be millions or billions of records, so that each possible key value may occur many times.
  • having a plurality of reference positions and using a mask may enhance the performance of the database.
  • the key value lists 430 may even be possible to dispense with the key value lists 430 entirely. If the list is ordered and contains all possible characteristic values within a defined characteristic space (for example the numbers 1 to 265536), maintaining the list as a separate entity is unnecessary since all of its values can be inferred. In such a case, a key value n which has been extracted from a sample can be used as an index to go straight to row n of a look-up table and thus directly to the corresponding record identifier list.
  • a defined characteristic space for example the numbers 1 to 265536
  • a more detailed match may then be carried out against each of the possibilities, using any convenient matching algorithm.
  • the sample text may be compared word for word against the full text of each of the possible matches.
  • the database itself may be held on the same computer or at the same location where the preliminary and/or the final matching takes place.
  • the process may be distributed, with the preliminary matching being carried out according to a characteristic list held at a local computer, and the preliminary matches being passed on to a remote computer for the detailed matching to take place.
  • the primary case list 10 (which includes the full data representing all the cases) to be held at a central location, with a local machine needing to hold just the key value lists 430 (if any) and the individual record identifier lists 420.
  • the process of the present invention may further be speeded up by using multiple computers or processors operating in parallel.
  • a user computer 1010 forwards a matching task to a controller 1020 which splits it up and distributes it between a plurality of computers or processors 1030.
  • Each processor 1030 may be instructed to handle a particular characteristic or group of keys; alternatively, the controller 1020 may split up the work in some other way.
  • the processors 1030 pass their results onto a consolidator 1040, which finalises the selection of possible matches (for example using the procedure illustrated in Figure 7.
  • the list of possibilities is then forwarded as required, either to a computer or processor 1050 which carries out the detailed matching or as shown by reference numeral 1060 back to the user 1010 for further analysis.
  • one embodiment may be in hardware, such as implemented to operate on a device or combination of devices, for example, whereas another embodiment may be in software.
  • an embodiment may be implemented in firmware, or as any combination of hardware, software, and/or firmware, for example.
  • one embodiment may comprise one or more articles, such as a storage medium or storage media.
  • This storage media such as, one or more CD-ROMs and/or disks, for example, may have stored thereon instructions, that when executed by a system, such as a computer system, computing platform, or other system, for example, may result in an embodiment of a method in accordance with claimed subject matter being executed, such as one of the embodiments previously described, for example.
  • a computing platform may include one or more processing units or processors, one or more input/output devices, such as a display, a keyboard and/or a mouse, and/or one or more memories, such as static random access memory, dynamic random access memory, flash memory, and/or a hard drive.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé d'amélioration de la vitesse avec laquelle un enregistrement d'échantillon de données peut être mis en correspondance avec des enregistrements d'une base de données. Ledit procédé fait appel à la définition d'une liste de valeurs clés possibles (430), à l'essai de ces valeurs clés par rapport à l'échantillon et, pour chaque enregistrement de la base de données, au comptage du nombre de valeurs clés qui correspondent à la fois à l'enregistrement et à l'échantillon, à des positions de référence sélectionnés par un masque. Une liste de correspondances possibles est ensuite choisie sur la base de ce compte, pour une mise en correspondance ou une analyse plus détaillée. Ce procédé fournit une mise en correspondance très rapide en échange d'un effort supplémentaire lors d'un nouvel enregistrement dans la base de données.
PCT/EP2011/070075 2011-11-14 2011-11-14 Mise en correspondance rapide de base de données WO2013071953A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP11784481.1A EP2780830A1 (fr) 2011-11-14 2011-11-14 Mise en correspondance rapide de base de données
PCT/EP2011/070075 WO2013071953A1 (fr) 2011-11-14 2011-11-14 Mise en correspondance rapide de base de données
BR112014011646A BR112014011646A2 (pt) 2011-11-14 2011-11-14 método de identificação de partidas entre um registro de dados de amostra e uma pluralidade de registros de dados de inscritos; e sistema de identificação de possíveis correspondências

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2011/070075 WO2013071953A1 (fr) 2011-11-14 2011-11-14 Mise en correspondance rapide de base de données

Publications (1)

Publication Number Publication Date
WO2013071953A1 true WO2013071953A1 (fr) 2013-05-23

Family

ID=44992913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2011/070075 WO2013071953A1 (fr) 2011-11-14 2011-11-14 Mise en correspondance rapide de base de données

Country Status (3)

Country Link
EP (1) EP2780830A1 (fr)
BR (1) BR112014011646A2 (fr)
WO (1) WO2013071953A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9846739B2 (en) 2006-10-23 2017-12-19 Fotonation Limited Fast database matching
CN111368527A (zh) * 2020-02-28 2020-07-03 上海汇航捷讯网络科技有限公司 一种键值匹配方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778179B (zh) * 2014-01-14 2019-05-28 阿里巴巴集团控股有限公司 一种数据迁移测试方法和系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020059197A1 (en) * 1997-10-31 2002-05-16 Hunter Van A. Longest best match search
US20070276853A1 (en) * 2005-01-26 2007-11-29 Honeywell International Inc. Indexing and database search system
US20080097992A1 (en) * 2006-10-23 2008-04-24 Donald Martin Monro Fast database matching
WO2008050107A1 (fr) * 2006-10-23 2008-05-02 Donald Martin Monro Mise en correspondance de bases de données approximative
EP1956517A1 (fr) * 2007-02-07 2008-08-13 WinBooks s.a. Procédé informatique pour traiter des opérations comptables et produit logiciel pour mettre en oeuvre un tel procédé
GB2473313A (en) * 2009-06-15 2011-03-09 Honeywell Int Inc Adaptive iris matching
US20110249872A1 (en) * 2010-04-09 2011-10-13 Donald Martin Monro Image template masking

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020059197A1 (en) * 1997-10-31 2002-05-16 Hunter Van A. Longest best match search
US20070276853A1 (en) * 2005-01-26 2007-11-29 Honeywell International Inc. Indexing and database search system
US20080097992A1 (en) * 2006-10-23 2008-04-24 Donald Martin Monro Fast database matching
WO2008050107A1 (fr) * 2006-10-23 2008-05-02 Donald Martin Monro Mise en correspondance de bases de données approximative
EP1956517A1 (fr) * 2007-02-07 2008-08-13 WinBooks s.a. Procédé informatique pour traiter des opérations comptables et produit logiciel pour mettre en oeuvre un tel procédé
GB2473313A (en) * 2009-06-15 2011-03-09 Honeywell Int Inc Adaptive iris matching
US20110249872A1 (en) * 2010-04-09 2011-10-13 Donald Martin Monro Image template masking

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BENTLEY J L: "Multidimensional binary search trees used for associative searching", COMMUNICATIONS OF THE ASSOCIATION FOR COMPUTING MACHINERY, ACM, NEW YORK, NY, US, vol. 18, no. 9, 1 September 1975 (1975-09-01), pages 509 - 517, XP008087673, ISSN: 0001-0782, DOI: 10.1145/361002.361007 *
HOLLINGSWORTH K P ET AL: "The Best Bits in an Iris Code", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 31, no. 6, 1 June 2009 (2009-06-01), pages 964 - 973, XP011266637, ISSN: 0162-8828, DOI: 10.1109/TPAMI.2008.185 *
TSAPATSOULIS N ET AL: "Facial image indexing in multimedia databases", PATTERN ANALYSIS AND APPLICATIONS, SPRINGER, NEW YORK, NY, US, vol. 4, no. 2-3, 1 January 2001 (2001-01-01), pages 93 - 107, XP002230774, ISSN: 1433-7541, DOI: 10.1007/PL00014577 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9846739B2 (en) 2006-10-23 2017-12-19 Fotonation Limited Fast database matching
CN111368527A (zh) * 2020-02-28 2020-07-03 上海汇航捷讯网络科技有限公司 一种键值匹配方法
CN111368527B (zh) * 2020-02-28 2023-06-20 上海汇航捷讯网络科技有限公司 一种键值匹配方法

Also Published As

Publication number Publication date
BR112014011646A2 (pt) 2017-05-02
EP2780830A1 (fr) 2014-09-24

Similar Documents

Publication Publication Date Title
KR101153033B1 (ko) 사본 탐지 및 삭제 방법
KR101231560B1 (ko) 데이터 클러스터와 유의어의 탐색과 수정에 대한 방법 및 시스템
Wang et al. Efficient approximate entity extraction with edit distance constraints
KR101201037B1 (ko) 키워드와 웹 사이트 콘텐츠 사이의 관련성 검증
US20130110839A1 (en) Constructing an analysis of a document
CN110929125B (zh) 搜索召回方法、装置、设备及其存储介质
GB2513472A (en) Resolving similar entities from a database
Sood et al. Probabilistic near-duplicate detection using simhash
JP5605583B2 (ja) 検索方法、類似度計算方法、類似度計算及び同一文書照合システムと、そのプログラム
CN107844533A (zh) 一种智能问答系统及分析方法
US20100306214A1 (en) Identifying modifiers in web queries over structured data
CN110134777B (zh) 问题去重方法、装置、电子设备和计算机可读存储介质
EP2095277A1 (fr) Mise en correspondance de bases de données approximative
GB2493587A (en) Entity resolution system identifying non-distinct names in a set of names
CN110162752B (zh) 文章判重处理方法、装置及电子设备
Liu et al. An image-based near-duplicate video retrieval and localization using improved edit distance
Sarwar et al. An effective and scalable framework for authorship attribution query processing
Michelson et al. Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web
EP2084623A1 (fr) Concordance rapide de base de données
EP2780830A1 (fr) Mise en correspondance rapide de base de données
US9846739B2 (en) Fast database matching
Moravec et al. A comparison of extended fingerprint hashing and locality sensitive hashing for binary audio fingerprints
CN115269765A (zh) 账号识别方法、装置、电子设备和存储介质
Cha An effective and efficient indexing scheme for audio fingerprinting
US10552459B2 (en) Classifying a document using patterns

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11784481

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2011784481

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2011784481

Country of ref document: EP

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112014011646

Country of ref document: BR

ENP Entry into the national phase

Ref document number: 112014011646

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20140514