CN107784110B - Index establishing method and device - Google Patents

Index establishing method and device Download PDF

Info

Publication number
CN107784110B
CN107784110B CN201711069369.8A CN201711069369A CN107784110B CN 107784110 B CN107784110 B CN 107784110B CN 201711069369 A CN201711069369 A CN 201711069369A CN 107784110 B CN107784110 B CN 107784110B
Authority
CN
China
Prior art keywords
index
hash value
target text
hash
bucket
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711069369.8A
Other languages
Chinese (zh)
Other versions
CN107784110A (en
Inventor
谢永恒
张侠
火一莽
万月亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201711069369.8A priority Critical patent/CN107784110B/en
Publication of CN107784110A publication Critical patent/CN107784110A/en
Application granted granted Critical
Publication of CN107784110B publication Critical patent/CN107784110B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses an index establishing method and a device, wherein the method comprises the following steps: extracting feature words of a target text; sequencing the feature words to obtain feature character strings; applying a MinHash algorithm to the characteristic character string to obtain a Hash value corresponding to the target text; searching whether an index mapping bucket matched with the hash value exists in a mapping cache pool, and if so, establishing an index between the hash value and the target text in the index mapping bucket; if the index mapping bucket matched with the hash value does not exist in the mapping cache pool, establishing the index mapping bucket matched with the hash value, and establishing an index between the hash value and the target text. The index establishing method provided by the embodiment of the invention reduces the index storage amount, realizes the classification of similar texts by establishing indexes of the similar texts in the same index mapping bucket, and improves the retrieval speed of the similar texts.

Description

Index establishing method and device
Technical Field
The embodiment of the invention relates to the field of information indexing and query, in particular to an index establishing method and device.
Background
In recent years, with the rapid development and popularization of internet technology, in many cases, i need to quickly and accurately find out the data we want from mass data, and this process is called similarity search.
With the rapid increase of network data, the search speed has become a bottleneck of similarity search, and therefore, how to design a fast and effective index structure becomes an urgent requirement of similarity search in the big data era. One of the commonly used index techniques is a tree-based index, typically a KD tree. The index of the tree structure adopts the structural design of subspace division, object data are divided into a plurality of subspaces, each subspace contains similar data, and when searching is carried out, only a certain subspace range is searched, so that the retrieval speed is effectively improved in a low-dimensional feature space. However, when the feature dimension of the search object increases, the efficiency of tree structure indexing is greatly reduced, and the efficiency is hardly improved compared with the time complexity of linear search. Another indexing technique is based on the conventional hash function indexing, such as md5, which is based on the principle that the original content is randomly mapped as uniformly as possible to a signature, so that even if the original content only differs by one byte, the generated signatures are likely to be very different. If the two signatures are equal, the original content is equal with a certain probability, and if not, no information is provided except for indicating that the original content is not equal. Therefore, the traditional hash function-based indexing technology cannot determine the similarity between original contents by comparing the similarity of signatures, and has certain limitations.
Disclosure of Invention
The embodiment of the invention provides an index establishing method and device, which effectively reduce the storage capacity of index data and further improve the retrieval speed.
In a first aspect, an embodiment of the present invention provides an index establishing method, where the method includes:
extracting feature words of a target text;
sequencing the feature words to obtain feature character strings;
applying a MinHash algorithm to the characteristic character string to obtain a Hash value corresponding to the target text;
searching whether an index mapping bucket matched with the hash value exists in a mapping cache pool, and if so, establishing an index between the hash value and the target text in the index mapping bucket;
if the index mapping bucket matched with the hash value does not exist in the mapping cache pool, establishing the index mapping bucket matched with the hash value, and establishing an index between the hash value and the target text.
Further, establishing an index between the hash value and the target text in the index mapping bucket comprises:
if the index mapping bucket does not have the index hash value which is the same as the hash value, storing the hash value into the index mapping bucket, and establishing an index between the hash value and the target text;
and if the index mapping bucket has the index hash value which is the same as the hash value, the index mapping bucket does not save the hash value again, and the index between the index hash value and the target text is directly established.
Further, the method further comprises:
and if the index mapping bucket matched with the hash value exists in the mapping cache pool, recommending the text data corresponding to the hash value as the text data similar to the target text.
Further, the method further comprises:
randomly determining N hash functions;
respectively carrying out Hash operation on the characteristic character strings of the target text based on the N Hash functions to obtain N Hash values;
counting the number of similar hash values of the N hash values in the same index mapping bucket in the mapping cache pool;
sequencing the number of the similar hash values, and determining a recommended text data set similar to the target text according to a sequencing result;
calculating the similarity between the target text and each recommended text data in the recommended text data set;
recommending the recommended text data with the similarity meeting the set threshold; wherein N is a positive integer.
Further, the extracting the feature words of the target text comprises:
performing word segmentation on the target text;
and determining the characteristic words of the target text according to the part of speech and the occurrence frequency of each participle.
Further, the segmenting the target text comprises:
and selecting corresponding word segmentation units according to word frequency and part of speech based on a large-granularity or small-granularity mode to segment the target text by taking words as units, and labeling the part of speech of each word.
Further, before segmenting the target text, the method further comprises:
and filtering unrecognizable characters in the target text.
In a second aspect, an embodiment of the present invention provides an index creating apparatus, where the apparatus includes:
the characteristic word extraction module is used for extracting the characteristic words of the target text;
the sorting module is used for sorting the feature words to obtain feature character strings;
the first operation module is used for applying a MinHash algorithm to the characteristic character string to obtain a Hash value corresponding to the target text;
a first establishing module, configured to find whether an index mapping bucket matching the hash value exists in a mapping cache pool, and if so, establish an index between the hash value and the target text in the index mapping bucket;
and the second establishing module is used for establishing an index mapping bucket matched with the hash value and establishing an index between the hash value and the target text if the index mapping bucket matched with the hash value does not exist in the mapping cache pool.
Further, the apparatus further comprises:
a storage unit, configured to store the hash value in the index mapping bucket if an index hash value that is the same as the hash value does not exist in the index mapping bucket, and establish an index between the hash value and the target text;
and the establishing unit is used for directly establishing the index between the index hash value and the target text without storing the hash value again if the index hash value identical to the hash value already exists in the index mapping bucket.
Further, the apparatus further comprises:
the recommending module is used for recommending the text data corresponding to the hash value as the text data similar to the target text if the index mapping bucket matched with the hash value exists in the mapping cache pool; or for randomly determining N hash functions; respectively carrying out Hash operation on the characteristic character strings of the target text based on the N Hash functions to obtain N Hash values; counting the number of similar hash values of the N hash values in the same index mapping bucket in the mapping cache pool; sequencing the number of the similar hash values, and determining a recommended text data set similar to the target text according to a sequencing result; calculating the similarity between the target text and each recommended text data in the recommended text data set; recommending the recommended text data with the similarity meeting the set threshold; wherein N is a positive integer.
According to the index establishing method provided by the embodiment of the invention, a MinHash algorithm is applied to a characteristic word character string of a target text to obtain a Hash value corresponding to the target text, whether an index mapping bucket matched with the Hash value exists in a mapping cache pool is further searched, if the index mapping bucket exists, an index between the Hash value and the target text is established in the index mapping bucket, if the index mapping bucket matched with the Hash value does not exist in the mapping cache pool, the index mapping bucket matched with the Hash value is established, and an index between the Hash value and the target text is established, so that indexes of similar text data are stored in the same index mapping bucket in a Hash value mode, storage indexes of similar data are reduced, and the retrieval speed is improved.
Drawings
Fig. 1 is a schematic flow chart of an index establishing method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of an index establishing method according to a second embodiment of the present invention;
fig. 3 is a schematic flow chart of neighbor hash classification, index establishment and similar text query according to a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of an index creating apparatus according to a third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the various steps may be rearranged. The process may be terminated when its steps are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a schematic flow chart of an index creating method according to an embodiment of the present invention, where the index creating method according to the embodiment is applicable to creating indexes for large batches of text data, and the index creating method may be executed by an index creating device. Referring to fig. 1, the method specifically includes the following steps:
and step 110, extracting the characteristic words of the target text.
Specifically, the feature words of the target text can be extracted based on Chinese word segmentation, and the word segmentation capable of accurately reflecting the text meaning can be found out through text word segmentation and word frequency sequencing during word segmentation and further depending on text semantic analysis, part of speech tuning and the like, and the word segmentation capable of accurately reflecting the text meaning is used as the feature words. The feature words are further sequenced according to a preset strategy to obtain feature character strings, a MinHash algorithm is applied to the feature character strings to obtain Hash values corresponding to the target text, and the Hash values serve as the identification of the target text, so that the target text can be indexed through the corresponding Hash values, and compared with a method of directly storing feature index texts, the index storage amount is reduced; meanwhile, when the index is used for searching similar texts, the corresponding hash values can be directly compared, so that the complexity of similarity identification between the texts is greatly reduced, and the searching efficiency is improved. Meanwhile, the characteristic words of the target text are obtained by adopting intelligent word segmentation instead of simple keyword extraction, so that higher identification accuracy is ensured.
Illustratively, extracting the feature words of the target text comprises:
performing word segmentation on the target text; specifically, the word segmentation of the target text comprises the following steps: and selecting corresponding word segmentation units according to word frequency and part of speech based on a large-granularity or small-granularity mode to segment the target text by taking words as units, and labeling the part of speech of each word.
And determining the characteristic words of the target text according to the part of speech and the occurrence frequency of each participle.
The specific method is to divide the target text into words, and a large-granularity mode or a small-granularity mode can be adopted during specific division, for example, the target text is 'world cup football match', if we divide according to the large-granularity mode, the division result is 'world cup/football match', if we divide according to the small-granularity mode, the division result is 'world cup/football/match', and the two modes are different in the word number range of the divided words. When actually performing segmentation, the intelligent word frequency statistics can be combined, a specific word segmentation unit is selected by combining the word frequency and the word part to perform segmentation, and the word part is labeled, for example, for a target text of "2014 world cup football match held in brazil", the result after the word segmentation can be: 2014 (number words), football (nouns), Brazil (local nouns), and holding (verbs). Then, several words are selected from all the participles as the feature words of the target text, for example, only three feature words are selected, and a common local noun, a noun and a verb can better reflect the meaning of the original text. Since each participle appears only once in the above example, the frequency of the participle can be ignored, and for example, "brazil", "world cup" and "holding" can be selected as the characteristic words.
Further, before segmenting the target text, the method further comprises the following steps:
and filtering unrecognizable characters in the target text.
Specifically, the target text is mainly subjected to noise filtering preprocessing, and unrecognizable characters can refer to non-specified coding formats or nonsense characters such as tabs, spaces and the like.
And 120, sequencing the feature words to obtain feature character strings.
And sorting the selected feature words according to a preset sorting strategy, for example, the sorting strategy may be sorting according to the initial of each feature word to obtain sorted feature character strings.
And step 130, applying a MinHash algorithm to the characteristic character string to obtain a Hash value corresponding to the target text.
Wherein, the first step of applying MinHash algorithm to the characteristic character string is: the hash function generation and design hash function h (x) { h1(x), h2(x), … hm (x) }, sets the number of bits (e.g., 32 bits, i.e., m ═ 32) of hash function h (x), and initializes each bit of hash function h (x) to 0. Where h (x) means a hash function that maps x to an integer. It is assumed that h (x) is a good hash function with good uniformity, and can map different elements to different integers. The second step is that: calculating a hash code (such as 32 bits) corresponding to each feature word by using the generated hash function h (x), and for each bit of the hash code of each feature word, if the bit is 1, adding the weight (usually, the frequency of occurrence) to the value of the corresponding bit of the Minhash; otherwise, reducing its weight; for the resulting (32-bit) Minhash, 1 is set if the bit is 1, and 0 is set if the bit is 0. And finally, obtaining a hash value corresponding to the target text based on the obtained hash code.
Step 140, searching whether an index mapping bucket matched with the hash value exists in a mapping cache pool, and if so, establishing an index between the hash value and the target text in the index mapping bucket.
The index mapping buckets in the mapping cache pool are a set of hash values in a preset range, the hash values in the preset range are mutually close hash values, and the target text corresponding to each hash value is represented as a similar text, namely each index mapping bucket is a set of similar text indexes. Indexes of similar texts are built in the same index mapping bucket, and due to the physical characteristics of the MinHash algorithm, corresponding Hash values of a plurality of similar texts are possibly the same, so that the similar texts possibly correspond to the same index Hash value, the classification of the similar texts is realized, the index storage capacity is reduced, and the retrieval speed of the similar texts is improved.
Further, establishing an index between the hash value and the target text in the index mapping bucket comprises:
if the index mapping bucket does not have the index hash value which is the same as the hash value, storing the hash value into the index mapping bucket, and establishing an index between the hash value and the target text;
and if the index mapping bucket has the index hash value which is the same as the hash value, the index mapping bucket does not save the hash value again, and the index between the index hash value and the target text is directly established.
Because the hash values of the similar texts which can be obtained after the MinHash algorithm is applied are the same, if the index hash value which is the same as the current hash value exists in the index mapping bucket, the current index hash value is not repeatedly stored, and the index relationship between the existing index hash value which is the same as the current hash value and the current target text is directly established. Therefore, an index corresponding to each target text is not required to be established, so that the storage indexes of the similar texts are reduced, the number of the overstocked storage indexes is reduced, and the retrieval speed of the similar texts is further improved.
And 150, if no index mapping bucket matched with the hash value exists in the mapping cache pool, establishing an index mapping bucket matched with the hash value, and establishing an index between the hash value and the target text.
In the index establishing method provided by this embodiment, the MinHash algorithm is applied to the feature character string of the target text to obtain the corresponding hash value, and since the hash values corresponding to a plurality of similar texts may be the same, the plurality of similar texts may correspond to the same index hash value, so that the index storage amount is reduced, and by establishing the indexes of the similar texts in the same index mapping bucket, the classification of the similar texts is realized, and the retrieval speed of the similar texts is improved.
Example two
Fig. 2 is a flowchart illustrating a flow of an index establishing method according to a second embodiment of the present invention, where on the basis of the technical solution of the first embodiment, recommendation operations for similar texts are added in the embodiment, and similar text recommendation is performed based on an index established by the method disclosed in the first embodiment, so that higher similar text recommendation efficiency and accuracy can be achieved. Referring specifically to fig. 2, the method includes:
and step 210, extracting the characteristic words of the target text.
And step 220, sequencing the feature words to obtain feature character strings.
And step 230, applying a MinHash algorithm to the characteristic character string to obtain a Hash value corresponding to the target text.
Step 240, if an index mapping bucket matched with the hash value exists in the mapping cache pool, establishing an index between the hash value and the target text in the index mapping bucket, and recommending text data corresponding to the hash value as text data similar to the target text.
Specifically, the operation of step 240 may be to recommend the text data corresponding to the index hash value in the matched index mapping bucket, which is the same as the hash value, as the text data similar to the target text. Obviously, at this time, the index hash value identical to the hash value exists in the matched index mapping bucket, and at this time, the hash value does not need to be repeatedly stored, so that the index storage amount is reduced, and similar text recommendation can be directly performed. Or if the index mapping bucket does not have the index hash value identical to the hash value, the following operations may be performed to implement recommendation of similar text.
Randomly determining N hash functions; the N hash functions can be directly selected from the hash function library in sequence;
respectively carrying out Hash operation on the characteristic character strings of the target text based on the N Hash functions to obtain N Hash values;
counting the number of similar hash values of the N hash values in the same index mapping bucket in the mapping cache pool;
sequencing the number of the similar hash values, and determining a recommended text data set similar to the target text according to a sequencing result;
calculating the similarity between the target text and each recommended text data in the recommended text data set;
recommending the recommended text data with the similarity meeting the set threshold; wherein N is a positive integer.
For ease of understanding, the above-described operation is now exemplified: setting the N to be 10, wherein the corresponding 10 hash values are respectively: n1, N2 … … N10, counting that N1, N2, N3 and N4 are located in index mapping bucket No. 1 in the mapping cache pool, N5, N6 and N7 are located in index mapping bucket No. 2 in the mapping cache pool, N8 and N9 are located in index mapping bucket No. 3 in the mapping cache pool, N10 is located in index mapping bucket No. 4 in the mapping cache pool, and the sorting result is 4, 3, 2, 1; and the recommended text data set similar to the target text determined according to the sorting result is the text data corresponding to the hash values N1, N2, N3 and N4 in the index mapping bucket No. 1.
Specifically, the similarity between the target text and each piece of recommended text data in the recommended text data set can be obtained by calculating the similarity between a set a and a set B, where the set a is a set composed of feature word elements of the target text, and the set B is a set composed of feature word elements of each piece of recommended text data. The similarity between the set A and the set B can be obtained by a hash function:
then, for set A, B, hmin (a) ═ hmin (B), the condition holds that the element with the smallest hash value in a ∪ B is also in a ∩ B, then there is an assumption that h (x) is a good hash function, which has good homogeneity and can map different elements into different integers.
First, use multiple hash functions
To calculate the probability that set A, B has the smallest hash value, we can select a number of hash functions, such as K, and then hash the sets A, B with these K hash functions separately, resulting in K minima for each set, such as min (a) K { a1, a 2.. ak }, min (b) K { b1, b 2.. once, bk }, then the similarity of set A, B is | min (a) K ∩ min (b) K |/| min (a) K ∪ min (b) K |, and min (a) K and min (b) the ratio of the number of the same elements in K to the total number of elements.
Second, use a single hash function
hmin (S) is the element in the set S with the smallest hash value, and hmink (S) is defined as the K elements in the set S with the smallest hash value. Thus, only one hash is needed for each set, and then the minimum K elements are taken. The similarity of the two sets A, B is calculated as the ratio of the number of intersections to the number of unions of the smallest K elements in set a to the smallest K elements in set B.
Step 250, if no index mapping bucket matched with the hash value exists in the mapping cache pool, establishing an index mapping bucket matched with the hash value, and establishing an index between the hash value and the target text.
According to the index establishing method provided by the embodiment, similar text recommendation is performed on the basis of the index established by the method disclosed by the embodiment, and high similar text recommendation efficiency and accuracy can be realized.
Further, on the basis of the above embodiment, referring to the flow diagram of neighbor hash classification, index establishment and similar text query shown in fig. 3, as shown in fig. 3, firstly, word segmentation processing is performed on text data to obtain a feature word character string, then, hash processing is performed on the feature word character string to obtain a corresponding hash value, that is, a text feature of the text data, a corresponding index is established in a mapping cache pool according to the text feature, and similar text recommendation is performed. By establishing indexes by utilizing the hash values of the text data feature words, the index storage capacity is greatly reduced, and the similar data query is converted into feature nearest neighbor search; because the hash values corresponding to the feature words of the similar texts are possibly the same, the time delay caused by overstock due to large index quantity when indexes are established on large-batch text data is solved; meanwhile, corresponding word sets are added conveniently according to different service content applications, and index classification is improved according to corresponding hash results; by converting the similar text search into the similar encoding search, the query time is greatly reduced, the query efficiency is improved, and the recommendation of similar contents can be realized through the encoding similarity degree between different texts.
EXAMPLE III
Fig. 4 is a schematic structural diagram of an index creating apparatus according to a third embodiment of the present invention, and referring to fig. 4, the apparatus includes: the system comprises a feature word extraction module 410, a sorting module 420, a first operation module 430, a first establishing module 440 and a second establishing module 450;
the feature word extracting module 410 is configured to extract feature words of the target text; the sorting module 420 is configured to sort the feature words to obtain feature character strings; a first operation module 430, configured to apply a MinHash algorithm to the feature string to obtain a hash value corresponding to the target text; a first establishing module 440, configured to find whether an index mapping bucket matching the hash value exists in a mapping cache pool, and if so, establish an index between the hash value and the target text in the index mapping bucket; a second establishing module 450, configured to establish an index mapping bucket matching the hash value if no index mapping bucket matching the hash value exists in the mapping cache pool, and establish an index between the hash value and the target text.
Further, the first establishing module 440 includes:
a storage unit, configured to store the hash value in the index mapping bucket if an index hash value that is the same as the hash value does not exist in the index mapping bucket, and establish an index between the hash value and the target text;
and the establishing unit is used for directly establishing the index between the index hash value and the target text without storing the hash value again if the index hash value identical to the hash value already exists in the index mapping bucket.
Further, the apparatus further comprises:
the recommending module is used for recommending the text data corresponding to the hash value as the text data similar to the target text if the index mapping bucket matched with the hash value exists in the mapping cache pool; or for randomly determining N hash functions; respectively carrying out Hash operation on the characteristic character strings of the target text based on the N Hash functions to obtain N Hash values; counting the number of similar hash values of the N hash values in the same index mapping bucket in the mapping cache pool; sequencing the number of the similar hash values, and determining a recommended text data set similar to the target text according to a sequencing result; calculating the similarity between the target text and each recommended text data in the recommended text data set; recommending the recommended text data with the similarity meeting the set threshold.
According to the index establishing device provided by the embodiment, the corresponding hash value is obtained by applying the MinHash algorithm to the characteristic character string of the target text, and as the corresponding hash values of a plurality of similar texts are possibly the same, the similar texts are possibly corresponding to the same index hash value, so that the index storage amount is reduced, and by establishing the indexes of the similar texts in the same index mapping bucket, the classification of the similar texts is realized, and the retrieval speed of the similar texts is improved.
The control device of the variable frequency refrigerator provided by the embodiment of the invention can execute the control method of the variable frequency refrigerator provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Those skilled in the art can understand that all or part of the steps in the method of the foregoing embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (7)

1. An index building method, comprising:
extracting feature words of a target text;
sequencing the feature words to obtain feature character strings;
applying a MinHash algorithm to the characteristic character string to obtain a Hash value corresponding to the target text;
searching whether an index mapping bucket matched with the hash value exists in a mapping cache pool, and if so, establishing an index between the hash value and the target text in the index mapping bucket;
if the mapping cache pool does not have an index mapping bucket matched with the hash value, establishing an index mapping bucket matched with the hash value, and establishing an index between the hash value and the target text;
the method further comprises the following steps:
if the index mapping bucket matched with the hash value exists in the mapping cache pool, recommending the text data corresponding to the hash value as the text data similar to the target text;
the method further comprises the following steps:
randomly determining N hash functions;
respectively carrying out Hash operation on the characteristic character strings of the target text based on the N Hash functions to obtain N Hash values;
counting the number of similar hash values of the N hash values in the same index mapping bucket in the mapping cache pool;
sequencing the number of the similar hash values, and determining a recommended text data set similar to the target text according to a sequencing result;
calculating the similarity between a target text characteristic word element set and a recommended text data characteristic word element set to obtain the similarity between the target text and each recommended text data in the recommended text data set;
recommending the recommended text data with the similarity meeting the set threshold;
wherein N is a positive integer.
2. The method of claim 1, wherein indexing between the hash value and the target text in the index map bucket comprises:
if the index mapping bucket does not have the index hash value which is the same as the hash value, storing the hash value into the index mapping bucket, and establishing an index between the hash value and the target text;
and if the index mapping bucket has the index hash value which is the same as the hash value, the index mapping bucket does not save the hash value again, and the index between the index hash value and the target text is directly established.
3. The method according to claim 1 or 2, wherein the extracting the feature words of the target text comprises:
performing word segmentation on the target text;
and determining the characteristic words of the target text according to the part of speech and the occurrence frequency of each participle.
4. The method of claim 3, wherein the tokenizing the target text comprises:
and selecting corresponding word segmentation units according to word frequency and part of speech based on a large-granularity or small-granularity mode to segment the target text by taking words as units, and labeling the part of speech of each word.
5. The method of claim 3, further comprising, prior to tokenizing the target text:
and filtering unrecognizable characters in the target text.
6. An index building apparatus, comprising:
the characteristic word extraction module is used for extracting the characteristic words of the target text;
the sorting module is used for sorting the feature words to obtain feature character strings;
the first operation module is used for applying a MinHash algorithm to the characteristic character string to obtain a Hash value corresponding to the target text;
a first establishing module, configured to find whether an index mapping bucket matching the hash value exists in a mapping cache pool, and if so, establish an index between the hash value and the target text in the index mapping bucket;
a second establishing module, configured to establish an index mapping bucket matching the hash value if no index mapping bucket matching the hash value exists in the mapping cache pool, and establish an index between the hash value and the target text;
the device further comprises:
the recommending module is used for recommending the text data corresponding to the hash value as the text data similar to the target text if the index mapping bucket matched with the hash value exists in the mapping cache pool; or for randomly determining N hash functions; respectively carrying out Hash operation on the characteristic character strings of the target text based on the N Hash functions to obtain N Hash values; counting the number of similar hash values of the N hash values in the same index mapping bucket in the mapping cache pool; sequencing the number of the similar hash values, and determining a recommended text data set similar to the target text according to a sequencing result; calculating the similarity between a target text characteristic word element set and a recommended text data characteristic word element set to obtain the similarity between the target text and each recommended text data in the recommended text data set; recommending the recommended text data with the similarity meeting the set threshold; wherein N is a positive integer.
7. The apparatus of claim 6, wherein the first establishing module comprises:
a storage unit, configured to store the hash value in the index mapping bucket if an index hash value that is the same as the hash value does not exist in the index mapping bucket, and establish an index between the hash value and the target text;
and the establishing unit is used for directly establishing the index between the index hash value and the target text without storing the hash value again if the index hash value identical to the hash value already exists in the index mapping bucket.
CN201711069369.8A 2017-11-03 2017-11-03 Index establishing method and device Active CN107784110B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711069369.8A CN107784110B (en) 2017-11-03 2017-11-03 Index establishing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711069369.8A CN107784110B (en) 2017-11-03 2017-11-03 Index establishing method and device

Publications (2)

Publication Number Publication Date
CN107784110A CN107784110A (en) 2018-03-09
CN107784110B true CN107784110B (en) 2020-07-03

Family

ID=61431627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711069369.8A Active CN107784110B (en) 2017-11-03 2017-11-03 Index establishing method and device

Country Status (1)

Country Link
CN (1) CN107784110B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670148A (en) * 2018-09-26 2019-04-23 平安科技(深圳)有限公司 Collection householder method, device, equipment and storage medium based on speech recognition
CN109710656A (en) * 2018-11-12 2019-05-03 清华大学 Approximate enquiring method and device
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
JP7193000B2 (en) * 2019-08-30 2022-12-20 富士通株式会社 Similar document search method, similar document search program, similar document search device, index information creation method, index information creation program, and index information creation device
CN111078821B (en) * 2019-11-27 2023-12-08 泰康保险集团股份有限公司 Dictionary setting method, dictionary setting device, medium and electronic equipment
CN111597309A (en) * 2020-05-25 2020-08-28 深圳市小满科技有限公司 Similar enterprise recommendation method and device, electronic equipment and medium
CN111858607A (en) * 2020-07-24 2020-10-30 北京金山云网络技术有限公司 Data processing method and device, electronic equipment and computer readable medium
CN113992625B (en) * 2021-10-15 2024-05-28 杭州安恒信息技术股份有限公司 Domain name source station detection method, system, computer and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
CN102193995A (en) * 2011-04-26 2011-09-21 深圳市迅雷网络技术有限公司 Method and device for establishing multimedia data index and retrieval
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6862602B2 (en) * 1997-03-07 2005-03-01 Apple Computer, Inc. System and method for rapidly identifying the existence and location of an item in a file

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device
CN102193995A (en) * 2011-04-26 2011-09-21 深圳市迅雷网络技术有限公司 Method and device for establishing multimedia data index and retrieval
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
minhash算法;mengfanrong;《https://www.cnblogs.com/mengfanrong/p/5058919.html》;20151219;第1-3页 *

Also Published As

Publication number Publication date
CN107784110A (en) 2018-03-09

Similar Documents

Publication Publication Date Title
CN107784110B (en) Index establishing method and device
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US9442929B2 (en) Determining documents that match a query
CN106570141B (en) Approximate repeated image detection method
CN107918604B (en) Chinese word segmentation method and device
CN106033416A (en) A string processing method and device
US10649997B2 (en) Method, system and computer program product for performing numeric searches related to biometric information, for finding a matching biometric identifier in a biometric database
US20140032207A1 (en) Information Classification Based on Product Recognition
CN104462085A (en) Method and device for correcting search keywords
CN111291177A (en) Information processing method and device and computer storage medium
JP7149976B2 (en) Error correction method and apparatus, computer readable medium
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN111801665A (en) Hierarchical Locality Sensitive Hash (LSH) partition indexing for big data applications
KR20220134695A (en) System for author identification using artificial intelligence learning model and a method thereof
CN109885641B (en) Method and system for searching Chinese full text in database
CN113449082A (en) New word discovery method, system, electronic device and medium
US11574004B2 (en) Visual image search using text-based search engines
US8484221B2 (en) Adaptive routing of documents to searchable indexes
Matsui et al. Reconfigurable Inverted Index
CN112925912A (en) Text processing method, and synonymous text recall method and device
Yu et al. Scalable forest hashing for fast similarity search
CN112528021B (en) Model training method, model training device and intelligent equipment
CN109992716B (en) Indonesia similar news recommendation method based on ITQ algorithm
CN110968691B (en) Judicial hotspot determination method and device
CN111159996A (en) Short text set similarity comparison method and system based on improved text fingerprint algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An index establishment method and device

Effective date of registration: 20220105

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022990000005

PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220712

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022990000005

PC01 Cancellation of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A kind of index establishment method and apparatus

Effective date of registration: 20220907

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022110000206

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20200703

Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch

Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING

Registration number: Y2022110000206

PC01 Cancellation of the registration of the contract for pledge of patent right