CN107784110A - A kind of index establishing method and device - Google Patents
A kind of index establishing method and device Download PDFInfo
- Publication number
- CN107784110A CN107784110A CN201711069369.8A CN201711069369A CN107784110A CN 107784110 A CN107784110 A CN 107784110A CN 201711069369 A CN201711069369 A CN 201711069369A CN 107784110 A CN107784110 A CN 107784110A
- Authority
- CN
- China
- Prior art keywords
- index
- cryptographic hash
- target text
- hash
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000013507 mapping Methods 0.000 claims abstract description 65
- 238000000605 extraction Methods 0.000 claims description 8
- 230000029058 respiratory gaseous exchange Effects 0.000 claims 1
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
- A kind of 1. index establishing method, it is characterised in that including:Extract the Feature Words of target text;The Feature Words are ranked up to obtain feature string;To the feature string application MinHash algorithms, cryptographic Hash corresponding to the target text is obtained;Search to whether there is in mapped cache pond and map bucket with the index of the Hash values match, if in the presence of in the index The index between the cryptographic Hash and the target text is established in mapping bucket;If being not present in the mapped cache pond and mapping bucket with the index of the Hash values match, establish and the Hash values match Index mapping bucket, and the index established between the cryptographic Hash and the target text.
- 2. according to the method for claim 1, it is characterised in that establish the cryptographic Hash and institute in the index mapping bucket The index between target text is stated, including:If being not present in the index mapping bucket and indexing cryptographic Hash with the cryptographic Hash identical, the cryptographic Hash is stored in institute State in index mapping bucket, and the index established between the cryptographic Hash and the target text;If existing in the index mapping bucket and indexing cryptographic Hash with the cryptographic Hash identical, the cryptographic Hash is not entered Row preserves again, the index directly established between the index cryptographic Hash and the target text.
- 3. method according to claim 1 or 2, it is characterised in that also include:, will be corresponding with the cryptographic Hash if existing in the mapped cache pond and mapping bucket with the index of the Hash values match Text data is recommended as the text data similar to the target text.
- 4. method according to claim 1 or 2, it is characterised in that also include:N number of hash function is determined at random;Hash operation is carried out to the feature string of target text based on N number of hash function respectively, obtains N number of cryptographic Hash;Count the quantity that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond;The quantity of the close cryptographic Hash is ranked up, and the recommendation similar to the target text is determined according to ranking results Text data set;Calculate the target text and the similarity recommended between each recommendation text data of text data concentration;The recommendation text data that similarity meets given threshold is recommended;Wherein, N is positive integer.
- 5. method according to claim 1 or 2, it is characterised in that the Feature Words of the extraction target text include:Target text is segmented;The Feature Words of the target text are determined according to the part of speech of each participle and the frequency occurred.
- 6. according to the method for claim 5, it is characterised in that described participle is carried out to target text to include:Based on big granularity or small grain size pattern, with reference to corresponding to word frequency and part of speech selection participle unit to the target text with Word is that unit is divided, and marks the part of speech of each word.
- 7. according to the method for claim 5, it is characterised in that before being segmented to target text, in addition to:The character that can not be identified in target text is filtered.
- 8. one kind index establishes device, it is characterised in that including:Feature Words extraction module, for extracting the Feature Words of target text;Order module, for being ranked up to obtain feature string to the Feature Words;First computing module, for the feature string application MinHash algorithms, obtaining breathing out corresponding to the target text Uncommon value;First establishes module, maps bucket with the index of the Hash values match for searching to whether there is in mapped cache pond, if In the presence of, then it is described index mapping bucket in establish the index between the cryptographic Hash and the target text;Second establishes module, if mapping bucket with the index of the Hash values match for being not present in the mapped cache pond, builds It is vertical to map bucket, and the index established between the cryptographic Hash and the target text with the index of the Hash values match.
- 9. device according to claim 8, it is characterised in that described first, which establishes module, includes:Storage unit, if cryptographic Hash is indexed with the cryptographic Hash identical for being not present in the index mapping bucket, by institute State in the cryptographic Hash deposit index mapping bucket, and the index established between the cryptographic Hash and the target text;Unit is established, if indexing cryptographic Hash with the cryptographic Hash identical for existing in the index mapping bucket, no The cryptographic Hash is preserved again, the index directly established between the index cryptographic Hash and the target text.
- 10. device according to claim 8, it is characterised in that also include:Recommending module, will be with institute if mapping bucket with the index of the Hash values match for existing in the mapped cache pond Text data corresponding to cryptographic Hash is stated as the text data similar to the target text to be recommended;Or for random true Fixed N number of hash function;Hash operation is carried out to the feature string of target text based on N number of hash function respectively, obtains N Individual cryptographic Hash;Count the quantity that N number of cryptographic Hash is located at the close cryptographic Hash of same index mapping bucket in mapped cache pond;Will The quantity of the close cryptographic Hash is ranked up, and the recommendation textual data similar to the target text is determined according to ranking results According to collection;Calculate the target text and the similarity recommended between each recommendation text data of text data concentration;By phase Meet that the recommendation text data of given threshold is recommended like degree;Wherein, N is positive integer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711069369.8A CN107784110B (en) | 2017-11-03 | 2017-11-03 | Index establishing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711069369.8A CN107784110B (en) | 2017-11-03 | 2017-11-03 | Index establishing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107784110A true CN107784110A (en) | 2018-03-09 |
CN107784110B CN107784110B (en) | 2020-07-03 |
Family
ID=61431627
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711069369.8A Active CN107784110B (en) | 2017-11-03 | 2017-11-03 | Index establishing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107784110B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670148A (en) * | 2018-09-26 | 2019-04-23 | 平安科技(深圳)有限公司 | Collection householder method, device, equipment and storage medium based on speech recognition |
CN109710656A (en) * | 2018-11-12 | 2019-05-03 | 清华大学 | Approximate enquiring method and device |
CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Adjudicate document information retrieval method, device, computer equipment and storage medium |
CN111078821A (en) * | 2019-11-27 | 2020-04-28 | 泰康保险集团股份有限公司 | Dictionary setting method, device, medium and electronic equipment |
CN111597309A (en) * | 2020-05-25 | 2020-08-28 | 深圳市小满科技有限公司 | Similar enterprise recommendation method and device, electronic equipment and medium |
CN111858607A (en) * | 2020-07-24 | 2020-10-30 | 北京金山云网络技术有限公司 | Data processing method and device, electronic equipment and computer readable medium |
WO2021038887A1 (en) * | 2019-08-30 | 2021-03-04 | 富士通株式会社 | Similar document retrieval method, similar document retrieval program, similar document retrieval device, index information creation method, index information creation program, and index information creation device |
CN113992625A (en) * | 2021-10-15 | 2022-01-28 | 杭州安恒信息技术股份有限公司 | Domain name source station detection method, system, computer and readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020073068A1 (en) * | 1997-03-07 | 2002-06-13 | Guha Ramanathan V. | System and method for rapidly identifying the existence and location of an item in a file |
CN102063446A (en) * | 2009-11-13 | 2011-05-18 | 中国移动通信集团四川有限公司 | Method for creating inverted index and inverted indexing device |
CN102193995A (en) * | 2011-04-26 | 2011-09-21 | 深圳市迅雷网络技术有限公司 | Method and device for establishing multimedia data index and retrieval |
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
CN106156154A (en) * | 2015-04-14 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The search method of Similar Text and device thereof |
-
2017
- 2017-11-03 CN CN201711069369.8A patent/CN107784110B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020073068A1 (en) * | 1997-03-07 | 2002-06-13 | Guha Ramanathan V. | System and method for rapidly identifying the existence and location of an item in a file |
CN102063446A (en) * | 2009-11-13 | 2011-05-18 | 中国移动通信集团四川有限公司 | Method for creating inverted index and inverted indexing device |
CN102193995A (en) * | 2011-04-26 | 2011-09-21 | 深圳市迅雷网络技术有限公司 | Method and device for establishing multimedia data index and retrieval |
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
CN106156154A (en) * | 2015-04-14 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The search method of Similar Text and device thereof |
Non-Patent Citations (1)
Title |
---|
MENGFANRONG: "minhash算法", 《HTTPS://WWW.CNBLOGS.COM/MENGFANRONG/P/5058919.HTML》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670148A (en) * | 2018-09-26 | 2019-04-23 | 平安科技(深圳)有限公司 | Collection householder method, device, equipment and storage medium based on speech recognition |
CN109710656A (en) * | 2018-11-12 | 2019-05-03 | 清华大学 | Approximate enquiring method and device |
CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Adjudicate document information retrieval method, device, computer equipment and storage medium |
WO2021038887A1 (en) * | 2019-08-30 | 2021-03-04 | 富士通株式会社 | Similar document retrieval method, similar document retrieval program, similar document retrieval device, index information creation method, index information creation program, and index information creation device |
JPWO2021038887A1 (en) * | 2019-08-30 | 2021-03-04 | ||
JP7193000B2 (en) | 2019-08-30 | 2022-12-20 | 富士通株式会社 | Similar document search method, similar document search program, similar document search device, index information creation method, index information creation program, and index information creation device |
CN111078821A (en) * | 2019-11-27 | 2020-04-28 | 泰康保险集团股份有限公司 | Dictionary setting method, device, medium and electronic equipment |
CN111078821B (en) * | 2019-11-27 | 2023-12-08 | 泰康保险集团股份有限公司 | Dictionary setting method, dictionary setting device, medium and electronic equipment |
CN111597309A (en) * | 2020-05-25 | 2020-08-28 | 深圳市小满科技有限公司 | Similar enterprise recommendation method and device, electronic equipment and medium |
CN111858607A (en) * | 2020-07-24 | 2020-10-30 | 北京金山云网络技术有限公司 | Data processing method and device, electronic equipment and computer readable medium |
CN113992625A (en) * | 2021-10-15 | 2022-01-28 | 杭州安恒信息技术股份有限公司 | Domain name source station detection method, system, computer and readable storage medium |
CN113992625B (en) * | 2021-10-15 | 2024-05-28 | 杭州安恒信息技术股份有限公司 | Domain name source station detection method, system, computer and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107784110B (en) | 2020-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107784110A (en) | A kind of index establishing method and device | |
CN109947904B (en) | Preference space Skyline query processing method based on Spark environment | |
Norouzi et al. | Fast exact search in hamming space with multi-index hashing | |
CN102129451B (en) | Method for clustering data in image retrieval system | |
CN111104511B (en) | Method, device and storage medium for extracting hot topics | |
CN105426426B (en) | A kind of KNN file classification methods based on improved K-Medoids | |
US20100313258A1 (en) | Identifying synonyms of entities using a document collection | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
Sood et al. | Probabilistic near-duplicate detection using simhash | |
CN103365992B (en) | Method for realizing dictionary search of Trie tree based on one-dimensional linear space | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
CN104298715B (en) | A kind of more indexed results ordering by merging methods based on TF IDF | |
CN107291895B (en) | Quick hierarchical document query method | |
US20150199567A1 (en) | Document classification assisting apparatus, method and program | |
CN111797239A (en) | Application program classification method and device and terminal equipment | |
Wick et al. | A unified approach for schema matching, coreference and canonicalization | |
CN109325146A (en) | A kind of video recommendation method, device, storage medium and server | |
KR102371437B1 (en) | Method and apparatus for recommending entity, electronic device and computer readable medium | |
CN110929498A (en) | Short text similarity calculation method and device and readable storage medium | |
Adamu et al. | A survey on big data indexing strategies | |
CN111325033B (en) | Entity identification method, entity identification device, electronic equipment and computer readable storage medium | |
CN114297415A (en) | Multi-source heterogeneous data storage method and retrieval method for full media data space | |
KR20180129001A (en) | Method and System for Entity summarization based on multilingual projected entity space | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN108984711A (en) | A kind of personalized APP recommended method based on layering insertion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: An index establishment method and device Effective date of registration: 20220105 Granted publication date: 20200703 Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING Registration number: Y2022990000005 |
|
PC01 | Cancellation of the registration of the contract for pledge of patent right |
Date of cancellation: 20220712 Granted publication date: 20200703 Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING Registration number: Y2022990000005 |
|
PC01 | Cancellation of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A kind of index establishment method and apparatus Effective date of registration: 20220907 Granted publication date: 20200703 Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING Registration number: Y2022110000206 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PC01 | Cancellation of the registration of the contract for pledge of patent right |
Granted publication date: 20200703 Pledgee: China Co. truction Bank Corp Beijing Zhongguancun branch Pledgor: RUN TECHNOLOGIES Co.,Ltd. BEIJING Registration number: Y2022110000206 |
|
PC01 | Cancellation of the registration of the contract for pledge of patent right |