WO2022239174A1 - 類似度導出システムおよび類似度導出方法 - Google Patents

類似度導出システムおよび類似度導出方法 Download PDF

Info

Publication number
WO2022239174A1
WO2022239174A1 PCT/JP2021/018169 JP2021018169W WO2022239174A1 WO 2022239174 A1 WO2022239174 A1 WO 2022239174A1 JP 2021018169 W JP2021018169 W JP 2021018169W WO 2022239174 A1 WO2022239174 A1 WO 2022239174A1
Authority
WO
WIPO (PCT)
Prior art keywords
hash
hash value
sets
hash function
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/018169
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
善之 大野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to PCT/JP2021/018169 priority Critical patent/WO2022239174A1/ja
Priority to US18/288,586 priority patent/US12413413B2/en
Priority to JP2023520672A priority patent/JP7464193B2/ja
Publication of WO2022239174A1 publication Critical patent/WO2022239174A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3236Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • the minimum hash value may be called MinHash.
  • the minimum hash value obtained for each element based on the same hash function is referred to as the minimum hash value.
  • the hash value calculation unit 2, minimum hash value specification unit 3, and similarity derivation unit 4 are realized, for example, by a CPU (Central Processing Unit) of a computer that operates according to a similarity derivation program.
  • the CPU reads a similarity derivation program from a program recording medium such as a program storage device of a computer, and according to the similarity derivation program, a hash value calculation unit 2, a minimum hash value identification unit 3, and a similarity derivation unit 4 should operate as
  • FIG. 2 is a flowchart showing an example of the progress of processing in the first embodiment.
  • description is abbreviate
  • Steps S2 and S3 are the same as steps S2 and S3 in the first embodiment.
  • the universal set generation unit 52 extracts only one element from a plurality of elements whose first hash values match and whose elements themselves match among the individual elements of each set. Also, each element that does not correspond to the plurality of elements is extracted, and one universal set including each extracted element is generated. Then, the second hash value calculation unit 54 calculates hash values corresponding to each hash function h2, . . . , hm other than the predetermined hash function h1 for each element belonging to the universal set. Therefore, for a plurality of elements whose first hash values match and whose elements themselves match, the second hash value calculator 54 calculates hash values corresponding to hash functions h2, . . . , hm. only once.
  • the set selection unit 61 sequentially selects one set from a plurality of sets.
  • the second hash value calculator 65 corresponds to each hash function h2, . is identical to the hash value corresponding to each hash function h2, . . . , hm of the matching element.
  • step S77 the element selection unit 62 determines whether or not there is an unselected element in the selected set. In this example, it is determined that there is an unselected element in the selected set A (Yes in step S77). In this case, the processing after step S72 is repeated.
  • the first hash value calculation unit 63 applies the numerical hash function h0 to the element selected in step S72, Converts the element to a numeric element.
  • the determining unit 64 may determine whether or not matching elements have already been obtained, using elements that match the elements converted by the numerical hash function h0 as matching elements. If no matching element is obtained, in step S76, the second hash value calculator 65 applies a plurality of hash functions h1, h2, . Calculate each hash value corresponding to a plurality of hash functions h1, h2, . . . , hm.
  • the hash value calculation process includes: a set selection process for sequentially selecting one set from the plurality of sets; Element selection processing for sequentially selecting one element from the selected set; a first hash value calculation process that calculates a first hash value of the selected element by applying the predetermined hash function to the selected element; a determination process for determining whether or not a matching element, which is an element whose first hash value matches the selected element and whose element itself matches, has already been selected; When the matching element has already been selected, the hash value corresponding to each hash function other than the predetermined hash function of the selected element is the same as the hash value corresponding to each hash function of the matching element. and and a second hash value calculation process of calculating a hash value corresponding to each hash function other than the predetermined hash function of the selected element when the matching element is not selected. Similarity derivation method.
  • Hash obtained by a predetermined hash function out of the plurality of hash functions when obtaining a plurality of hash values obtained by applying a plurality of hash functions to individual elements of each set contained in the plurality of sets With respect to a plurality of elements having the same value and the same element itself, the duplication of calculation of each hash function other than the predetermined hash function is eliminated, and the plurality of hash values are obtained for each element of each set.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/JP2021/018169 2021-05-13 2021-05-13 類似度導出システムおよび類似度導出方法 Ceased WO2022239174A1 (ja)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/JP2021/018169 WO2022239174A1 (ja) 2021-05-13 2021-05-13 類似度導出システムおよび類似度導出方法
US18/288,586 US12413413B2 (en) 2021-05-13 2021-05-13 Similarity degree derivation system and similarity degree derivation method
JP2023520672A JP7464193B2 (ja) 2021-05-13 2021-05-13 類似度導出システムおよび類似度導出方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/018169 WO2022239174A1 (ja) 2021-05-13 2021-05-13 類似度導出システムおよび類似度導出方法

Publications (1)

Publication Number Publication Date
WO2022239174A1 true WO2022239174A1 (ja) 2022-11-17

Family

ID=84028068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/018169 Ceased WO2022239174A1 (ja) 2021-05-13 2021-05-13 類似度導出システムおよび類似度導出方法

Country Status (3)

Country Link
US (1) US12413413B2 (https=)
JP (1) JP7464193B2 (https=)
WO (1) WO2022239174A1 (https=)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170078286A1 (en) * 2015-09-16 2017-03-16 RiskIQ, Inc. Using hash signatures of dom objects to identify website similarity
US20170161375A1 (en) * 2015-12-07 2017-06-08 Adlib Publishing Systems Inc. Clustering documents based on textual content
US20170322930A1 (en) * 2016-05-07 2017-11-09 Jacob Michael Drew Document based query and information retrieval systems and methods
US20180095941A1 (en) * 2016-09-30 2018-04-05 Quantum Metric, LLC Techniques for view capture and storage for mobile applications
US20180181609A1 (en) * 2016-12-28 2018-06-28 Google Inc. System for De-Duplicating Job Postings
WO2021038887A1 (ja) * 2019-08-30 2021-03-04 富士通株式会社 類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10579661B2 (en) * 2013-05-20 2020-03-03 Southern Methodist University System and method for machine learning and classifying data
JP7032650B2 (ja) 2018-06-28 2022-03-09 富士通株式会社 類似テキスト検索方法、類似テキスト検索装置および類似テキスト検索プログラム
US11120052B1 (en) * 2018-06-28 2021-09-14 Amazon Technologies, Inc. Dynamic distributed data clustering using multi-level hash trees

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170078286A1 (en) * 2015-09-16 2017-03-16 RiskIQ, Inc. Using hash signatures of dom objects to identify website similarity
US20170161375A1 (en) * 2015-12-07 2017-06-08 Adlib Publishing Systems Inc. Clustering documents based on textual content
US20170322930A1 (en) * 2016-05-07 2017-11-09 Jacob Michael Drew Document based query and information retrieval systems and methods
US20180095941A1 (en) * 2016-09-30 2018-04-05 Quantum Metric, LLC Techniques for view capture and storage for mobile applications
US20180181609A1 (en) * 2016-12-28 2018-06-28 Google Inc. System for De-Duplicating Job Postings
WO2021038887A1 (ja) * 2019-08-30 2021-03-04 富士通株式会社 類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置

Also Published As

Publication number Publication date
JP7464193B2 (ja) 2024-04-09
US20240214211A1 (en) 2024-06-27
US12413413B2 (en) 2025-09-09
JPWO2022239174A1 (https=) 2022-11-17

Similar Documents

Publication Publication Date Title
WO2020086115A1 (en) Multi-task training architecture and strategy for attention- based speech recognition system
CN114023342A (zh) 一种语音转换方法、装置、存储介质及电子设备
WO2017124930A1 (zh) 一种特征数据处理方法及设备
JP6331756B2 (ja) テストケース生成プログラム、テストケース生成方法、及びテストケース生成装置
CN119493559A (zh) 代码生成方法、装置、电子设备以及存储介质
AU2024200306A1 (en) Automated indexing and extraction of information in digital documents
JPWO2020152804A1 (ja) 情報提供システム、方法およびプログラム
CN117762806A (zh) 代码检测处理方法及装置
JP6778811B2 (ja) 音声認識方法及び装置
WO2018180971A1 (ja) 情報処理システム、特徴量説明方法および特徴量説明プログラム
JP7464193B2 (ja) 類似度導出システムおよび類似度導出方法
CN114884772B (zh) 裸机vxlan的部署方法、系统和电子设备
JP7041603B2 (ja) 計算機システム及び業務フローのパターンの生成方法
CN106648891A (zh) 基于MapReduce模型的任务执行方法和装置
WO2022070422A1 (ja) 計算機システム及び文字認識方法
WO2022049681A1 (ja) 相関索引構築装置、相関テーブル探索装置、方法およびプログラム
US12056147B2 (en) Analysis device, analysis method, and analysis program
CN117709302A (zh) 一种文档转换方法及装置
WO2019171537A1 (ja) 意味推定システム、方法およびプログラム
CN114880242A (zh) 测试用例的提取方法、装置、设备和介质
JP7184176B2 (ja) 割当装置、方法およびプログラム
JP2016184273A (ja) 演算制御装置、演算制御方法及び演算制御プログラム
CN107194014B (zh) 数据源调用方法及装置
US20250217226A1 (en) Root Cause Locating Method and Apparatus, and Storage Medium
US20250298887A1 (en) Program identification method and program identification device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941908

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023520672

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 18288586

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21941908

Country of ref document: EP

Kind code of ref document: A1

WWG Wipo information: grant in national office

Ref document number: 18288586

Country of ref document: US