JP5819629B2 - パッセージシーケンスの再使用を介して文書の展開を推測することによる文書類似性の測定 - Google Patents
パッセージシーケンスの再使用を介して文書の展開を推測することによる文書類似性の測定 Download PDFInfo
- Publication number
- JP5819629B2 JP5819629B2 JP2011099059A JP2011099059A JP5819629B2 JP 5819629 B2 JP5819629 B2 JP 5819629B2 JP 2011099059 A JP2011099059 A JP 2011099059A JP 2011099059 A JP2011099059 A JP 2011099059A JP 5819629 B2 JP5819629 B2 JP 5819629B2
- Authority
- JP
- Japan
- Prior art keywords
- document
- passage
- hmm
- state
- new
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000007704 transition Effects 0.000 claims description 37
- 238000000034 method Methods 0.000 claims description 13
- 230000009471 action Effects 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 description 20
- 238000004422 calculation algorithm Methods 0.000 description 15
- 230000000007 visual effect Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000010276 construction Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006798 recombination Effects 0.000 description 2
- 238000005215 recombination Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
- G06F16/3326—Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
- G06F16/3328—Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages using graphical result space presentation or visualisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/84—Mapping; Conversion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Algebra (AREA)
- Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Pure & Applied Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Description
[bj(k)]に設定できることに留意されたい。即ち、新しい状態rの放出確率は、HMMの状態の何れによっても(状態rを除く)観測が生成されていない残りの確率に設定される。遷移確率の場合と同様に、放出確率の設定に際しても、観測された全ての状態に渡る放出確率の合計は1に等しい、という制約が充足されなければならないことに留意されたい。
Claims (3)
- パッセージの第1の集合を含む文書のコレクシ ョンを選択することと、
前記パッセージの第1の集合を基礎としてパッセージシーケンスモデルを構築することと、
パッセージの第2の集合を含む文書を受信することと、
前記構築されたパッセージシーケンスモデルを基礎として 、前記文書のコレクシ ョンに対して前記新しい文書に関連づけられる動作シーケンスを決定することと、
前記決定される動作シーケンスを基礎として 、前記新しい文書と前記コレクシ ョン内の少なくとも 1つの文書との間の類似性を推定することを含む方法 。 - 前記パッセージシーケンスモデルは隠れマルコフモデル(HMM)であり、かつ前記方 法は、前記パッセージの第1の集合のフィンガープリントを生成することをさらに含み、
少なくとも1つのフィンガープリントはHMMの1つの状態に対応する、請求項1に記載の方法。 - 前記HMMの状態間の遷移確率を決定することをさらに含む、請求項2に記載の方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/774,426 | 2010-05-05 | ||
US12/774,426 US8086548B2 (en) | 2010-05-05 | 2010-05-05 | Measuring document similarity by inferring evolution of documents through reuse of passage sequences |
Publications (2)
Publication Number | Publication Date |
---|---|
JP2011238221A JP2011238221A (ja) | 2011-11-24 |
JP5819629B2 true JP5819629B2 (ja) | 2015-11-24 |
Family
ID=44262593
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2011099059A Active JP5819629B2 (ja) | 2010-05-05 | 2011-04-27 | パッセージシーケンスの再使用を介して文書の展開を推測することによる文書類似性の測定 |
Country Status (4)
Country | Link |
---|---|
US (1) | US8086548B2 (ja) |
EP (1) | EP2385471A1 (ja) |
JP (1) | JP5819629B2 (ja) |
KR (1) | KR101711839B1 (ja) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8527436B2 (en) * | 2010-08-30 | 2013-09-03 | Stratify, Inc. | Automated parsing of e-mail messages |
US9262390B2 (en) * | 2010-09-02 | 2016-02-16 | Lexis Nexis, A Division Of Reed Elsevier Inc. | Methods and systems for annotating electronic documents |
US9449024B2 (en) * | 2010-11-19 | 2016-09-20 | Microsoft Technology Licensing, Llc | File kinship for multimedia data tracking |
US9256697B2 (en) * | 2012-05-11 | 2016-02-09 | Microsoft Technology Licensing, Llc | Bidirectional mapping between applications and network content |
KR101429621B1 (ko) * | 2012-10-04 | 2014-08-13 | 한양대학교 에리카산학협력단 | 중복 뉴스 결합 시스템 및 중복 뉴스 결합 방법 |
CN103530421B (zh) * | 2012-11-02 | 2017-01-04 | 中国人民解放军国防科学技术大学 | 基于微博的事件相似性度量方法及系统 |
US9965521B1 (en) * | 2014-02-05 | 2018-05-08 | Google Llc | Determining a transition probability from one or more past activity indications to one or more subsequent activity indications |
US20160110315A1 (en) * | 2014-10-20 | 2016-04-21 | Xerox Corporation | Methods and systems for digitizing a document |
EP3215944B1 (en) | 2014-11-03 | 2021-07-07 | Vectra AI, Inc. | A system for implementing threat detection using daily network traffic community outliers |
WO2016073383A1 (en) | 2014-11-03 | 2016-05-12 | Vectra Networks, Inc. | A system for implementing threat detection using threat and risk assessment of asset-actor interactions |
JP6972788B2 (ja) * | 2017-08-31 | 2021-11-24 | 富士通株式会社 | 特定プログラム、特定方法および情報処理装置 |
CN114365161A (zh) * | 2019-09-18 | 2022-04-15 | 三菱电机株式会社 | 作业要素分析装置和作业要素分析方法 |
CN113268959B (zh) * | 2021-05-25 | 2024-05-03 | 北京北大方正电子有限公司 | 文档处理方法、装置和电子设备 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000003362A (ja) * | 1998-06-16 | 2000-01-07 | Dainippon Printing Co Ltd | 文書解析システム及び記録媒体 |
US6363381B1 (en) * | 1998-11-03 | 2002-03-26 | Ricoh Co., Ltd. | Compressed document matching |
US6774917B1 (en) * | 1999-03-11 | 2004-08-10 | Fuji Xerox Co., Ltd. | Methods and apparatuses for interactive similarity searching, retrieval, and browsing of video |
US6990628B1 (en) * | 1999-06-14 | 2006-01-24 | Yahoo! Inc. | Method and apparatus for measuring similarity among electronic documents |
US6542635B1 (en) * | 1999-09-08 | 2003-04-01 | Lucent Technologies Inc. | Method for document comparison and classification using document image layout |
US6772120B1 (en) * | 2000-11-21 | 2004-08-03 | Hewlett-Packard Development Company, L.P. | Computer method and apparatus for segmenting text streams |
EP2067102A2 (en) * | 2006-09-15 | 2009-06-10 | Exbiblio B.V. | Capture and display of annotations in paper and electronic documents |
JP2009181170A (ja) * | 2008-01-29 | 2009-08-13 | Fujitsu Ltd | 作業手順書作成支援システム |
-
2010
- 2010-05-05 US US12/774,426 patent/US8086548B2/en not_active Expired - Fee Related
-
2011
- 2011-04-21 EP EP11163473A patent/EP2385471A1/en not_active Withdrawn
- 2011-04-27 JP JP2011099059A patent/JP5819629B2/ja active Active
- 2011-05-02 KR KR1020110041460A patent/KR101711839B1/ko active IP Right Grant
Also Published As
Publication number | Publication date |
---|---|
KR20110122789A (ko) | 2011-11-11 |
US20110276523A1 (en) | 2011-11-10 |
EP2385471A1 (en) | 2011-11-09 |
JP2011238221A (ja) | 2011-11-24 |
KR101711839B1 (ko) | 2017-03-13 |
US8086548B2 (en) | 2011-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5819629B2 (ja) | パッセージシーケンスの再使用を介して文書の展開を推測することによる文書類似性の測定 | |
JP6972265B2 (ja) | ポインタセンチネル混合アーキテクチャ | |
US10146765B2 (en) | System and method for inputting text into electronic devices | |
CN106484777B (zh) | 一种多媒体数据处理方法以及装置 | |
CN113656807B (zh) | 一种漏洞管理方法、装置、设备及存储介质 | |
CN106445915B (zh) | 一种新词发现方法及装置 | |
US10289262B2 (en) | Method and system for determining user interface usage | |
CN104539514A (zh) | 消息过滤方法和装置 | |
KR101852527B1 (ko) | 기계학습 기반의 동적 시뮬레이션 파라미터 교정 방법 | |
US20160092597A1 (en) | Method, controller, program and data storage system for performing reconciliation processing | |
CN108509793A (zh) | 一种基于用户行为日志数据的用户异常行为检测方法及装置 | |
CN113268403B (zh) | 时间序列的分析预测方法、装置、设备及存储介质 | |
CN104573031B (zh) | 一种微博突发事件检测方法 | |
WO2014020834A1 (ja) | 単語潜在トピック推定装置および単語潜在トピック推定方法 | |
JP5591772B2 (ja) | 文脈依存性推定装置、発話クラスタリング装置、方法、及びプログラム | |
CN100541491C (zh) | 文档信息处理装置、文档信息处理方法和计算机可读介质 | |
CN103744830A (zh) | 基于语义分析的excel文档中身份信息的识别方法 | |
CN112328779B (zh) | 训练样本构建方法、装置、终端设备及存储介质 | |
CN113297854A (zh) | 文本到知识图谱实体的映射方法、装置、设备及存储介质 | |
CN110378486B (zh) | 网络嵌入方法、装置、电子设备和存储介质 | |
CN113935387A (zh) | 文本相似度的确定方法、装置和计算机可读存储介质 | |
JP5824429B2 (ja) | スパムアカウントスコア算出装置、スパムアカウントスコア算出方法、及びプログラム | |
JP2007011571A (ja) | 情報処理装置、およびプログラム | |
CN111897618A (zh) | 一种ui界面的确定方法、装置及存储介质 | |
CN115934809B (zh) | 一种数据处理方法、装置和电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
RD04 | Notification of resignation of power of attorney |
Free format text: JAPANESE INTERMEDIATE CODE: A7424 Effective date: 20130516 |
|
A621 | Written request for application examination |
Free format text: JAPANESE INTERMEDIATE CODE: A621 Effective date: 20140421 |
|
A977 | Report on retrieval |
Free format text: JAPANESE INTERMEDIATE CODE: A971007 Effective date: 20150210 |
|
A131 | Notification of reasons for refusal |
Free format text: JAPANESE INTERMEDIATE CODE: A131 Effective date: 20150224 |
|
A521 | Request for written amendment filed |
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20150522 |
|
TRDD | Decision of grant or rejection written | ||
A01 | Written decision to grant a patent or to grant a registration (utility model) |
Free format text: JAPANESE INTERMEDIATE CODE: A01 Effective date: 20150908 |
|
A61 | First payment of annual fees (during grant procedure) |
Free format text: JAPANESE INTERMEDIATE CODE: A61 Effective date: 20151001 |
|
R150 | Certificate of patent or registration of utility model |
Ref document number: 5819629 Country of ref document: JP Free format text: JAPANESE INTERMEDIATE CODE: R150 |
|
R250 | Receipt of annual fees |
Free format text: JAPANESE INTERMEDIATE CODE: R250 |
|
R250 | Receipt of annual fees |
Free format text: JAPANESE INTERMEDIATE CODE: R250 |
|
R250 | Receipt of annual fees |
Free format text: JAPANESE INTERMEDIATE CODE: R250 |
|
R250 | Receipt of annual fees |
Free format text: JAPANESE INTERMEDIATE CODE: R250 |
|
R250 | Receipt of annual fees |
Free format text: JAPANESE INTERMEDIATE CODE: R250 |