JP2024041369A

JP2024041369A - Similarity determination device, similarity determination system, and similarity determination method

Info

Publication number: JP2024041369A
Application number: JP2022146140A
Authority: JP
Inventors: アショクマールチェッティマニ; Chettymani Ashokkumar; 裕紀山▲崎▼; Yuki Yamazaki; 修吾三上; Shugo Mikami; 桃伽粕谷; Momoka Kasuya
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2024-03-27

Abstract

To provide similarity determination means capable of comparing sentence similarity at high speed and with high accuracy.SOLUTION: Provided is a similarity determination device, including: a keyword management database that stores a set of first domain keywords corresponding to a first domain; a domain graph database that stores a first domain graph generated based on the set of first domain keywords for a first sentence corresponding to the first domain; a graph creation part that generates a second domain graph for a second sentence corresponding to the first domain on the basis of the first set of domain keywords; and a similarity determination part that generates a comparison result indicating similarity between the first domain graph and the second domain graph by comparing the first domain graph with the second domain graph.SELECTED DRAWING: Figure 2

Description

本発明は、類似度判定装置、類似度判定システム及び類似度判定方法に関する。 The present invention relates to a similarity determination device, a similarity determination system, and a similarity determination method.

近年、新たな脆弱性の発見や新たな攻撃手法の出現などにより、サイバー攻撃は年々進化し続けている。このような状況の下、情報システムがサイバー攻撃に対する防御機能を備えているか、サイバー攻撃を受けた際に被害がどこまで広がるか、という問題は社会や組織において重要な関心事項となっている。 In recent years, cyber attacks have continued to evolve year by year due to the discovery of new vulnerabilities and the emergence of new attack methods. Under these circumstances, the question of whether information systems have the ability to defend against cyber-attacks and how far the damage will spread in the event of a cyber-attack has become an important concern for society and organizations.

そのため、例えば企業等の組織では、ビジネスへのリスクを把握すべく、新たに発見される脆弱性やサイバー攻撃（以下、サイバーイベントという）に関する様々なレポートを取得し、これらのレポートに基づいて適切な対策を講じること求められている。 Therefore, in order to understand the risks to business, organizations such as companies, for example, obtain various reports on newly discovered vulnerabilities and cyber attacks (hereinafter referred to as cyber events), and take appropriate actions based on these reports. We are required to take appropriate measures.

リスクとなり得るサイバーイベントに対して、適切な対策を講じるためには、報告されたサイバーイベントについて、当該サイバーイベントが既に対応済み又は対応中か否かを判定することが望ましい。報告されたサイバーイベントが既に対策が策定済みのサイバーイベントと同様（又は同一）である場合、対策を再度策定する必要はないため、対策策定に必要な時間やリソースを節約することができる。 In order to take appropriate measures against cyber events that may pose a risk, it is desirable to determine whether the reported cyber event has already been dealt with or is currently being dealt with. If a reported cyber event is similar to (or identical to) a cyber event for which countermeasures have already been developed, there is no need to develop countermeasures again, thereby saving time and resources required for countermeasure development.

サイバーイベントが既に対応済み又は対応中か否かを判定する手段の１つとして、脆弱性やサイバー攻撃を報告するレポートと類似度が高いレポートが他に存在するか否かを判定することが考えられる。 One way to determine whether a cyber event has already been responded to or is currently being addressed is to determine whether there are other reports that are highly similar to the report reporting vulnerabilities or cyber attacks. It will be done.

従来から、文章を比較し、類似度を判定するいくつかの提案がなされている。
例えば、Ｒｅｔｔｉｎｇｅｒｅｔａｌ．（非特許文献１）には、「ドキュメントの関連性を評価することは、ドキュメントの取得や推奨など、多くのアプリケーションの中核である。殆どの類似性アプローチは、単語分布ベースのドキュメント表現で動作する。これらのアプローチは、計算が高速であるが、ドキュメントの言語、語彙、またはタイプが異なる場合には問題が発生する上、知識グラフで利用可能な豊富な関係知識が無視されてしまう。一方、グラフベースのドキュメントモデルは、エンティティ間の関係に関する貴重な知識を活用できるが、グラフの操作には資源が多くかかるため、多くのアプリケーションで類似性の評価が実行不可能になる傾向がある。この論文は、明示的な階層的及び横断的関係を活用する効率的な意味的類似性アプローチを提示する。本実験では、（i）本アプローチの類似性測度は、同等の測度よりも、人間のドキュメント類似性の認識との相関が大幅に高いこと、（ii）これは注釈の少ない短いドキュメントにも当てはまること、（iii）ドキュメントの類似性は、他のグラフトラバーサルベースのアプローチと比較して効率的に計算できることを示す。」技術が記載されている。 Several proposals have been made to compare sentences and determine the degree of similarity.
For example, Rettinger et al. (Non-Patent Document 1) states that "Evaluating the relevance of documents is the core of many applications, such as document retrieval and recommendation. Most similarity approaches work with word distribution-based document representations. Although these approaches are computationally fast, they have problems when the documents have different languages, vocabularies, or types, and they ignore the rich relational knowledge available in the knowledge graph. ,While graph-based document models can leverage valuable knowledge,about relationships between entities, graph manipulation tends to be,resource intensive, making similarity evaluation infeasible in,many applications. This paper presents an efficient semantic similarity approach that exploits explicit hierarchical and cross-cutting relationships. In our experiments, we demonstrate that (i) the similarity measure of our approach is more effective than the equivalent measure for humans; has a significantly higher correlation with document similarity perception, (ii) this also applies to short documents with fewer annotations, and (iii) document similarity is significantly higher than other graph traversal-based approaches. The technology is described as "We show that calculations can be performed efficiently."

Paul C., Rettinger A., Mogadala A., Knoblock C.A., Szekely P. (2016) Efficient Graph-Based Document Similarity. In: Sack H., Blomqvist E., d'Aquin M., Ghidini C., Ponzetto S., Lange C. (eds) The Semantic Web. Latest Advances and New Domains. ESWC 2016. Lecture Notes in Computer Science, vol 9678. Springer, Cham. https://doi.org/10.1007/978-3-319-34129-3_21Paul C., Rettinger A., Mogadala A., Knoblock C.A., Szekely P. (2016) Efficient Graph-Based Document Similarity. In: Sack H., Blomqvist E., d'Aquin M., Ghidini C., Ponzetto S ., Lange C. (eds) The Semantic Web. Latest Advances and New Domains. ESWC 2016. Lecture Notes in Computer Science, vol 9678. Springer, Cham. https://doi.org/10.1007/978-3-319- 34129-3_21

上記の非特許文献１には、文章における階層的及び横断的関係を活用し、グラフトラバーサル手段を用いることで、文章の類似度を判定する手段が記載されている。 The above-mentioned Non-Patent Document 1 describes a means for determining the similarity of sentences by utilizing hierarchical and cross-sectional relationships in the sentences and using graph traversal means.

しかし、非特許文献１に記載の手段では、文章全体がグラフとして表現されるため、文章が長ければ長い程、グラフが大きくなり、当該グラフを格納したり、処理したりするために必要なコンピューティング資源や時間が多くなるという課題がある。
従って、非特許文献１に記載の手段を例えばサイバーイベントを報告するレポートに対して適用した場合、グラフの大きさにより、比較の判定結果が出力されるまでの時間が遅くなり、サイバーイベントへの対策策定が遅れてしまうことがある。 However, with the method described in Non-Patent Document 1, the entire sentence is expressed as a graph, so the longer the sentence, the larger the graph, and the computer required to store and process the graph. The problem is that it requires a lot of resources and time.
Therefore, when the method described in Non-Patent Document 1 is applied to a report reporting a cyber event, for example, depending on the size of the graph, the time until the comparison judgment result is output is delayed, and the result of the comparison is delayed. The formulation of countermeasures may be delayed.

そこで、本開示は、文章における要素間の関係情報を維持しつつ、文章に対応するグラフ表現の規模を抑えることで、高速且つ高精度な文章類似度比較が可能な類似度判定手段を提供することを目的とする。 Therefore, the present disclosure provides a similarity determination means that can perform high-speed and highly accurate sentence similarity comparisons by suppressing the scale of a graph representation corresponding to a sentence while maintaining relationship information between elements in a sentence. The purpose is to

上記の課題を解決するために、代表的な本発明の類似度判定装置の一つは、第１のドメインに対応する第１のドメインキーワードのセットを格納するキーワード管理データベースと、前記第１のドメインに対応する第１の文章について、前記第１のドメインキーワードのセットに基づいて生成される第１のドメイングラフを格納するドメイングラフデータベースと、前記第１のドメインキーワードのセットに基づいて、前記第１のドメインに対応する第２の文章について第２のドメイングラフを生成するグラフ生成部と、前記第１のドメイングラフと前記第２のドメイングラフとを比較することで、前記第１のドメイングラフと前記第２のドメイングラフとの類似度を示す比較結果を生成する類似度判定部とを含む。 In order to solve the above problems, one of the typical similarity determination devices of the present invention includes a keyword management database that stores a set of first domain keywords corresponding to a first domain, and a keyword management database that stores a set of first domain keywords corresponding to a first domain. A domain graph database that stores a first domain graph generated based on the first set of domain keywords for a first sentence corresponding to a domain; A graph generating unit that generates a second domain graph for a second sentence corresponding to the first domain, and comparing the first domain graph and the second domain graph, and a similarity determination unit that generates a comparison result indicating the degree of similarity between the graph and the second domain graph.

本開示によれば、文章における要素間の関係情報を維持しつつ、文章に対応するグラフ表現の規模を抑えることで、高速且つ高精度な文章類似度比較が可能な類似度判定手段を提供することができる。
上記以外の課題、構成及び効果は、以下の発明を実施するための形態における説明により明らかにされる。 According to the present disclosure, there is provided a similarity determination means that can perform high-speed and highly accurate sentence similarity comparisons by suppressing the scale of a graph representation corresponding to a sentence while maintaining relational information between elements in a sentence. be able to.
Problems, configurations, and effects other than those described above will be made clear by the description in the detailed description below.

図１は、本開示の実施形態を実施するためのコンピュータシステムを示す図である。FIG. 1 is a diagram illustrating a computer system for implementing embodiments of the present disclosure. 図２は、本開示の実施形態に係る類似度判定システムの構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a configuration of a similarity determination system according to an embodiment of the present disclosure. 図３は、本開示の実施形態に係る類似度判定処理における事前グラフ生成処理の流れと、グラフ比較処理の流れとを示す図である。FIG. 3 is a diagram showing the flow of preliminary graph generation processing and the flow of graph comparison processing in the similarity determination processing according to the embodiment of the present disclosure. 図４は、本開示の実施形態に係る事前グラフ生成処理の流れの詳細を示す図である。FIG. 4 is a diagram illustrating details of the flow of advance graph generation processing according to the embodiment of the present disclosure. 図５は、本開示の実施形態に係るドメイングラフ及び正規化済みのドメイングラフの一例を示す図である。FIG. 5 is a diagram illustrating an example of a domain graph and a normalized domain graph according to an embodiment of the present disclosure. 図６は、本開示の実施形態に係るグラフ比較処理の流れの詳細を示す図である。FIG. 6 is a diagram illustrating details of the flow of graph comparison processing according to the embodiment of the present disclosure. 図７は、開示の実施形態に係るドメインキーワード及びドメイングラフを更新するための更新処理の流れを示す図である。FIG. 7 is a diagram illustrating a flow of update processing for updating domain keywords and domain graphs according to the disclosed embodiment. 図８は、本開示の実施形態に係るキーワード管理画面の一例を示す図である。FIG. 8 is a diagram illustrating an example of a keyword management screen according to an embodiment of the present disclosure. 図９は、本開示の実施形態に係る文章管理画面の一例を示す図である。FIG. 9 is a diagram illustrating an example of a text management screen according to an embodiment of the present disclosure. 図１０は、本開示の実施形態に係る多ドメイン管理処理の流れの一例を示す図である。FIG. 10 is a diagram illustrating an example of the flow of multi-domain management processing according to the embodiment of the present disclosure. 図１１は、本開示の実施形態に係る２つのドメイングラフに対する類似度判定の具体例を示す図である。FIG. 11 is a diagram illustrating a specific example of similarity determination for two domain graphs according to an embodiment of the present disclosure.

以下、図面を参照して、本発明の実施形態について説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。
また、「第１」、「第２」、「第３」等の用語は、本開示において様々な要素又は構成要素を説明するのに用いられる場合があるが、これらの要素又は構成要素はこれらの用語によって限定されるべきでないことが理解されるであろう。これらの用語は、或る要素又は構成要素を別の要素又は構成要素と区別するためにのみ用いられる。したがって、以下で論述する第１の要素又は構成要素は、本発明概念の教示から逸脱することなく第２の要素又は構成要素と呼ぶこともできる。 Embodiments of the present invention will be described below with reference to the drawings. Note that the present invention is not limited to this embodiment. In addition, in the description of the drawings, the same parts are denoted by the same reference numerals.
Additionally, terms such as "first,""second," and "third" may be used in the present disclosure to describe various elements or components; It will be understood that there should be no limitation by the terms. These terms are only used to distinguish one element or component from another. Accordingly, a first element or component discussed below may also be referred to as a second element or component without departing from the teachings of the inventive concept.

まず、図１を参照して、本開示の実施形態を実施するためのコンピュータシステム１００について説明する。本明細書で開示される様々な実施形態の機構及び装置は、任意の適切なコンピューティングシステムに適用されてもよい。コンピュータシステム１００の主要コンポーネントは、１つ以上のプロセッサ１０２、メモリ１０４、端末インターフェース１１２、ストレージインタフェース１１３、Ｉ／Ｏ（入出力）デバイスインタフェース１１４、及びネットワークインターフェース１１５を含む。これらのコンポーネントは、メモリバス１０６、Ｉ／Ｏバス１０８、バスインターフェースユニット１０９、及びＩ／Ｏバスインターフェースユニット１１０を介して、相互的に接続されてもよい。 First, with reference to FIG. 1, a computer system 100 for implementing an embodiment of the present disclosure will be described. The mechanisms and apparatus of the various embodiments disclosed herein may be applied to any suitable computing system. The main components of computer system 100 include one or more processors 102 , memory 104 , terminal interface 112 , storage interface 113 , I/O (input/output) device interface 114 , and network interface 115 . These components may be interconnected via memory bus 106, I/O bus 108, bus interface unit 109, and I/O bus interface unit 110.

コンピュータシステム１００は、プロセッサ１０２と総称される１つ又は複数の汎用プログラマブル中央処理装置（ＣＰＵ）１０２Ａ及び１０２Ｂを含んでもよい。ある実施形態では、コンピュータシステム１００は複数のプロセッサを備えてもよく、また別の実施形態では、コンピュータシステム１００は単一のＣＰＵシステムであってもよい。各プロセッサ１０２は、メモリ１０４に格納された命令を実行し、オンボードキャッシュを含んでもよい。 Computer system 100 may include one or more general purpose programmable central processing units (CPUs) 102A and 102B, collectively referred to as processors 102. In some embodiments, computer system 100 may include multiple processors, and in other embodiments, computer system 100 may be a single CPU system. Each processor 102 executes instructions stored in memory 104 and may include onboard cache.

ある実施形態では、メモリ１０４は、データ及びプログラムを記憶するためのランダムアクセス半導体メモリ、記憶装置、又は記憶媒体（揮発性又は不揮発性のいずれか）を含んでもよい。メモリ１０４は、本明細書で説明する機能を実施するプログラム、モジュール、及びデータ構造のすべて又は一部を格納してもよい。例えば、メモリ１０４は、類似度判定アプリケーション１５０を格納していてもよい。ある実施形態では、類似度判定アプリケーション１５０は、後述する機能をプロセッサ１０２上で実行する命令又は記述を含んでもよい。 In some embodiments, memory 104 may include random access semiconductor memory, storage devices, or storage media (either volatile or nonvolatile) for storing data and programs. Memory 104 may store all or a portion of programs, modules, and data structures that perform the functions described herein. For example, the memory 104 may store a similarity determination application 150. In some embodiments, similarity determination application 150 may include instructions or writing to perform functions described below on processor 102.

ある実施形態では、類似度判定アプリケーション１５０は、プロセッサベースのシステムの代わりに、またはプロセッサベースのシステムに加えて、半導体デバイス、チップ、論理ゲート、回路、回路カード、および/または他の物理ハードウェアデバイスを介してハードウェアで実施されてもよい。ある実施形態では、類似度判定アプリケーション１５０は、命令又は記述以外のデータを含んでもよい。ある実施形態では、カメラ、センサ、または他のデータ入力デバイス（図示せず）が、バスインターフェースユニット１０９、プロセッサ１０２、またはコンピュータシステム１００の他のハードウェアと直接通信するように提供されてもよい。 In some embodiments, the similarity determination application 150 operates on semiconductor devices, chips, logic gates, circuits, circuit cards, and/or other physical hardware instead of or in addition to processor-based systems. It may also be implemented in hardware via a device. In some embodiments, similarity determination application 150 may include data other than instructions or descriptions. In some embodiments, cameras, sensors, or other data input devices (not shown) may be provided to communicate directly with bus interface unit 109, processor 102, or other hardware of computer system 100. .

コンピュータシステム１００は、プロセッサ１０２、メモリ１０４、表示システム１２４、及びＩ／Ｏバスインターフェースユニット１１０間の通信を行うバスインターフェースユニット１０９を含んでもよい。Ｉ／Ｏバスインターフェースユニット１１０は、様々なＩ／Ｏユニットとの間でデータを転送するためのＩ／Ｏバス１０８と連結していてもよい。Ｉ／Ｏバスインターフェースユニット１１０は、Ｉ／Ｏバス１０８を介して、Ｉ／Ｏプロセッサ（ＩＯＰ）又はＩ／Ｏアダプタ（ＩＯＡ）としても知られる複数のＩ／Ｏインタフェースユニット１１２，１１３，１１４、及び１１５と通信してもよい。 Computer system 100 may include a bus interface unit 109 that provides communication between processor 102 , memory 104 , display system 124 , and I/O bus interface unit 110 . I/O bus interface unit 110 may be coupled to I/O bus 108 for transferring data to and from various I/O units. I/O bus interface unit 110 connects via I/O bus 108 to a plurality of I/O interface units 112, 113, 114, also known as I/O processors (IOPs) or I/O adapters (IOAs). and 115.

表示システム１２４は、表示コントローラ、表示メモリ、又はその両方を含んでもよい。表示コントローラは、ビデオ、オーディオ、又はその両方のデータを表示装置１２６に提供することができる。また、コンピュータシステム１００は、データを収集し、プロセッサ１０２に当該データを提供するように構成された1つまたは複数のセンサ等のデバイスを含んでもよい。 Display system 124 may include a display controller, display memory, or both. A display controller may provide video, audio, or both data to display device 126. Computer system 100 may also include devices, such as one or more sensors, configured to collect data and provide the data to processor 102.

例えば、コンピュータシステム１００は、心拍数データやストレスレベルデータ等を収集するバイオメトリックセンサ、湿度データ、温度データ、圧力データ等を収集する環境センサ、及び加速度データ、運動データ等を収集するモーションセンサ等を含んでもよい。これ以外のタイプのセンサも使用可能である。表示システム１２４は、単独のディスプレイ画面、テレビ、タブレット、又は携帯型デバイスなどの表示装置１２６に接続されてもよい。 For example, the computer system 100 may include a biometric sensor that collects heart rate data, stress level data, etc., an environmental sensor that collects humidity data, temperature data, pressure data, etc., and a motion sensor that collects acceleration data, exercise data, etc. May include. Other types of sensors can also be used. Display system 124 may be connected to a display device 126, such as a standalone display screen, a television, a tablet, or a handheld device.

Ｉ／Ｏインタフェースユニットは、様々なストレージ又はＩ／Ｏデバイスと通信する機能を備える。例えば、端末インタフェースユニット１１２は、ビデオ表示装置、スピーカテレビ等のユーザ出力デバイスや、キーボード、マウス、キーパッド、タッチパッド、トラックボール、ボタン、ライトペン、又は他のポインティングデバイス等のユーザ入力デバイスのようなユーザＩ／Ｏデバイス１１６の取り付けが可能である。ユーザは、ユーザインターフェースを使用して、ユーザ入力デバイスを操作することで、ユーザＩ／Ｏデバイス１１６及びコンピュータシステム１００に対して入力データや指示を入力し、コンピュータシステム１００からの出力データを受け取ってもよい。ユーザインターフェースは例えば、ユーザＩ／Ｏデバイス１１６を介して、表示装置に表示されたり、スピーカによって再生されたり、プリンタを介して印刷されたりしてもよい。 The I/O interface unit has the ability to communicate with various storage or I/O devices. For example, the terminal interface unit 112 may include a user output device such as a video display device, a speaker television, or a user input device such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device. It is possible to attach user I/O devices 116 such as: Using the user interface, a user operates user input devices to input input data and instructions to user I/O device 116 and computer system 100, and to receive output data from computer system 100. Good too. The user interface may be displayed on a display device, played through a speaker, or printed through a printer, for example, via the user I/O device 116.

ストレージインタフェース１１３は、１つ又は複数のディスクドライブや直接アクセスストレージ装置１１７（通常は磁気ディスクドライブストレージ装置であるが、単一のディスクドライブとして見えるように構成されたディスクドライブのアレイ又は他のストレージ装置であってもよい）の取り付けが可能である。ある実施形態では、ストレージ装置１１７は、任意の二次記憶装置として実装されてもよい。メモリ１０４の内容は、ストレージ装置１１７に記憶され、必要に応じてストレージ装置１１７から読み出されてもよい。Ｉ／Ｏデバイスインタフェース１１４は、プリンタ、ファックスマシン等の他のＩ／Ｏデバイスに対するインターフェースを提供してもよい。ネットワークインターフェース１１５は、コンピュータシステム１００と他のデバイスが相互的に通信できるように、通信経路を提供してもよい。この通信経路は、例えば、ネットワーク１３０であってもよい。 Storage interface 113 may include one or more disk drives or direct access storage devices 117 (typically magnetic disk drive storage devices, but also an array of disk drives or other storage devices configured to appear as a single disk drive). ) can be installed. In some embodiments, storage device 117 may be implemented as any secondary storage device. The contents of memory 104 are stored in storage device 117 and may be read from storage device 117 as needed. I/O device interface 114 may provide an interface to other I/O devices such as printers, fax machines, etc. Network interface 115 may provide a communication path so that computer system 100 and other devices can communicate with each other. This communication path may be, for example, network 130.

ある実施形態では、コンピュータシステム１００は、マルチユーザメインフレームコンピュータシステム、シングルユーザシステム、又はサーバコンピュータ等の、直接的ユーザインターフェースを有しない、他のコンピュータシステム（クライアント）からの要求を受信するデバイスであってもよい。他の実施形態では、コンピュータシステム１００は、デスクトップコンピュータ、携帯型コンピューター、ノートパソコン、タブレットコンピュータ、ポケットコンピュータ、電話、スマートフォン、又は任意の他の適切な電子機器であってもよい。 In some embodiments, computer system 100 is a device that receives requests from other computer systems (clients) that do not have a direct user interface, such as a multi-user mainframe computer system, a single-user system, or a server computer. There may be. In other embodiments, computer system 100 may be a desktop computer, a portable computer, a laptop, a tablet computer, a pocket computer, a telephone, a smart phone, or any other suitable electronic device.

次に、図２を参照して、本開示の実施形態に係る類似度判定システムについて説明する。 Next, with reference to FIG. 2, a similarity determination system according to an embodiment of the present disclosure will be described.

図２は、本開示の実施形態に係る類似度判定システム２００の構成の一例を示す図である。類似度判定システム２００は、複数の文章間の類似度を示す比較結果を生成し、
ユーザへ提供するためのシステムである。図２に示すように、類似度判定システム２００は、類似度判定装置２１０と、通信ネットワーク２５０と、ユーザ端末２６０とからなる。類似度判定装置２１０と、ユーザ端末２６０とは、通信ネットワーク２５０を介して互いに接続されてもよい。 FIG. 2 is a diagram illustrating an example of a configuration of a similarity determination system 200 according to an embodiment of the present disclosure. The similarity determination system 200 generates a comparison result indicating the similarity between multiple sentences,
This is a system for providing information to users. As shown in FIG. 2, the similarity determination system 200 includes a similarity determination device 210, a communication network 250, and a user terminal 260. Similarity determination device 210 and user terminal 260 may be connected to each other via communication network 250.

類似度判定装置２１０は、文章の類似度判定を行うための装置であり、図２に示すように、メモリ２２０、記憶部２３０、プロセッサ２４４及び入出力部２４６を主に含む。
ある実施形態では、類似度判定装置２１０は、図１に示すコンピュータシステム１００によって実装されてもよい。 The similarity determination device 210 is a device for determining the similarity of sentences, and as shown in FIG. 2, mainly includes a memory 220, a storage section 230, a processor 244, and an input/output section 246.
In some embodiments, similarity determination device 210 may be implemented by computer system 100 shown in FIG. 1.

メモリ２２０は、本開示の実施形態に係る類似度判定手段の機能を実施するための
類似度判定アプリケーション１５０を格納するためのメモリであってもよい。この類似度判定アプリケーション１５０は、図２に示すように、グラフ生成部２２１、類似度判定部２２２及び更新部２２３等のソフトウェアモジュールの機能を実施するための処理命令を含んでもよい。 The memory 220 may be a memory for storing the similarity determination application 150 for implementing the function of the similarity determination means according to the embodiment of the present disclosure. As shown in FIG. 2, the similarity determination application 150 may include processing instructions for implementing the functions of software modules such as the graph generation section 221, the similarity determination section 222, and the update section 223.

グラフ生成部２２１は、対象の文章に対応するドメイングラフを生成するための機能部である。本開示における「ドメイングラフ」は、対象の文章における一意なドメインキーワードをノードとし、ドメインキーワード間をエッジとして表現する有向非巡回グラフ（ＤｉｒｅｃｔｅｄＡｃｙｃｌｉｃｇｒａｐｈ）であってもよい。また、本開示における「ドメイン」とは、特定の分野や話題を意味し、例えば「有名な物理学者」、「自動車」、「ＤｏＳ攻撃」等、任意のトピックを含んでもよい。異なる文章について生成したドメイングラフを比較することで、文章間の類似度を判定することができる。
なお、グラフ生成部２２１の機能の詳細については後述するため、ここではその説明を省略する。 The graph generation unit 221 is a functional unit for generating a domain graph corresponding to a target sentence. The "domain graph" in the present disclosure may be a directed acyclic graph in which unique domain keywords in a target sentence are represented as nodes, and domain keywords are represented as edges. Furthermore, the term "domain" in the present disclosure means a specific field or topic, and may include any topic such as "famous physicists,""automobiles,""DoSattacks," and the like. By comparing domain graphs generated for different sentences, it is possible to determine the degree of similarity between the sentences.
Note that the details of the function of the graph generation unit 221 will be described later, so a description thereof will be omitted here.

類似度判定部２２２は、グラフ生成部２２１によって生成されるドメイングラフを用いて、複数の文章間の類似度を判定し、判定した類似度を示す比較結果を出力するための機能部である。
なお、類似度判定部２２２の機能の詳細については後述するため、ここではその説明を省略する。 The similarity determination unit 222 is a functional unit that determines the degree of similarity between a plurality of sentences using the domain graph generated by the graph generation unit 221, and outputs a comparison result indicating the determined degree of similarity.
Note that the details of the function of the similarity determination unit 222 will be described later, so the description thereof will be omitted here.

更新部２２３は、後述するキーワード管理ＤＢ２３１に格納されるドメインキーワードに対して、新たなドメインキーワードの追加や既存のドメインキーワードの削除を行うと共に、ドメインキーワードの変更に基づいてドメイングラフＤＢ２３２に格納されるドメイングラフを更新するための機能部である。
なお、更新部２２３の機能の詳細については後述するため、ここではその説明を省略する。 The update unit 223 adds new domain keywords and deletes existing domain keywords to the domain keywords stored in the keyword management DB 231 (described later), and also adds new domain keywords to the domain keywords stored in the domain graph DB 232 based on changes in domain keywords. This is a functional unit for updating the domain graph.
Note that the details of the function of the update unit 223 will be described later, so the description thereof will be omitted here.

記憶部２３０は、本開示の実施形態に係る各種情報を格納するためのデータベース（以下、「ＤＢ」）を収容する記憶領域であり、図２に示すように、キーワード管理ＤＢ２３１及びドメイングラフＤＢ２３２を含んでもよい。 The storage unit 230 is a storage area that accommodates a database (hereinafter referred to as "DB") for storing various information according to the embodiment of the present disclosure, and as shown in FIG. May include.

キーワード管理ＤＢ２３１は、本開示の実施形態に係るドメインキーワードのセットを格納するためのデータベースである。本開示における「ドメインキーワード」とは、特定のドメインにおいて特に重要性が高い単語である。これらのドメインキーワードは、文章間の類似度を示す比較結果を要求するユーザ（例えば、後述するユーザ端末２６０のユーザ）によって選択されてもよい。一例として、「素粒子物理学」とのドメインにおいて、「ヒッグス粒子」はドメインキーワードとして選択されてもよい。ある実施形態では、キーワード管理ＤＢ２３１は、様々な異なるドメインに対応するドメインキーワードのセットを格納してもよい。後述するように、これらのドメインキーワードは、本開示の実施形態に係るドメイングラフを生成する際に用いられる。 The keyword management DB 231 is a database for storing a set of domain keywords according to the embodiment of the present disclosure. A "domain keyword" in the present disclosure is a word that is particularly important in a specific domain. These domain keywords may be selected by a user (for example, a user of the user terminal 260 described below) who requests a comparison result indicating the degree of similarity between sentences. As an example, in a domain with "particle physics", "Higgs boson" may be selected as a domain keyword. In some embodiments, keyword management DB 231 may store a set of domain keywords corresponding to various different domains. As described below, these domain keywords are used when generating a domain graph according to embodiments of the present disclosure.

ドメイングラフＤＢ２３２は、グラフ生成部２２１によって生成されるドメイングラフを格納するためのデータベースである。後述するように、ドメイングラフＤＢ２３２に格納されるドメイングラフ（例えば、第１のドメイングラフ）は、新たに生成されるドメイングラフ（例えば、第２のドメイングラフ）の比較対象として用いられる。 The domain graph DB 232 is a database for storing domain graphs generated by the graph generation unit 221. As will be described later, a domain graph (for example, a first domain graph) stored in the domain graph DB 232 is used as a comparison target for a newly generated domain graph (for example, a second domain graph).

プロセッサ２４４は、メモリ２２０によって格納される類似度判定アプリケーション１５０の各機能部の機能を規定する処理命令を実施するための処理部である。 The processor 244 is a processing unit for executing processing instructions that define the functions of each functional unit of the similarity determination application 150 stored in the memory 220.

入出力部２４６は、類似度判定装置２１０に入力される情報を受け付けると共に、類似度判定装置２１０によって生成される比較結果等の情報を出力するための機能部である。ある実施形態では、入出力部２４６は、例えばキーボード、マウス、ＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）を表示するディスプレイ等を含んでもよい。 The input/output unit 246 is a functional unit that receives information input to the similarity determination device 210 and outputs information such as comparison results generated by the similarity determination device 210. In one embodiment, the input/output unit 246 may include, for example, a keyboard, a mouse, a display that displays a GUI (Graphical User Interface), and the like.

通信ネットワーク２５０は、例えばローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（ＷＡＮ）、衛星ネットワーク、ケーブルネットワーク、ＷｉＦｉネットワーク、またはそれらの任意の組み合わせを含むものであってもよい。 Communication network 250 may include, for example, a local area network (LAN), a wide area network (WAN), a satellite network, a cable network, a WiFi network, or any combination thereof.

ユーザ端末２６０は、類似度判定装置２１０のユーザによって利用可能な端末装置である。ユーザは、ユーザ端末２６０を用いることで、例えば入出力部２４６によって提供されるＧＵＩを用いて、文章間の類似度を示す比較結果を要求したり、ドメインキーワードを選択したり、比較結果を確認したりすることができる。一例として、ユーザ端末２６０は、例えばスマートフォン、スマートウォッチ、タブレット、パソコン等を含んでもよく、特に限定されない。
なお、図２では、説明の便宜上、１つのユーザ端末２６０を含む構成を一例として説明しているが、ユーザ端末２６０の数は特に限定されない。 The user terminal 260 is a terminal device that can be used by the user of the similarity determination device 210. By using the user terminal 260, the user can, for example, use the GUI provided by the input/output unit 246 to request a comparison result showing the similarity between sentences, select a domain keyword, and check the comparison result. You can do it. As an example, the user terminal 260 may include, for example, a smartphone, a smart watch, a tablet, a personal computer, etc., and is not particularly limited.
Note that in FIG. 2, for convenience of explanation, a configuration including one user terminal 260 is described as an example, but the number of user terminals 260 is not particularly limited.

以上説明した類似度判定装置２１０によれば、文章における要素間の関係情報を維持しつつ、文章に対応するグラフ表現の規模を抑えることで、高速且つ高精度な文章類似度比較が可能な類似度判定手段を提供することができる。 According to the similarity determination device 210 described above, by suppressing the scale of the graph representation corresponding to the sentences while maintaining relationship information between elements in the sentences, the similarity determination device 210 enables high-speed and highly accurate sentence similarity comparison. It is possible to provide a degree determination means.

次に、図３を参照して、本開示の実施形態に係る類似度判定処理の全体の流れについて説明する。 Next, with reference to FIG. 3, the overall flow of the similarity determination process according to the embodiment of the present disclosure will be described.

上述したように、本開示の実施形態の一態様は、複数の文章間の類似度を判定する処理に関する。また、この類似度判定処理は、事前グラフ生成処理４００と、グラフ比較処理６００とを主に含む。図３は、本開示の実施形態に係る類似度判定処理における事前グラフ生成処理４００の流れと、グラフ比較処理６００の流れとを示す図である。 As described above, one aspect of the embodiment of the present disclosure relates to processing for determining the degree of similarity between multiple sentences. Further, this similarity determination process mainly includes a preliminary graph generation process 400 and a graph comparison process 600. FIG. 3 is a diagram showing the flow of preliminary graph generation processing 400 and the flow of graph comparison processing 600 in the similarity determination processing according to the embodiment of the present disclosure.

事前グラフ生成処理４００は、文章を比較する際に基準となるドメイングラフ（例えば、正規化済みの第１のドメイングラフ）を既存の文章（第１の文章）に基づいて事前に生成し、ドメイングラフＤＢ２３２に格納しておくための処理である。より具体的に、図３に示すように、事前グラフ生成処理４００では、グラフ生成部２２１は、第１のドメインに対応する第１の文章４１２と、キーワード管理ＤＢ２３１に格納され、当該第１のドメインに対応する第１のドメインキーワードとを用いて、第１の文章４１２をグラフ形式で示す正規化済みの第１のドメイングラフ４１５を生成し、ドメイングラフＤＢ２３２に格納する。 The preliminary graph generation process 400 generates in advance a domain graph (for example, a normalized first domain graph) that is a reference when comparing sentences based on an existing sentence (first sentence), and This is a process for storing in the graph DB 232. More specifically, as shown in FIG. 3, in the pre-graph generation process 400, the graph generation unit 221 generates a first sentence 412 corresponding to the first domain and the first sentence 412 that is stored in the keyword management DB 231. A normalized first domain graph 415 representing the first sentence 412 in a graph format is generated using the first domain keyword corresponding to the domain and stored in the domain graph DB 232.

グラフ比較処理６００は、対象の文章（第２の文章）に対応する新たなドメイングラフ（例えば、正規化済みの第２のドメイングラフ）を生成し、既存のドメイングラフ（例えば、図４に示す事前グラフ生成処理４００によって生成される正規化済みの第１のドメイングラフ）に対して比較することで、正規化済みの第１のドメイングラフと正規化済みの第２のドメイングラフの類似度を示す比較結果を生成するための処理である。 Graph comparison processing 600 generates a new domain graph (for example, a normalized second domain graph) corresponding to the target sentence (second sentence), and generates a new domain graph (for example, the normalized second domain graph) corresponding to the target sentence (second sentence), and generates a new domain graph (for example, the normalized second domain graph) corresponding to the target sentence (second sentence), and The similarity between the normalized first domain graph and the normalized second domain graph is determined by comparing the normalized first domain graph generated by the preliminary graph generation process 400. This is the process for generating the comparison results shown.

より具体的には、図３に示すように、グラフ比較処理６００では、グラフ生成部２２１は、第１のドメインに対応する第２の文章４２２と、キーワード管理ＤＢ２３１に格納され、当該第１のドメインに対応する第１のドメインキーワードとを用いて、第２の文章４２２をグラフ形式で示す正規化済みの第２のドメイングラフ４２５を生成する。その後、類似度判定部２２２は、ドメイングラフＤＢ２３２に格納される正規化済みの第１のドメイングラフ４１５と、新たに生成した正規化済みの第２のドメイングラフ４２５を比較することで、正規化済みの第１のドメイングラフと正規化済みの第２のドメイングラフの類似度を示す比較結果を生成する。 More specifically, as shown in FIG. 3, in the graph comparison process 600, the graph generation unit 221 generates a second sentence 422 corresponding to the first domain and the first sentence stored in the keyword management DB 231. A normalized second domain graph 425 showing the second sentence 422 in a graph format is generated using the first domain keyword corresponding to the domain. Thereafter, the similarity determination unit 222 performs normalization by comparing the normalized first domain graph 415 stored in the domain graph DB 232 and the newly generated normalized second domain graph 425. A comparison result indicating the degree of similarity between the normalized first domain graph and the normalized second domain graph is generated.

なお、以上では、本開示の実施形態における類似度判定処理に含まれる事前グラフ生成処理４００及びグラフ比較処理６００の大まかな流れについて説明したが、これらの処理の詳細については後述するため、ここではその説明を省略する。 Note that although the general flow of the preliminary graph generation process 400 and the graph comparison process 600 included in the similarity determination process in the embodiment of the present disclosure has been described above, the details of these processes will be described later, so they will not be described here. The explanation will be omitted.

次に、図４を参照して、本開示の実施形態に係る事前グラフ生成処理の詳細について説明する。 Next, details of the preliminary graph generation process according to the embodiment of the present disclosure will be described with reference to FIG. 4.

図４は、本開示の実施形態に係る事前グラフ生成処理４００の流れの詳細を示す図である。上述したように、事前グラフ生成処理４００は、正規化済みの第１のドメイングラフを事前に生成し、ドメイングラフＤＢ２３２に格納しておくための処理であり、主に類似度判定装置２１０のグラフ生成部２２１によって実施される。 FIG. 4 is a diagram illustrating details of the flow of preliminary graph generation processing 400 according to the embodiment of the present disclosure. As described above, the preliminary graph generation process 400 is a process for generating a normalized first domain graph in advance and storing it in the domain graph DB 232, and mainly generates a graph of the similarity determination device 210. This is performed by the generation unit 221.

まず、ステップＳ４１０では、グラフ生成部２２１は、第１の文章４１２のドメインに対応するドメインキーワードのセットをキーワード管理ＤＢ２３１から抽出し、抽出したドメインキーワードのセットの内、第１の文章４１２に含まれるドメインキーワードのサブセットを特定する。 First, in step S410, the graph generation unit 221 extracts a set of domain keywords corresponding to the domain of the first sentence 412 from the keyword management DB 231, and among the set of extracted domain keywords, the graph generation unit 221 extracts a set of domain keywords corresponding to the domain of the first sentence 412, Identify a subset of domain keywords that will be used.

一例として、第１の文章４１２が「物理学」とのドメインに関する場合、グラフ生成部２２１は、「物理学」に対応するドメインキーワードのセットをキーワード管理ＤＢ２３１から抽出する。このドメインキーワードのセットは、「原子」、「重力」、「中性子」、「電荷」、「電子」及び「不確定性原理」を含む場合、グラフ生成部２２１は、これらのドメインキーワードを第１の文章４１２において検索し、特定する。例えば、上記のドメインキーワードのセットの内、「原子」、「中性子」及び「電子」のみが第１の文章４１２に現れる場合、「原子」、「中性子」及び「電子」をドメインキーワードのサブセットとして特定する。 As an example, if the first sentence 412 is related to the domain "physics", the graph generation unit 221 extracts a set of domain keywords corresponding to "physics" from the keyword management DB 231. If this set of domain keywords includes "atom," "gravity," "neutron," "charge," "electron," and "uncertainty principle," the graph generation unit 221 converts these domain keywords into the first Search and specify in the sentence 412 of. For example, if only "atom," "neutron," and "electron" from the above set of domain keywords appear in the first sentence 412, then "atom," "neutron," and "electron" are included as a subset of the domain keywords. Identify.

次に、ステップＳ４２０では、グラフ生成部２２１は、ステップＳ４１０で特定したドメインキーワードのサブセットに基づいてノードを生成する。ここでのノードは、ドメインキーワードのサブセットを後述するドメイングラフにおいて表現するためのデータ構造である。ある実施形態では、グラフ生成部２２１は、第１の文章４１２において特定したドメインキーワードのサブセットの一意なドメインキーワード毎にノードを作成してもよい。一例として、「原子」、「中性子」及び「電子」をドメインキーワードのサブセットとして特定した場合、グラフ生成部２２１は、「原子」、「中性子」及び「電子」のそれぞれに対応するノードを含むノードのセットを生成してもよい。 Next, in step S420, the graph generation unit 221 generates nodes based on the subset of domain keywords identified in step S410. A node here is a data structure for expressing a subset of domain keywords in a domain graph described later. In some embodiments, the graph generator 221 may create a node for each unique domain keyword in the subset of domain keywords identified in the first sentence 412. As an example, when "atom", "neutron", and "electron" are specified as a subset of domain keywords, the graph generation unit 221 generates a node containing nodes corresponding to "atom", "neutron", and "electron", respectively. may generate a set of

次に、ステップＳ４３０では、グラフ生成部２２１は、ステップＳ４２０で生成した各ノードについて、ノードスコアを計算する。本開示における「ノードスコア」とは、特定のノードに対応するドメインキーワードの、文章における相対的な重要性を示す定量的な尺度であり、重要性が高ければ高い程、ノードスコアが高くなる。ある実施形態では、グラフ生成部２２１は、特定のノードに対応するドメインキーワードが第１の文章４１２に出現する回数（出現回数）に基づいて当該ノードのノードスコアを計算してもよい。この場合、グラフ生成部２２１は、以下に示す数式１を用いて各ノードのノードスコアＮ_i（Ｎ_１、Ｎ_２、．．．Ｎ_Ｔ）を計算してもよい。

ここで、Ｃｏｕｎｔ(Ｎｉ)は、特定のノードＮに対応するドメインキーワードの特定の文章（例えば、第１の文章４１２）における出現回数であり、Ｔは、グラフにおけるノートの総数である。
なお、以上では、ノードスコアをドメインキーワードの文章における出現回数に基づいて計算した場合を一例として説明したが、本開示はこれに限定されず、例えば特定のドメインキーワードの意味的コサイン値に基づいてノードスコアを計算することも可能である。 Next, in step S430, the graph generation unit 221 calculates a node score for each node generated in step S420. A "node score" in the present disclosure is a quantitative measure indicating the relative importance of a domain keyword corresponding to a specific node in a sentence, and the higher the importance, the higher the node score. In one embodiment, the graph generation unit 221 may calculate the node score of a particular node based on the number of times a domain keyword corresponding to a particular node appears in the first sentence 412 (number of appearances). In this case, the graph generation unit 221 may calculate the node score N _i (N ₁ , N ₂ , . . . N _T ) of each node using Equation 1 shown below.

Here, Count(Ni) is the number of appearances of the domain keyword corresponding to the specific node N in a specific sentence (eg, first sentence 412), and T is the total number of notes in the graph.
Note that, although the above example describes a case where the node score is calculated based on the number of occurrences of the domain keyword in the sentence, the present disclosure is not limited to this, and for example, the node score is calculated based on the semantic cosine value of a specific domain keyword. It is also possible to calculate node scores.

次に、ステップＳ４４０では、グラフ生成部２２１は、ステップＳ４２０で生成したノードのセットの関係を示すエッジを生成する。ここでのエッジとは、ノード間の関係をドメイングラフにおいて表現するためのデータ構造である。ある実施形態では、グラフ生成部２２１は、第１の文章に対応する知識グラフ（ＫｎｏｗｌｅｄｇｅＧｒａｐｈ）を予め作成し、当該知識グラフに基づいてノードのセットの関係を示すエッジを生成してもよい。
一般に、知識グラフとは、様々な知識を体系的に連結し、グラフ構造で表すデータ構造である。ここで、知識グラフを生成するための手段は特に限定されず、自然言語処理やニューラルネットワーク等、任意の既存の手段を用いてもよい。 Next, in step S440, the graph generation unit 221 generates edges indicating the relationship between the set of nodes generated in step S420. The edge here is a data structure for expressing relationships between nodes in a domain graph. In one embodiment, the graph generation unit 221 may create a knowledge graph corresponding to the first sentence in advance, and generate edges indicating relationships between sets of nodes based on the knowledge graph.
Generally, a knowledge graph is a data structure that systematically connects various pieces of knowledge and represents it in a graph structure. Here, the means for generating the knowledge graph is not particularly limited, and any existing means such as natural language processing or neural networks may be used.

次に、ステップＳ４５０では、グラフ生成部２２１は、ステップＳ４４０で生成した各エッジについて、エッジ重みを計算する。本開示における「エッジ重み」とは、ドメイングラフにおいてエッジで接続されている２つのノードに対応するドメインキーワードの関連度を定量的に示す尺度であり、ドメインキーワードの関連度が高ければ高い程、エッジ重みが高くなる。
ある実施形態では、グラフ生成部２２１は、第１の文章に基づいて予め作成された知識グラフに基づいてエッジ重みを計算してもよい。この場合、グラフ生成部２２１は、２つのノードに対応するドメインキーワードの、知識グラフにおける最短距離（接続数）に基づいてエッジ重みを計算してもよい。ある実施形態では、グラフ生成部２２１は、以下に示す数式２を用いてノードｉ及びノードｊ間のエッジのエッジ重みＥ_ijを計算してもよい。

ここで、Ｌ_ijは、知識グラフにおいて、ノードｉに対応するドメインキーワードとノードｊに対応するドメインキーワードの間の接続数である。知識グラフにおいて、２つのドメインキーワード間の接続数が少なければ少ない程、関連度が高いため、ここでは、説明の便宜上、接続数の逆（１/接続数）をエッジ重みＥ_ijとする。従って、エッジ重みＥ_ijは、２つのドメインキーワード間の関連度が高ければ高い程、高くなる。 Next, in step S450, the graph generation unit 221 calculates edge weights for each edge generated in step S440. "Edge weight" in this disclosure is a measure that quantitatively indicates the degree of association of domain keywords corresponding to two nodes connected by an edge in a domain graph, and the higher the degree of association of domain keywords, the higher the degree of association between domain keywords. Edge weight increases.
In some embodiments, the graph generation unit 221 may calculate edge weights based on a knowledge graph created in advance based on the first sentence. In this case, the graph generation unit 221 may calculate edge weights based on the shortest distance (number of connections) between domain keywords corresponding to two nodes in the knowledge graph. In one embodiment, the graph generation unit 221 may calculate the edge weight E _ij of the edge between node i and node j using Equation 2 shown below.

Here, L _ij is the number of connections between the domain keyword corresponding to node i and the domain keyword corresponding to node j in the knowledge graph. In the knowledge graph, the smaller the number of connections between two domain keywords, the higher the degree of association; therefore, for convenience of explanation, here, the inverse of the number of connections (1/number of connections) is assumed to be the edge weight E _ij . Therefore, the higher the degree of association between two domain keywords, the higher the edge weight E _ij becomes.

次に、ステップＳ４６０では、グラフ生成部２２１は、ステップＳ４２０で生成したノードのセットと、ステップＳ４３０で計算したノードスコアと、ステップＳ４４０で生成したエッジのセットと、ステップＳ４５０で計算したエッジ重みを用いて、第１のドメイングラフを生成する。この第１のドメイングラフは、第１の文章４１２において特定したドメインキーワードのサブセットをノードとし、ドメインキーワード間の間をエッジとして表現する有向非巡回グラフ（ＤｉｒｅｃｔｅｄＡｃｙｃｌｉｃｇｒａｐｈ）である。また、この第１のドメイングラフにおける各ノードは、当該ノードに対応するドメインキーワードの重要性を示すノードスコアに対応付けられ、各エッジは、当該エッジが接続するノード間の関連度を示すエッジ重みに対応付けられる。 Next, in step S460, the graph generation unit 221 generates the set of nodes generated in step S420, the node scores calculated in step S430, the set of edges generated in step S440, and the edge weights calculated in step S450. to generate a first domain graph. This first domain graph is a directed acyclic graph in which the subset of domain keywords specified in the first sentence 412 is represented as nodes, and the spaces between domain keywords are represented as edges. Furthermore, each node in this first domain graph is associated with a node score that indicates the importance of the domain keyword corresponding to the node, and each edge is associated with an edge weight that indicates the degree of association between the nodes that the edge connects. can be mapped to.

次に、ステップＳ４７０では、グラフ生成部２２１は、ステップＳ４６０で生成した第１のドメイングラフ４１５を正規化することで、正規化済みの第１のドメイングラフを生成する。より具体的には、グラフ生成部２２１は、正規化済みの第１のドメイングラフにおける各ノードのノードスコアと、各エッジのエッジスコアを正規化することで正規化済みの第１のドメイングラフを生成してもよい。ここで、グラフ生成部２２１は、正規化したノードスコアＮＳ_iを以下に示す数式３に基づいて計算し、正規化したエッジ重みＮＥ_ijを以下に示す数式４に基づいて計算してもよい。

Next, in step S470, the graph generation unit 221 generates a normalized first domain graph by normalizing the first domain graph 415 generated in step S460. More specifically, the graph generation unit 221 generates the normalized first domain graph by normalizing the node score of each node and the edge score of each edge in the normalized first domain graph. May be generated. Here, the graph generation unit 221 may calculate the normalized node score NS _i based on Equation 3 shown below, and calculate the normalized edge weight NE _ij based on Equation 4 shown below.

次に、ステップＳ４８０では、グラフ生成部２２１は、ステップＳ４７０で生成した正規化済みの第１のドメイングラフを、第１の文章４１２と、ステップＳ４１０で特定したドメインキーワードのサブセットとを対応付けてドメイングラフＤＢ２３２に格納する。 Next, in step S480, the graph generation unit 221 associates the normalized first domain graph generated in step S470 with the first sentence 412 and the subset of domain keywords identified in step S410. It is stored in the domain graph DB 232.

以上説明した事前グラフ生成処理４００を様々な文章に対して行うことで、多数の異なる文章に対応するドメイングラフを予め用意し、ドメイングラフＤＢ２３２に格納しておくことができる。後述するように、事前グラフ生成処理４００によって生成される正規化済みの第１のドメイングラフは、後述するグラフ比較処理６００において、第２の文章に基づいて生成される正規化済みの第２のドメイングラフと比較するために用いられる。従って、事前グラフ生成処理４００を様々な文章に対して行い、数多くのドメイングラフを比較用に準備しておくことで、より幅広い類似度判定が可能となり、第２の文章との類似度が高い第１の文章が特定しやすくなる。 By performing the preliminary graph generation process 400 described above on various sentences, domain graphs corresponding to many different sentences can be prepared in advance and stored in the domain graph DB 232. As will be described later, the normalized first domain graph generated by the pre-graph generation process 400 is compared to the normalized second domain graph generated based on the second sentence in the graph comparison process 600, which will be described later. Used for comparison with domain graph. Therefore, by performing the preliminary graph generation process 400 on various sentences and preparing a large number of domain graphs for comparison, it is possible to perform a broader similarity determination, and it is possible to determine the degree of similarity in a wider range of ways. The first sentence becomes easier to identify.

次に、図５を参照して、本開示の実施形態に係るドメイングラフ及び正規化済みのドメイングラフについて説明する。 Next, with reference to FIG. 5, a domain graph and a normalized domain graph according to an embodiment of the present disclosure will be described.

図５は、本開示の実施形態に係るドメイングラフ５１０及び正規化済みのドメイングラフ５２０の一例を示す図である。 FIG. 5 is a diagram illustrating an example of a domain graph 510 and a normalized domain graph 520 according to an embodiment of the present disclosure.

上述したように、本開示におけるドメイングラフは、対象の文章における一意なドメインキーワードをノードとし、ドメインキーワード間の間をエッジとして表現する有向非巡回グラフ（ＤｉｒｅｃｔｅｄＡｃｙｃｌｉｃＧｒａｐｈ）であってもよい。 As described above, the domain graph in the present disclosure may be a directed acyclic graph in which unique domain keywords in a target sentence are represented as nodes, and spaces between domain keywords are represented as edges.

一例として、図５に示すように、ドメイングラフ５１０は、ノード「ｉ」、ノード「ｉ＋１」、ノード「ｉ＋ｘ」等、多数のノードを含んでもよい。また、ドメイングラフにおける各ノードは、当該ノードの相対的な重要性を示すノードスコアに対応付けられ、各エッジは、当該エッジによって接続されるノードの関連度を示すエッジ重みに対応付けられる。例えば、ノード「ｉ」は、ノードスコア「Ｎｉ」に対応付けられ、ノード「ｉ」とノード「ｉ＋１」を接続するエッジは、「Ｅ_i,i+1」とのエッジ重みに対応付けられる。 As an example, as shown in FIG. 5, domain graph 510 may include a number of nodes, such as node "i," node "i+1," node "i+x," and so on. Further, each node in the domain graph is associated with a node score indicating the relative importance of the node, and each edge is associated with an edge weight indicating the degree of association of nodes connected by the edge. For example, a node "i" is associated with a node score "Ni", and an edge connecting node "i" and node "i+1" is associated with an edge weight of "E _i,i+1 ".

また、ドメイングラフ５１０における各ノードスコア及びエッジ重みを正規化することで、正規化済みのドメイングラフ５２０を生成することができる。図５に示すように、正規化済みのドメイングラフ５２０各ノードは、正規化したノードスコアに対応付けられ、各エッジは、正規化したエッジ重みに対応付けられる。例えば、ノード「ｉ」は、正規化したノードスコア「ＮＳｉ」に対応付けられ、ノード「ｉ」とノード「ｉ＋１」を接続するエッジは、「ＮＥ_i,i+1」とのエッジ重みに対応付けられる。 Further, by normalizing each node score and edge weight in the domain graph 510, a normalized domain graph 520 can be generated. As shown in FIG. 5, each node in the normalized domain graph 520 is associated with a normalized node score, and each edge is associated with a normalized edge weight. For example, node "i" is associated with the normalized node score "NSi", and the edge connecting node "i" and node "i+1" corresponds to the edge weight of "NE _i,i+1 ". Can be attached.

このように、文章をドメイングラフとして表現することで、より高精度の類似度判定が可能となる。
また、上述したように、ドメイングラフにおける各ノードは、当該ノードに対応するドメインキーワードの文章における重要性を示すノードスコアに対応付けられ、ドメイングラフにおける各エッジは、当該エッジによって接続されるノードに対応するドメインキーワードの関連度を示すエッジ重みを対応付けられる。これにより、文章全体をドメイングラフとして表現しなくても、文章において特に重要なキーワードに関する意味的情報を維持しつつ、ドメイングラフの規模を抑えることが可能となる。
更に、ドメイングラフにおける各ノードスコア及びエッジ重みを正規化した正規化済みのドメイングラフを生成することで、ドメイングラフを比較する際、ドメイングラフの大きさの相違に起因する類似度判定の低下を防ぐことができる。 In this way, by representing a sentence as a domain graph, it becomes possible to determine similarity with higher accuracy.
Furthermore, as described above, each node in the domain graph is associated with a node score that indicates the importance of the domain keyword corresponding to the node in the sentence, and each edge in the domain graph is associated with the node connected by the edge. It can be associated with edge weights that indicate the degree of relevance of the corresponding domain keyword. This makes it possible to reduce the size of the domain graph while maintaining semantic information regarding particularly important keywords in the text, without having to express the entire text as a domain graph.
Furthermore, by generating a normalized domain graph in which each node score and edge weight in the domain graph are normalized, when comparing domain graphs, it is possible to avoid a decline in similarity judgment due to differences in the size of domain graphs. It can be prevented.

次に、図６を参照して、本開示の実施形態に係るグラフ比較処理の詳細について説明する。 Next, details of the graph comparison process according to the embodiment of the present disclosure will be described with reference to FIG. 6.

図６は、本開示の実施形態に係るグラフ比較処理６００の流れの詳細を示す図である。上述したように、本開示の実施形態に係るグラフ比較処理６００は、対象の文章（第２の文章）に対応する新たなドメイングラフ（例えば、正規化済みの第２のドメイングラフ）を生成し、既存のドメイングラフ（例えば、図４に示す事前グラフ生成処理４００によって生成される正規化済みの第１のドメイングラフ）に対して比較することで、正規化済みの第１のドメイングラフと正規化済みの第２のドメイングラフの類似度を示す比較結果６８５を生成するための処理である。また、グラフ比較処理６００主にグラフ生成部２２１及び類似度判定部２２２によって実施される。
なお、以下説明するグラフ比較処理６００において、第２の文章に対応する正規化済みの第２のドメイングラフを生成する処理（ステップＳ６１０～Ｓ６７０）は、図４に示す事前グラフ生成処理４００と実質的に同様であるため、繰り返しとなる説明を省略する。 FIG. 6 is a diagram illustrating details of the flow of graph comparison processing 600 according to the embodiment of the present disclosure. As described above, the graph comparison process 600 according to the embodiment of the present disclosure generates a new domain graph (for example, a normalized second domain graph) corresponding to the target sentence (second sentence). , by comparing it with an existing domain graph (for example, the normalized first domain graph generated by the pre-graph generation process 400 shown in FIG. This is a process for generating a comparison result 685 indicating the degree of similarity of the converted second domain graph. Further, the graph comparison process 600 is mainly performed by the graph generation unit 221 and the similarity determination unit 222.
In the graph comparison process 600 described below, the process of generating the normalized second domain graph corresponding to the second sentence (steps S610 to S670) is substantially the same as the preliminary graph generation process 400 shown in FIG. Since they are essentially the same, repetitive explanation will be omitted.

まず、ステップＳ６０５では、入出力部２４６は、第２の文章４２２と、当該第２の文章４２２のドメイン名を示す情報とを含む入力をユーザ端末２６０から受け付ける。この第２の文章４２２は、図４に示す第１の文章４１２と同一の文章であってもよく、異なる文章であってもよい。ある実施形態では、この第２の文章４２２は、既存の文章に対する類似度判定が希望される文章であってもよい。入出力部２４６は、受け付けた第２の文章４２２をグラフ生成部２２１に転送すると共に、当該第２の文章４２２のドメイン名を示す情報をキーワード管理ＤＢ２３１及びドメイングラフＤＢ２３２に転送する。 First, in step S605, the input/output unit 246 receives an input including the second text 422 and information indicating the domain name of the second text 422 from the user terminal 260. This second sentence 422 may be the same sentence as the first sentence 412 shown in FIG. 4, or may be a different sentence. In some embodiments, this second sentence 422 may be a sentence for which similarity determination with respect to an existing sentence is desired. The input/output unit 246 transfers the received second sentence 422 to the graph generation unit 221 and also transfers information indicating the domain name of the second sentence 422 to the keyword management DB 231 and domain graph DB 232.

次に、ステップＳ６１０では、グラフ生成部２２１は、第２の文章４２２のドメインに対応するドメインキーワードのセットをキーワード管理ＤＢ２３１から抽出し、抽出したドメインキーワードのセットの内、第２の文章４２２に含まれるドメインキーワードのサブセット（第１のドメインキーワードのサブセット）を特定する。 Next, in step S610, the graph generation unit 221 extracts a set of domain keywords corresponding to the domain of the second sentence 422 from the keyword management DB 231, and extracts a set of domain keywords corresponding to the domain of the second sentence 422 from the set of extracted domain keywords. A subset of included domain keywords (a first subset of domain keywords) is identified.

次に、ステップＳ６２０では、グラフ生成部２２１は、ステップＳ６１０で特定したドメインキーワードのサブセットに基づいてノード（第１のノードセット）を生成する。 Next, in step S620, the graph generation unit 221 generates nodes (first node set) based on the subset of domain keywords identified in step S610.

次に、ステップＳ６３０では、グラフ生成部２２１は、ステップＳ６２０で生成した各ノードについて、ノードスコアを計算する。ここで、ノードスコアを計算するためには、上述した数式１を用いてもよい。 Next, in step S630, the graph generation unit 221 calculates a node score for each node generated in step S620. Here, in order to calculate the node score, Equation 1 described above may be used.

次に、ステップＳ６４０では、グラフ生成部２２１は、ステップＳ６２０で生成したノードのセットの関係を示すエッジを生成する。 Next, in step S640, the graph generation unit 221 generates edges indicating the relationship between the set of nodes generated in step S620.

次に、ステップＳ６５０では、グラフ生成部２２１は、ステップＳ６４０で生成した各エッジについて、エッジ重みを計算する。ここで、エッジ重みを計算するためには、第２の文章４２２に基づいて生成した知識グラフと、上述した数式２を用いてもよい。 Next, in step S650, the graph generation unit 221 calculates edge weights for each edge generated in step S640. Here, in order to calculate the edge weight, the knowledge graph generated based on the second sentence 422 and Equation 2 described above may be used.

次に、ステップＳ６６０では、グラフ生成部２２１は、ステップＳ６２０で生成したノードのセットと、ステップＳ６３０で計算したノードスコアと、ステップＳ６４０で生成したエッジのセットと、ステップＳ６５０で計算したエッジ重みを用いて、第２のドメイングラフを生成する。 Next, in step S660, the graph generation unit 221 generates the set of nodes generated in step S620, the node scores calculated in step S630, the set of edges generated in step S640, and the edge weights calculated in step S650. to generate a second domain graph.

次に、ステップＳ６７０では、グラフ生成部２２１は、ステップＳ６６０で生成した第２のドメイングラフを正規化することで、正規化済みの第２のドメイングラフを生成する。ここで、第２のドメイングラフを正規化し、正規化済みの第２のドメイングラフを生成するためには、グラフ生成部２２１は、上述した数式３及び数式４を用いてもよい。 Next, in step S670, the graph generation unit 221 generates a normalized second domain graph by normalizing the second domain graph generated in step S660. Here, in order to normalize the second domain graph and generate a normalized second domain graph, the graph generation unit 221 may use Equation 3 and Equation 4 described above.

次に、ステップＳ６８０では、類似度判定部２２２は、図４に示す事前グラフ生成処理４００によって生成された正規化済みの第１のドメイングラフをドメイングラフＤＢ２３２から取得した後、正規化済みの第１のドメイングラフと、ステップＳ６７０で生成した正規化済みの第２のドメイングラフとの類似度を判定する。 Next, in step S680, the similarity determination unit 222 acquires the normalized first domain graph generated by the advance graph generation process 400 shown in FIG. The degree of similarity between the first domain graph and the normalized second domain graph generated in step S670 is determined.

ここで、まず、類似度判定部２２２は、正規化済みの第１のドメイングラフと、正規化済みの第２のドメイングラフとに基づいて、正規化済みの第１のドメイングラフと、正規化済みの第２のドメイングラフとで共通しているノードの重要性を示すノードスコア（ＣｏｍｍｏｎＮｏｄｅＳｃｏｒｅ；ＣＮＳ）を、以下の数式５によって計算する。

ここで、f₁は正規化済みの第１のドメイングラフであり、f_２は正規化済みの第２のドメイングラフであり、ＮＳ_iは正規化済みのノードスコアである。 Here, first, the similarity determination unit 222 determines the normalized first domain graph and the normalized domain graph based on the normalized first domain graph and the normalized second domain graph. A common node score (CNS) indicating the importance of a node common to the completed second domain graph is calculated using Equation 5 below.

Here, f ₁ is the normalized first domain graph, f ₂ is the normalized second domain graph, and NS _i is the normalized node score.

次に、類似度判定部２２２は、正規化済みの第１のドメイングラフと、正規化済みの第２のドメイングラフとに基づいて、正規化済みの第１のドメイングラフと、正規化済みの第２のドメイングラフとで共通しているノード間のエッジの関連度を示す関係スコア（Ｒｅｌａｔｉｏｎｓｈｉｐｓｃｏｒｅｆｏｒｃｏｍｍｏｎｎｏｄｅｓ；Ｒｓ）を、以下の数式６によって計算する。

Next, the similarity determination unit 222 determines the normalized first domain graph and the normalized second domain graph based on the normalized first domain graph and the normalized second domain graph. A relationship score (Relationship score for common nodes; Rs) indicating the degree of association of edges between nodes that are common to the second domain graph is calculated using Equation 6 below.

次に、類似度判定部２２２は、計算したＣＮＳとＲｓとを用いて、正規化済みの第１のドメイングラフと、正規化済みの第２のドメイングラフとの類似度を以下の数式７によって計算する。この類似度は、パーセンテージで表現されてもよい。従って、「１００％」の類似度は、正規化済みの第１のドメイングラフと、正規化済みの第２のドメイングラフと（そして、グラフの元となった第１の文章および第２の文章）が同一であることを意味する。

Next, the similarity determination unit 222 uses the calculated CNS and Rs to calculate the similarity between the normalized first domain graph and the normalized second domain graph using the following formula 7. calculate. This similarity may be expressed as a percentage. Therefore, "100%" similarity is between the normalized first domain graph and the normalized second domain graph (and between the first sentence and second sentence that are the source of the graph). ) means that they are the same.

次に、類似度判定部２２２は、ここで計算した類似度を示す比較結果６８５を生成し、ユーザ端末２６０に送信する。 Next, the similarity determination unit 222 generates a comparison result 685 indicating the similarity calculated here, and transmits it to the user terminal 260.

以上説明したグラフ比較処理６００によれば、文章における要素間の関係情報を維持しつつ、文章に対応するグラフ表現の規模を抑えることで、高速且つ高精度な文章類似度比較が可能な類似度判定手段を提供することができる。 According to the graph comparison processing 600 described above, by maintaining the relational information between elements in a sentence and suppressing the scale of the graph representation corresponding to the sentence, the similarity allows for high-speed and highly accurate sentence similarity comparison. Determination means can be provided.

次に、図７を参照して、本開示の実施形態に係るドメインキーワード及びドメイングラフを更新する処理について説明する。 Next, with reference to FIG. 7, processing for updating domain keywords and domain graphs according to an embodiment of the present disclosure will be described.

上述したように、本開示の実施形態に係るキーワード管理ＤＢ２３１は、様々な異なるドメインに対応するドメインキーワードのセットを格納することができる。また、本開示の実施形態に係る類似度判定は、これらのドメインキーワードに基づいて生成されるドメイングラフを用いて行われるため、高精度の類似度判定を促進するためには、特定のドメインに対応するドメインキーワードのセットに対して、新たなドメインキーワードの追加や既存のドメインキーワードの削除等を行い、ドメインキーワードのセットを更新することが望ましい場合がある。 As described above, the keyword management DB 231 according to the embodiment of the present disclosure can store sets of domain keywords corresponding to various different domains. Further, similarity determination according to the embodiment of the present disclosure is performed using a domain graph generated based on these domain keywords, so in order to promote highly accurate similarity determination, it is necessary to It may be desirable to update the set of domain keywords by adding new domain keywords, deleting existing domain keywords, etc. to the corresponding set of domain keywords.

また、キーワード管理ＤＢ２３１に格納されている特定のドメインキーワードのセットが更新される場合、当該ドメインキーワードのセットに基づいて生成され、ドメイングラフＤＢ２３２に格納されるドメイングラフを更新することが望ましい。
従って、本開示の一態様は、ユーザ端末２６０を介してユーザによって入力される更新要求に基づいて、キーワード管理ＤＢ２３１に格納されるドメインキーワードのセットと、ドメイングラフＤＢ２３２に格納されるドメイングラフを更新することに関する。図７は、開示の実施形態に係るドメインキーワード及びドメイングラフを更新するための更新処理７００の流れを示す図である。 Further, when a specific set of domain keywords stored in the keyword management DB 231 is updated, it is desirable to update a domain graph generated based on the set of domain keywords and stored in the domain graph DB 232.
Therefore, one aspect of the present disclosure updates the set of domain keywords stored in the keyword management DB 231 and the domain graph stored in the domain graph DB 232 based on an update request input by the user via the user terminal 260. Concerning what to do. FIG. 7 is a diagram illustrating a flow of an update process 700 for updating domain keywords and domain graphs according to a disclosed embodiment.

まず、入出力部２４６は、ドメイングラフＤＢ２３２に格納されている特定のドメインキーワードのセット（例えば、第１のドメインキーワードのセット）に対する更新を要求する更新要求をユーザ端末２６０から受け付ける。ここでの更新要求は、特定のドメインキーワードのセットに対して、新たなドメインキーワードの追加、既存のドメインキーワードの削除、又は既存のドメインキーワードの変更を要求するユーザ入力であってもよい。ある実施形態では、この更新要求は、入出力部２４６によってユーザに提示されるユーザインターフェースを介して入力されてもよい。 First, the input/output unit 246 receives an update request from the user terminal 260 requesting an update to a specific set of domain keywords (for example, the first set of domain keywords) stored in the domain graph DB 232 . The update request here may be a user input requesting addition of a new domain keyword, deletion of an existing domain keyword, or modification of an existing domain keyword for a particular set of domain keywords. In some embodiments, this update request may be input via a user interface presented to the user by input/output unit 246.

次に、更新部２２３は、ユーザ端末２６０から取得した更新要求に基づいて、指定されているドメインキーワードのセットをキーワード管理ＤＢ２３１において更新する。一例として、「物理学」とのドメインに対応するドメインキーワードのセットに対して、「光子」とのドメインキーワードの追加が更新要求によって指定される場合、更新部２２３は、キーワード管理ＤＢ２３１に格納される「物理学」とのドメインに対応するドメインキーワードのセットに対して、「光子」とのドメインキーワードを追加してもよい。 Next, the update unit 223 updates the specified set of domain keywords in the keyword management DB 231 based on the update request obtained from the user terminal 260. As an example, when an update request specifies addition of a domain keyword of "photon" to a set of domain keywords corresponding to a domain of "physics", the update unit 223 stores the domain keyword of "photon" in the keyword management DB 231. The domain keyword "photon" may be added to the set of domain keywords corresponding to the domain "physics".

その後、更新部２２３は、ドメイングラフＤＢ２３２に格納されているドメイングラフの内、ドメインキーワードのセットが更新されたドメインに対応するドメイングラフを、更新したドメインキーワードのセットに基づいて更新するようにグラフ生成部２２１を指示する。ここで、グラフ生成部２２１は、更新したドメインキーワードのセットに基づいて、新たなノードやエッジをドメイングラフに追加したり、既存のノードやエッジを削除したり、各ノードのノードスコアや各エッジのエッジスコアを再度計算したりしてもよい。 Thereafter, the updating unit 223 updates the domain graph corresponding to the domain whose set of domain keywords has been updated among the domain graphs stored in the domain graph DB 232 based on the updated set of domain keywords. The generation unit 221 is instructed. Here, the graph generation unit 221 adds new nodes and edges to the domain graph, deletes existing nodes and edges, and calculates the node score of each node and each edge based on the updated set of domain keywords. The edge score may be calculated again.

以上説明したように、本開示の実施形態に係る更新処理７００によれば、キーワード管理ＤＢ２３１に格納されているドメインキーワードのセットを更新すると共に、当該ドメインキーワードのセットに対応するドメイングラフを更新することができる。これにより、特定のドメインの進歩によるドメインキーワードの変化の反映や、ユーザの目的や価値観に適したドメインキーワードの設定が可能となり、類似度判定の精度を向上させることができる。 As explained above, according to the update process 700 according to the embodiment of the present disclosure, the set of domain keywords stored in the keyword management DB 231 is updated, and the domain graph corresponding to the set of domain keywords is updated. be able to. This makes it possible to reflect changes in domain keywords due to progress in a particular domain and to set domain keywords that are appropriate for the user's purpose and values, thereby improving the accuracy of similarity determination.

次に、図８を参照して、本開示の実施形態に係るキーワード管理画面について説明する。 Next, with reference to FIG. 8, a keyword management screen according to an embodiment of the present disclosure will be described.

図８は、本開示の実施形態に係るキーワード管理画面８００の一例を示す図である。このキーワード管理画面８００によれば、ユーザは、類似度判定装置２１０のキーワード管理部ＤＢ２３１に格納されるドメインキーワードを登録したり更新したりすることができる。ある実施形態では、このキーワード管理画面８００は、例えば入出力部２４６によって生成され、ユーザ端末２６０に提示されるユーザインターフェースの画面であってもよい。 FIG. 8 is a diagram illustrating an example of a keyword management screen 800 according to an embodiment of the present disclosure. According to this keyword management screen 800, the user can register or update domain keywords stored in the keyword management unit DB 231 of the similarity determination device 210. In some embodiments, the keyword management screen 800 may be a user interface screen generated by the input/output unit 246 and presented to the user terminal 260, for example.

図８に示すように、キーワード管理画面８００は、新ドメイン登録ウィンドウ８１０と、ドメインキーワード更新ウィンドウ８２０とを含んでもよい。 As shown in FIG. 8, the keyword management screen 800 may include a new domain registration window 810 and a domain keyword update window 820.

新ドメイン登録ウィンドウ８１０では、ユーザは、新たなドメインをキーワード管理部ＤＢ２３１に登録することができる。例えば、ユーザは、「建築」とのドメイン名を新ドメイン登録ウィンドウ８１０の入力エリア８１１において入力し、確定ボタン８１２を押すことで、「建築」との新たなドメイン名をドメインをキーワード管理部ＤＢ２３１に登録することができる。 In the new domain registration window 810, the user can register a new domain in the keyword management section DB 231. For example, the user enters the domain name "architecture" in the input area 811 of the new domain registration window 810 and presses the confirm button 812, thereby inputting the new domain name "architecture" into the keyword management unit DB 231. can be registered.

ドメインキーワード更新ウィンドウ８２０では、ユーザは、登録済みのドメイン（例えば新ドメイン登録ウィンドウ８１０で登録したドメイン）について、新たなドメインキーワードの追加、既存のドメインキーワードの削除、又は既存のドメインキーワードの変更等を行うことができる。例えば、ユーザは、ドメイン選択ウィンドウ８２１で特定のドメイン名を選択した後、当該ドメインについて追加、削除、又は変更したドメインキーワードをドメインキーワード入力ウィンドウ８２２において入力することができる。その後、確定ボタン８２３を押すことで、ユーザは、ドメインキーワード入力ウィンドウ８２２に入力したドメインキーワードに対して行いたいアクション（追加、削除、又は変更）を選択することができる。更に、ユーザは、ドメイングラフ更新８２４を押すことで、更新要求を入力し、図７に示す更新処理を実施することができる。 In the domain keyword update window 820, the user can add a new domain keyword, delete an existing domain keyword, change an existing domain keyword, etc. for a registered domain (for example, a domain registered in the new domain registration window 810). It can be performed. For example, after selecting a particular domain name in the domain selection window 821, the user can input added, deleted, or changed domain keywords for the domain in the domain keyword input window 822. Thereafter, by pressing the confirm button 823, the user can select an action (add, delete, or change) that he or she wants to perform on the domain keyword input in the domain keyword input window 822. Furthermore, by pressing the domain graph update 824, the user can input an update request and implement the update process shown in FIG. 7.

以上説明したキーワード管理画面８００では、ユーザは、キーワード管理部ＤＢ２３１に格納されるドメインキーワードを容易に管理することができる。 On the keyword management screen 800 described above, the user can easily manage domain keywords stored in the keyword management unit DB 231.

次に、図９を参照して、本開示の実施形態に係る文章管理画面について説明する。 Next, with reference to FIG. 9, a text management screen according to an embodiment of the present disclosure will be described.

図９は、本開示の実施形態に係る文章管理画面９００の一例を示す図である。この文章管理画面９００によれば、ユーザは、特定の文章と類似している他の文章を検索したり、２つの文章を比較（類似度判定）したりすることができる。ある実施形態では、この文章管理画面９００は、例えば入出力部２４６によって生成され、ユーザ端末２６０に提示されるユーザインターフェースの画面であってもよい。 FIG. 9 is a diagram illustrating an example of a text management screen 900 according to an embodiment of the present disclosure. According to this text management screen 900, the user can search for other texts that are similar to a specific text, or compare two texts (determine the degree of similarity). In one embodiment, the text management screen 900 may be a user interface screen generated by the input/output unit 246 and presented to the user terminal 260, for example.

図９に示すように、文章管理画面９００は、類似文章検索ウィンドウ９１０と、文章比較ウィンドウ９２０とを含んでもよい。 As shown in FIG. 9, the text management screen 900 may include a similar text search window 910 and a text comparison window 920.

類似文章検索ウィンドウ９１０では、ユーザは、特定の文章と類似している他の文章を検索することができる。例えば、ユーザは、「物理学」とのドメイン名を類似文章検索ウィンドウ９１０のドメイン選択ウィンドウ９１１において入力し、「物理学の法則」との文章をファイル入力ウィンドウ９１２において入力し、確定ボタン９１３を押すことで、本開示の実施形態に係るグラフ比較処理６００を実施し、「物理学の法則」との類似度が高い文章をドメイングラフＤＢ２３２の中から検索することができる。 Similar text search window 910 allows the user to search for other texts that are similar to a specific text. For example, the user enters the domain name "physics" in the domain selection window 911 of the similar text search window 910, enters the text "laws of physics" in the file input window 912, and presses the confirm button 913. By pressing the button, the graph comparison process 600 according to the embodiment of the present disclosure is executed, and a sentence having a high degree of similarity to "laws of physics" can be searched from the domain graph DB 232.

文章比較ウィンドウ９２０では、ユーザは、２つの特定の文章を比較することができる。例えば、ユーザは、「建築」とのドメイン名を文章比較ウィンドウ９２０のドメイン選択ウィンドウ９２１において入力し、「古代ローマの建築」との文章を第１のファイル入力ウィンドウ９２２において入力し、「古代ギリシアの建築」との文章を第２のファイル入力ウィンドウ９２３において入力し、比較ボタン９２４を押すことで、本開示の実施形態に係るグラフ比較処理６００を実施し、「古代ローマの建築」と「古代ギリシアの建築」との類似度を判定することができる。 Text comparison window 920 allows a user to compare two particular texts. For example, the user inputs the domain name "architecture" in the domain selection window 921 of the sentence comparison window 920, inputs the sentence "ancient Roman architecture" in the first file input window 922, and inputs the sentence "architecture of ancient Rome" in the first file input window 922. By inputting the text "architecture of ancient Rome" in the second file input window 923 and pressing the comparison button 924, the graph comparison process 600 according to the embodiment of the present disclosure is executed, and the text "architecture of ancient Rome" and "architecture of ancient It is possible to judge the degree of similarity with "Greek architecture".

以上説明した文章管理画面９００では、ユーザは、所定の文章間の類似度を容易に判定することができる。 On the text management screen 900 described above, the user can easily determine the degree of similarity between predetermined texts.

次に、図１０を参照して、本開示の実施形態に係る多ドメイン管理処理について説明する。 Next, with reference to FIG. 10, multi-domain management processing according to an embodiment of the present disclosure will be described.

一般に、複数の異なるドメインに関連する文章が存在する。従って、本開示の一態様は、複数の異なるドメインに関連する文章についても高精度の類似度判定結果を提供するためには、所定の文章との対応性が高いドメインを判定し、判定したドメインのドメインキーワードを用いてドメイングラフを生成するための多ドメイン管理処理に関する。図１０は、本開示の実施形態に係る多ドメイン管理処理１０００の流れの一例を示す図である。この多ドメイン管理処理１０００は、主に類似度判定装置２１０のグラフ生成部２２１によって実施される。 Generally, there are texts related to several different domains. Therefore, in order to provide highly accurate similarity determination results even for sentences related to a plurality of different domains, one aspect of the present disclosure is to determine a domain that has a high degree of correspondence with a predetermined sentence, This invention relates to multi-domain management processing for generating a domain graph using domain keywords. FIG. 10 is a diagram illustrating an example of the flow of multi-domain management processing 1000 according to the embodiment of the present disclosure. This multi-domain management process 1000 is mainly performed by the graph generation unit 221 of the similarity determination device 210.

まず、ステップＳ１０１０では、グラフ生成部２２１は、類似度判定が希望される第２の文章の入力を受け付けた後、この第２の文章との対応性が高いドメインを判定する。
より具体的には、グラフ生成部２２１は、第２の文章の、キーワード管理ＤＢ２３１に格納されている各ドメインキーワードのセットに対する関連度を示すドメインスコアを計算する。ある実施形態では、グラフ生成部２２１は、各ドメインキーワードのセットの内、第２の文章に含まれるドメインキーワードの割合に基づいてドメインスコアを計算してもよい。 First, in step S1010, the graph generation unit 221 receives an input of a second sentence for which similarity determination is desired, and then determines a domain that has high correspondence with the second sentence.
More specifically, the graph generation unit 221 calculates a domain score indicating the degree of association of the second sentence with each set of domain keywords stored in the keyword management DB 231. In an embodiment, the graph generation unit 221 may calculate the domain score based on the proportion of domain keywords included in the second sentence among each set of domain keywords.

一例として、キーワード管理ＤＢ２３１は、「物理学者」との第１のドメインに対応する第１のドメインキーワードのセットと、「化学者」との第２のドメインに対応する第２のドメインキーワードのセットとを格納しているとする。また、グラフ生成部２２１は、物理学者、化学者、生物学者等、様々な科学分野の化学者に関する「有名な科学者」との第２の文章の入力を受け付けるとする。
この場合、グラフ生成部２２１は、「有名な科学者」との第２の文章の、「物理学者」との第１のドメインに対する関連度を示す第１のドメインスコアと、「化学者」との第２のドメインに対する関連度を示す第２のドメインスコアとを計算する。 As an example, the keyword management DB 231 includes a set of first domain keywords corresponding to a first domain of "physicist" and a set of second domain keywords corresponding to a second domain of "chemist". Suppose that you are storing . It is also assumed that the graph generation unit 221 receives an input of a second sentence with the phrase "famous scientist" regarding chemists in various scientific fields, such as physicists, chemists, and biologists.
In this case, the graph generation unit 221 generates a first domain score indicating the degree of association of the second sentence with “famous scientist” with the first domain of “physicist” and a first domain score with “chemist”. A second domain score indicating the degree of relevance of the domain to the second domain is calculated.

例えば、「物理学者」との第１のドメインに対応する第１のドメインキーワードのセットの内の７０％のドメインキーワードが「有名な科学者」との第２の文章に含まれている場合、グラフ生成部２２１は、「有名な科学者」との第２の文章の、「物理学者」との第１のドメインに対する第１のドメインスコアを「７０％」としてもよい。
また、「化学者」との第２のドメインに対応する第２のドメインキーワードのセットの内の３０％のドメインキーワードが「有名な科学者」との第２の文章に含まれている場合、グラフ生成部２２１は、「有名な科学者」との第２の文章の、「化学者」との第２のドメインに対する第２のドメインスコアを「３０％」としてもよい。
このように、グラフ生成部２２１は、第２の文章に含まれている各ドメインキーワードのセットの割合に基づいて、第２の文章の、キーワード管理ＤＢ２３１に格納されている各ドメインキーワードのセットに対する関連度を示すドメインスコアを計算することができる。 For example, if 70% of the domain keywords in the set of first domain keywords corresponding to the first domain with "physicist" are included in the second sentence with "famous scientist", The graph generation unit 221 may set the first domain score of the second sentence "famous scientist" to "70%" for the first domain "physicist".
Furthermore, if 30% of the domain keywords of the set of second domain keywords corresponding to the second domain with "chemist" are included in the second sentence with "famous scientist", The graph generation unit 221 may set the second domain score of the second sentence "famous scientist" to "30%" for the second domain "chemist".
In this way, the graph generation unit 221 calculates the proportion of each domain keyword set stored in the keyword management DB 231 in the second sentence based on the ratio of each domain keyword set included in the second sentence. A domain score indicating relevance can be calculated.

次に、ステップＳ１０２０では、グラフ生成部２２１は、ステップＳ１０１０で計算したドメインスコアに基づいて、適切なドメインキーワードのセットをキーワード管理ＤＢ２３１から取得する。ある実施形態では、グラフ生成部２２１は、各ドメインについて計算したドメインスコアの関係と、予め定まったドメインスコア閾値とに基づいて適切なドメインキーワードのセットを取得してもよい。
より具体的には、グラフ生成部２２１は、第２の文章の第１のドメインに対する関連度を示す第１のドメインスコアが、第２の文章の第２のドメインに対する関連度を示す第２のドメインスコアを超え、且つ、所定のドメインスコア閾値（例えば、５０％）を満たす場合、第１のドメインに対応する第１のドメインキーワードのセットをキーワード管理ＤＢ２３１から取得してもよい。
一方、グラフ生成部２２１は、第２の文章の第２のドメインに対する関連度を示す第２のドメインスコアが、第２の文章の第１のドメインに対する関連度を示す第１のドメインスコアを超え、且つ、所定のドメインスコア閾値（例えば、５０％）を満たす場合、第２のドメインに対応する第２のドメインキーワードのセットをキーワード管理ＤＢ２３１から取得してもよい。 Next, in step S1020, the graph generation unit 221 obtains an appropriate set of domain keywords from the keyword management DB 231 based on the domain score calculated in step S1010. In one embodiment, the graph generation unit 221 may obtain an appropriate set of domain keywords based on the domain score relationship calculated for each domain and a predetermined domain score threshold.
More specifically, the graph generation unit 221 determines that a first domain score indicating the degree of association of the second sentence with the first domain is a second domain score indicating the degree of association of the second sentence with the second domain. If the domain score is exceeded and a predetermined domain score threshold (for example, 50%) is satisfied, a set of first domain keywords corresponding to the first domain may be acquired from the keyword management DB 231.
On the other hand, the graph generation unit 221 determines that the second domain score indicating the degree of association of the second sentence with the second domain exceeds the first domain score indicating the degree of association of the second sentence with the first domain. , and if a predetermined domain score threshold (for example, 50%) is satisfied, a set of second domain keywords corresponding to the second domain may be acquired from the keyword management DB 231.

次に、ステップＳ１０３０では、グラフ生成部２２１は、ステップＳ１０２０で取得したドメインキーワードを用いて、第２の文章に対応する正規化済みの第２のドメイングラフを生成する。
なお、正規化済みの第２のドメイングラフを生成する処理の詳細については、図６を参照して説明したため、ここではその説明を省略する。 Next, in step S1030, the graph generation unit 221 uses the domain keyword acquired in step S1020 to generate a normalized second domain graph corresponding to the second sentence.
Note that the details of the process of generating the normalized second domain graph have been described with reference to FIG. 6, so the description thereof will be omitted here.

以上説明した多ドメイン管理処理１０００によれば、所定の文章が複数のドメインに関連する場合であっても、当該文章に対応するドメイングラフを生成するための適切なドメインキーワードを判定することができる。また、これにより、類似度判定の精度を向上させることができる。 According to the multi-domain management process 1000 described above, even when a given text is related to multiple domains, it is possible to determine an appropriate domain keyword for generating a domain graph corresponding to the text. . Moreover, this makes it possible to improve the accuracy of similarity determination.

次に、図１１を参照して、本開示の実施形態に係る類似度判定の具体例について説明する。 Next, a specific example of similarity determination according to the embodiment of the present disclosure will be described with reference to FIG. 11.

図１１は、本開示の実施形態に係る２つのドメイングラフに対する類似度判定の具体例を示す図である。図１１には、ドメイングラフＡと、ドメイングラフＢとの２つのドメイングラフについて、ノードスコア、正規化済みのノードスコア、エッジ重み及び正規化済みのエッジ重み等のパラメータが示されている。 FIG. 11 is a diagram illustrating a specific example of similarity determination for two domain graphs according to an embodiment of the present disclosure. FIG. 11 shows parameters such as node scores, normalized node scores, edge weights, and normalized edge weights for two domain graphs, domain graph A and domain graph B.

より具体的には、ドメイングラフＡにおけるノードは、f_A(Ｎ)＝｛３、２、１、４｝とのノードスコアに対応付けられている。各ノードスコアを、当該ドメイングラフのノードスコアの和（３＋２＋１＋４＝１０）で割り算することで、正規化したノードスコアf_A(ＮＳ)＝｛０．３、０．２、０．１、０．４｝を得ることができる。
同様に、ドメイングラフＢにおけるノードは、f_Ｂ(Ｎ)＝｛３、２、１、４｝とのノードスコアに対応付けられている。各ノードスコアを、当該ドメイングラフのノードスコアの和（３＋２＋１＋４＝１０）で割り算することで、正規化したノードスコアf_Ｂ(ＮＳ)＝｛０．３、０．２、０．１、０．４｝を得ることができる。 More specifically, nodes in domain graph A are associated with node scores of f _A (N)={3, 2, 1, 4}. By dividing each node score by the sum of the node scores of the domain graph (3+2+1+4=10), the normalized node score f _A (NS)={0.3, 0.2, 0.1, 0. 4} can be obtained.
Similarly, nodes in domain graph B are associated with node scores of f _B (N)={3, 2, 1, 4}. By dividing each node score by the sum of the node scores of the domain graph (3+2+1+4=10), the normalized node score f _B (NS)={0.3, 0.2, 0.1, 0. 4} can be obtained.

また、図１１は、ドメイングラフＡにおける各ノード間の接続数f_A(Ｌ)を表１１０５に示し、ドメイングラフＢにおける各ノード間の接続数f_Ｂ(Ｌ)を表１１１０に示す。これらの接続数の逆を取ることで、ドメイングラフＡ及びドメイングラフＢのエッジ重みＥ_ijを計算することができる。
また、上述した数式３を用いることで、ドメイングラフＡの正規化したエッジ重みＮＥ_ijを計算することができる。ドメイングラフＡの正規化したエッジ重みは、表１１１５に示され、ドメイングラフＢの正規化したエッジ重みは、表１１２０に示される。 Further, in FIG. 11, the number of connections f _A (L) between each node in domain graph A is shown in table 1105, and the number of connections f _B (L) between each node in domain graph B is shown in table 1110. By taking the inverse of these connections, the edge weights E _ij of domain graph A and domain graph B can be calculated.
Further, by using Equation 3 described above, the normalized edge weight NE _ij of the domain graph A can be calculated. The normalized edge weights for domain graph A are shown in table 1115 and the normalized edge weights for domain graph B are shown in table 1120.

以上説明したドメイングラフＡの正規化したノードスコアf_A(ＮＳ)と、ドメイングラフＢの正規化したノードスコアf_Ｂ(ＮＳ)を数式５に代入することで、ドメイングラフＡとドメイングラフＢとのＣＮＳを計算することができる。今回の場合、ドメイングラフＡの正規化したノードスコアf_A(ＮＳ)と、ドメイングラフＢの正規化したノードスコアf_Ｂ(ＮＳ)とが同一であるため、数式５は、（１－０＝１）となる。 By substituting the normalized node score f _A (NS) of domain graph A and the normalized node score f _B (NS) of domain graph B explained above into Equation 5, domain graph A and domain graph B can be The CNS of can be calculated. In this case, since the normalized node score f _A (NS) of domain graph A and the normalized node score f _B (NS) of domain graph B are the same, Equation 5 can be expressed as (1-0= 1).

また、以上説明したドメイングラフＡ及びドメイングラフＢの正規化エッジ重みを数式６に代入することで、ドメイングラフＡとドメイングラフＢとのＲＳを計算することができる。今回の場合、数式６は、（１－０．２１６３＝０．７８３７）となる。 Further, by substituting the normalized edge weights of domain graph A and domain graph B described above into Equation 6, the RS of domain graph A and domain graph B can be calculated. In this case, Equation 6 becomes (1-0.2163=0.7837).

そして、計算したＣＮＳ及びＲＳを数式７に代入することで、ドメイングラフＡとドメイングラフＢとの類似度を計算することができる。今回の場合、数式７による計算の結果、ドメイングラフＡとドメイングラフＢとの類似度が「８９．１８５％」となる。 Then, by substituting the calculated CNS and RS into Equation 7, the degree of similarity between domain graph A and domain graph B can be calculated. In this case, as a result of calculation using Equation 7, the degree of similarity between domain graph A and domain graph B is "89.185%."

このように、文章をドメイングラフとして表現し、これらのドメイングラフを比較することで、２つの文章間の類似度を高精度且つ高速に判定することが可能となる。 In this way, by representing sentences as domain graphs and comparing these domain graphs, it becomes possible to determine the degree of similarity between two sentences with high precision and at high speed.

以上、本開示の実施形態に係る類似度判定手段について説明した。
上述したように、本開示の一態様は、文章をドメイングラフとして表現することに関する。これにより、文章のテキストをそのまま比較した場合に比べて、文章の単語間の意味的関係を表現することができるため、類似度判定の精度を向上させることができる。 The similarity determination means according to the embodiment of the present disclosure has been described above.
As mentioned above, one aspect of the present disclosure relates to representing sentences as domain graphs. This makes it possible to express the semantic relationships between the words of the sentences compared to the case where the texts of the sentences are compared as they are, thereby improving the accuracy of similarity determination.

しかし、上述したように、文章全体をドメイングラフとして表現する従来の手段では、文章に含まれる全ての単語がノードとして表現されるため、文章が長い場合、ノードやエッジの数が膨大となり、グラフに対する比較などの処理が遅なるという課題がある。
そこで、本開示の一態様は、文章全体ではなく、文章における特定のキーワードのみをノードとして表現するドメイングラフを生成することに関する。これにより、章全体をドメイングラフとして表現する従来の手段などに比べて、グラフの規模を抑えることができる。また、グラフの規模を抑えることで、グラフに対する比較などの処理の所要時間を短縮させ、高速な類似度判定が可能となる。 However, as mentioned above, in the conventional means of representing the entire sentence as a domain graph, all words included in the sentence are represented as nodes, so if the sentence is long, the number of nodes and edges becomes enormous, and the graph There is a problem that processing such as comparisons between the two files is delayed.
Therefore, one aspect of the present disclosure relates to generating a domain graph that expresses only specific keywords in a sentence as nodes instead of the entire sentence. This makes it possible to reduce the size of the graph compared to conventional methods that express the entire chapter as a domain graph. In addition, by reducing the scale of the graph, the time required for processing such as comparing the graphs can be shortened, and high-speed similarity determination can be made.

ただし、文章全体ではなく、文章における特定のキーワードのみに基づいてドメイングラフを生成した場合、文章全体をドメイングラフとして表現した場合に比べて、情報のロスが発生する可能性がある。
そこで、本開示の一態様では、ドメイングラフにおける各ノードは、当該ノードに対応するドメインキーワードの重要性を示すノードスコアに対応付けられ、各エッジは、当該エッジが接続するノード間の関連度を示すエッジ重みに対応付けられる。
これにより、文章全体をドメイングラフとして表現しなくても、文章において特に重要なキーワードに関する意味的情報を維持し、情報ロスを抑えることができる。 However, if a domain graph is generated based only on specific keywords in the text rather than the entire text, information loss may occur compared to when the entire text is represented as a domain graph.
Therefore, in one aspect of the present disclosure, each node in the domain graph is associated with a node score that indicates the importance of the domain keyword corresponding to the node, and each edge is associated with a degree of association between the nodes that the edge connects. is associated with the edge weight shown.
As a result, even if the entire text is not expressed as a domain graph, it is possible to maintain semantic information regarding particularly important keywords in the text and suppress information loss.

更に、本開示の一態様では、ドメイングラフにおける各ノードスコア及びエッジ重みを正規化した正規化済みのドメイングラフを生成することに関する。このように、ドメイングラフを正規化することで、ドメイングラフを比較する際、ドメイングラフの大きさの相違に起因する類似度判定の低下を防ぐことができる。 Further, one aspect of the present disclosure relates to generating a normalized domain graph in which each node score and edge weight in the domain graph are normalized. By normalizing domain graphs in this way, when comparing domain graphs, it is possible to prevent a decrease in similarity determination due to a difference in the size of domain graphs.

以上説明した本開示の実施形態に係る類似度判定手段を、例えばサイバーイベントに関するレポートに適用した場合、特定のレポートとの類似度が高い既存のレポートを特定することができる。新たなレポートとの類似度が所定の類似度基準を満たす既存のレポートが存在する場合、当該レポートで報告されるサイバーイベントが既に対応済み又は対応中であると見なし、対策を再度策定するために必要な時間やリソースを節約することができる。一方、新たなレポートとの類似度が所定の類似度基準を満たす既存のレポートが存在しない場合、当該レポートで報告されるサイバーイベントが未対応であり、新たな対策を策定する必要がると判定することができる。
このように、レポートで報告されるサイバーイベントが既に対応済み又は対応中か、未対応かを高精度で判定することができる。また、本開示の実施形態に係る類似度判定の際に用いられるドメイングラフは、予め定めた特定のキーワードのみに基づいたドメイングラフであるため、レポートの類似度を示す比較結果を高速に生成することができる。このため、サイバーイベントに対する迅速な対策策定が可能となる。 When the similarity determination means according to the embodiment of the present disclosure described above is applied to, for example, a report regarding a cyber event, it is possible to specify an existing report that has a high degree of similarity to a specific report. If there is an existing report that satisfies the predetermined similarity criteria with respect to the new report, the cyber event reported in that report will be considered to have already been responded to or is being responded to, and measures will be taken to formulate the countermeasures again. It can save you time and resources. On the other hand, if there is no existing report that satisfies the predetermined similarity criteria with respect to the new report, it is determined that the cyber event reported in the report has not been addressed and new countermeasures need to be developed. can do.
In this way, it is possible to determine with high accuracy whether the cyber event reported in the report has already been handled, is currently being handled, or has not been handled yet. Furthermore, since the domain graph used in determining the degree of similarity according to the embodiment of the present disclosure is a domain graph based only on specific predetermined keywords, comparison results indicating the degree of similarity of reports can be generated at high speed. be able to. This makes it possible to quickly formulate countermeasures against cyber events.

このように、本開示によれば、文章における要素間の関係情報を維持しつつ、文章に対応するグラフ表現の規模を抑えることで、高速且つ高精度な文章類似度比較が可能な類似度判定手段を提供することができる。 As described above, according to the present disclosure, similarity determination enables high-speed and highly accurate text similarity comparison by suppressing the scale of the graph representation corresponding to the text while maintaining relationship information between elements in the text. means can be provided.

以上では、本開示の実施形態に係る類似度判定手段を装置、システム及び方法で実装する場合を一例として説明したが、本開示はこれに限定されず、例えば、コンピュータプログラムとして実装されてもよい。このコンピュータプログラムは、外部装置の記憶媒体からネットワーク経由、及び/又は、可搬型記憶媒体経由で、本開示の実施形態に係る類似度判定手段を実装するコンピュータシステムに導入されてもよい。 Above, the case where the similarity determination means according to the embodiment of the present disclosure is implemented as an apparatus, system, and method has been described as an example, but the present disclosure is not limited to this, and may be implemented as a computer program, for example. . This computer program may be introduced into a computer system implementing the similarity determination means according to the embodiment of the present disclosure from a storage medium of an external device via a network and/or a portable storage medium.

例えば、本開示の実施形態に係る一態様は、類似度判定コンピュータプログラムであって、処理命令を格納するメモリと、プロセッサとを含むコンピュータシステムにおいて、前記メモリに格納されている前記処理命令は、第１のドメインに対応する第１のドメインキーワードのセットを取得する工程と、前記第１のドメインに対応する第１の文章について、前記第１のドメインキーワードのセットに基づいて生成される正規化済みの第１のドメイングラフを生成するする工程と、前記第１のドメインキーワードのセットに基づいて、前記第１のドメインに対応する第２の文章に含まれる第１のドメインキーワードのサブセットを特定する工程と、特定した前記第１のドメインキーワードのサブセットの前記第２の文章における出現回数を判定する工程と、前記第１のドメインキーワードのサブセットに基づいて、少なくとも第１のノードと第２のノードとを含む第１のノードセットを生成する工程と、判定した前記出現回数に基づいて、前記第１のノードセットに含まれる各ノードの前記第２の文章における重要性を示すノードスコアを計算する工程と、前記第２の文章に基づいて、前記第１のノードセットに含まれる各ノードの意味的関係を示す知識グラフを生成する工程と、前記第１のノードと前記第２のノードの、前記知識グラフにおける最短距離を判定する工程と、判定した前記最短距離に基づいて、前記第１のノードと前記第２のノードとの関連度を示すエッジ重みを計算する工程と、前記第１のノードと前記第２のノードとの関係を示し、前記エッジ重みに対応付けられたエッジを生成する工程と、前記第１のノードセットと、前記ノードスコアと、前記エッジ重みと、前記エッジとに基づいて、前記第２の文章に対応する第２のドメイングラフを生成する工程と、前記ノードスコアと前記エッジ重みを正規化した正規化済みの第２のドメイングラフを生成する工程と、前記正規化済みの第１のドメイングラフと前記正規化済みの第２のドメイングラフとを比較することで、前記正規化済みの第１のドメイングラフと前記正規化済みの第２のドメイングラフとの類似度を示す比較結果を生成し、出力する工程とを前記プロセッサに実行させることを特徴とする類似度判定コンピュータプログラムである。 For example, one aspect according to an embodiment of the present disclosure is a computer program for determining similarity, in a computer system including a memory for storing processing instructions and a processor, in which the processing instructions stored in the memory include: obtaining a first set of domain keywords corresponding to a first domain; and normalizing a first sentence corresponding to the first domain based on the first set of domain keywords. and identifying a subset of first domain keywords included in a second sentence corresponding to the first domain based on the set of first domain keywords. determining the number of occurrences of the identified subset of first domain keywords in the second sentence; and calculating a node score indicating the importance of each node included in the first node set in the second sentence based on the determined number of occurrences. a step of generating a knowledge graph indicating a semantic relationship between each node included in the first node set based on the second sentence; , a step of determining the shortest distance in the knowledge graph; a step of calculating an edge weight indicating the degree of association between the first node and the second node based on the determined shortest distance; the first node set, the node score, the edge weight, and the edge; a step of generating a second domain graph corresponding to the second sentence based on the second sentence; a step of generating a normalized second domain graph in which the node score and the edge weight are normalized; By comparing the normalized first domain graph and the normalized second domain graph, the difference between the normalized first domain graph and the normalized second domain graph is determined. The computer program for determining similarity is characterized by causing the processor to execute the steps of generating and outputting a comparison result indicating the degree of similarity.

以上、本発明の実施の形態について説明したが、本発明は、上述した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the embodiments described above, and various changes can be made without departing from the gist of the present invention.

１５０類似度判定アプリケーション
２００類似度判定システム
２１０類似度判定装置
２２０メモリ
２２１グラフ生成部
２２２類似度判定部
２２３更新部
２３０記憶部
２３１キーワード管理ＤＢ
２３２ドメイングラフＤＢ
２４４プロセッサ
２４６入出力部
２５０通信ネットワーク
２６０ユーザ端末 150 Similarity determination application 200 Similarity determination system 210 Similarity determination device 220 Memory 221 Graph generation section 222 Similarity determination section 223 Update section 230 Storage section 231 Keyword management DB
232 Domain graph DB
244 Processor 246 Input/output unit 250 Communication network 260 User terminal

Claims

A similarity determination device,
Equipped with a processor and memory,
The memory is
a keyword management database storing a set of first domain keywords corresponding to the first domain;
a domain graph database that stores a first domain graph generated based on the first domain keyword set for a first sentence corresponding to the first domain;
a graph generation unit that generates a second domain graph for a second sentence corresponding to the first domain based on the first domain keyword set;
a similarity determination unit that generates a comparison result indicating the degree of similarity between the first domain graph and the second domain graph by comparing the first domain graph and the second domain graph;
A similarity determination device comprising a processing instruction for causing the processor to function as a similarity determination device.

The graph generation unit is
identifying a subset of first domain keywords included in the second sentence based on the first set of domain keywords;
determining the number of occurrences of the identified subset of first domain keywords in the second sentence;
generating a first node set including at least a first node and a second node based on the first subset of domain keywords;
calculating a node score indicating the importance of each node included in the first node set in the second sentence based on the determined number of occurrences;
The similarity determination device according to claim 1, characterized in that:

The graph generation unit is
generating a knowledge graph indicating a semantic relationship between each node included in the first node set based on the second sentence;
determining the shortest distance between the first node and the second node in the knowledge graph;
Based on the determined shortest distance, calculate an edge weight indicating the degree of association between the first node and the second node;
indicating a relationship between the first node and the second node, and generating an edge associated with the edge weight;
The similarity determination device according to claim 2, characterized in that:

generating the second domain graph corresponding to the second sentence based on the first node set, the node score, the edge weight, and the edge;
generating a normalized second domain graph in which the node scores and the edge weights are normalized;
The similarity determination device according to claim 3, characterized in that:

The keyword management database is
a second set of domain keywords corresponding to a second domain in addition to the first set of domain keywords corresponding to the first domain;
The graph generation unit is
calculating a first domain score indicating the degree of relevance of the second sentence to the first domain;
calculating a second domain score indicating the degree of relevance of the second sentence to the second domain;
if the first domain score exceeds the second domain score and satisfies a predetermined domain score threshold, generating the second domain graph based on the first set of domain keywords;
generating the second domain graph based on the set of second domain keywords if the second domain score exceeds the first domain score and satisfies a predetermined domain score threshold;
The similarity determination device according to claim 1, characterized in that:

When an update request requesting addition or deletion of a domain keyword to the first set of domain keywords is received from a user,
further comprising an update unit that updates the second domain graph based on addition or deletion of the domain keyword indicated in the update request;
The similarity determination device according to claim 1, characterized in that:

In a similarity determination system in which a user terminal and a similarity determination device are connected via a communication network,
The similarity determination device is
Equipped with a processor and memory,
The memory is
a keyword management database storing a set of first domain keywords corresponding to the first domain;
a domain graph database that stores a first domain graph generated based on the first domain keyword set for a first sentence corresponding to the first domain;
an input/output unit that acquires a second sentence corresponding to the first domain from the user terminal;
a graph generation unit that generates a second domain graph for the second sentence based on the first set of domain keywords;
By comparing the first domain graph and the second domain graph, a comparison result indicating the degree of similarity between the first domain graph and the second domain graph is generated and transmitted to the user terminal. a similarity determination unit that
A similarity determination system comprising processing instructions for causing the processor to function as a system.

A similarity determination method,
obtaining a set of first domain keywords corresponding to the first domain;
generating a normalized first domain graph generated based on the set of first domain keywords for a first sentence corresponding to the first domain;
identifying a subset of first domain keywords included in a second sentence corresponding to the first domain based on the first set of domain keywords;
determining the number of occurrences of the identified subset of first domain keywords in the second sentence;
generating a first node set including at least a first node and a second node based on the first subset of domain keywords;
Calculating a node score indicating the importance of each node included in the first node set in the second sentence based on the determined number of occurrences;
generating a knowledge graph indicating a semantic relationship between each node included in the first node set based on the second sentence;
determining the shortest distance between the first node and the second node in the knowledge graph;
calculating an edge weight indicating a degree of association between the first node and the second node based on the determined shortest distance;
a step of indicating a relationship between the first node and the second node and generating an edge associated with the edge weight;
generating a second domain graph corresponding to the second sentence based on the first node set, the node score, the edge weight, and the edge;
generating a normalized second domain graph in which the node scores and the edge weights are normalized;
By comparing the normalized first domain graph and the normalized second domain graph, the normalized first domain graph and the normalized second domain graph are a step of generating and outputting a comparison result indicating the degree of similarity;
A similarity determination method characterized by comprising: