JP7301938B2

JP7301938B2 - Document creation system, document creation method and document creation program

Info

Publication number: JP7301938B2
Application number: JP2021197574A
Authority: JP
Inventors: 正樹高橋; 裕也根本
Original assignee: Mizuho Research and Technologies Ltd
Current assignee: Mizuho Research and Technologies Ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2023-07-03
Anticipated expiration: 2041-12-06
Also published as: JP2023083722A

Description

本開示は、所定語をマスキングした文書を作成する文書作成システム、文書作成方法及び文書作成プログラムに関する。 The present disclosure relates to a document creation system, a document creation method, and a document creation program for creating a document in which predetermined words are masked.

個人情報の保護等のため、文書に含まれる単語のマスキングを行なう場合がある。そこで、文書中から検出した固有名詞部分の確認と修正とを行なうための技術が検討されている（例えば、特許文献１を参照。）。この特許文献に記載された文書処理方法では、マスキングすべき文字列またはその一部を記憶する単語辞書を基に、入力した文書からマスキング対象箇所を検出し、この検出されたマスキング対象箇所をマスキング結果リストに記憶する。表示されたマスキング対象箇所のいずれかがユーザにより修正されると、マスキング結果リストに記憶されたマスキング対象箇所を、ユーザにより修正されたマスキング対象箇所に書き換える。 In order to protect personal information, etc., words contained in documents may be masked. Therefore, techniques for confirming and correcting proper noun parts detected from a document have been studied (see, for example, Japanese Patent Application Laid-Open No. 2002-200013). In the document processing method described in this patent document, a portion to be masked is detected from an input document based on a word dictionary that stores a character string or part thereof to be masked, and the detected masking target portion is masked. Store in result list. When any of the displayed masking target portions is corrected by the user, the masking target portions stored in the masking result list are rewritten with the masking target portions corrected by the user.

また、プライバシ情報を保護するための技術が検討されている（例えば、特許文献２を参照。）。この特許文献に記載された技術では、提示情報は、クライアントから取得したコンテンツに対して、プライバシ情報に該当しうる候補単語等を、ユーザにより予め設定された、秘匿する単語、または単語の組合せと、公開する単語、または単語の組合せから決定する。 Also, techniques for protecting privacy information are being studied (see, for example, Patent Document 2). In the technique described in this patent document, the presentation information consists of candidate words or the like that may correspond to privacy information for content acquired from a client, as confidential words or combinations of words set in advance by the user. , the word to publish, or a combination of words.

特開２００４－２２７１４１号公報Japanese Patent Application Laid-Open No. 2004-227141 特開２０１２－１５９９３９号公報JP 2012-159939 A

しかしながら、単に、マスク対象単語を予めリストアップしたリストを用いてマスキングを行なっていたのでは、個人情報や特定情報の漏えい等を的確に抑制することができない。例えば、同じ単語であっても、普通名詞として使用される場合や固有名詞として使用される場合がある。この場合、単語だけでは個人情報か否かの区別が難しい。また、問題がない単語を予めリストアップしたリストを用いてマスキングを行なった場合、リストアップ数によっては、表示可能な単語が少なくなる可能性がある。 However, simply performing masking using a list of words to be masked in advance cannot accurately prevent leakage of personal information or specific information. For example, the same word may be used as a common noun or as a proper noun. In this case, it is difficult to distinguish whether the information is personal information or not based only on words. In addition, when masking is performed using a list in which no problem words are listed in advance, the number of words that can be displayed may decrease depending on the number of listed words.

上記課題を解決する文書作成システムは、ユーザ端末に接続された制御部を備える。そして、前記制御部が、前記ユーザ端末から取得した文章の形態素分析を行なって、構成単語を特定し、前記構成単語が第１種類の品詞の場合には、第１方法で作成した第１ホワイトリストと突合し、前記構成単語が第２種類の品詞の場合には、第２方法で作成した第２ホワイトリストと突合し、前記第１ホワイトリスト及び前記第２ホワイトリストに含まれない前記構成単語をマスキングして、前記ユーザ端末に出力する。 A document creation system that solves the above problems includes a control unit connected to a user terminal. Then, the control unit performs a morphological analysis of the sentence acquired from the user terminal to identify constituent words, and if the constituent words are a first type of part of speech, a first white sentence created by a first method is selected. If the component word is a second type of part of speech, it is matched with a second whitelist created by a second method, and the component words not included in the first whitelist and the second whitelist are matched. After masking, output to the user terminal.

本開示によれば、所定語をマスキングした的確な文書を作成することができる。 According to the present disclosure, it is possible to create an accurate document in which predetermined words are masked.

本開示の文書作成システムの説明図である。1 is an explanatory diagram of a document creation system of the present disclosure; FIG. 本開示のハードウェア構成の説明図である。2 is an explanatory diagram of the hardware configuration of the present disclosure; FIG. 本開示の処理手順の説明図である。FIG. 4 is an explanatory diagram of a processing procedure of the present disclosure; 本開示のホワイトリストの登録単語数と表現率との関係の説明図である。FIG. 4 is an explanatory diagram of the relationship between the number of registered words in the whitelist of the present disclosure and the expression rate; 本開示の処理手順の説明図である。FIG. 4 is an explanatory diagram of a processing procedure of the present disclosure; 本開示の処理手順の説明図である。FIG. 4 is an explanatory diagram of a processing procedure of the present disclosure;

図１～図６に従って、文書作成システム、文書作成方法及び文書作成プログラムを具体化した一実施形態を説明する。本実施形態では、文章に含まれる所定の単語（所定語）のマスキングを行なう場合を想定する。
図１に示すように、本実施形態の文書作成システムは、ネットワークを介して、相互に接続されたユーザ端末１０、支援サーバ２０、辞書サーバ３０を用いる。 An embodiment of a document creation system, a document creation method, and a document creation program will be described with reference to FIGS. 1 to 6. FIG. In this embodiment, it is assumed that a predetermined word (predetermined word) included in a sentence is masked.
As shown in FIG. 1, the document creation system of this embodiment uses a user terminal 10, a support server 20, and a dictionary server 30 which are interconnected via a network.

（ハードウェア構成例）
図２は、ユーザ端末１０、支援サーバ２０、辞書サーバ３０等として機能する情報処理装置Ｈ１０のハードウェア構成例である。 (Hardware configuration example)
FIG. 2 is a hardware configuration example of the information processing device H10 that functions as the user terminal 10, the support server 20, the dictionary server 30, and the like.

情報処理装置Ｈ１０は、通信装置Ｈ１１、入力装置Ｈ１２、表示装置Ｈ１３、記憶装置Ｈ１４、プロセッサＨ１５を有する。なお、このハードウェア構成は一例であり、他のハードウェアを有していてもよい。 The information processing device H10 has a communication device H11, an input device H12, a display device H13, a storage device H14, and a processor H15. Note that this hardware configuration is an example, and other hardware may be included.

通信装置Ｈ１１は、他の装置との間で通信経路を確立して、データの送受信を実行するインタフェースであり、例えばネットワークインタフェースや無線インタフェース等である。 The communication device H11 is an interface that establishes a communication path with another device and executes data transmission/reception, such as a network interface or a wireless interface.

入力装置Ｈ１２は、利用者等からの入力を受け付ける装置であり、例えばマウスやキーボード等である。表示装置Ｈ１３は、各種情報を表示するディスプレイやタッチパネル等である。 The input device H12 is a device that receives input from a user or the like, such as a mouse or a keyboard. The display device H13 is a display, a touch panel, or the like that displays various information.

記憶装置Ｈ１４は、ユーザ端末１０、支援サーバ２０、辞書サーバ３０の各種機能を実行するためのデータや各種プログラムを格納する記憶装置である。記憶装置Ｈ１４の一例としては、ＲＯＭ、ＲＡＭ、ハードディスク等がある。 The storage device H14 is a storage device that stores data and various programs for executing various functions of the user terminal 10, the support server 20, and the dictionary server 30. FIG. Examples of the storage device H14 include ROM, RAM, hard disk, and the like.

プロセッサＨ１５は、記憶装置Ｈ１４に記憶されるプログラムやデータを用いて、ユーザ端末１０、支援サーバ２０における各処理（例えば、後述する制御部２１における処理）を制御する。プロセッサＨ１５の一例としては、例えばＣＰＵやＭＰＵ等がある。このプロセッサＨ１５は、ＲＯＭ等に記憶されるプログラムをＲＡＭに展開して、各種処理に対応する各種プロセスを実行する。例えば、プロセッサＨ１５は、ユーザ端末１０、支援サーバ２０のアプリケーションプログラムが起動された場合、後述する各処理を実行するプロセスを動作させる。 The processor H15 uses programs and data stored in the storage device H14 to control each process in the user terminal 10 and the support server 20 (for example, process in the control unit 21 described later). Examples of the processor H15 include, for example, a CPU and an MPU. The processor H15 develops a program stored in a ROM or the like into a RAM and executes various processes corresponding to various processes. For example, when the application programs of the user terminal 10 and the support server 20 are activated, the processor H15 operates a process for executing each process described later.

プロセッサＨ１５は、自身が実行するすべての処理についてソフトウェア処理を行なうものに限られない。例えば、プロセッサＨ１５は、自身が実行する処理の少なくとも一部についてハードウェア処理を行なう専用のハードウェア回路（例えば、特定用途向け集積回路：ＡＳＩＣ）を備えてもよい。すなわち、プロセッサＨ１５は、以下で構成し得る。 Processor H15 is not limited to performing software processing for all the processing that it itself executes. For example, the processor H15 may include a dedicated hardware circuit (for example, an application specific integrated circuit: ASIC) that performs hardware processing for at least part of the processing performed by the processor H15. That is, the processor H15 can be configured as follows.

（１）コンピュータプログラム（ソフトウェア）に従って動作する１つ以上のプロセッサ
（２）各種処理のうち少なくとも一部の処理を実行する１つ以上の専用のハードウェア回路、或いは
（３）それらの組み合わせ、を含む回路（circuitry）
プロセッサは、ＣＰＵ並びに、ＲＡＭ及びＲＯＭ等のメモリを含み、メモリは、処理をＣＰＵに実行させるように構成されたプログラムコード又は指令を格納している。メモリすなわちコンピュータ可読媒体は、汎用又は専用のコンピュータでアクセスできるあらゆる利用可能な媒体を含む。 (1) one or more processors that operate according to a computer program (software); (2) one or more dedicated hardware circuits that perform at least some of the various types of processing; or (3) a combination thereof. circuit containing
A processor includes a CPU and memory, such as RAM and ROM, which stores program code or instructions configured to cause the CPU to perform processes. Memory or computer-readable media includes any available media that can be accessed by a general purpose or special purpose computer.

（ユーザ端末１０、支援サーバ２０及び辞書サーバ３０の機能）
図１を用いて、ユーザ端末１０、支援サーバ２０、辞書サーバ３０の機能を説明する。
ユーザ端末１０は、本システムを利用するユーザが用いるコンピュータ端末である。 (Functions of User Terminal 10, Support Server 20, and Dictionary Server 30)
Functions of the user terminal 10, the support server 20, and the dictionary server 30 will be described with reference to FIG.
A user terminal 10 is a computer terminal used by a user who uses this system.

支援サーバ２０は、文章のマスキングを行なうコンピュータシステムである。この支援サーバ２０は、制御部２１、教師情報記憶部２２、辞書記憶部２３を備えている。
制御部２１は、後述する処理（取得段階、リスト作成段階、マスク処理段階等を含む処理）を行なう。このための文書作成プログラムを実行することにより、制御部２１は、取得部２１０、リスト作成部２１１、マスク処理部２１２等として機能する。 The support server 20 is a computer system that performs masking of sentences. This support server 20 comprises a control section 21 , a teacher information storage section 22 and a dictionary storage section 23 .
The control unit 21 performs a process (including an acquisition stage, a list creation stage, a mask processing stage, etc.), which will be described later. By executing a document creation program for this purpose, the control unit 21 functions as an acquisition unit 210, a list creation unit 211, a mask processing unit 212, and the like.

取得部２１０は、ユーザ端末１０から教師情報や、マスキングを行なう公開候補文を取得する処理を実行する。
リスト作成部２１１は、マスキングの要否を判定するための名詞のホワイトリスト（第１ホワイトリスト）を生成する処理を実行する。このリスト作成部２１１は、表現率について、第１ホワイトリストに含める名詞を判定するための基準値に関するデータを保持する。ここで、表現率とは、文を構成する全文字数に対して、マスキングされていない文字数の割合である。
マスク処理部２１２は、公開対象の候補文において、必要に応じてマスキングを行なった公開文を作成する処理を実行する。 The acquisition unit 210 executes processing for acquiring teacher information and disclosure candidate sentences to be masked from the user terminal 10 .
The list creation unit 211 executes a process of creating a noun whitelist (first whitelist) for determining whether or not masking is necessary. The list creating unit 211 holds data relating to the reference value for determining the nouns to be included in the first white list with respect to the expression rate. Here, the expression rate is the ratio of the number of characters that are not masked to the total number of characters that make up the sentence.
The mask processing unit 212 executes a process of creating a disclosed sentence by masking the candidate sentence to be disclosed as necessary.

教師情報記憶部２２には、名詞辞書作成処理に用いる教師情報が記録される。教師情報には、公開対象候補となる公開候補文に関するデータが記録される。この公開候補文では、単語のマスキングは行なわれていない。 In the teacher information storage unit 22, teacher information used for noun dictionary creation processing is recorded. The teacher information records data relating to disclosure candidate sentences that are candidates for disclosure. No word masking is performed in this public candidate sentence.

辞書記憶部２３には、公開文において利用可能な単語をリストアップした第１ホワイトリストが記録される。第１ホワイトリストは、第１種類の品詞である名詞について、第１方法である名詞辞書作成処理の実行時に記録される。 The dictionary storage unit 23 records a first whitelist listing words that can be used in public sentences. The first whitelist is recorded when the noun dictionary creation process, which is the first method, is executed for nouns that are the first type of part of speech.

辞書サーバ３０は、単語に関して品詞に関する辞書を保持するコンピュータシステムである。辞書サーバ３０としては、例えば、国立国語研究所が提供するUniDic辞書を用いることができる。この辞書サーバ３０のUniDic辞書は、単語の短単位で辞書を行なう。ここで、短単位は、言語の形態論的側面に着目し、最小単位を基に斉一性を重視して規定された言語単位（単位語）である。この辞書サーバ３０は、第２種類の品詞である「名詞以外の品詞」の単語について、第２方法である一般辞書を用いて個人情報や特定情報に関わらない単語についての第２ホワイトリストを保持する。本実施形態では、第２種類の品詞は、少なくとも、助詞、動詞、助動詞、副詞、形容詞の何れか一つを含む。 The dictionary server 30 is a computer system that holds a dictionary of parts of speech for words. As the dictionary server 30, for example, a UniDic dictionary provided by the National Institute for Japanese Language can be used. The UniDic dictionary of this dictionary server 30 dictionaries in short units of words. Here, a short unit is a linguistic unit (unit word) defined with an emphasis on uniformity based on the minimum unit, focusing on the morphological aspects of language. This dictionary server 30 holds a second white list of words that are not related to personal information or specific information using a general dictionary, which is the second method, for words of the second type of part of speech, ie, "parts of speech other than nouns." do. In this embodiment, the second type of part of speech includes at least one of particles, verbs, auxiliary verbs, adverbs, and adjectives.

（名詞辞書作成処理）
次に、図３を用いて、名詞辞書作成処理を説明する。
ここでは、支援サーバ２０の制御部２１は、教師情報の取得処理を実行する（ステップＳ１１）。具体的には、制御部２１の取得部２１０は、ユーザ端末１０から、教師文を取得して、教師情報記憶部２２に記録する。次に、取得部２１０は、教師文を形態素分析により、品詞に分けて、教師文に含まれるすべての名詞群（品詞群）を抽出する。更に、取得部２１０は、教師文に含まれる全文字数を算出する。 (Noun dictionary creation processing)
Next, the noun dictionary creation process will be described with reference to FIG.
Here, the control unit 21 of the support server 20 executes teacher information acquisition processing (step S11). Specifically, the acquisition unit 210 of the control unit 21 acquires the teacher sentence from the user terminal 10 and records it in the teacher information storage unit 22 . Next, the acquisition unit 210 divides the training sentence into parts of speech by morphological analysis, and extracts all noun groups (part-of-speech groups) included in the training sentence. Furthermore, the acquisition unit 210 calculates the total number of characters included in the teacher sentence.

次に、支援サーバ２０の制御部２１は、名詞の出現個数の算出処理を実行する（ステップＳ１２）。具体的には、制御部２１のリスト作成部２１１は、教師文から抽出したすべての名詞の名詞総数を算出する。次に、リスト作成部２１１は、教師文から抽出した名詞毎に、同じ名詞の出現個数を算出する。 Next, the control unit 21 of the support server 20 executes a process of calculating the number of occurrences of nouns (step S12). Specifically, the list creation unit 211 of the control unit 21 calculates the total number of nouns of all nouns extracted from the teacher sentence. Next, the list creation unit 211 calculates the number of appearances of the same noun for each noun extracted from the teacher sentence.

次に、支援サーバ２０の制御部２１は、固有名詞の除外処理を実行する（ステップＳ１３）。具体的には、制御部２１のリスト作成部２１１は、教師文から抽出した各名詞について、辞書サーバ３０から名詞種類（普通名詞、固有名詞、数詞、形式名詞、代名詞）を取得する。そして、リスト作成部２１１は、教師文から抽出した名詞群から、固有名詞を除外して、使用可能な単語候補（名詞候補）を特定する。 Next, the control unit 21 of the support server 20 executes proper noun exclusion processing (step S13). Specifically, the list creation unit 211 of the control unit 21 acquires the noun type (common noun, proper noun, numeral, formal noun, pronoun) from the dictionary server 30 for each noun extracted from the teacher sentence. Then, the list creating unit 211 excludes proper nouns from the noun group extracted from the teacher sentence, and specifies usable word candidates (noun candidates).

次に、支援サーバ２０の制御部２１は、出現個数が多い順番に名詞の特定処理を実行する（ステップＳ１４）。具体的には、制御部２１のリスト作成部２１１は、固有名詞を除外した名詞群において、出現個数が多く、出現頻度が高い名詞を特定する。 Next, the control unit 21 of the support server 20 performs noun identification processing in descending order of appearance (step S14). Specifically, the list creation unit 211 of the control unit 21 identifies nouns with a large number of appearances and a high appearance frequency in the noun group excluding proper nouns.

次に、支援サーバ２０の制御部２１は、名詞辞書への登録処理を実行する（ステップＳ１５）。具体的には、制御部２１のリスト作成部２１１は、特定した名詞を、辞書記憶部２３の第１ホワイトリストに登録する。 Next, the control unit 21 of the support server 20 executes registration processing to the noun dictionary (step S15). Specifically, the list creation unit 211 of the control unit 21 registers the identified noun in the first white list of the dictionary storage unit 23 .

次に、支援サーバ２０の制御部２１は、表現率の算出処理を実行する（ステップＳ１６）。具体的には、制御部２１のリスト作成部２１１は、この時点で第１ホワイトリストに登録されている各名詞の文字数に出現個数を乗算することにより、出現文字数を算出する。また、リスト作成部２１１は、出現文字数の総和を全文字数で除算することにより、表現率を算出する。 Next, the control unit 21 of the support server 20 executes processing for calculating the expression rate (step S16). Specifically, the list creation unit 211 of the control unit 21 calculates the number of appearing characters by multiplying the number of appearing characters by the number of appearing characters of each noun registered in the first white list at this time. In addition, the list creating unit 211 calculates the expression rate by dividing the total number of appearing characters by the total number of characters.

ここで、図４に示すように、第１ホワイトリストに登録された名詞が多くなる場合、マスキングされる単語が少なくなる。その結果、表現率が高くなる。ただし、第１ホワイトリストに登録された名詞が多くなると、表現率の増加割合は小さくなる。 Here, as shown in FIG. 4, when the number of nouns registered in the first whitelist increases, the number of words to be masked decreases. As a result, the expressiveness increases. However, as the number of nouns registered in the first white list increases, the rate of increase in the expression rate decreases.

次に、支援サーバ２０の制御部２１は、表現率が基準値より高いかどうかについての判定処理を実行する（ステップＳ１７）。具体的には、制御部２１のリスト作成部２１１は、算出した表現率と基準値とを比較する。表現率の増加割合が小さくなる領域に、基準値を設定することにより、第１ホワイトリストに登録された名詞数の増加を抑制できる。これにより、第１ホワイトリストに登録された名詞のメンテナンスを容易にしている。 Next, the control unit 21 of the support server 20 executes determination processing as to whether the expression rate is higher than the reference value (step S17). Specifically, the list creation unit 211 of the control unit 21 compares the calculated expression rate with the reference value. By setting a reference value in an area where the rate of increase in expression rate is small, an increase in the number of nouns registered in the first white list can be suppressed. This facilitates maintenance of nouns registered in the first white list.

表現率が基準値以下と判定した場合（ステップＳ１７において「ＮＯ」の場合）、支援サーバ２０の制御部２１は、出現個数が多い順に名詞の特定処理（ステップＳ１４）以降の処理を繰り返す。
一方、表現率が基準値よりも高いと判定した場合（ステップＳ１７において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、名詞辞書作成処理を終了する。 If it is determined that the expression rate is equal to or less than the reference value ("NO" in step S17), the control unit 21 of the support server 20 repeats the noun specifying process (step S14) and subsequent processes in descending order of the number of occurrences.
On the other hand, when determining that the expression rate is higher than the reference value ("YES" in step S17), the control unit 21 of the support server 20 terminates the noun dictionary creation process.

（マスキング処理）
次に、図５及び図６を用いて、マスキング処理を説明する。この処理は、ユーザ端末１０から、新たに公開候補文を取得した場合に実行される。 (masking process)
Next, masking processing will be described with reference to FIGS. 5 and 6. FIG. This process is executed when a new disclosure candidate sentence is acquired from the user terminal 10 .

まず、図５に示すように、支援サーバ２０の制御部２１は、単語の分割処理を実行する（ステップＳ２１）。具体的には、制御部２１の取得部２１０は、公開候補文の形態素分析により、文の構成単語（短単位）に分割する。 First, as shown in FIG. 5, the control unit 21 of the support server 20 executes word division processing (step S21). Specifically, the acquisition unit 210 of the control unit 21 divides the disclosure candidate sentence into constituent words (short units) by morphological analysis.

次に、支援サーバ２０の制御部２１は、文を構成する各単語を、順次、処理対象単語として特定し、以下の処理を繰り返す。
まず、支援サーバ２０の制御部２１は、名詞かどうかについての判定処理を実行する（ステップＳ２２）。具体的には、制御部２１のマスク処理部２１２は、辞書サーバ３０から、各単語の品詞を取得する。そして、マスク処理部２１２は、処理対象単語の品詞が名詞かどうかを判定する。 Next, the control unit 21 of the support server 20 sequentially specifies each word forming the sentence as a processing target word, and repeats the following processing.
First, the control unit 21 of the support server 20 executes determination processing as to whether or not it is a noun (step S22). Specifically, the mask processing unit 212 of the control unit 21 acquires the part of speech of each word from the dictionary server 30 . Then, the mask processing unit 212 determines whether the part of speech of the word to be processed is a noun.

処理対象単語が名詞と判定した場合（ステップＳ２２において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、マスキング処理を実行する（ステップＳ２３）。具体的には、制御部２１のマスク処理部２１２は、公開候補文中の処理対象単語のマスキングを行なう。 If the processing target word is determined to be a noun ("YES" in step S22), the control unit 21 of the support server 20 executes masking processing (step S23). Specifically, the mask processing unit 212 of the control unit 21 masks the words to be processed in the disclosure candidate sentences.

処理対象単語が助詞、動詞、助動詞、副詞、形容詞等であり、名詞でないと判定した場合（ステップＳ２２において「ＮＯ」の場合）、支援サーバ２０の制御部２１は、一般辞書で作成されたホワイトリスト突合処理を実行する（ステップＳ２４）。具体的には、制御部２１のマスク処理部２１２は、処理対象単語と、辞書サーバ３０に記録された第２ホワイトリストとを突合する。 If it is determined that the processing target word is a particle, verb, auxiliary verb, adverb, adjective, etc. and is not a noun ("NO" in step S22), the control unit 21 of the support server 20 selects the white List matching processing is executed (step S24). Specifically, the mask processing unit 212 of the control unit 21 matches the word to be processed with the second whitelist recorded in the dictionary server 30 .

次に、支援サーバ２０の制御部２１は、マスキング対象かについての判定処理を実行する（ステップＳ２５）。具体的には、制御部２１のマスク処理部２１２は、処理対象単語が第２ホワイトリストに含まれない場合、マスキング対象と判定する。 Next, the control unit 21 of the support server 20 executes determination processing as to whether it is a masking target (step S25). Specifically, the mask processing unit 212 of the control unit 21 determines that the word to be processed is to be masked if it is not included in the second whitelist.

マスキング対象と判定した場合（ステップＳ２５において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、マスキング処理を実行する（ステップＳ２３）。
なお、マスキング対象でないと判定した場合（ステップＳ２５において「ＮＯ」の場合）、支援サーバ２０の制御部２１は、この処理対象単語についての処理を終了する。 If determined to be masked ("YES" in step S25), the control unit 21 of the support server 20 executes masking processing (step S23).
If it is determined that the word is not to be masked ("NO" in step S25), the control unit 21 of the support server 20 terminates the processing for this word to be processed.

次に、支援サーバ２０の制御部２１は、名詞辞書で作成されたホワイトリスト突合処理を実行する（ステップＳ２６）。具体的には、制御部２１のマスク処理部２１２は、処理対象単語と、辞書記憶部２３に記録された第１ホワイトリストとを突合する。 Next, the control unit 21 of the support server 20 executes whitelist matching processing created by the noun dictionary (step S26). Specifically, the mask processing unit 212 of the control unit 21 matches the word to be processed with the first whitelist recorded in the dictionary storage unit 23 .

次に、支援サーバ２０の制御部２１は、マスキング解除対象かどうかについての判定処理を実行する（ステップＳ２７）。具体的には、制御部２１のマスク処理部２１２は、処理対象単語が第１ホワイトリストに含まれる場合、マスキング解除対象と判定する。 Next, the control unit 21 of the support server 20 executes determination processing as to whether or not the masking is to be released (step S27). Specifically, the mask processing unit 212 of the control unit 21 determines that the word to be processed is to be unmasked when it is included in the first whitelist.

マスキング解除対象と判定した場合（ステップＳ２７において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、マスク解除処理を実行する（ステップＳ２８）。具体的には、制御部２１のマスク処理部２１２は、ホワイトリスト単語として、公開候補文中の処理対象単語に付されたマスクを除去する。 If it is determined that the masking is to be removed ("YES" in step S27), the control unit 21 of the support server 20 executes masking removal processing (step S28). Specifically, the mask processing unit 212 of the control unit 21 removes the mask attached to the processing target word in the disclosure candidate sentence as a whitelist word.

一方、マスキング解除対象でないと判定した場合（ステップＳ２７において「ＮＯ」の場合）、支援サーバ２０の制御部２１は、マスク解除処理（ステップＳ２８）をスキップする。この場合、処理対象単語のマスクを維持する。
以上の処理を、文を構成するすべての単語について繰り返す。 On the other hand, when it is determined that the masking is not to be canceled ("NO" in step S27), the control unit 21 of the support server 20 skips the masking removal process (step S28). In this case, the mask of the words to be processed is maintained.
The above processing is repeated for all the words forming the sentence.

次に、図６に示すように、支援サーバ２０の制御部２１は、単語の再構成処理を実行する（ステップＳ３１）。具体的には、制御部２１のマスク処理部２１２は、短単位の単語を順次、ずらして構成した結合単語（再構成単語）を生成する。 Next, as shown in FIG. 6, the control unit 21 of the support server 20 executes word reconstruction processing (step S31). Specifically, the mask processing unit 212 of the control unit 21 generates combined words (reconstructed words) formed by sequentially shifting short unit words.

次に、支援サーバ２０の制御部２１は、品詞の特定処理を実行する（ステップＳ３２）。具体的には、制御部２１のマスク処理部２１２は、結合単語について、辞書サーバ３０から品詞を取得する。 Next, the control unit 21 of the support server 20 executes part-of-speech identification processing (step S32). Specifically, the mask processing unit 212 of the control unit 21 acquires the part of speech of the combined word from the dictionary server 30 .

次に、支援サーバ２０の制御部２１は、ブラックリスト対象かどうかについての判定処理を実行する（ステップＳ３３）。具体的には、制御部２１のマスク処理部２１２は、結合単語について、辞書サーバ３０から取得した品詞が固有名詞である場合には、マスキング対象であるブラックリストに含まれると判定する。 Next, the control unit 21 of the support server 20 executes determination processing as to whether or not it is subject to the blacklist (step S33). Specifically, if the part of speech acquired from the dictionary server 30 is a proper noun, the mask processing unit 212 of the control unit 21 determines that the combined word is included in the blacklist to be masked.

マスキング対象と判定した場合（ステップＳ３３において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、ステップＳ２３と同様に、マスキング処理を実行する（ステップＳ３４）。 If determined to be masked ("YES" in step S33), the control unit 21 of the support server 20 executes masking processing (step S34), as in step S23.

一方、マスキング対象でないと判定した場合（ステップＳ３３において「ＮＯ」の場合）、支援サーバ２０の制御部２１は、マスキング処理（ステップＳ３４）をスキップする。 On the other hand, if it is determined not to be masked ("NO" in step S33), the control unit 21 of the support server 20 skips the masking process (step S34).

次に、支援サーバ２０の制御部２１は、終了かどうかについての判定処理を実行する（ステップＳ３５）。具体的には、制御部２１のマスク処理部２１２は、公開候補文において、連続するすべてのホワイトリスト単語について終了したかどうかを判定する。 Next, the control unit 21 of the support server 20 executes determination processing as to whether or not to end (step S35). Specifically, the mask processing unit 212 of the control unit 21 determines whether or not all consecutive whitelist words have been completed in the disclosure candidate sentence.

終了でないと判定した場合（ステップＳ３５において「ＮＯ」の場合）、支援サーバ２０の制御部２１は、単語の再構成処理（ステップＳ３１）以降の処理を繰り返す。
一方、終了と判定した場合（ステップＳ３５において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、公開文の出力処理を実行する（ステップＳ３６）。具体的には、制御部２１のマスク処理部２１２は、公開候補文について、マスキング処理を行なった公開文を、ユーザ端末１０に出力する。 If it is determined not to end ("NO" in step S35), the control unit 21 of the support server 20 repeats the word reconstruction process (step S31) and subsequent processes.
On the other hand, if it is determined to end ("YES" in step S35), the control unit 21 of the support server 20 executes processing for outputting a public statement (step S36). Specifically, the mask processing unit 212 of the control unit 21 outputs to the user terminal 10 the disclosure sentence that has undergone masking processing for the disclosure candidate sentence.

本実施形態によれば、以下のような効果を得ることができる。
（１）本実施形態においては、支援サーバ２０の制御部２１は、教師情報の取得処理（ステップＳ１１）、固有名詞の除外処理（ステップＳ１３）を実行する。これにより、固有名詞以外の名詞を抽出することができる。 According to this embodiment, the following effects can be obtained.
(1) In the present embodiment, the control unit 21 of the support server 20 executes teacher information acquisition processing (step S11) and proper noun exclusion processing (step S13). This makes it possible to extract nouns other than proper nouns.

（２）本実施形態においては、支援サーバ２０の制御部２１は、出現個数が多い順番に名詞の特定処理（ステップＳ１４）、名詞辞書への登録処理（ステップＳ１５）、表現率の算出処理（ステップＳ１６）を実行する。これにより、所定の表現率を確保したホワイトリストを作成することができる。 (2) In the present embodiment, the control unit 21 of the support server 20 performs noun identification processing (step S14), registration processing in the noun dictionary (step S15), expression rate calculation processing ( Step S16) is executed. This makes it possible to create a whitelist that secures a predetermined expression rate.

（３）本実施形態においては、名詞と判定した場合（ステップＳ２２において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、マスキング処理を実行する（ステップＳ２３）。これにより、すべてをマスキングした文を初期値として用いることができる。 (3) In the present embodiment, if it is determined to be a noun ("YES" in step S22), the control unit 21 of the support server 20 executes masking processing (step S23). This makes it possible to use the fully masked sentence as the initial value.

（４）本実施形態においては、名詞でないと判定した場合（ステップＳ２２において「ＮＯ」の場合）、支援サーバ２０の制御部２１は、一般辞書で作成された第２ホワイトリスト突合処理を実行する（ステップＳ２４）。マスキング対象と判定した場合（ステップＳ２５において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、マスキング処理を実行する（ステップＳ２３）。これにより、名詞以外の単語についても、固有情報を排除することができる。 (4) In the present embodiment, if it is determined that it is not a noun ("NO" in step S22), the control unit 21 of the support server 20 executes the second whitelist matching process created by the general dictionary. (Step S24). If determined to be masked ("YES" in step S25), the control unit 21 of the support server 20 executes masking processing (step S23). As a result, it is possible to eliminate specific information for words other than nouns.

（５）本実施形態においては、支援サーバ２０の制御部２１は、名詞辞書で作成されたホワイトリスト突合処理を実行する（ステップＳ２６）。マスキング解除対象と判定した場合（ステップＳ２７において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、マスク解除処理を実行する（ステップＳ２８）。これにより、マスキングした単語について、ホワイトリストを用いて、再構成することができる。更に、マスキング対象と判定した場合（ステップＳ２５において「ＹＥＳ」の場合）にも、支援サーバ２０の制御部２１は、名詞辞書で作成されたホワイトリスト突合処理を実行する（ステップＳ２６）。これにより、辞書サーバ３０から取得した品詞が的確でない場合にも、二つのホワイトリストを用いて是正することができる。 (5) In this embodiment, the control unit 21 of the support server 20 executes the whitelist matching process created by the noun dictionary (step S26). If it is determined that the masking is to be removed ("YES" in step S27), the control unit 21 of the support server 20 executes masking removal processing (step S28). This allows the masked words to be reconstructed using the whitelist. Furthermore, when it is determined to be a masking target ("YES" in step S25), the control unit 21 of the support server 20 also executes whitelist matching processing created by the noun dictionary (step S26). As a result, even if the part of speech acquired from the dictionary server 30 is not accurate, it can be corrected using the two whitelists.

（６）本実施形態においては、支援サーバ２０の制御部２１は、単語の再構成処理（ステップＳ３１）、品詞の特定処理（ステップＳ３２）を実行する。マスキング対象と判定した場合（ステップＳ３３において「ＹＥＳ」の場合）、支援サーバ２０の制御部２１は、マスキング処理を実行する（ステップＳ３４）。これにより、連続する普通名詞により固有名詞が生成される場合にも、公開文から排除することができる。 (6) In the present embodiment, the control unit 21 of the support server 20 executes word reconstruction processing (step S31) and part-of-speech identification processing (step S32). If determined to be a masking target ("YES" in step S33), the control unit 21 of the support server 20 executes masking processing (step S34). As a result, even when a proper noun is generated by continuous common nouns, it can be excluded from the published sentence.

本実施形態は、以下のように変更して実施することができる。本実施形態及び以下の変更例は、技術的に矛盾しない範囲で互いに組み合わせて実施することができる。
・上記実施形態では、ユーザ端末１０、支援サーバ２０、辞書サーバ３０を用いるが、ハードウェア構成はこれに限定されるものではない。例えば、UniDic辞書を支援サーバ２０内に保持するようにしてもよい。 This embodiment can be implemented with the following modifications. This embodiment and the following modified examples can be implemented in combination with each other within a technically consistent range.
- In the above embodiment, the user terminal 10, the support server 20, and the dictionary server 30 are used, but the hardware configuration is not limited to this. For example, a UniDic dictionary may be held within the support server 20 .

・上記実施形態では、表現率として、文を構成する全文字数に対して、マスキングされていない文字数の割合を用いた。ホワイトリスト単語により、文を表現できる割合であれば、文字数に限定されるものではない。例えば、文を構成する全単語数に対して、ホワイトリスト単語数の割合を用いてもよい。 - In the above-described embodiment, the ratio of the number of unmasked characters to the total number of characters constituting a sentence is used as the expression ratio. The number of characters is not limited as long as the whitelist words can express a sentence. For example, the ratio of the number of whitelisted words to the total number of words forming a sentence may be used.

・上記実施形態では、辞書サーバ３０としては、例えば、国立国語研究所が提供するUniDic辞書を用いたが、品詞を特定できれば、これに限定されるものではない。
・上記実施形態では、マスキング対象と判定した場合（ステップＳ２５において「ＹＥＳ」の場合）にも、支援サーバ２０の制御部２１は、マスキング処理（ステップＳ２３）、名詞辞書で作成されたホワイトリスト突合処理（ステップＳ２６）を実行する。ここで、マスキング対象と判定した場合（ステップＳ２５において「ＹＥＳ」の場合）、名詞辞書で作成されたホワイトリスト突合を行なうことなく、マスキング処理（ステップＳ２３）のみを行なうようにしてもよい。 - In the above-described embodiment, for example, the UniDic dictionary provided by the National Institute for Japanese Language and Linguistics is used as the dictionary server 30, but it is not limited to this as long as the part of speech can be specified.
In the above embodiment, even when it is determined to be a masking target ("YES" in step S25), the control unit 21 of the support server 20 performs the masking process (step S23), matches the whitelist created by the noun dictionary Processing (step S26) is executed. Here, if it is determined that the object is to be masked ("YES" in step S25), only the masking process (step S23) may be performed without matching the whitelist created by the noun dictionary.

１０…ユーザ端末、２０…支援サーバ、３０…辞書サーバ、２１…制御部、２１０…取得部、２１１…リスト作成部、２１２…マスク処理部、２２…教師情報記憶部、２３…辞書記憶部。 10 User terminal 20 Support server 30 Dictionary server 21 Control unit 210 Acquisition unit 211 List creation unit 212 Mask processing unit 22 Teacher information storage unit 23 Dictionary storage unit.

Claims

A document creation system comprising a control unit connected to a user terminal,
The control unit
Performing morphological analysis of sentences obtained from the user terminal to identify constituent words,
If the constituent word is the first type of part of speech, matching with the first whitelist created by the first method,
If the constituent word is the second type of part of speech, matching with the second whitelist created by the second method,
masking the constituent words that are not included in the first whitelist and the second whitelist and outputting them to the user terminal;
The first type of part of speech is a noun,
In the first method,
extract all constituent words of the teacher sentence,
Identifying usable noun candidates in the extracted constituent words,
calculating the frequency of appearance of each of the noun candidates;
sequentially identifying the noun candidates in order of appearance frequency, and calculating an expression rate based on a ratio of all the identified noun candidates in the teacher sentence;
A document creation system , wherein the first white list is created including the noun candidates when the expression rate is equal to or higher than a reference value.

The second type of part of speech is a part of speech group including any part of speech other than a noun,
2. The document creation system according to claim 1 , wherein said second whitelist is created using said part of speech group.

The control unit
combining the consecutive constituent words to generate a reconstructed word;
3. The document creation system according to claim 1 , wherein the masking is performed when the reconstructed word is included in a blacklist.

A method for creating masked text using a document creation system having a control unit connected to a user terminal, comprising:
The control unit
Performing morphological analysis of sentences obtained from the user terminal to identify constituent words,
If the constituent word is the first type of part of speech, matching with the first whitelist created by the first method,
If the constituent word is the second type of part of speech, matching with the second whitelist created by the second method,
masking the constituent words that are not included in the first whitelist and the second whitelist and outputting them to the user terminal;
The first type of part of speech is a noun,
In the first method,
extract all constituent words of the teacher sentence,
Identifying usable noun candidates in the extracted constituent words,
calculating the frequency of appearance of each of the noun candidates;
sequentially identifying the noun candidates in order of appearance frequency, and calculating an expression rate based on a ratio of all the identified noun candidates in the teacher sentence;
A document creation method , wherein the first white list is created including the noun candidates for which the expression rate is equal to or higher than a reference value.

A program for creating masked text using a document creation system having a control unit connected to a user terminal,
the control unit,
Performing morphological analysis of sentences obtained from the user terminal to identify constituent words,
If the constituent word is the first type of part of speech, matching with the first whitelist created by the first method,
If the constituent word is the second type of part of speech, matching with the second whitelist created by the second method,
masking the constituent words that are not included in the first whitelist and the second whitelist and outputting them to the user terminal;
The first type of part of speech is a noun,
In the first method,
extract all constituent words of the teacher sentence,
Identifying usable noun candidates in the extracted constituent words,
calculating the frequency of appearance of each of the noun candidates;
sequentially identifying the noun candidates in order of appearance frequency, and calculating an expression rate based on a ratio of all the identified noun candidates in the teacher sentence;
A document creation program for functioning as means for creating the first white list including the noun candidates when the expression rate is equal to or higher than a reference value.