JP7507733B2

JP7507733B2 - Information processing device, information processing method, and information processing program

Info

Publication number: JP7507733B2
Application number: JP2021134681A
Authority: JP
Inventors: 颯太山城
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2024-06-28
Anticipated expiration: 2041-08-20
Also published as: JP2023028783A

Description

本発明は、情報処理装置、情報処理方法、及び情報処理プログラムに関する。 The present invention relates to an information processing device, an information processing method, and an information processing program.

機械学習等によりモデルを学習するために用いるデータの生成に関する様々な技術が提供されている。例えば、帳票の手書き文字領域に記入される複数の単語が登録されたデータベースと、手書き文字画像のデータセットとに基づいて文字列画像及び正解ラベルを含む学習用データを生成する技術が提供されている（例えば特許文献１等）。 Various technologies have been provided for generating data used to train a model using machine learning and other methods. For example, a technology has been provided for generating training data including character string images and correct answer labels based on a database in which multiple words written in handwritten character areas of forms are registered, and a dataset of handwritten character images (for example, Patent Document 1, etc.).

特許第６５９０３５５号公報Japanese Patent No. 6590355

しかしながら、上記の従来技術には、改善の余地がある。例えば、従来技術では手書き文字画像、すなわち画像を対象としており、文字情報を対象とするモデルの生成に用いるデータを生成することは難しい。そのため、モデルの学習に利用可能な文字情報を効率的に生成することが望まれている。 However, there is room for improvement in the above conventional techniques. For example, the conventional techniques target handwritten character images, i.e., images, and it is difficult to generate data to be used for generating models that target character information. Therefore, it is desirable to efficiently generate character information that can be used for model training.

本願は、上記に鑑みてなされたものであって、モデルの学習に利用可能な文字情報を効率的に生成する情報処理装置、情報処理方法、及び情報処理プログラムを提供することを目的とする。 The present application has been made in consideration of the above, and aims to provide an information processing device, an information processing method, and an information processing program that efficiently generate character information that can be used for model learning.

本願に係る情報処理装置は、所定の種別に該当する文字列である抽出対象文字列を文字情報から抽出するモデルの学習に用いるためのラベルが付された第１文字情報を含む学習用データセットと、前記ラベルが付されていない文字情報である第２文字情報とを取得する取得部と、前記学習用データセットのうち、前記第２文字情報と類似する前記第１文字情報を類似文字情報として選択し、前記類似文字情報中の前記抽出対象文字列である第１文字列を、前記第２文字情報中の前記抽出対象文字列である第２文字列に変更することにより、前記第２文字列を含み、前記モデルの学習に利用可能な文字情報である変更文字情報を生成する生成部と、を備えたことを特徴とする。 The information processing device according to the present application is characterized by comprising: an acquisition unit that acquires a learning dataset including first character information labeled for use in training a model that extracts a target string of extraction, which is a string corresponding to a predetermined type, from character information, and second character information that is character information without the label; and a generation unit that selects the first character information similar to the second character information from the learning dataset as similar character information, and changes the first character string that is the target string of extraction in the similar character information to the second character string that is the target string of extraction in the second character information, thereby generating changed character information that includes the second character string and is character information that can be used for training the model.

実施形態の一態様によれば、モデルの学習に利用可能な文字情報を効率的に生成することができるという効果を奏する。 According to one aspect of the embodiment, it is possible to efficiently generate character information that can be used for model training.

図１は、実施形態に係る情報処理システムによる処理の一例を示す図である。FIG. 1 is a diagram illustrating an example of processing by an information processing system according to an embodiment. 図２は、生成処理の一例を示す図である。FIG. 2 is a diagram illustrating an example of the generation process. 図３は、ベクトル変換処理の一例を示す図である。FIG. 3 is a diagram illustrating an example of vector conversion processing. 図４は、実施形態に係る情報処理装置の構成例を示す図である。FIG. 4 is a diagram illustrating an example of the configuration of an information processing device according to an embodiment. 図５は、実施形態に係る学習用データ記憶部の一例を示す図である。FIG. 5 is a diagram illustrating an example of a learning data storage unit according to the embodiment. 図６は、実施形態に係るモデル情報記憶部の一例を示す図である。FIG. 6 is a diagram illustrating an example of a model information storage unit according to the embodiment. 図７は、実施形態に係る情報処理装置による処理の一例を示すフローチャートである。FIG. 7 is a flowchart illustrating an example of processing by the information processing device according to the embodiment. 図８は、ハードウェア構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of a hardware configuration.

以下に、本願に係る情報処理装置、情報処理方法、及び情報処理プログラムを実施するための形態（以下、「実施形態」と呼ぶ）について図面を参照しつつ詳細に説明する。なお、この実施形態により本願に係る情報処理装置、情報処理方法、及び情報処理プログラムが限定されるものではない。また、以下の各実施形態において同一の部位には同一の符号を付し、重複する説明は省略される。 Below, the information processing device, information processing method, and information processing program according to the present application will be described in detail with reference to the drawings. Note that the information processing device, information processing method, and information processing program according to the present application are not limited to these embodiments. In addition, the same parts in the following embodiments will be denoted by the same reference numerals, and duplicated descriptions will be omitted.

（実施形態）
〔１．情報処理〕
図１を用いて、実施形態に係る情報処理の一例について説明する。図１は、実施形態に係る情報処理システムによる処理の一例を示す図である。図１では、情報処理装置１００が生成した学習用データを用いて固有表現を抽出するモデルＭ１を生成する場合を一例として説明する。 (Embodiment)
[1. Information Processing]
An example of information processing according to the embodiment will be described with reference to Fig. 1. Fig. 1 is a diagram showing an example of processing by an information processing system according to the embodiment. In Fig. 1, a case where a model M1 that extracts named entities is generated using learning data generated by an information processing device 100 will be described as an example.

まず、情報処理システム１の構成について説明する。図１に示すように、情報処理システム１は、端末装置１０と、情報処理装置１００とが含まれる。端末装置１０と、情報処理装置１００とは図示しない所定の通信網を介して、有線または無線により通信可能に接続される。なお、図１に示した情報処理システム１には、複数台の端末装置１０や、複数台の情報処理装置１００が含まれてもよい。 First, the configuration of the information processing system 1 will be described. As shown in FIG. 1, the information processing system 1 includes a terminal device 10 and an information processing device 100. The terminal device 10 and the information processing device 100 are connected to each other via a predetermined communication network (not shown) so as to be able to communicate with each other via a wired or wireless connection. Note that the information processing system 1 shown in FIG. 1 may include multiple terminal devices 10 and multiple information processing devices 100.

情報処理装置１００は、ラベルが付された文字情報（以下「第１文字情報」ともいう）中の文字列を、ラベルが付されていない文字情報（以下「第２文字情報」ともいう）中の文字列で変換して、モデルの学習に利用可能な文字情報を生成するコンピュータである。情報処理装置１００は、第１文字情報中で所定の種別に該当する文字列（以下「第１文字列」ともいう）を、第２文字情報中で所定の種別に該当すると推定される文字列（以下「第２文字列」ともいう）に変換する。図１の例では、所定の種別が固有表現である場合を一例として説明する。 The information processing device 100 is a computer that converts a character string in labeled character information (hereinafter also referred to as "first character information") with a character string in unlabeled character information (hereinafter also referred to as "second character information") to generate character information that can be used for model learning. The information processing device 100 converts a character string that corresponds to a predetermined type in the first character information (hereinafter also referred to as "first character string") into a character string that is estimated to correspond to the predetermined type in the second character information (hereinafter also referred to as "second character string"). In the example of FIG. 1, a case where the predetermined type is a named entity will be described as an example.

これにより、情報処理装置１００は、第２文字列を含み、モデルの学習に利用可能な文字情報（以下「変更文字情報」ともいう）を生成する。図１では、情報処理装置１００は、所定の編集者ＥＤ１により人手でラベル付与された（人手ラベル付き訓練データ）等の第１文字情報を含む学習用データを用いて、新たに学習用データとして用いられる変更文字情報を生成する。 As a result, the information processing device 100 generates character information (hereinafter also referred to as "changed character information") that includes the second character string and can be used for learning a model. In FIG. 1, the information processing device 100 generates changed character information to be used as new learning data, using learning data that includes first character information, such as manually labeled training data (manually labeled training data) by a specific editor ED1.

端末装置１０は、データ（文字情報）にラベル（正解情報）を付与し、人手で訓練データ（学習用データ）を生成するために利用されるデバイス（コンピュータ）である。端末装置１０は、データに対するラベルを付与する所定の編集者ＥＤ１によって利用される。端末装置１０は、例えば、スマートフォンや、タブレット型端末や、ノート型ＰＣ（Personal Computer）や、デスクトップＰＣや、携帯電話機や、ＰＤＡ（Personal Digital Assistant）等により実現される。図１は、端末装置１０がデスクトップＰＣである場合を示す。 The terminal device 10 is a device (computer) that is used to assign labels (correct answer information) to data (text information) and manually generate training data (learning data). The terminal device 10 is used by a specific editor ED1 who assigns labels to data. The terminal device 10 is realized, for example, by a smartphone, a tablet terminal, a notebook PC (Personal Computer), a desktop PC, a mobile phone, a PDA (Personal Digital Assistant), or the like. FIG. 1 shows the case where the terminal device 10 is a desktop PC.

例えば、所定の編集者ＥＤ１は、端末装置１０を操作して、各文字情報にラベルを付与する。例えば、端末装置１０は、所定の編集者ＥＤ１の操作に応じて、図２中に示すラベル付きデータである第１文字情報ＬＤ１のように、文字情報中の各文字列が該当する種別（属性）を示す情報（ラベル）を正解情報として付与する。 For example, a specific editor ED1 operates the terminal device 10 to assign a label to each piece of character information. For example, in response to the operation of the specific editor ED1, the terminal device 10 assigns information (label) indicating the type (attribute) to which each character string in the character information corresponds as correct answer information, such as the first character information LD1, which is the labeled data shown in FIG. 2.

例えば、端末装置１０は、文字情報のうち所定の種別に該当する文字列が含まれる位置（範囲）を示すラベルを文字情報に付加する。例えば、ラベル文字情報のうち組織名等の固有名詞（固有表現）に該当する文字列を示すラベルを文字情報に付加する。端末装置１０は、所定の編集者ＥＤ１の操作に応じて、文字情報にラベルを付与することにより、学習用データを生成する。端末装置１０は、人手で生成された訓練データ（学習用データ）を情報処理装置１００へ送信する。 For example, the terminal device 10 adds a label to the character information indicating the position (range) where a character string corresponding to a predetermined type is included in the character information. For example, a label indicating a character string corresponding to a proper noun (named entity) such as an organization name is added to the character information. The terminal device 10 generates learning data by adding a label to the character information in response to the operation of a predetermined editor ED1. The terminal device 10 transmits the manually generated training data (learning data) to the information processing device 100.

また、情報処理システム１は、ユーザによって利用されるデバイスであるユーザ端末を含んでもよい。ユーザ端末は、例えば、スマートフォンや、タブレット型端末や、ノート型ＰＣや、デスクトップＰＣや、携帯電話機や、ＰＤＡ等により実現され、ユーザに様々なサービスを提供する。ユーザ端末は、ユーザの操作に応じて処理を実行し、情報を表示する。 The information processing system 1 may also include a user terminal, which is a device used by a user. The user terminal may be, for example, a smartphone, a tablet terminal, a notebook PC, a desktop PC, a mobile phone, a PDA, or the like, and provides various services to the user. The user terminal executes processing and displays information in response to user operations.

また、情報処理システム１は、インターネット百科事典に関するサービスをユーザに提供するサービス提供装置を含んでもよい。なお、ここでいうインターネット百科事典とは例えばＷｉｋｉｐｅｄｉａ等であってもよい。サービス提供装置は、所定の対象を解説する解説コンテンツに関する情報を提供する。サービス提供装置は、情報処理装置１００からの要求に応じて、所定の対象を解説する解説コンテンツの情報を情報処理装置１００に送信する。サービス提供装置は、ユーザ端末からの要求に応じて、所定の対象を解説する解説コンテンツの情報をユーザ端末に送信する。 The information processing system 1 may also include a service providing device that provides a user with a service related to an Internet encyclopedia. Note that the Internet encyclopedia referred to here may be, for example, Wikipedia. The service providing device provides information related to commentary content that explains a specific subject. In response to a request from the information processing device 100, the service providing device transmits information about the commentary content that explains the specific subject to the information processing device 100. In response to a request from the user terminal, the service providing device transmits information about the commentary content that explains the specific subject to the user terminal.

以下、図１を用いて、情報処理の一例を説明する。まず、所定の編集者ＥＤ１は、各文字情報にラベルを付与する（ステップＳ１１）。例えば、所定の編集者ＥＤ１は、ラベルが付されていない文字情報の内容を確認し、文字情報中で固有表現に該当する文字列に、その文字列が固有表現であることを示すラベルを付すことにより、人手ラベル付き訓練データを生成する。例えば、所定の編集者ＥＤ１は、端末装置１０を操作して、図２中に示すラベル付きデータである第１文字情報ＬＤ１のように、文字情報中の各文字列が該当する種別（属性）を示す情報（ラベル）を正解情報として付与してもよいが、図２の詳細は後述する。 An example of information processing will be described below with reference to FIG. 1. First, a predetermined editor ED1 assigns a label to each piece of character information (step S11). For example, the predetermined editor ED1 checks the content of the unlabeled character information, and generates manually labeled training data by assigning a label indicating that each character string in the character information corresponds to a named entity to the character string, which indicates that the character string is a named entity. For example, the predetermined editor ED1 may operate the terminal device 10 to assign information (labels) indicating the type (attribute) to which each character string in the character information corresponds as correct answer information, such as the first character information LD1, which is the labeled data shown in FIG. 2; details of FIG. 2 will be described later.

所定の編集者ＥＤ１は、各文字情報にラベルを付与が完了した後、端末装置１０を操作して、人手ラベル付き訓練データを情報処理装置１００へ送信する（ステップＳ１２）。端末装置１０は、所定の編集者ＥＤ１の操作に応じて、所定の編集者ＥＤ１がラベルを付した人手ラベル付き訓練データを情報処理装置１００へ送信する。 After the specified editor ED1 has finished labeling each piece of character information, the specified editor ED1 operates the terminal device 10 to transmit the manually labeled training data to the information processing device 100 (step S12). In response to the operation of the specified editor ED1, the terminal device 10 transmits the manually labeled training data to which the specified editor ED1 has added labels to the information processing device 100.

これにより、情報処理装置１００は、人手ラベル付き訓練データを取得する。そして、情報処理装置１００は、取得した人手ラベル付き訓練データを第１文字情報として学習用データセットＤＳ１に追加する。具体的には、情報処理装置１００は、端末装置１０から受信した人手ラベル付き訓練データを学習に用いるデータとして学習用データ記憶部１２１（図５参照）に登録する。 As a result, the information processing device 100 acquires the manually-labeled training data. Then, the information processing device 100 adds the acquired manually-labeled training data to the learning dataset DS1 as first character information. Specifically, the information processing device 100 registers the manually-labeled training data received from the terminal device 10 in the learning data storage unit 121 (see FIG. 5) as data to be used for learning.

そして、情報処理装置１００は、学習用データセットＤＳ１中の第１文字情報群を用いて、新たな文字情報を生成する処理を行う。図１では、情報処理装置１００は、コンテンツから抽出された第２文字情報ＵＤ１を対象として、新たな文字情報を生成する処理を行う。ここで、第２文字情報ＵＤ１は、ラベルが付されていない文字情報である。例えば、第２文字情報ＵＤ１は、図２中に示す第２文字情報ＵＤ１のように、所定の対象（図２の例では「Ｘ曜日の〇〇」）についての解説が記載されたコンテンツに含まれる文字情報である。なお、第２文字情報ＵＤ１の抽出対象となるコンテンツは、インターネット百科事典内で提供されるコンテンツ等の様々なコンテンツであってもよい。第２文字情報ＵＤ１には、固有表現の第２文字列として「Ｘ曜日の〇〇」が含まれる。なお、Ｘ曜日の〇〇と抽象的に示すが、Ｘ曜日の〇〇は、実在する固有名詞（固有名称）であり、かつ新たに出現したアーティスト（組織名）を示す新語であるものとする。 Then, the information processing device 100 performs a process of generating new character information using the first character information group in the learning dataset DS1. In FIG. 1, the information processing device 100 performs a process of generating new character information using the second character information UD1 extracted from the content as a target. Here, the second character information UD1 is character information to which no label is attached. For example, the second character information UD1 is character information included in a content that describes a predetermined target (in the example of FIG. 2, "XX on the Xth of the week"), like the second character information UD1 shown in FIG. 2. Note that the content to be extracted from the second character information UD1 may be various contents such as contents provided in an Internet encyclopedia. The second character information UD1 includes "XX on the Xth of the week" as the second character string of the named expression. Note that although XX on the Xth of the week is abstractly shown, XX on the Xth of the week is a proper noun (proper name) that exists and is a neologism that indicates a newly appeared artist (organization name).

情報処理装置１００は、学習用データセットＤＳ１のうち、第２文字情報ＵＤ１に類似する第１文字情報を選択する（ステップＳ１３）。例えば、情報処理装置１００は、各文字情報をベクトル化して、ベクトルの類似度を基に、類似文字情報を選択してもよい。この場合、情報処理装置１００は、学習用データセットＤＳ１中の各第１文字情報がベクトル化された第１ベクトルの各々と、第２文字情報ＵＤ１がベクトル化された第２ベクトルとの類似度に基づいて、類似文字情報を選択する。 The information processing device 100 selects first character information similar to the second character information UD1 from the learning dataset DS1 (step S13). For example, the information processing device 100 may vectorize each piece of character information and select similar character information based on the similarity of the vectors. In this case, the information processing device 100 selects similar character information based on the similarity between each of the first vectors obtained by vectorizing each piece of first character information in the learning dataset DS1 and the second vector obtained by vectorizing the second character information UD1.

例えば、情報処理装置１００は、第２ベクトルとのコサイン類似度が最大の第１ベクトルに対応する第１文字情報を類似文字情報として選択する。なお、ベクトル間の類似関係は、コサイン類似度に限らず、任意の指標が用いられてもよく、例えば、ユークリッド距離やマハラノビス距離等が用いられてもよい。例えば、情報処理装置１００は、文字情報をベクトルに変換するモデルＭ２を用いて、各文字情報をベクトルに変換する。例えば、情報処理装置１００は、ｗｏｒｄ２ｖｅｃ（「ｗ２ｖ」ともいう）に関する種々の技術を用いて学習されたモデルＭ２を用いて、各文字情報をベクトルに変換する。 For example, the information processing device 100 selects, as similar character information, the first character information corresponding to the first vector having the maximum cosine similarity with the second vector. Note that the similarity relationship between the vectors is not limited to cosine similarity, and any index may be used, for example, Euclidean distance or Mahalanobis distance. For example, the information processing device 100 converts each piece of character information into a vector using a model M2 that converts character information into a vector. For example, the information processing device 100 converts each piece of character information into a vector using a model M2 that has been trained using various techniques related to word2vec (also called "w2v").

例えば、情報処理装置１００は、第２文字情報ＵＤ１中の名詞に対応する各文字列をモデルＭ２に入力し、モデルＭ２が出力したベクトルの平均を第２文字情報ＵＤ１のベクトル（第２ベクトル）とする。また、情報処理装置１００は、第１文字情報ＬＤ１中の名詞に対応する各文字列をモデルＭ２に入力し、モデルＭ２が出力したベクトルの平均を第１文字情報ＬＤ１のベクトル（第１ベクトル）とする。なお、上記は一例に過ぎず、情報処理装置１００は、第２文字情報ＵＤ１全体をベクトル変換したものを第２ベクトルとし、第１文字情報ＬＤ１全体をベクトル変換したものを第１ベクトルとしてもよい。 For example, the information processing device 100 inputs each character string corresponding to a noun in the second character information UD1 into the model M2, and sets the average of the vectors output by the model M2 as the vector of the second character information UD1 (second vector). The information processing device 100 also inputs each character string corresponding to a noun in the first character information LD1 into the model M2, and sets the average of the vectors output by the model M2 as the vector of the first character information LD1 (first vector). Note that the above is merely an example, and the information processing device 100 may set the second vector as a vector obtained by vector-converting the entire second character information UD1, and the first vector as a vector obtained by vector-converting the entire first character information LD1.

図１では、情報処理装置１００は、学習用データセットＤＳ１のうち、第２文字情報ＵＤ１との類似度が最大である第１文字情報ＬＤ１を類似文字情報として選択する。第１文字情報ＬＤ１には、固有表現の第１文字列として「アーティストＡ」が含まれる。 In FIG. 1, the information processing device 100 selects, from the learning dataset DS1, the first character information LD1 that has the highest similarity to the second character information UD1 as similar character information. The first character information LD1 includes "Artist A" as the first character string of the named entity.

なお、アーティストＡと抽象的に示すが、アーティストＡは、実在する固有名詞（固有名称）であるものとする。また、情報処理装置１００は、第２文字情報ＵＤ１に類似する第１文字情報がない場合、第２文字情報ＵＤ１を処理対象から除外してもよい。例えば、情報処理装置１００は、第２文字情報ＵＤ１との類似度が所定値以上の第１文字情報がない場合、第２文字情報ＵＤ１を処理対象から除外してもよい。 Although artist A is referred to abstractly, artist A is assumed to be an existing proper noun (proper name). Furthermore, if there is no first character information similar to second character information UD1, information processing device 100 may exclude second character information UD1 from processing. For example, if there is no first character information whose similarity to second character information UD1 is equal to or greater than a predetermined value, information processing device 100 may exclude second character information UD1 from processing.

そして、情報処理装置１００は、類似文字情報中の第１文字列を第２文字列に変換することにより、第２文字列を含み、モデルの学習に利用可能な変更文字情報を生成する（ステップＳ１４）。図１では、情報処理装置１００は、第１文字情報ＬＤ１中の固有表現「アーティストＡ」を、第２文字情報ＵＤ１中の固有表現「Ｘ曜日の〇〇」に変換する。これにより、情報処理装置１００は、第１文字情報ＬＤ１中の第１文字列である「アーティストＡ」が第２文字列である「Ｘ曜日の〇〇」に変換された変更文字情報ＣＤ１を生成する。すなわち、情報処理装置１００は、第２文字列である「Ｘ曜日の〇〇」を含み、モデルＭ１の学習に利用可能な変更文字情報ＣＤ１を生成する。 Then, the information processing device 100 converts the first character string in the similar character information into the second character string, thereby generating changed character information that includes the second character string and can be used for learning the model (step S14). In FIG. 1, the information processing device 100 converts the named entity "artist A" in the first character information LD1 into the named entity "XX on the Xth day of the week" in the second character information UD1. As a result, the information processing device 100 generates changed character information CD1 in which the first character string "artist A" in the first character information LD1 is converted into the second character string "XX on the Xth day of the week". In other words, the information processing device 100 generates changed character information CD1 that includes the second character string "XX on the Xth day of the week" and can be used for learning the model M1.

情報処理装置１００は、生成した変更文字情報を第１文字情報として学習に用いるデータに追加する（ステップＳ１５）。図１では、情報処理装置１００は、第１文字情報ＬＤ１中の第１文字列である「アーティストＡ」が第２文字列である「Ｘ曜日の〇〇」に変換された変更文字情報ＣＤ１を学習用データセットＤＳ１に追加する。例えば、情報処理装置１００は、「Ｘ曜日の〇〇」を含む変更文字情報ＣＤ１に、変更文字情報ＣＤ１中の「Ｘ曜日の〇〇」が固有表現であることを示すラベルを対応付けて、第１文字情報として学習用データ記憶部１２１に格納する。 The information processing device 100 adds the generated changed character information to the data used for learning as first character information (step S15). In FIG. 1, the information processing device 100 adds changed character information CD1, in which the first character string "Artist A" in the first character information LD1 is converted to the second character string "XX on the Xth day of the week", to the learning data set DS1. For example, the information processing device 100 associates the changed character information CD1 including "XX on the Xth day of the week" with a label indicating that "XX on the Xth day of the week" in the changed character information CD1 is a named entity, and stores the result in the learning data storage unit 121 as the first character information.

そして、情報処理装置１００は、変更文字情報ＣＤ１が追加された学習用データセットＤＳ１を用いて、モデルＭ１を学習する（ステップＳ１６）。情報処理装置１００は、学習用データセットＤＳ１を用いて、モデルＭ１の重み等のパラメータを学習（更新）する。モデルＭ１の学習処理には、任意の手法が採用可能である。 Then, the information processing device 100 uses the training data set DS1 to which the changed character information CD1 has been added to train the model M1 (step S16). The information processing device 100 uses the training data set DS1 to train (update) parameters such as weights of the model M1. Any method can be used for the training process of the model M1.

例えば、情報処理装置１００は、モデルＭ１に入力された文字情報に固有表現が含まれる場合、固有表現を示す情報を出力するようにモデルＭ１を学習する。例えば、情報処理装置１００は、モデルＭ１に変更文字情報ＣＤ１が入力された場合に、モデルＭ１が文字列「Ｘ曜日の〇〇」を出力するようにモデルＭ１を学習する。例えば、モデルＭ１は、再帰型ニューラルネットワーク（Recurrent Neural Network：ＲＮＮ）やＲＮＮを拡張したＬＳＴＭ（Long Short-Term Memory units）等のネットワークであってもよい。なお、上記は一例に過ぎず、モデルＭ１は、文字情報から所定の種別の文字列を抽出可能であれば、再帰型ニューラルネットワークに限らず、どのようなネットワーク構成が採用されてもよい。 For example, when the character information input to the model M1 includes a named entity, the information processing device 100 trains the model M1 to output information indicating the named entity. For example, when the changed character information CD1 is input to the model M1, the information processing device 100 trains the model M1 to output the character string "XX on the Xth day of the week". For example, the model M1 may be a network such as a recurrent neural network (RNN) or LSTM (Long Short-Term Memory units), which is an extension of the RNN. Note that the above is merely an example, and the model M1 may adopt any network configuration other than a recurrent neural network as long as it is capable of extracting a predetermined type of character string from character information.

上述したように、情報処理装置１００は、既にラベルが付されたデータ（第１文字情報）の第１文字列を他の文字列（第２文字列）に変換して、新たな学習用データを生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。図１の例では、情報処理装置１００は、新語である「Ｘ曜日の〇〇」を含む文字情報を学習用データに追加することができるため、固有表現が新語である場合であっても適切に抽出する可能性が高いモデルを学習することができる。このように、学習データ中にこれらのような新語の例が十分に含まれていれば、機械学習モデルがこれら新語に対処できる可能性が上がるため、情報処理システム１では、新語の固有表現が含まれた学習データを生成し、水増しすることにより、新語であっても適切に抽出する可能性が高いモデルを学習することができる。 As described above, the information processing device 100 can efficiently generate character information that can be used for model training by converting the first character string of already labeled data (first character information) into another character string (second character string) to generate new training data. In the example of FIG. 1, the information processing device 100 can add character information including the new word "XX on the Xth day of the week" to the training data, so that a model that is likely to properly extract a named entity even if the named entity is a new word can be trained. In this way, if the training data contains sufficient examples of such new words, the machine learning model is more likely to be able to deal with these new words, so in the information processing system 1, training data including the named entity of the new word is generated and padded, so that a model that is likely to properly extract even a new word can be trained.

〔１－１．処理例〕
ここで、図２、３を参照して、図１で説明した内容に関する処理例を説明する。図２は、生成処理の一例を示す図である。図３は、ベクトル変換処理の一例を示す図である。 [1-1. Processing example]
Here, a processing example related to the content described in Fig. 1 will be described with reference to Fig. 2 and Fig. 3. Fig. 2 is a diagram showing an example of a generation process. Fig. 3 is a diagram showing an example of a vector conversion process.

図２に示すように、第２文字情報ＵＤ１は、例えばＷｉｋｉｐｅｄｉａなどのインターネット百科事典のコンテンツＣＴを対象として抽出された文字情報である。新語のような新たに生まれた固有表現については、情報処理システム１では、十分な情報が手元にない場合が多い。そのため、情報処理装置１００は、例えばインターネット百科事典に直近（例えば１カ月以内等）で追加されたリンク数の多いエントリーを収集する。これにより、情報処理装置１００は、例えばインターネット百科事典内の解説コンテンツを取得する。 As shown in FIG. 2, the second character information UD1 is character information extracted from the content CT of an Internet encyclopedia such as Wikipedia. For newly created named entities such as neologisms, the information processing system 1 often does not have sufficient information on hand. For this reason, the information processing device 100 collects, for example, entries with a large number of links that were recently added to the Internet encyclopedia (for example, within the last month). In this way, the information processing device 100 obtains, for example, explanatory content from the Internet encyclopedia.

そして、情報処理装置１００は、解説コンテンツに付けられたタグから、解説コンテンツ中に含まれる文字列のうち固有表現を推定する。例えば、情報処理装置１００は、解説コンテンツのＨＴＭＬ（Hyper Text Markup Language）を解析し、各種の情報を推定してもよい。例えば、情報処理装置１００は、解説コンテンツのＨＴＭＬを解析し、解説コンテンツの見出しの部分を、解説コンテンツが説明する対象であると推定する。 Then, the information processing device 100 estimates named entities from the character strings contained in the commentary content from the tags attached to the commentary content. For example, the information processing device 100 may analyze the HTML (Hyper Text Markup Language) of the commentary content and estimate various pieces of information. For example, the information processing device 100 analyzes the HTML of the commentary content and estimates that the heading part of the commentary content is the subject that the commentary content explains.

また、例えば、情報処理装置１００は、解説コンテンツに付されたタグが、音楽ユニット等の組織名等である場合、解説コンテンツの見出しの部分が示す対象が固有表現であると推定する。例えば、情報処理装置１００は、解説コンテンツに付されたタグが示すカテゴリが、音楽ユニット等の組織名等である場合、解説コンテンツの見出しの部分が示す対象が固有表現であると推定してもよい。 In addition, for example, when the tag attached to the commentary content is the name of an organization such as a music unit, the information processing device 100 may infer that the object indicated by the heading portion of the commentary content is a named expression. For example, when the category indicated by the tag attached to the commentary content is the name of an organization such as a music unit, the information processing device 100 may infer that the object indicated by the heading portion of the commentary content is a named expression.

そして、情報処理装置１００は、解説コンテンツの見出し部分が示す文字列を第２文字列とし、解説コンテンツ中の見出し部分に続く文章とともに第２文字情報として抽出してもよい。なお、上記は一例に過ぎず、第２文字情報は様々な情報を含む情報であってもよい。また、情報処理装置１００は、解析コンテンツから抽出された第２文字情報をサービス提供装置から取得してもよい。 The information processing device 100 may then extract the character string indicated by the heading of the commentary content as the second character string together with the sentence following the heading in the commentary content as the second character information. Note that the above is merely an example, and the second character information may be information including various information. Furthermore, the information processing device 100 may obtain the second character information extracted from the analysis content from the service providing device.

例えば、インターネット百科事典のコンテンツＣＴは、例えば５万エントリー等と多数あり、また収集するコストは低い。一方で、人手ラベル付き訓練データが含まれる学習用データセットＤＳ１については生成するコストが高く、例えば３０００個等と少数である。そこで、情報処理装置１００は、コンテンツＣＴと人手ラベル付き訓練データＭＤ１とを用いて、新たな学習用データを自動で生成する。 For example, the content CT of an Internet encyclopedia is large, for example 50,000 entries, and the cost of collecting it is low. On the other hand, the cost of generating a learning data set DS1 that includes manually labeled training data is high, and the number of entries is small, for example 3,000. Therefore, the information processing device 100 automatically generates new learning data using the content CT and the manually labeled training data MD1.

例えば、情報処理装置１００は、コンテンツＣＴに含まれる解説コンテンツのうち、説明する対象が新語であるコンテンツ（「新語コンテンツ」ともいう）を選択する。例えば、情報処理装置１００は、各解説コンテンツが説明する対象のうち、学習用データセットＤＳ１中の第１文字情報に固有表現として含まれない対象を新語であると推定する。そして、情報処理装置１００は、説明する対象が新語であると推定したコンテンツを新語コンテンツとして選択する。情報処理装置１００は、新語コンテンツから第２文字情報を生成する。 For example, the information processing device 100 selects, from among the commentary content included in the content CT, content in which the subject to be explained is a new word (also referred to as "new word content"). For example, the information processing device 100 estimates that, from among the subjects explained by each commentary content, an object that is not included as a named entity in the first character information in the learning dataset DS1 is a new word. Then, the information processing device 100 selects, as new word content, the content in which the subject to be explained is estimated to be a new word. The information processing device 100 generates second character information from the new word content.

例えば、情報処理装置１００は、解説コンテンツの見出し部分が示す文字列（第２文字列）と、解説コンテンツ中の見出し部分に続く文章とを抽出することにより、第２文字情報を生成する。情報処理装置１００は、見出しが「Ｘ曜日の〇〇」である解説コンテンツから、見出し部分が示す第２文字列「Ｘ曜日の〇〇」と、解説コンテンツ中の見出し部分に続く文章とを抽出することにより、第２文字情報ＵＤ１を生成する。 For example, the information processing device 100 generates the second character information by extracting a character string (second character string) indicated by the heading portion of the commentary content and the sentence following the heading portion in the commentary content. The information processing device 100 generates the second character information UD1 by extracting the second character string "XX on the Xth day of the week" indicated by the heading portion from commentary content with the heading "XX on the Xth day of the week" and the sentence following the heading portion in the commentary content.

そして、情報処理装置１００は、学習用データセットＤＳ１のうち、第２文字情報ＵＤ１に類似する第１文字情報を選択する。例えば、情報処理装置１００は、学習用データセットＤＳ１中の各第１文字情報と第２文字情報ＵＤ１との各々をベクトル化して、ベクトルの類似度を基に、類似文字情報を選択する。情報処理装置１００は、学習用データセットＤＳ１中の各第１文字情報と第２文字情報ＵＤ１との各々をベクトル化する。情報処理装置１００は、文字情報をベクトルに変換するモデルＭ２を用いて、各文字情報をベクトルに変換する。 Then, the information processing device 100 selects, from the learning dataset DS1, first character information that is similar to the second character information UD1. For example, the information processing device 100 vectorizes each piece of first character information and second character information UD1 in the learning dataset DS1, and selects similar character information based on the similarity of the vectors. The information processing device 100 vectorizes each piece of first character information and second character information UD1 in the learning dataset DS1. The information processing device 100 converts each piece of character information into a vector using a model M2 that converts character information into a vector.

図３では、情報処理装置１００は、第２文字情報ＵＤ１をモデルＭ２に入力することより、モデルＭ２に第２文字情報ＵＤ１をベクトル化したベクトルＶＣ１を出力させることにより、第２文字情報ＵＤ１をベクトルに変換する。また、情報処理装置１００は、第１文字情報ＬＤ１をモデルＭ２に入力することより、モデルＭ２に第１文字情報ＬＤ１をベクトル化したベクトルＶＣ２を出力させることにより、第１文字情報ＬＤ１をベクトルに変換する。なお、図３では、第１文字情報ＬＤ１のみを図示するが、情報処理装置１００は、学習用データセットＤＳ１中の各第１文字情報をベクトル化するものとする。 In FIG. 3, the information processing device 100 converts the second character information UD1 into a vector by inputting the second character information UD1 into a model M2 and having the model M2 output a vector VC1 obtained by vectorizing the second character information UD1. The information processing device 100 also converts the first character information LD1 into a vector by inputting the first character information LD1 into the model M2 and having the model M2 output a vector VC2 obtained by vectorizing the first character information LD1. Note that while FIG. 3 illustrates only the first character information LD1, the information processing device 100 vectorizes each piece of first character information in the training dataset DS1.

例えば、情報処理装置１００は、第２文字情報ＵＤ１のベクトルＶＣ１とのコサイン類似度が最大であるベクトルに対応する第１文字情報を類似文字情報として選択する。図２では、情報処理装置１００は、固有表現の第１文字列として「アーティストＡ」が含まれる第１文字情報ＬＤ１を類似文字情報として選択する。このように、情報処理装置１００は、学習データ中の文（ベクトル）と最もよく似た説明文（ベクトル）を持つエントリーを対象として、処理を実行する。これにより、情報処理装置１００は、元文と関連のないエントリーが選ばれる可能性を抑制することができる。これにより、情報処理装置１００は、学習データ中の文と似た説明文を持つエントリーを対象として、処理を実行する。 For example, the information processing device 100 selects, as similar character information, the first character information corresponding to the vector having the maximum cosine similarity with the vector VC1 of the second character information UD1. In FIG. 2, the information processing device 100 selects, as similar character information, the first character information LD1 containing "artist A" as the first character string of the named entity. In this way, the information processing device 100 executes processing on entries having a description (vector) most similar to a sentence (vector) in the learning data. This allows the information processing device 100 to reduce the possibility of selecting an entry unrelated to the original sentence. This allows the information processing device 100 to execute processing on entries having a description similar to a sentence in the learning data.

そして、情報処理装置１００は、第１文字情報ＬＤ１中の固有表現「アーティストＡ」を、第２文字情報ＵＤ１中の固有表現「Ｘ曜日の〇〇」に変換する。これにより、情報処理装置１００は、第１文字情報ＬＤ１中の第１文字列である「アーティストＡ」が第２文字列である「Ｘ曜日の〇〇」に変換された変更文字情報ＣＤ１を生成する。図２の例では、情報処理装置１００は、固有表現の一例である組織名のラベル部分に「Ｘ曜日の〇〇」が配置された変更文字情報ＣＤ１を生成する。これにより、情報処理装置１００は、自動的（人工的）に作成された新しい学習用データを用いてモデルを学習することができる。 Then, the information processing device 100 converts the named entity "artist A" in the first character information LD1 into the named entity "XX on the Xth day of the week" in the second character information UD1. As a result, the information processing device 100 generates changed character information CD1 in which "artist A", the first character string in the first character information LD1, is converted into the second character string "XX on the Xth day of the week". In the example of FIG. 2, the information processing device 100 generates changed character information CD1 in which "XX on the Xth day of the week" is placed in the label portion of the organization name, which is an example of a named entity. As a result, the information processing device 100 can train a model using new learning data that has been automatically (artificially) created.

なお、上記の処理は一例に過ぎず、情報処理装置１００は、変更文字情報を生成可能であれば、どのような処理を行ってもよい。例えば、情報処理装置１００は、学習用データセットＤＳ１から一の第１文字情報を選択し、コンテンツＣＴの中から、選択した第１文字情報（選択第１文字情報）に類似する第２文字情報を選択してもよい。この場合、情報処理装置１００は、選択第１文字情報に類似する第２文字情報がない場合、選択第１文字情報を処理対象から除外してもよい。例えば、情報処理装置１００は、選択第１文字情報との類似度が所定値以上の第２文字情報がない場合、選択第１文字情報を処理対象から除外してもよい。 Note that the above process is merely an example, and the information processing device 100 may perform any process as long as it is capable of generating changed character information. For example, the information processing device 100 may select one piece of first character information from the learning dataset DS1, and select second character information similar to the selected first character information (selected first character information) from the content CT. In this case, if there is no second character information similar to the selected first character information, the information processing device 100 may exclude the selected first character information from the processing target. For example, if there is no second character information whose similarity to the selected first character information is equal to or greater than a predetermined value, the information processing device 100 may exclude the selected first character information from the processing target.

〔１－２．その他例〕
上述した処理は一例に過ぎず、情報処理システム１は、様々な処理を行ってもよい。例えば、上述した処理では、固有表現を抽出するモデルを学習する場合を示したが、学習されるモデルは、固有表現を抽出するモデルに限られない。例えば、モデルは、入力された記事等のテキストについて、キーワード抽出して、主題や人工物名等を見つけて、ウィキなどのリンクをはるモデルであってもよい。また、例えば、モデルは、人名、クレジットカードの番号等の個人情報のマスキングするために用いるモデルであってもよい。また、キーワード関連の処理を行うためのモデルであれば、任意のモデルが採用可能である。 [1-2. Other examples]
The above-described process is merely an example, and the information processing system 1 may perform various processes. For example, in the above-described process, a case where a model that extracts named entities is learned is shown, but the model to be learned is not limited to a model that extracts named entities. For example, the model may be a model that extracts keywords from text such as an input article, finds themes and names of artifacts, and provides links to wikis, etc. Also, for example, the model may be a model used to mask personal information such as people's names and credit card numbers. Any model can be adopted as long as it is a model for performing keyword-related processing.

例えば、情報処理システム１は、置換する文字列と類似する文字列を見つけて、置き換えることで学習データを拡張してもよい。例えば、情報処理システム１は、インターネット百科事典でのリンクの類似性が高いコンテンツやインターネット百科事典の記事内容が近いコンテンツを対象としてもよい。例えば、情報処理システム１は、ｗ２ｖやｓ２ｖ等のベクトル化に関する任意の技術を用いて、名詞だけベクトル化して、平均化してもよい。また、情報処理システム１は、要約を作ってベクトル化してもよい。 For example, the information processing system 1 may expand the learning data by finding a string similar to the string to be replaced and replacing it. For example, the information processing system 1 may target content with high link similarity in an Internet encyclopedia or content with similar article content in an Internet encyclopedia. For example, the information processing system 1 may vectorize only nouns using any vectorization technology such as w2v or s2v, and average them. The information processing system 1 may also create a summary and vectorize it.

例えば、情報処理システム１は、学習データ内の各単語（組織名、人工物等）を、インターネット百科事典の同一ページ内から抽出して組み替えることで、新しい変更文字情報を生成してもよい。また、文字情報は、説明文章が含まれていればよく、訓練データの元と、置き換える元のデータとは違うものであってもよい。また、情報処理システム１は、適用したいカテゴリごとにモデルを作ってもよい。例えば、情報処理システム１は、日本の音楽ユニット等、インターネット百科事典のカテゴリごとに学習用データを生成し、カテゴリごとの学習用データを用いて、カテゴリごとのモデルを生成してもよい。 For example, the information processing system 1 may generate new changed character information by extracting and rearranging each word (organization name, artifact, etc.) in the training data from the same page of the Internet encyclopedia. Furthermore, the character information may contain explanatory text, and the original training data may be different from the data it replaces. Furthermore, the information processing system 1 may create a model for each category to which it is desired to apply. For example, the information processing system 1 may generate learning data for each category of the Internet encyclopedia, such as Japanese music groups, and generate a model for each category using the learning data for each category.

〔２．情報処理装置の構成〕
次に、図４を用いて、実施形態に係る情報処理装置１００の構成について説明する。図４は、実施形態に係る情報処理装置１００の構成例を示す図である。図４に示すように、情報処理装置１００は、通信部１１０と、記憶部１２０と、制御部１３０とを有する。なお、情報処理装置１００は、情報処理装置１００の管理者等から各種操作を受け付ける入力部（例えば、キーボードやマウス等）や、各種情報を表示するための表示部（例えば、液晶ディスプレイ等）を有してもよい。 2. Configuration of information processing device
Next, the configuration of the information processing device 100 according to the embodiment will be described with reference to Fig. 4. Fig. 4 is a diagram showing an example of the configuration of the information processing device 100 according to the embodiment. As shown in Fig. 4, the information processing device 100 has a communication unit 110, a storage unit 120, and a control unit 130. Note that the information processing device 100 may have an input unit (e.g., a keyboard, a mouse, etc.) that accepts various operations from an administrator of the information processing device 100, and a display unit (e.g., a liquid crystal display, etc.) that displays various information.

（通信部１１０）
通信部１１０は、例えば、ＮＩＣ（Network Interface Card）等によって実現される。そして、通信部１１０は、所定の通信網（ネットワーク）と有線または無線で接続され、端末装置１０との間で情報の送受信を行う。 (Communication unit 110)
The communication unit 110 is realized by, for example, a network interface card (NIC) etc. The communication unit 110 is connected to a predetermined communication network by wire or wirelessly, and transmits and receives information to and from the terminal device 10.

（記憶部１２０）
記憶部１２０は、例えば、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。実施形態に係る記憶部１２０は、図４に示すように、学習用データ記憶部１２１と、モデル情報記憶部１２２と、コンテンツ情報記憶部１２３とを有する。 (Memory unit 120)
The storage unit 120 is realized by, for example, a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disk. As shown in FIG. 4 , the storage unit 120 according to the embodiment has a learning data storage unit 121, a model information storage unit 122, and a content information storage unit 123.

（学習用データ記憶部１２１）
実施形態に係る学習用データ記憶部１２１は、学習に用いるデータに関する各種情報を記憶する。学習用データ記憶部１２１は、学習に用いる学習データ（データセット）を記憶する。図５は、実施形態に係る学習用データ記憶部の一例を示す図である。例えば、学習用データ記憶部１２１は、学習に用いる学習データや精度評価（測定）に用いる評価用データ等の種々のデータに関する各種情報を記憶する。図５に、実施形態に係る学習用データ記憶部１２１の一例を示す。図５の例では、学習用データ記憶部１２１は、「データセットＩＤ」、「データＩＤ」、「データ」、「ラベル」、「日時」といった項目が含まれる。 (Learning Data Storage Unit 121)
The learning data storage unit 121 according to the embodiment stores various information related to data used for learning. The learning data storage unit 121 stores learning data (dataset) used for learning. FIG. 5 is a diagram illustrating an example of the learning data storage unit according to the embodiment. For example, the learning data storage unit 121 stores various information related to various data such as learning data used for learning and evaluation data used for accuracy evaluation (measurement). FIG. 5 illustrates an example of the learning data storage unit 121 according to the embodiment. In the example of FIG. 5, the learning data storage unit 121 includes items such as "dataset ID", "data ID", "data", "label", and "date and time".

「データセットＩＤ」は、データセットを識別するための識別情報を示す。「データＩＤ」は、データを識別するための識別情報を示す。また、「データ」は、データＩＤにより識別されるデータに対応するデータを示す。 "Dataset ID" indicates identification information for identifying a dataset. "Data ID" indicates identification information for identifying data. Furthermore, "Data" indicates data corresponding to the data identified by the data ID.

「ラベル」は、対応するデータに付されるラベル（正解ラベル）を示す。例えば、「ラベル」は、対応するデータ（文字情報）中の各文字列がどの種別に該当するかを示す情報（正解情報）であってもよい。例えば、「ラベル」は、文字情報のうち所定の種別に該当する文字列が含まれる位置（範囲）を示す正解情報である。例えば、「ラベル」は、文字情報のうち組織名等の固有名詞（固有表現）に該当する文字列を示す正解情報であってもよい。また、「ラベル」は、人名や地名などといった固有名詞、日付表現、時間表現等の固有表現に該当する文字列を示す正解情報であってもよい。 A "label" refers to a label (correct label) that is attached to the corresponding data. For example, a "label" may be information (correct answer information) that indicates which type each character string in the corresponding data (character information) corresponds to. For example, a "label" is correct answer information that indicates the position (range) in the character information where a character string that corresponds to a specific type is included. For example, a "label" may be correct answer information that indicates a character string in the character information that corresponds to a proper noun (named entity) such as an organization name. In addition, a "label" may be correct answer information that indicates a character string that corresponds to a named entity such as a proper noun such as a person's name or place name, a date expression, or a time expression.

また、「日時」は、対応するデータに関する時間（日時）を示す。なお、図５の例では、「ＤＡ１」等で図示するが、「日時」には、「２０２１年８月１２日１７時４８分３７秒」等の具体的な日時であってもよいし、「バージョンＸＸのモデル学習から使用開始」等、そのデータがどのモデルの学習から使用が開始されたかを示す情報が記憶されてもよい。 In addition, "date and time" indicates the time (date and time) related to the corresponding data. Note that in the example of FIG. 5, "DA1" and the like are illustrated, but "date and time" may be a specific date and time such as "August 12, 2021, 17:48:37", or information indicating which model learning the data started to be used from, such as "Use started from model learning of version XX", may be stored.

図５の例では、データセットＩＤ「ＤＳ１」により識別されるデータセット（データセットＤＳ１）には、データＩＤ「ＤＩＤ１」、「ＤＩＤ２」、「ＤＩＤ３」等により識別される複数のデータが含まれることを示す。例えば、データＩＤ「ＤＩＤ１」、「ＤＩＤ２」、「ＤＩＤ３」等により識別される各データ（学習用データ）は、モデルの学習に用いられる文字情報（文字データ）等である。 The example in FIG. 5 shows that the dataset (dataset DS1) identified by the dataset ID "DS1" includes multiple data identified by data IDs "DID1", "DID2", "DID3", etc. For example, each data (learning data) identified by the data IDs "DID1", "DID2", "DID3", etc. is character information (character data) used for model training, etc.

例えば、データＩＤ「ＤＩＤ１」により識別されるデータＤＴ１は、ラベルＬＢ１が付されたラベル有りデータであり、日時ＤＡ１でのモデルの学習から使用が開始されたことを示す。また、例えば、データＩＤ「ＤＩＤ４」により識別されるデータＤＴ４は、ラベル無しデータとして取集され、予測ラベルであるラベルＬＢ４が付されたデータであり、日時ＤＡ４でのモデルの学習から使用が開始されたことを示す。 For example, data DT1 identified by data ID "DID1" is labeled data with label LB1, indicating that its use began with model training at date and time DA1. Furthermore, for example, data DT4 identified by data ID "DID4" is unlabeled data collected and with label LB4, which is a predictive label, indicating that its use began with model training at date and time DA4.

なお、学習用データ記憶部１２１は、上記に限らず、目的に応じて種々の情報を記憶してもよい。例えば、学習用データ記憶部１２１は、各データが学習用データであるか、評価用データであるか等を特定可能に記憶してもよい。例えば、学習用データ記憶部１２１は、学習用データと評価用データとを区別可能に記憶する。学習用データ記憶部１２１は、各データが学習用データや評価用データであるかを識別する情報を記憶してもよい。情報処理装置１００は、学習用データとして用いられる各データと正解情報とに基づいて、モデルを学習する。情報処理装置１００は、評価用データとして用いられる各データと正解情報とに基づいて、モデルの精度を算出する。情報処理装置１００は、評価用データを入力した場合にモデルが出力する出力結果と、正解情報とを比較した結果を収集することにより、モデルの精度を算出する。 The learning data storage unit 121 may store various information according to the purpose, without being limited to the above. For example, the learning data storage unit 121 may store data in a manner that allows each piece of data to be identified as learning data or evaluation data. For example, the learning data storage unit 121 stores learning data and evaluation data in a manner that allows them to be distinguished. The learning data storage unit 121 may store information that identifies whether each piece of data is learning data or evaluation data. The information processing device 100 learns a model based on each piece of data used as learning data and the correct answer information. The information processing device 100 calculates the accuracy of the model based on each piece of data used as evaluation data and the correct answer information. The information processing device 100 calculates the accuracy of the model by collecting a result of comparing the output result output by the model when the evaluation data is input with the correct answer information.

（モデル情報記憶部１２２）
実施形態に係るモデル情報記憶部１２２は、モデルに関する情報を記憶する。例えば、モデル情報記憶部１２２は、学習処理により学習（生成）された学習済みモデル（モデル）の情報（モデルデータ）を記憶する。図６は、実施形態に係るモデル情報記憶部の一例を示す図である。図６に示した例では、モデル情報記憶部１２２は、「モデルＩＤ」、「用途」、「モデルデータ」といった項目が含まれる。 (Model information storage unit 122)
The model information storage unit 122 according to the embodiment stores information about a model. For example, the model information storage unit 122 stores information (model data) of a trained model (model) trained (generated) by a learning process. Fig. 6 is a diagram illustrating an example of a model information storage unit according to the embodiment. In the example shown in Fig. 6, the model information storage unit 122 includes items such as "model ID", "purpose", and "model data".

「モデルＩＤ」は、モデルを識別するための識別情報を示す。「用途」は、対応するモデルの用途を示す。「モデルデータ」は、モデルのデータを示す。図６等では「モデルデータ」に「ＭＤＴ１」といった概念的な情報が格納される例を示したが、実際には、モデルの構成（ネットワーク構成）の情報やパラメータに関する情報等、そのモデルを構成する種々の情報が含まれる。例えば、「モデルデータ」には、ネットワークの各層におけるノードと、各ノードが採用する関数と、ノードの接続関係と、ノード間の接続に対して設定される接続係数とを含む情報が含まれる。 "Model ID" indicates identification information for identifying a model. "Use" indicates the use of the corresponding model. "Model data" indicates the data of the model. Figure 6 etc. shows an example in which conceptual information such as "MDT1" is stored in "model data", but in reality, various information that constitutes the model is included, such as information on the model configuration (network configuration) and information on parameters. For example, "model data" includes information including the nodes in each layer of the network, the functions employed by each node, the connection relationships between the nodes, and the connection coefficients set for the connections between the nodes.

図６に示す例では、モデルＩＤ「Ｍ１」により識別されるモデル（モデルＭ１）は、用途が「固有表現抽出」であることを示す。すなわち、モデルＭ１は、入力された文字情報中で固有表現に該当する文字列を示す情報（文字列等）を出力するモデルであることを示す。また、モデルＭ１のモデルデータは、モデルデータＭＤＴ１であることを示す。 In the example shown in FIG. 6, the model identified by the model ID "M1" (model M1) indicates that its use is "named entity extraction." In other words, model M1 indicates that it is a model that outputs information (such as a character string) that indicates a character string that corresponds to a named entity in input character information. In addition, it indicates that the model data of model M1 is model data MDT1.

また、モデルＩＤ「Ｍ２」により識別されるモデル（モデルＭ２）は、用途が「ベクトル変換」であることを示す。すなわち、モデルＭ２は、入力された情報（例えば文字情報）をベクトル変換したベクトルを出力するモデルであることを示す。モデルＭ２のモデルデータは、モデルデータＭＤＴ２であることを示す。 The model (model M2) identified by the model ID "M2" indicates that its use is "vector conversion." In other words, model M2 indicates that it is a model that outputs vectors obtained by vector-converting input information (e.g., character information). The model data of model M2 indicates that it is model data MDT2.

なお、モデル情報記憶部１２２は、上記に限らず、目的に応じて種々の情報を記憶してもよい。 The model information storage unit 122 may store various types of information depending on the purpose, not limited to the above.

（コンテンツ情報記憶部１２３）
実施形態に係るコンテンツ情報記憶部１２３は、コンテンツに関する各種情報を記憶する。例えば、コンテンツ情報記憶部１２３は、第２文字情報の抽出対象となるコンテンツに関する各種情報を記憶する。例えば、コンテンツ情報記憶部１２３は、インターネット上で提供される所定のコンテンツの情報を記憶する。例えば、コンテンツ情報記憶部１２３は、所定の対象を解説する解説コンテンツの情報を記憶する。例えば、コンテンツ情報記憶部１２３は、インターネット百科事典内のコンテンツの情報を記憶する。例えば、コンテンツ情報記憶部１２３は、インターネット百科事典に関するサービスをユーザに提供するサービス提供装置から受信したコンテンツの情報を記憶する。 (Content information storage unit 123)
The content information storage unit 123 according to the embodiment stores various information related to the content. For example, the content information storage unit 123 stores various information related to the content from which the second character information is extracted. For example, the content information storage unit 123 stores information on a specific content provided on the Internet. For example, the content information storage unit 123 stores information on an explanation content that explains a specific subject. For example, the content information storage unit 123 stores information on a content in an Internet encyclopedia. For example, the content information storage unit 123 stores information on a content received from a service providing device that provides a user with a service related to the Internet encyclopedia.

コンテンツ情報記憶部１２３は、所定のコンテンツから抽出された第２文字情報を記憶する。コンテンツ情報記憶部１２３は、インターネット上で提供される所定のコンテンツから抽出された第２文字情報を記憶する。コンテンツ情報記憶部１２３は、所定の対象を解説する解説コンテンツから抽出された第２文字情報を記憶する。コンテンツ情報記憶部１２３は、第１文字列が示す対象とは異なる対象を解説する解説コンテンツから抽出された第２文字情報を記憶する。コンテンツ情報記憶部１２３は、インターネット百科事典内のコンテンツから抽出された第２文字情報を記憶する。 The content information storage unit 123 stores second character information extracted from specified content. The content information storage unit 123 stores second character information extracted from specified content provided on the Internet. The content information storage unit 123 stores second character information extracted from commentary content that explains a specified subject. The content information storage unit 123 stores second character information extracted from commentary content that explains a subject different from the subject indicated by the first character string. The content information storage unit 123 stores second character information extracted from content in an Internet encyclopedia.

なお、上記は一例に過ぎず、コンテンツ情報記憶部１２３は、様々なコンテンツ等の情報を記憶してもよい。 Note that the above is merely an example, and the content information storage unit 123 may store information on various contents, etc.

（制御部１３０）
図４の説明に戻って、制御部１３０は、コントローラ（controller）であり、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等によって、情報処理装置１００内部の記憶装置に記憶されている各種プログラム（情報処理プログラムの一例に相当）がＲＡＭを作業領域として実行されることにより実現される。また、制御部１３０は、コントローラであり、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実現される。 (Control unit 130)
Returning to the explanation of Fig. 4, the control unit 130 is a controller, and is realized, for example, by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like, executing various programs (corresponding to an example of an information processing program) stored in a storage device inside the information processing device 100 using a RAM as a working area. The control unit 130 is also a controller, and is realized, for example, by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

図４に示すように、制御部１３０は、取得部１３１と、生成部１３２と、学習部１３３と、処理部１３４と、提供部１３５とを有し、以下に説明する情報処理の機能や作用を実現または実行する。なお、制御部１３０の内部構成は、図４に示した構成に限られず、後述する情報処理を行う構成であれば他の構成であってもよい。また、制御部１３０が有する各処理部の接続関係は、図４に示した接続関係に限られず、他の接続関係であってもよい。 As shown in FIG. 4, the control unit 130 has an acquisition unit 131, a generation unit 132, a learning unit 133, a processing unit 134, and a provision unit 135, and realizes or executes the functions and actions of the information processing described below. Note that the internal configuration of the control unit 130 is not limited to the configuration shown in FIG. 4, and may be other configurations as long as they perform the information processing described below. Also, the connection relationship between each processing unit of the control unit 130 is not limited to the connection relationship shown in FIG. 4, and may be other connection relationships.

（取得部１３１）
取得部１３１は、通信部１１０を介して、外部の情報処理装置から各種情報を受信する。取得部１３１は、端末装置１０から各種情報を受信する。取得部１３１は、端末装置１０から受信したラベル付きの訓練データ（文字情報）を学習用データとして記憶部１２０へ格納する。取得部１３１は、端末装置１０から受信したラベル付き文字情報を、モデルの学習に用いるデータ（学習データ）として学習用データ記憶部１２１に登録する。また、取得部１３１は、インターネット百科事典に関するサービスをユーザに提供するサービス提供装置からコンテンツの情報を受信する。取得部１３１は、サービス提供装置から第２文字情報を受信してもよい。 (Acquisition unit 131)
The acquisition unit 131 receives various information from an external information processing device via the communication unit 110. The acquisition unit 131 receives various information from the terminal device 10. The acquisition unit 131 stores the labeled training data (character information) received from the terminal device 10 in the storage unit 120 as learning data. The acquisition unit 131 registers the labeled character information received from the terminal device 10 in the learning data storage unit 121 as data used for learning a model (learning data). The acquisition unit 131 also receives content information from a service providing device that provides a service related to an Internet encyclopedia to a user. The acquisition unit 131 may receive second character information from the service providing device.

取得部１３１は、記憶部１２０から各種の情報を取得する。取得部１３１は、学習用データ記憶部１２１から学習に用いるデータを取得する。取得部１３１は、モデル情報記憶部１２２からモデルの情報を取得する。 The acquisition unit 131 acquires various types of information from the storage unit 120. The acquisition unit 131 acquires data to be used for learning from the learning data storage unit 121. The acquisition unit 131 acquires model information from the model information storage unit 122.

取得部１３１は、所定のコンテンツから抽出された第２文字情報を取得する。取得部１３１は、インターネット上で提供される所定のコンテンツから抽出された第２文字情報を取得する。取得部１３１は、所定の対象を解説する解説コンテンツから抽出された第２文字情報を取得する。取得部１３１は、第１文字列が示す対象とは異なる対象を解説する解説コンテンツから抽出された第２文字情報を取得する。取得部１３１は、インターネット百科事典内のコンテンツから抽出された第２文字情報を取得する。 The acquisition unit 131 acquires second character information extracted from specified content. The acquisition unit 131 acquires second character information extracted from specified content provided on the Internet. The acquisition unit 131 acquires second character information extracted from commentary content that explains a specified object. The acquisition unit 131 acquires second character information extracted from commentary content that explains an object different from the object indicated by the first character string. The acquisition unit 131 acquires second character information extracted from content in an Internet encyclopedia.

（生成部１３２）
生成部１３２は、各種情報を生成する。例えば、生成部１３２は、記憶部１２０に記憶された情報（データ）から各種情報（データ）を生成する。生成部１３２は、生成した情報を記憶部１２０に登録する。例えば、生成部１３２は、学習用データ記憶部１２１や、モデル情報記憶部１２２や、コンテンツ情報記憶部１２３等に記憶された情報（データ）から各種情報を生成する。 (Generation unit 132)
The generation unit 132 generates various information. For example, the generation unit 132 generates various information (data) from information (data) stored in the storage unit 120. The generation unit 132 registers the generated information in the storage unit 120. For example, the generation unit 132 generates various information from information (data) stored in the learning data storage unit 121, the model information storage unit 122, the content information storage unit 123, etc.

生成部１３２は、各種情報を選択する。生成部１３２は、学習用データセットから、所定の条件を満たす文字情報を選択する。生成部１３２は、学習用データセットから、第２文字情報との類似度に基づいて類似文字情報を選択する。生成部１３２は、学習用データセットのうち、第２文字情報との類似度が最大である第１文字情報を類似文字情報として選択する。生成部１３２は、学習用データセット中の各第１文字情報がベクトル化された第１ベクトルの各々と、第２文字情報がベクトル化された第２ベクトルとの類似度に基づいて、類似文字情報を選択する。 The generation unit 132 selects various information. The generation unit 132 selects character information that satisfies a predetermined condition from the training dataset. The generation unit 132 selects similar character information from the training dataset based on the similarity with the second character information. The generation unit 132 selects, from the training dataset, first character information that has the greatest similarity with the second character information as similar character information. The generation unit 132 selects similar character information based on the similarity between each of first vectors obtained by vectorizing each piece of first character information in the training dataset and a second vector obtained by vectorizing the second character information.

生成部１３２は、各種情報を推定する。生成部１３２は、文字情報に含まれる文字列のうち新語を推定する。例えば、生成部１３２は、各解説コンテンツが説明する対象のうち、学習用データセットＤＳ１中の第１文字情報に固有表現として含まれない対象を新語であると推定する。そして、生成部１３２は、説明する対象が新語であると推定したコンテンツを新語コンテンツとして選択する。生成部１３２は、新語コンテンツから第２文字情報を生成する。 The generation unit 132 estimates various information. The generation unit 132 estimates new words from character strings included in the character information. For example, the generation unit 132 estimates that, among the objects explained by each explanatory content, an object that is not included as a named entity in the first character information in the learning dataset DS1 is a new word. The generation unit 132 then selects the content in which the object to be explained is estimated to be a new word as new word content. The generation unit 132 generates second character information from the new word content.

生成部１３２は、類似文字情報中の第１文字列を、第２文字情報中の第２文字列に変更することにより、変更文字情報を生成する。生成部１３２は、第１文字列が所定の種別に該当することを示す種別ラベルを第２文字列の種別ラベルとする変更文字情報を生成する。生成部１３２は、類似文字情報中の固有表現である第１文字列を、第２文字情報中の固有表現である第２文字列に変更することにより、変更文字情報を生成する。 The generating unit 132 generates changed character information by changing a first character string in the similar character information to a second character string in the second character information. The generating unit 132 generates changed character information in which a type label indicating that the first character string corresponds to a predetermined type is set as a type label of the second character string. The generating unit 132 generates changed character information by changing a first character string that is a named entity in the similar character information to a second character string that is a named entity in the second character information.

（学習部１３３）
学習部１３３は、モデルを学習する。学習部１３３は、外部の情報処理装置からの情報や記憶部１２０に記憶された情報に基づいて、各種情報を学習する。学習部１３３は、学習用データ記憶部１２１に記憶された情報に基づいて、各種情報を学習する。学習部１３３は、学習により生成したモデルをモデル情報記憶部１２２に格納する。 (Learning Unit 133)
The learning unit 133 learns a model. The learning unit 133 learns various pieces of information based on information from an external information processing device and information stored in the storage unit 120. The learning unit 133 learns various pieces of information based on information stored in the learning data storage unit 121. The learning unit 133 stores the model generated by learning in the model information storage unit 122.

学習部１３３は、生成部１３２が生成した文字情報を含む学習用データを用いてモデルを学習する。学習部１３３は、生成部１３２により生成された変更文字情報を用いた機械学習の処理により、モデルを学習する。学習部１３３は、文字情報の入力に応じて、当該文字情報に抽出対象文字列が含まれる場合、抽出対象文字列を出力するモデルを学習する。学習部１３３は、変更文字情報から第２文字列が抽出されるようにモデルを学習する。 The learning unit 133 learns a model using learning data including character information generated by the generation unit 132. The learning unit 133 learns a model by a machine learning process using the changed character information generated by the generation unit 132. In response to input of character information, the learning unit 133 learns a model that outputs a character string to be extracted when the character information includes the character string to be extracted. The learning unit 133 learns a model so that a second character string is extracted from the changed character information.

学習部１３３は、学習処理を行う。学習部１３３は、各種学習を行う。学習部１３３は、取得部１３１により取得された情報に基づいて、各種情報を学習する。学習部１３３は、モデルを学習（生成）する。学習部１３３は、モデル等の各種情報を学習する。学習部１３３は、学習によりモデルを生成する。学習部１３３は、種々の機械学習に関する技術を用いて、モデルを学習する。例えば、学習部１３３は、モデル（ネットワーク）のパラメータを学習する。学習部１３３は、種々の機械学習に関する技術を用いて、モデルを学習する。 The learning unit 133 performs a learning process. The learning unit 133 performs various learning. The learning unit 133 learns various information based on the information acquired by the acquisition unit 131. The learning unit 133 learns (generates) a model. The learning unit 133 learns various information such as a model. The learning unit 133 generates a model through learning. The learning unit 133 learns the model using various machine learning techniques. For example, the learning unit 133 learns the parameters of the model (network). The learning unit 133 learns the model using various machine learning techniques.

学習部１３３は、学習用データ記憶部１２１に記憶された学習用データ（教師データ）に基づいて、学習処理を行う。学習部１３３は、モデル（ネットワーク）のパラメータを学習する。学習部１３３は、接続されたノード間の接続係数（重み）等のパラメータを学習する。学習部１３３は、種々の機械学習に関する技術を用いて、モデルを学習する。学習部１３３は、モデルに入力するデータと、そのデータが入力された場合の出力を示す正解データとを用いて行う学習処理、すなわち教師有り学習の手法によりモデルのパラメータを学習する。なお、上記は一例であり、学習部１３３は、モデルのパラメータを学習可能であれば、どのような学習処理により、モデルのパラメータを学習してもよい。 The learning unit 133 performs a learning process based on the learning data (teacher data) stored in the learning data storage unit 121. The learning unit 133 learns the parameters of the model (network). The learning unit 133 learns parameters such as the connection coefficients (weights) between connected nodes. The learning unit 133 learns the model using various machine learning techniques. The learning unit 133 learns the model parameters by a learning process performed using data to be input to the model and correct answer data indicating the output when that data is input, that is, a supervised learning method. Note that the above is just one example, and the learning unit 133 may learn the model parameters by any learning process as long as it is possible to learn the model parameters.

学習部１３３は、モデルＭ１を生成する。学習部１３３は、ネットワークのパラメータを学習する。例えば、学習部１３３は、モデルＭ１のネットワークのパラメータを学習する。学習部１３３は、学習用データ記憶部１２１に記憶された学習用データを用いて、学習処理を行うことにより、モデルＭ１を生成する。例えば、学習部１３３は、固有表現抽出に用いられるモデルを生成する。学習部１３３は、モデルＭ１のネットワークのパラメータを学習することにより、モデルＭ１を生成する。 The learning unit 133 generates a model M1. The learning unit 133 learns the parameters of the network. For example, the learning unit 133 learns the parameters of the network of the model M1. The learning unit 133 performs a learning process using the learning data stored in the learning data storage unit 121 to generate the model M1. For example, the learning unit 133 generates a model used for named entity extraction. The learning unit 133 generates the model M1 by learning the parameters of the network of the model M1.

学習部１３３による学習の手法は特に限定されないが、例えば、ラベルとデータ（文字情報）とを紐づけた学習用データを用意し、その学習用データを多層ニューラルネットワークに基づいた計算モデルに入力して学習してもよい。学習部１３３は、再帰型ニューラルネットワーク（ＲＮＮ）やＲＮＮを拡張したＬＳＴＭに基づく手法を用いてもよい。 The method of learning by the learning unit 133 is not particularly limited, but for example, learning data that links labels with data (character information) may be prepared, and the learning data may be input to a computational model based on a multi-layer neural network for learning. The learning unit 133 may use a method based on a recurrent neural network (RNN) or an LSTM that is an extension of an RNN.

例えば、学習部１３３は、Ｓｅｑ２Ｓｅｑ（Sequence to Sequence Model）であるモデルＭ１を学習してもよい。例えば、Ｓｅｑ２Ｓｅｑは、ＲＮＮの一種であるＬＳＴＭを構成要素とするEncoder-Decoderモデルである。例えば、モデルＭ１は、図２の第２文字情報ＵＤ１に対応する文字情報が入力された場合、「Ｘ曜日の〇〇」という文字列を出力する。このように、Ｓｅｑ２ＳｅｑであるモデルＭ１は、第２文字情報ＵＤ１に対応する文字情報が入力されるEncoder側でベクトル化を行い、Decoder側で「Ｘ曜日の〇〇」を出力するようにＲＮＮの学習を行う。 For example, the learning unit 133 may learn model M1, which is a Seq2Seq (Sequence to Sequence Model). For example, Seq2Seq is an Encoder-Decoder model whose component is an LSTM, a type of RNN. For example, when character information corresponding to the second character information UD1 in FIG. 2 is input, model M1 outputs the character string "XX on the Xth day of the week." In this way, model M1, which is Seq2Seq, performs vectorization on the Encoder side where character information corresponding to the second character information UD1 is input, and learns the RNN to output "XX on the Xth day of the week" on the Decoder side.

（処理部１３４）
処理部１３４は、各種の処理を実行する。処理部１３４は、学習部１３３により学習されたモデルＭ１を用いた処理を実行する。処理部１３４は、文字情報をモデルＭ１に入力し、モデルＭ１が出力した文字列を固有表現の文字列とする。例えば、処理部１３４は、文字情報をモデルＭ１に入力し、モデルＭ１が出力した文字列を新語であるとする。 (Processing Unit 134)
The processing unit 134 executes various processes. The processing unit 134 executes processes using the model M1 learned by the learning unit 133. The processing unit 134 inputs character information to the model M1, and treats the character string output by the model M1 as a character string of a named entity. For example, the processing unit 134 inputs character information to the model M1, and treats the character string output by the model M1 as a new word.

処理部１３４は、文字情報をモデルＭ１に入力し、モデルＭ１が出力した文字列を固有表現の文字列を示す情報を外部装置へ提供部１３５に送信させる。 The processing unit 134 inputs the character information into the model M1, and causes the providing unit 135 to transmit the character string output by the model M1, which is information indicating the character string of the named entity, to an external device.

（提供部１３５）
提供部１３５は、通信部１１０を介して、外部装置へ情報を送信する。提供部１３５は、ユーザが利用する端末装置１０へ情報提供サービスを提供する。例えば、提供部１３５は、学習部１３３により学習されたモデルＭ１を端末装置１０へ送信する。提供部１３５は、処理部１３４による処理結果を示す情報を端末装置１０へ送信する。 (Providing Unit 135)
The providing unit 135 transmits information to an external device via the communication unit 110. The providing unit 135 provides an information provision service to the terminal device 10 used by the user. For example, the providing unit 135 transmits the model M1 learned by the learning unit 133 to the terminal device 10. The providing unit 135 transmits information indicating a processing result by the processing unit 134 to the terminal device 10.

提供部１３５は、処理部１３４による処理結果を示す情報を提供する。提供部１３５は、新語を示す情報を端末装置１０に送信する。提供部１３５は、固有表現を示す情報を端末装置１０に送信する。 The providing unit 135 provides information indicating the processing result by the processing unit 134. The providing unit 135 transmits information indicating the new word to the terminal device 10. The providing unit 135 transmits information indicating the named entity to the terminal device 10.

〔３．処理フロー〕
次に、図７を用いて、実施形態に係る情報処理システム１による情報処理の手順について説明する。図７は、実施形態に係る情報処理装置による処理の一例を示すフローチャートである。 3. Processing flow
Next, a procedure of information processing by the information processing system 1 according to the embodiment will be described with reference to Fig. 7. Fig. 7 is a flowchart showing an example of processing by the information processing device according to the embodiment.

図７に示すように、情報処理装置１００は、所定の種別に該当する文字列である抽出対象文字列を文字情報から抽出するモデルの学習に用いるためのラベルが付された第１文字情報を含む学習用データセット文字情報を取得する（ステップＳ１０１）。 As shown in FIG. 7, the information processing device 100 acquires learning dataset character information including first character information labeled with a label for use in training a model that extracts a target character string, which is a character string corresponding to a predetermined type, from character information (step S101).

情報処理装置１００は、ラベルが付されていない文字情報である第２文字情報を取得する（ステップＳ１０２）。情報処理装置１００は、学習用データセットのうち、第２文字情報と類似する第１文字情報を類似文字情報として選択する（ステップＳ１０３）。 The information processing device 100 acquires second character information, which is unlabeled character information (step S102). The information processing device 100 selects first character information that is similar to the second character information from the learning dataset as similar character information (step S103).

情報処理装置１００は、類似文字情報中の抽出対象文字列である第１文字列を、第２文字情報中の抽出対象文字列である第２文字列に変更することにより、第２文字列を含み、モデルの学習に利用可能な変更文字情報を生成する（ステップＳ１０４）。 The information processing device 100 changes the first character string, which is a character string to be extracted in the similar character information, to the second character string, which is a character string to be extracted in the second character information, to generate changed character information that includes the second character string and can be used for learning the model (step S104).

〔４．効果〕
上述してきたように、実施形態に係る情報処理装置１００は、取得部１３１と、生成部１３２とを有する。取得部１３１は、所定の種別に該当する文字列である抽出対象文字列を文字情報から抽出するモデルの学習に用いるためのラベルが付された第１文字情報を含む学習用データセットと、ラベルが付されていない文字情報である第２文字情報とを取得する。生成部１３２は、学習用データセットのうち、第２文字情報と類似する第１文字情報を類似文字情報として選択し、類似文字情報中の抽出対象文字列である第１文字列を、第２文字情報中の抽出対象文字列である第２文字列に変更することにより、第２文字列を含み、モデルの学習に利用可能な文字情報である変更文字情報を生成する。 4. Effects
As described above, the information processing device 100 according to the embodiment includes the acquisition unit 131 and the generation unit 132. The acquisition unit 131 acquires a learning dataset including first character information labeled for use in learning a model that extracts an extraction target character string, which is a character string corresponding to a predetermined type, from character information, and second character information that is unlabeled character information. The generation unit 132 selects, from the learning dataset, first character information similar to the second character information as similar character information, and changes the first character string, which is an extraction target character string in the similar character information, to the second character string, which is an extraction target character string in the second character information, thereby generating changed character information that includes the second character string and is character information that can be used for learning the model.

このように、実施形態に係る情報処理装置１００は、既存の学習用データに含まれる文字情報の文字列を変換することで新たな学習用データを生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by generating new training data by converting character strings of character information contained in existing training data.

また、実施形態に係る情報処理装置１００において、取得部１３１は、所定のコンテンツから抽出された第２文字情報を取得する。 In addition, in the information processing device 100 according to the embodiment, the acquisition unit 131 acquires second character information extracted from the specified content.

このように、実施形態に係る情報処理装置１００は、所定のコンテンツから抽出された第２文字情報の第２文字列に第１文字情報の第１文字列を変換して新たな学習用データを生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by converting the first character string of the first character information into the second character string of the second character information extracted from a specific content to generate new training data.

また、実施形態に係る情報処理装置１００において、取得部１３１は、インターネット上で提供される所定のコンテンツから抽出された第２文字情報を取得する。 In addition, in the information processing device 100 according to the embodiment, the acquisition unit 131 acquires second character information extracted from specific content provided on the Internet.

このように、実施形態に係る情報処理装置１００は、インターネット上で提供される所定のコンテンツから抽出された第２文字情報の第２文字列に第１文字情報の第１文字列を変換して新たな学習用データを生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by converting the first character string of the first character information into the second character string of the second character information extracted from specific content provided on the Internet to generate new training data.

また、実施形態に係る情報処理装置１００において、取得部１３１は、所定の対象を解説する解説コンテンツから抽出された第２文字情報を取得する。 In addition, in the information processing device 100 according to the embodiment, the acquisition unit 131 acquires second character information extracted from commentary content that explains a specific subject.

このように、実施形態に係る情報処理装置１００は、所定の対象を解説する解説コンテンツから抽出された第２文字情報の第２文字列に第１文字情報の第１文字列を変換して新たな学習用データを生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by generating new learning data by converting the first character string of the first character information into the second character string of the second character information extracted from commentary content that explains a specific target.

また、実施形態に係る情報処理装置１００において、取得部１３１は、第１文字列が示す対象とは異なる対象を解説する解説コンテンツから抽出された第２文字情報を取得する。 In addition, in the information processing device 100 according to the embodiment, the acquisition unit 131 acquires second character information extracted from commentary content that explains an object different from the object indicated by the first character string.

このように、実施形態に係る情報処理装置１００は、第１文字列が示す対象とは異なる対象を解説する解説コンテンツから抽出された第２文字情報の第２文字列に第１文字情報の第１文字列を変換して新たな学習用データを生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by converting the first character string of the first character information into the second character string of the second character information extracted from commentary content that explains an object different from the object indicated by the first character string to generate new training data.

また、実施形態に係る情報処理装置１００において、取得部１３１は、インターネット百科事典内のコンテンツから抽出された第２文字情報を取得する。 In addition, in the information processing device 100 according to the embodiment, the acquisition unit 131 acquires second character information extracted from content in the Internet encyclopedia.

このように、実施形態に係る情報処理装置１００は、インターネット百科事典内のコンテンツから抽出された第２文字情報の第２文字列に第１文字情報の第１文字列を変換して新たな学習用データを生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by converting the first character string of the first character information into the second character string of the second character information extracted from the content in the Internet encyclopedia to generate new training data.

また、実施形態に係る情報処理装置１００において、生成部１３２は、学習用データセットから、第２文字情報との類似度に基づいて類似文字情報を選択し、類似文字情報中の第１文字列を、第２文字情報中の第２文字列に変更することにより、変更文字情報を生成する。 In addition, in the information processing device 100 according to the embodiment, the generation unit 132 selects similar character information from the learning dataset based on the similarity with the second character information, and generates changed character information by changing the first character string in the similar character information to the second character string in the second character information.

このように、実施形態に係る情報処理装置１００は、学習用データセットから、第２文字情報との類似度に基づいて類似文字情報を選択して、選択した類似文字情報を用いて変更文字情報を生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by selecting similar character information from the training dataset based on the similarity with the second character information and generating changed character information using the selected similar character information.

また、実施形態に係る情報処理装置１００において、生成部１３２は、学習用データセットのうち、第２文字情報との類似度が最大である第１文字情報を類似文字情報として選択する。 In addition, in the information processing device 100 according to the embodiment, the generation unit 132 selects, from the learning dataset, the first character information that has the highest similarity to the second character information as the similar character information.

このように、実施形態に係る情報処理装置１００は、学習用データセットのうち第２文字情報との類似度が最大である第１文字情報を用いて変更文字情報を生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by generating changed character information using the first character information in the training dataset that has the highest similarity to the second character information.

また、実施形態に係る情報処理装置１００において、生成部１３２は、学習用データセット中の各第１文字情報がベクトル化された第１ベクトルの各々と、第２文字情報がベクトル化された第２ベクトルとの類似度に基づいて、類似文字情報を選択する。 In addition, in the information processing device 100 according to the embodiment, the generation unit 132 selects similar character information based on the similarity between each of the first vectors obtained by vectorizing each of the first character information in the learning dataset and the second vector obtained by vectorizing the second character information.

このように、実施形態に係る情報処理装置１００は、学習用データセットのうちベクトル化した状態で第２文字情報と類似する第１文字情報を用いて変更文字情報を生成することにより、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training by generating changed character information using first character information that is similar to second character information in a vectorized state from the training dataset.

また、実施形態に係る情報処理装置１００において、生成部１３２は、第１文字列が所定の種別に該当することを示す種別ラベルを第２文字列の種別ラベルとする変更文字情報を生成する。 In addition, in the information processing device 100 according to the embodiment, the generation unit 132 generates changed character information in which a type label indicating that the first character string corresponds to a predetermined type is set as a type label of the second character string.

このように、実施形態に係る情報処理装置１００は、学習用データセットから、第１文字列のラベルを第２文字列のラベルとして用いることで、自動的に第２文字列にラベルが付与されるため、モデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used for model training, since the label of the first string is automatically assigned to the second string by using the label of the first string as the label of the second string from the training dataset.

また、実施形態に係る情報処理装置１００において、取得部１３１は、固有表現に該当する抽出対象文字列を文字情報から抽出するモデルの学習に用いられる学習用データセットを取得する。生成部１３２は、類似文字情報中の固有表現である第１文字列を、第２文字情報中の固有表現である第２文字列に変更することにより、変更文字情報を生成する。 In the information processing device 100 according to the embodiment, the acquisition unit 131 acquires a learning dataset used to train a model that extracts extraction target character strings corresponding to named entities from character information. The generation unit 132 generates changed character information by changing a first character string that is a named entity in the similar character information to a second character string that is a named entity in the second character information.

このように、実施形態に係る情報処理装置１００は、文字情報中の固有表現を他の固有表現に変換することで、新たな学習用データを生成することにより、固有表現を抽出するモデルの学習に利用可能な文字情報を効率的に生成することができる。 In this way, the information processing device 100 according to the embodiment can efficiently generate character information that can be used to train a model that extracts named entities by converting named entities in character information into other named entities and generating new learning data.

また、実施形態に係る情報処理装置１００は、学習部１３３を有する。学習部１３３は、生成部１３２により生成された変更文字情報を用いた機械学習の処理により、モデルを学習する。 The information processing device 100 according to the embodiment also includes a learning unit 133. The learning unit 133 learns a model through machine learning processing using the changed character information generated by the generation unit 132.

これにより、実施形態に係る情報処理装置１００は、変更文字情報を用いた機械学習の処理により、モデルを学習することより、生成した情報を用いて適切にモデルを学習することができる。 As a result, the information processing device 100 according to the embodiment can appropriately learn a model using the generated information by learning the model through machine learning processing using the changed character information.

また、実施形態に係る情報処理装置１００において、学習部１３３は、文字情報の入力に応じて、当該文字情報に抽出対象文字列が含まれる場合、抽出対象文字列を出力するモデルを学習する。 In addition, in the information processing device 100 according to the embodiment, the learning unit 133 learns a model that outputs an extraction target string when the character information includes an extraction target string in response to input of the character information.

これにより、実施形態に係る情報処理装置１００は、生成した情報を用いて抽出対象文字列を出力するモデルを学習することができる。 As a result, the information processing device 100 according to the embodiment can learn a model that uses the generated information to output the extraction target string.

また、実施形態に係る情報処理装置１００において、学習部１３３は、変更文字情報から第２文字列が抽出されるようにモデルを学習する。 In addition, in the information processing device 100 according to the embodiment, the learning unit 133 learns a model so that the second character string is extracted from the changed character information.

〔５．ハードウェア構成〕
また、上述した実施形態に係る情報処理装置１００や端末装置１０は、例えば図８に示すような構成のコンピュータ１０００によって実現される。以下、情報処理装置１００を例に挙げて説明する。図８は、ハードウェア構成の一例を示す図である。コンピュータ１０００は、出力装置１０１０、入力装置１０２０と接続され、演算装置１０３０、一次記憶装置１０４０、二次記憶装置１０５０、出力Ｉ／Ｆ（Interface）１０６０、入力Ｉ／Ｆ１０７０、ネットワークＩ／Ｆ１０８０がバス１０９０により接続された形態を有する。 5. Hardware Configuration
The information processing device 100 and the terminal device 10 according to the above-described embodiment are realized by a computer 1000 having a configuration as shown in Fig. 8, for example. The information processing device 100 will be described below as an example. Fig. 8 is a diagram showing an example of a hardware configuration. The computer 1000 is connected to an output device 1010 and an input device 1020, and has a configuration in which a calculation device 1030, a primary storage device 1040, a secondary storage device 1050, an output I/F (Interface) 1060, an input I/F 1070, and a network I/F 1080 are connected by a bus 1090.

演算装置１０３０は、一次記憶装置１０４０や二次記憶装置１０５０に格納されたプログラムや入力装置１０２０から読み出したプログラム等に基づいて動作し、各種の処理を実行する。演算装置１０３０は、例えばＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等により実現される。 The arithmetic device 1030 operates based on programs stored in the primary storage device 1040 and the secondary storage device 1050, programs read from the input device 1020, and the like, and executes various processes. The arithmetic device 1030 is realized, for example, by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or the like.

一次記憶装置１０４０は、ＲＡＭ（Random Access Memory）等、演算装置１０３０が各種の演算に用いるデータを一次的に記憶するメモリ装置である。また、二次記憶装置１０５０は、演算装置１０３０が各種の演算に用いるデータや、各種のデータベースが登録される記憶装置であり、ＲＯＭ（Read Only Memory）、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリ等により実現される。二次記憶装置１０５０は、内蔵ストレージであってもよいし、外付けストレージであってもよい。また、二次記憶装置１０５０は、ＵＳＢメモリやＳＤ（Secure Digital）メモリカード等の取り外し可能な記憶媒体であってもよい。また、二次記憶装置１０５０は、クラウドストレージ（オンラインストレージ）やＮＡＳ（Network Attached Storage）、ファイルサーバ等であってもよい。 The primary storage device 1040 is a memory device such as a RAM (Random Access Memory) that primarily stores data used by the arithmetic device 1030 for various calculations. The secondary storage device 1050 is a storage device in which data used by the arithmetic device 1030 for various calculations and various databases are registered, and is realized by a ROM (Read Only Memory), a HDD (Hard Disk Drive), a SSD (Solid State Drive), a flash memory, or the like. The secondary storage device 1050 may be an internal storage device or an external storage device. The secondary storage device 1050 may be a removable storage medium such as a USB memory or a SD (Secure Digital) memory card. The secondary storage device 1050 may be a cloud storage device (online storage device), a NAS (Network Attached Storage), a file server, or the like.

出力Ｉ／Ｆ１０６０は、ディスプレイ、プロジェクタ、及びプリンタ等といった各種の情報を出力する出力装置１０１０に対し、出力対象となる情報を送信するためのインターフェイスであり、例えば、ＵＳＢ（Universal Serial Bus）やＤＶＩ（Digital Visual Interface）、ＨＤＭＩ（登録商標）（High Definition Multimedia Interface）といった規格のコネクタにより実現される。また、入力Ｉ／Ｆ１０７０は、マウス、キーボード、キーパッド、ボタン、及びスキャナ等といった各種の入力装置１０２０から情報を受信するためのインターフェイスであり、例えば、ＵＳＢ等により実現される。 The output I/F 1060 is an interface for transmitting information to be output to an output device 1010 that outputs various types of information, such as a display, projector, printer, etc., and is realized by a connector conforming to a standard such as USB (Universal Serial Bus), DVI (Digital Visual Interface), or HDMI (registered trademark) (High Definition Multimedia Interface). The input I/F 1070 is an interface for receiving information from various input devices 1020, such as a mouse, keyboard, keypad, button, scanner, etc., and is realized by a USB, etc.

また、出力Ｉ／Ｆ１０６０及び入力Ｉ／Ｆ１０７０はそれぞれ出力装置１０１０及び入力装置１０２０と無線で接続してもよい。すなわち、出力装置１０１０及び入力装置１０２０は、ワイヤレス機器であってもよい。 In addition, the output I/F 1060 and the input I/F 1070 may be wirelessly connected to the output device 1010 and the input device 1020, respectively. That is, the output device 1010 and the input device 1020 may be wireless devices.

また、出力装置１０１０及び入力装置１０２０は、タッチパネルのように一体化していてもよい。この場合、出力Ｉ／Ｆ１０６０及び入力Ｉ／Ｆ１０７０も、入出力Ｉ／Ｆとして一体化していてもよい。 The output device 1010 and the input device 1020 may be integrated together, such as a touch panel. In this case, the output I/F 1060 and the input I/F 1070 may also be integrated together as an input/output I/F.

なお、入力装置１０２０は、例えば、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、又は半導体メモリ等から情報を読み出す装置であってもよい。 The input device 1020 may be a device that reads information from, for example, an optical recording medium such as a CD (Compact Disc), a DVD (Digital Versatile Disc), or a PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.

ネットワークＩ／Ｆ１０８０は、ネットワークＮを介して他の機器からデータを受信して演算装置１０３０へ送り、また、ネットワークＮを介して演算装置１０３０が生成したデータを他の機器へ送信する。 The network I/F 1080 receives data from other devices via the network N and sends it to the computing device 1030, and also transmits data generated by the computing device 1030 to other devices via the network N.

演算装置１０３０は、出力Ｉ／Ｆ１０６０や入力Ｉ／Ｆ１０７０を介して、出力装置１０１０や入力装置１０２０の制御を行う。例えば、演算装置１０３０は、入力装置１０２０や二次記憶装置１０５０からプログラムを一次記憶装置１０４０上にロードし、ロードしたプログラムを実行する。 The arithmetic unit 1030 controls the output device 1010 and the input device 1020 via the output I/F 1060 and the input I/F 1070. For example, the arithmetic unit 1030 loads a program from the input device 1020 or the secondary storage device 1050 onto the primary storage device 1040 and executes the loaded program.

例えば、コンピュータ１０００が情報処理装置１００として機能する場合、コンピュータ１０００の演算装置１０３０は、一次記憶装置１０４０上にロードされたプログラムを実行することにより、制御部１３０の機能を実現する。また、コンピュータ１０００の演算装置１０３０は、ネットワークＩ／Ｆ１０８０を介して他の機器から取得したプログラムを一次記憶装置１０４０上にロードし、ロードしたプログラムを実行してもよい。また、コンピュータ１０００の演算装置１０３０は、ネットワークＩ／Ｆ１０８０を介して他の機器と連携し、プログラムの機能やデータ等を他の機器の他のプログラムから呼び出して利用してもよい。 For example, when the computer 1000 functions as the information processing device 100, the arithmetic unit 1030 of the computer 1000 executes a program loaded onto the primary storage device 1040 to realize the functions of the control unit 130. The arithmetic unit 1030 of the computer 1000 may also load a program acquired from another device via the network I/F 1080 onto the primary storage device 1040 and execute the loaded program. The arithmetic unit 1030 of the computer 1000 may also cooperate with other devices via the network I/F 1080 and use the functions and data of a program by calling them from other programs of the other devices.

〔６．その他〕
以上、本願の実施形態を説明したが、これら実施形態の内容により本発明が限定されるものではない。また、前述した構成要素には、当業者が容易に想定できるもの、実質的に同一のもの、いわゆる均等の範囲のものが含まれる。さらに、前述した構成要素は適宜組み合わせることが可能である。さらに、前述した実施形態の要旨を逸脱しない範囲で構成要素の種々の省略、置換又は変更を行うことができる。 [6. Other]
Although the embodiments of the present application have been described above, the present invention is not limited to the contents of these embodiments. The above-described components include those that a person skilled in the art can easily imagine, those that are substantially the same, and those that are within the so-called equivalent range. Furthermore, the above-described components can be appropriately combined. Furthermore, various omissions, substitutions, or modifications of the components can be made without departing from the spirit of the above-described embodiments.

また、上記実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically using known methods. In addition, the information including the processing procedures, specific names, various data, and parameters shown in the above documents and drawings can be changed as desired unless otherwise specified. For example, the various information shown in each drawing is not limited to the information shown in the drawings.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。 In addition, each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or part of them can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.

例えば、上述した情報処理装置１００は、複数のサーバコンピュータで実現してもよく、また、機能によっては外部のプラットホーム等をＡＰＩ（Application Programming Interface）やネットワークコンピューティング等で呼び出して実現するなど、構成は柔軟に変更できる。 For example, the information processing device 100 described above may be realized by multiple server computers, and depending on the functions, the configuration can be flexibly changed, such as by calling an external platform using an API (Application Programming Interface) or network computing.

また、上述してきた実施形態及び変形例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 The above-described embodiments and variations can be combined as appropriate to the extent that they do not cause inconsistencies in the processing content.

また、上述してきた「部（section、module、unit）」は、「手段」や「回路」などに読み替えることができる。例えば、取得部は、取得手段や取得回路に読み替えることができる。 The above-mentioned "section, module, unit" can be read as "means" or "circuit." For example, an acquisition unit can be read as an acquisition means or an acquisition circuit.

１情報処理システム
１００情報処理装置
１２０記憶部
１２１学習用データ記憶部
１２２モデル情報記憶部
１２３コンテンツ情報記憶部
１３０制御部
１３１取得部
１３２生成部
１３３学習部
１３４処理部
１３５提供部
１０端末装置 REFERENCE SIGNS LIST 1 Information processing system 100 Information processing device 120 Storage unit 121 Learning data storage unit 122 Model information storage unit 123 Content information storage unit 130 Control unit 131 Acquisition unit 132 Generation unit 133 Learning unit 134 Processing unit 135 Provision unit 10 Terminal device

Claims

an acquisition unit that acquires a learning dataset including first character information labeled with a label for use in learning a model that extracts an extraction target string, which is a string corresponding to a predetermined type, from character information, and second character information that is character information without the label and, when the predetermined type is a named entity, is character information included in content that describes an explanation of a predetermined target corresponding to the named entity ;
a generation unit that selects, from the learning dataset, the first character information similar to the second character information as similar character information , identifies the named entity of the predetermined target that is the extraction target character string in the second character information as the second character string, and generates changed character information that includes the second character string and is character information that can be used for learning the model by changing the first character string that is the extraction target character string in the similar character information to the second character string;
An information processing device comprising:

The acquisition unit is
The information processing apparatus according to claim 1 , further comprising: acquiring the second character information extracted from a predetermined content.

The acquisition unit is
The information processing apparatus according to claim 2 , further comprising: acquiring the second character information extracted from the predetermined content provided on the Internet.

The acquisition unit is
The information processing apparatus according to claim 2 or 3, characterized in that the second character information is acquired by extracting the second character information from an explanation content that explains a predetermined subject.

The acquisition unit is
The information processing apparatus according to claim 4 , further comprising: acquiring the second character information extracted from the commentary content that explains an object different from the object indicated by the first character string.

The acquisition unit is
6. The information processing apparatus according to claim 2, further comprising: acquiring the second character information extracted from a content in an Internet encyclopedia.

The generation unit is
The information processing device according to any one of claims 1 to 6, characterized in that the similar character information is selected from the learning dataset based on a similarity to the second character information, and the changed character information is generated by changing the first character string in the similar character information to the second character string in the second character information.

The generation unit is
The information processing apparatus according to claim 7 , further comprising: selecting, from the learning data set, the first character information having a maximum similarity to the second character information as the similar character information.

The generation unit is
9. The information processing apparatus according to claim 7, further comprising: selecting the similar character information based on a similarity between each of first vectors obtained by vectorizing each of the first character information in the learning dataset and a second vector obtained by vectorizing the second character information.

The generation unit is
The information processing device according to any one of claims 1 to 9, characterized in that the changed character information is generated with a type label indicating that the first character string corresponds to the predetermined type as a type label of the second character string.

The acquisition unit is
acquiring the learning dataset used for learning the model that extracts the extraction target character string corresponding to a named entity from character information;
The generation unit is
The information processing device according to any one of claims 1 to 10, characterized in that the changed character information is generated by changing the first character string, which is a unique expression in the similar character information, to the second character string, which is a unique expression in the second character information.

a learning unit that learns the model by a machine learning process using the changed character information generated by the generation unit;
The information processing device according to any one of claims 1 to 11, further comprising:

The learning unit is
The information processing apparatus according to claim 12 , further comprising: learning the model to output the extraction target character string when the extraction target character string is included in input character information.

The learning unit is
The information processing apparatus according to claim 13 , further comprising: learning the model so that the second character string is extracted from the changed character information.

1. A computer-implemented information processing method, comprising:
an acquisition step of acquiring a learning dataset including first character information labeled with a label for use in learning a model that extracts an extraction target character string, which is a character string corresponding to a predetermined type, from character information, and second character information that is character information without the label and, when the predetermined type is a named entity, is character information included in content that describes an explanation of a predetermined object corresponding to the named entity;
a generating step of selecting, from the learning dataset, the first character information similar to the second character information as similar character information, identifying the named entity of the predetermined target which is the extraction target character string in the second character information as the second character string, and changing the first character string which is the extraction target character string in the similar character information to the second character string, thereby generating changed character information which includes the second character string and is character information usable for learning the model;
13. An information processing method comprising:

an acquisition step of acquiring a learning dataset including first character information labeled with a label for use in learning a model that extracts an extraction target character string, which is a character string corresponding to a predetermined type, from character information, and second character information that is character information without the label and, when the predetermined type is a named entity, is character information included in content that describes an explanation of a predetermined object corresponding to the named entity;
a generation step of selecting, from the learning dataset, the first character information similar to the second character information as similar character information, identifying the named entity of the predetermined target that is the extraction target character string in the second character information as the second character string, and changing the first character string that is the extraction target character string in the similar character information to the second character string, thereby generating changed character information that includes the second character string and is character information that can be used for learning the model;
An information processing program characterized by causing a computer to execute the above.