JP2006092198A

JP2006092198A - Data processor and program

Info

Publication number: JP2006092198A
Application number: JP2004275975A
Authority: JP
Inventors: Hiroshi Masuichi; 博増市; Tsuguaki Ryu; 紹明劉; Hiroki Yoshimura; 宏樹吉村; Michihiro Tamune; 道弘田宗; Masatoshi Tagawa; 昌俊田川; Kiyoshi Tashiro; 潔田代; Atsushi Ito; 篤伊藤; Kyosuke Ishikawa; 恭輔石川; Naoko Sato; 直子佐藤
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-09-22
Filing date: 2004-09-22
Publication date: 2006-04-06

Abstract

<P>PROBLEM TO BE SOLVED: To generate data for machine learning for classifying, from data for machine learning for classifying sentences described in a certain field, sentences described in another field. <P>SOLUTION: A data processor is provided with an input means for inputting data for machine learning including one or more words, a specifying means for specifying the appearance frequency of each of one or more words included in data for machine learning inputted to an input means in a plurality of sentences described in a predetermined field, a determination means for determining whether or not the appearance frequency specified by the specifying means exceeds a predetermined threshold and an output means for erasing the words whose appearance frequency specified by the specifying means is determined to be not more than the predetermined threshold by the determination means from the machine learning data inputted to the input means and outputs the machine learning data from which the above words are erased. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

分類済みの複数のデータに基づいてコンピュータ装置に新たなデータの分類を行わせる技術に関する。 The present invention relates to a technique for causing a computer apparatus to classify new data based on a plurality of classified data.

近年、データの特徴に応じてそのデータを分類する技術である機械学習が注目を集めている。機械学習とは、予め分類された多数のサンプルデータの各々の特徴とその分類結果との相関を表すデータ（以下、「機械学習用データ」）をコンピュータ装置に記憶させておき、未分類の新たなサンプルデータが入力された場合に、その特徴と上記機械学習用データとに基づいてその分類を上記コンピュータ装置に特定させる技術である。例えば、非特許文献１には、上記機械学習に用いられるアルゴリズムの一例として、サポートベクトルマシンアルゴリズム（以下、「ＳＶＭアルゴリズム」）が開示されている。また、上記機械学習の応用例としては、特許文献１に開示された技術が挙げられる。この特許文献１には、自然言語処理についての機械学習の応用例として、文の特徴（例えば、文に含まれている名詞）とその文についての分類結果との相関を表す機械学習用データが記憶されている文書処理装置に、新たに入力された文の分類をその特徴と上記機械学習用データとに基づいて特定させる技術が開示されている。
Fabrizio Sebbastian, "Machine Learning in Automated Text Categorization", ACM Computing Surveys,Vol1.34，No.1, pp.1-47(2002) 特開平０７−０２１１６９号公報 In recent years, machine learning, which is a technique for classifying data according to data characteristics, has attracted attention. In machine learning, data representing the correlation between the characteristics of each of a large number of sample data classified in advance and the classification result (hereinafter referred to as “machine learning data”) is stored in a computer device, and unclassified new data is stored. This is a technique for allowing the computer device to specify the classification based on the characteristics and the machine learning data when simple sample data is input. For example, Non-Patent Document 1 discloses a support vector machine algorithm (hereinafter, “SVM algorithm”) as an example of an algorithm used for the machine learning. Moreover, as an application example of the machine learning, a technique disclosed in Patent Document 1 can be cited. In this patent document 1, as an application example of machine learning for natural language processing, machine learning data representing a correlation between a sentence feature (for example, a noun included in a sentence) and a classification result for the sentence is disclosed. A technique for causing a stored document processing apparatus to specify a classification of a newly input sentence based on its characteristics and the machine learning data is disclosed.
Fabrizio Sebbastian, "Machine Learning in Automated Text Categorization", ACM Computing Surveys, Vol1.34, No.1, pp.1-47 (2002) Japanese Patent Application Laid-Open No. 07-021169

ところで、機械学習によって新たなデータの分類を特定する際の特定精度は、その機械学習を行うコンピュータ装置に記憶されている機械学習用データの質が良いほど高く、また、その量が多いほど高いことが一般に知られている。ここで、機械学習用データの質が良いとは、機械学習用データの表す相関（すなわち、サンプルデータについての分類結果とそのサンプルデータの特徴との相関）が高いことを意味している。このため、特許文献１に開示された技術を用いて、ある分野について記載された文を機械学習によって適切に分類するためには、その分野に属するサンプルデータ（すなわち、その分野について記載された文）を大量に用意しておくとともに、さらに、それらサンプルデータを適切に分類して機械学習用データを生成しておく必要がある。つまり、精度の高い分類を行うためには、分類対象のデータの属する分野毎にその分野向きの機械学習用データを作成しておかなければならない。しかしながら、機械学習用データの作成作業は人手で行われることが一般的であり、特に、上記分類作業に膨大な作業工数を用してしまいうといった問題点があった。 By the way, the accuracy of specifying a new data classification by machine learning is higher as the quality of the machine learning data stored in the computer device that performs the machine learning is higher, and the higher the amount is, the higher the accuracy is. It is generally known. Here, the quality of the machine learning data is high means that the correlation represented by the machine learning data (that is, the correlation between the classification result of the sample data and the feature of the sample data) is high. For this reason, in order to appropriately classify a sentence described in a certain field by machine learning using the technique disclosed in Patent Document 1, sample data belonging to the field (that is, a sentence described in the field). ) Are prepared in large quantities, and machine sample data must be generated by appropriately classifying the sample data. That is, in order to classify with high accuracy, machine learning data suitable for each field must be created for each field to which the data to be classified belongs. However, the creation of machine learning data is generally performed manually, and in particular, there is a problem that a huge amount of work is used for the classification work.

本発明は、上記課題に鑑みて為されたものであり、ある分野について記載された文を分類するための機械学習用データから、他の分野について記載された文を分類するための機械学習用データを生成することを可能にする技術を提供することを目的としている。 The present invention has been made in view of the above problems, and for machine learning for classifying sentences described in other fields from data for machine learning for classifying sentences described in a certain field. The object is to provide a technology that allows data to be generated.

上記課題を解決するために、本発明は、１または複数の単語を含む機械学習用データが入力される入力手段と、前記入力手段へ入力された機械学習用データに含まれている１または複数の単語の各々について、所定の分野について記載された複数の文中での出現頻度を特定する特定手段と、前記特定手段によって特定された出現頻度が所定の閾値を超えているか否かを判定する判定手段と、前記入力手段へ入力された機械学習用データから、前記特定手段によって特定された出現頻度が前記所定の閾値以下であると前記判定手段によって判定された単語を削除して出力する出力手段と、を有するデータ処理装置を提供する。このようなデータ処理装置によれば、１または複数の単語を含む機械学習用データが入力されると、その機械学習用データに含まれている１または複数の単語のうちから、所定の分野について記載された複数の文中での出現頻度が所定の閾値以下である単語を削除した機械学習用データが出力される。このため、上記閾値に適切な値を設定しておけば、上記所定の分野について記載された文中での出現頻度が高い単語のみを含んだ機械学習用データが上記データ処理装置から出力される。 In order to solve the above-described problems, the present invention provides an input unit that receives machine learning data including one or more words, and one or more included in the machine learning data input to the input unit. For each of the words, a specifying means for specifying an appearance frequency in a plurality of sentences described in a predetermined field, and a determination for determining whether the appearance frequency specified by the specifying means exceeds a predetermined threshold And output means for deleting the word determined by the determining means that the appearance frequency specified by the specifying means is less than or equal to the predetermined threshold from the machine learning data input to the input means and outputting the same A data processing apparatus having According to such a data processing device, when machine learning data including one or more words is input, a predetermined field is selected from one or more words included in the machine learning data. Machine learning data in which a word whose appearance frequency in a plurality of written sentences is equal to or less than a predetermined threshold is deleted is output. For this reason, if an appropriate value is set as the threshold value, machine learning data including only words having a high appearance frequency in sentences described in the predetermined field is output from the data processing apparatus.

より好ましい態様においては、前記所定の分野について記載された複数の文から抽出された単語毎に該複数の文中での出現頻度を表すデータが記憶された記憶手段を備え、前記特定手段は、前記入力手段へ入力される機械学習用データに含まれている１または複数の単語の各々について、前記所定の分野について記載された複数の文中での出現頻度を前記記憶手段の記憶内容を参照して特定することを特徴とする。このような態様にあっては、上記入力手段へ入力された機械学習用データに含まれている１または複数の単語の各々について、上記記憶手段の記憶内容を参照して上記複数の文中での出現頻度が特定される。 In a more preferred aspect, the storage device stores data representing the appearance frequency in the plurality of sentences for each word extracted from the plurality of sentences described in the predetermined field, and the specifying unit includes the identification unit For each of one or more words included in the machine learning data input to the input means, the appearance frequency in the plurality of sentences described for the predetermined field is referred to the stored contents of the storage means. It is characterized by specifying. In such an aspect, for each of one or more words included in the machine learning data input to the input means, the stored contents of the storage means are referred to in the plurality of sentences. Appearance frequency is specified.

また、別の好ましい態様においては、前記所定の分野について記載された複数の文の各々を表す文データが記憶された記憶手段を備え、前記特定手段は、前記入力手段へ入力された機械学習用データに含まれている１または複数の単語の各々について、前記所定の分野について記載された複数の文中での出現頻度を前記記憶手段の記憶内容を参照して特定することを特徴とする。このような態様にあっても、上記入力手段へ入力された機械学習用データに含まれている１または複数の単語の各々について、上記記憶手段の記憶内容を参照して上記複数の文中での出現頻度が特定される。 In another preferable aspect, the storage device stores sentence data representing each of a plurality of sentences described in the predetermined field, and the specifying unit is for machine learning input to the input unit. For each of one or a plurality of words included in the data, an appearance frequency in a plurality of sentences described in the predetermined field is specified with reference to a storage content of the storage unit. Even in such an aspect, for each of one or more words included in the machine learning data input to the input means, the stored contents of the storage means are referred to in the plurality of sentences. Appearance frequency is specified.

より好ましい態様においては、前記記憶手段に記憶されている文データの表す文に含まれている単語を抽出し、各文データ毎に単語の集合を生成する生成手段を備え、前記出力手段は、前記特定手段によって特定された出現頻度が前記所定の閾値を超えていると前記判定手段によって判定された単語について、その単語と、前記入力手段へ入力された機械学習用データにその単語とともに含まれている単語と、からなる集合が前記生成手段により生成された集合の何れとも類似しない場合には、その単語をその機械学習用データから削除して出力することを特徴とする。このような態様にあっては、上記所定の分野について記載された文中での出現頻度が高くても、共に用いられる単語が類似していないことを根拠に異なる意味の単語であると推測される単語を削除することが可能になる。 In a more preferred aspect, the output means comprises a generating means for extracting a word contained in a sentence represented by sentence data stored in the storage means and generating a set of words for each sentence data, The word determined by the determining means that the appearance frequency specified by the specifying means exceeds the predetermined threshold is included in the word and the machine learning data input to the input means together with the word. If the set consisting of the current word and the set generated by the generating means is not similar to any other word, the word is deleted from the machine learning data and output. In such an aspect, even if the frequency of occurrence in the sentence described for the predetermined field is high, it is presumed that the words are used with different meanings based on the fact that the words used together are not similar. It becomes possible to delete the word.

より好ましい態様においては、前記入力手段へ入力される機械学習用データは、前記所定の分野とは異なる分野向けに作成されているとともに、該異なる分野についての専門用語が複数記憶された記憶手段を備え、前記出力手段は、前記特定手段によって特定された出現頻度が前記所定の閾値を超えていると前記判定手段によって判定された単語について、前記記憶手段に記憶されている専門用語と一致する場合には、前記入力手段へ入力された機械学習用データからその単語を削除して出力することを特徴とする。このような態様にあっては、入力された機械学習用データから、その機械学習用データの属する分野における専門用語を表す単語が削除されて出力される。 In a more preferred aspect, the machine learning data input to the input means is created for a field different from the predetermined field, and the storage means stores a plurality of technical terms for the different field. The output means matches a technical term stored in the storage means for a word determined by the determination means that the appearance frequency specified by the specifying means exceeds the predetermined threshold value Is characterized in that the word is deleted from the machine learning data input to the input means and output. In such an aspect, words representing technical terms in the field to which the machine learning data belongs are deleted from the input machine learning data and output.

また、別の好ましい態様においては、分野を問わずに一般に利用される用語である一般用語が複数記憶された記憶手段を備え、前記出力手段は、前記特定手段によって特定された出現頻度が前記所定の閾値以下であると判定された単語であっても、前記記憶手段に記憶されている一般用語と一致する場合には、前記入力手段へ入力された機械学習用データからその単語を削除せずに出力することを特徴とする。このような態様にあっては、入力された機械学習用データに含まれている１または複数の単語のうち、一般用語である単語については削除されることなくその機械学習用データが出力される。 In another preferable aspect, the storage device stores a plurality of general terms that are commonly used regardless of the field, and the output unit has the appearance frequency specified by the specifying unit as the predetermined frequency. Even if the word is determined to be less than or equal to the threshold value, if it matches the general term stored in the storage means, the word is not deleted from the machine learning data input to the input means. It is characterized by being output to. In such an aspect, among the one or more words included in the input machine learning data, the word that is a general term is output without being deleted. .

また、上記課題を解決するために、本発明は、コンピュータ装置に、１または複数の単語を含む機械学習用データが入力された場合に、その機械学習用データに含まれている１または複数の単語の各々について、所定の分野について記載された複数の文中での出現頻度を特定する第１の処理と、前記第１の処理によって特定された出現頻度が所定の閾値を超えているか否かを判定する第２の処理と、前記入力された機械学習用データから、前記第１の処理にて特定された出現頻度が前記所定の閾値以下であると前記第２の処理にて判定された単語を削除して出力する第３の処理とを実行させるプログラムを提供する。また、本発明の別の態様にあっては、コンピュータ装置読み取り可能な記録媒体に上記プログラムを書き込んで提供する、としても良い。このようなプログラムによれば、係るプログラムにしたがってコンピュータ装置を作動させることによって、そのコンピュータ装置に機械学習用データが入力された場合に、その機械学習用データに含まれている１または複数の単語のうちから、所定の分野について記載された複数の文中での出現頻度が上記閾値以下である単語を削除した機械学習用データが出力される。 In order to solve the above-described problem, the present invention provides one or a plurality of machine learning data included in the machine learning data when the computer learning data including one or more words is input to the computer device. For each word, a first process for specifying an appearance frequency in a plurality of sentences described in a predetermined field, and whether or not the appearance frequency specified by the first process exceeds a predetermined threshold The second process to be determined and the word determined in the second process that the appearance frequency specified in the first process is less than or equal to the predetermined threshold from the input machine learning data A program for executing a third process of deleting and outputting the. In another aspect of the present invention, the program may be written and provided on a computer-readable recording medium. According to such a program, when machine learning data is input to the computer device by operating the computer device according to the program, one or more words included in the machine learning data Among them, machine learning data in which a word whose appearance frequency in a plurality of sentences described in a predetermined field is equal to or less than the threshold is deleted.

本発明によれば、ある分野向けに作成された機械学習用データから、他の分野向けの機械学習用データを生成することが可能になる、といった効果を奏する。また、本発明によれば、機械学習用データを作成する際に必須であった人手による分類作業を行う必要がないため、機械学習用データの作成に要する作業工数を削減させることが可能になる、といった効果も奏する。 According to the present invention, it is possible to generate machine learning data for other fields from machine learning data created for a certain field. Further, according to the present invention, since it is not necessary to perform manual classification work that is essential when creating machine learning data, it is possible to reduce the number of work steps required to create machine learning data. There are also effects such as.

以下、図面を参照しつつ本発明を実施する際の最良の形態について説明する。
（Ａ：構成）
（Ａ−１：システム構成）
図１は、本発明の１実施形態に係るコンピュータシステム１０の構成例を示すブロック図である。図１に示されているように、このコンピュータシステム１０は、例えばハードディスクである記憶装置１００および記憶装置２００と、これら２つの記憶装置に接続されたデータ処理装置３００と、を含んでいる。 The best mode for carrying out the present invention will be described below with reference to the drawings.
(A: Configuration)
(A-1: System configuration)
FIG. 1 is a block diagram illustrating a configuration example of a computer system 10 according to an embodiment of the present invention. As shown in FIG. 1, the computer system 10 includes a storage device 100 and a storage device 200 that are, for example, hard disks, and a data processing device 300 connected to these two storage devices.

図１の記憶装置１００には、ある特定の分野についての記載内容を有する文を前述したＳＶＭアルゴリズムにしたがって分類する際に用いられる機械学習用データが記憶されている。本実施形態では、記憶装置１００には、ある特定の企業（以下、「企業Ａ」とする）で利用されている文書に記載されている文のうち、何らかの金額を表す文字列を含んでいる文をその金額の勘定項目（例えば、「交通費」や「宿泊費」、「物品購入費」など）毎に分類するための機械学習用データが記憶されている。 The storage device 100 in FIG. 1 stores machine learning data used when a sentence having a description content in a specific field is classified according to the SVM algorithm described above. In the present embodiment, the storage device 100 includes a character string representing some amount of money among sentences described in a document used by a specific company (hereinafter referred to as “company A”). Machine learning data for classifying sentences into account items (for example, “transportation expenses”, “accommodation expenses”, “article purchase expenses”, etc.) of the amount is stored.

図２は、記憶装置１００に記憶されている機械学習用データの一例を示す図である。図２に示すように、この機械学習用データには、上記勘定項目を示す文字列である分類データと、その勘定項目に属する金額表現とともに用いられる頻度が高い単語（名詞、記号または略号、以下、「素性」ともいう）を表す１または複数の素性データとが含まれている。例えば、図２（ａ）は「東京から名古屋までの新幹線の特急料金として７９８０円を請求します」という文に基づいて生成された機械学習用データである。この機械学習用データには、図２（ａ）に示すように、上記の文に含まれている金額表現（すなわち、“７９８０円”）の分類が“交通費”であることを表す分類データと、その文に含まれている名詞（“東京”、“名古屋”、“新幹線”、“特急”、“料金”および“請求”）がその文（または、その文に含まれている金額表現）の特徴を表す素性データとして含まれている。 FIG. 2 is a diagram illustrating an example of machine learning data stored in the storage device 100. As shown in FIG. 2, the machine learning data includes classification data, which is a character string indicating the account item, and a word (noun, symbol or abbreviation, which is frequently used together with the money amount expression belonging to the account item). , Which is also referred to as “feature”). For example, FIG. 2A shows machine learning data generated based on the sentence “I will charge 7980 yen as an express charge for the Shinkansen from Tokyo to Nagoya”. In this machine learning data, as shown in FIG. 2A, classification data indicating that the classification of the monetary expression (ie, “7980 yen”) included in the above sentence is “transportation expenses”. And the nouns included in the sentence (“Tokyo”, “Nagoya”, “Shinkansen”, “Express”, “Fare” and “Billing”) are included in the sentence (or the monetary expression included in the sentence) ) Feature data representing the characteristics of

そして、記憶装置１００に記憶されている機械学習用データは、以下に述べる２つのグループに分けることができる。第１に、一般的な単語を表す素性データのみを含んでいる機械学習用データのグループであり、この第１のグループに属する機械学習用データの一例としては、図２（ａ）、図２（ｂ）および図２（ｃ）に示す機械学習用データが挙げられる。これに対して、第２のグループに属する機械学習用データには、上記企業Ａ内でのみ利用されている単語を表す素性データが含まれている。この第２のグループに属する機械学習用データの一例としては、図２（ｄ）および図２（ｅ）に示す機械学習用データが挙げられる。例えば、上記第２のグループに属している機械学習用データに含まれている素性データのうち、「○○」は企業Ａが開発している製品のコードネームであり、「△△」は企業Ａにて利用されているコンピュータシステムのシステム名である。また、図２の「××」は、企業Ａが遂行しているプロジェクトのプロジェクト名である。 The machine learning data stored in the storage device 100 can be divided into the following two groups. First, there is a group of machine learning data including only feature data representing a general word. As an example of the machine learning data belonging to the first group, FIG. 2A and FIG. Data for machine learning shown in (b) and FIG. In contrast, the machine learning data belonging to the second group includes feature data representing words used only in the company A. As an example of the machine learning data belonging to the second group, the machine learning data shown in FIG. 2D and FIG. For example, among the feature data included in the machine learning data belonging to the second group, “XX” is the code name of the product developed by company A, and “△△” is the company name. A system name of the computer system used in A. In addition, “XX” in FIG. 2 is the project name of the project being executed by the company A.

このように、記憶装置１００に記憶されている機械学習用データの中には、上記企業Ａ内でのみ利用されている単語（つまり、図２の、「○○」や「△△」、「××」）を表す素性データを含んでいるものがある。このため、企業Ａとは異なる企業Ｂにて利用されている文書に記載されている文のうち、何らかの金額を表す文字列を含んでいる文を、記憶装置１００に記憶されている機械学習用データをそのまま用いて分類したとしても、誤った分類が為される虞がある。そこで、図１に示すコンピュータシステム１０では、記憶装置１００に記憶されている機械学習用データを、企業Ｂでの利用が可能な機械学習用データに変換して記憶装置２００へ出力する処理をデータ処理装置３００に行わせるようになっている。以下では、データ処理装置３００を中心に説明する。 As described above, the machine learning data stored in the storage device 100 includes words used only within the company A (that is, “XX”, “ΔΔ”, “ XXX ”). For this reason, among sentences described in a document used in a company B different from the company A, a sentence including a character string representing some amount of money is stored in the storage device 100 for machine learning. Even if the data is used as it is for classification, there is a risk of erroneous classification. Therefore, in the computer system 10 shown in FIG. 1, the process of converting the machine learning data stored in the storage device 100 into the machine learning data that can be used by the company B and outputting the data to the storage device 200 is data. The processing apparatus 300 is configured to perform the processing. Hereinafter, the data processing apparatus 300 will be mainly described.

（Ａ−２：データ処理装置３００の構成）
まず、図３を参照しつつデータ処理装置３００の構成について説明する。
図３は、データ処理装置３００のハードウェア構成の一例を示すブロック図である。図３に示すように、このデータ処理装置３００は、制御部３１０と、インタフェイス部３２０と、記憶部３３０と、これら構成要素間のデータ授受を仲介するバス３４０と、を備えている。 (A-2: Configuration of the data processing device 300)
First, the configuration of the data processing device 300 will be described with reference to FIG.
FIG. 3 is a block diagram illustrating an example of a hardware configuration of the data processing device 300. As shown in FIG. 3, the data processing device 300 includes a control unit 310, an interface unit 320, a storage unit 330, and a bus 340 that mediates data exchange between these components.

制御部３１０は、例えばＣＰＵ（Central Processing Unit）であり、後述する記憶部３３０に記憶されている各種ソフトウェアを実行することによって、データ処理装置３００の各部を制御するものである。インタフェイス部３２０は、記憶装置１００や記憶装置２００に接続されており、これら記憶装置と制御部３１０との間のデータ授受を仲介するためのものである。より詳細に説明すると、このインタフェイス部３２０は、制御部３１０からの指示に応じて記憶装置２００からデータを読み出し、そのデータを制御部３１０へ引渡すとともに、制御部３１０から引渡されたデータを制御部３１０からの指示に応じて記憶装置２００へ出力し記憶させることができる。 The control unit 310 is, for example, a CPU (Central Processing Unit), and controls each unit of the data processing device 300 by executing various software stored in the storage unit 330 described later. The interface unit 320 is connected to the storage device 100 or the storage device 200, and serves to mediate data exchange between the storage device and the control unit 310. More specifically, the interface unit 320 reads data from the storage device 200 in response to an instruction from the control unit 310, delivers the data to the control unit 310, and controls the data delivered from the control unit 310. It can be output to the storage device 200 and stored in accordance with an instruction from the unit 310.

記憶部３３０は、図３に示されているように、揮発性記憶部３３０ａと不揮発性記憶部３３０ｂとを含んでいる。揮発性記憶部３３０ａは、例えばＲＡＭ（Random Access Memory）であり、後述する各種ソフトウェアにしたがって作動している制御部３１０によってワークエリアとして利用される。一方、不揮発性記憶部３３０ｂは、例えば、ハードディスクであり、各種データや各種ソフトウェアが格納されている。 As shown in FIG. 3, the storage unit 330 includes a volatile storage unit 330a and a nonvolatile storage unit 330b. The volatile storage unit 330a is, for example, a RAM (Random Access Memory), and is used as a work area by the control unit 310 operating according to various software described below. On the other hand, the non-volatile storage unit 330b is, for example, a hard disk, and stores various data and various software.

不揮発性記憶部３３０ｂに格納されているデータの一例としては、図４に示す出現頻度辞書が挙げられる。図４に示すように、この出現頻度辞書には、単語を表す文字列と、所定の分野について記載された複数の文中でのその単語の出現頻度を表す値とが各単語毎に登録されている。この出現頻度辞書は、以下のようにして生成される。まず、所定の分野について記載された複数の文の各々について、各文を構成している単語を形態素解析などにより抽出する。その結果、Ｎｗ個の単語が抽出されたとすると、それら単語ｎ（ｎ＝１、２…Ｎｗ）の各々について上記複数の文中での出現頻度Ｗｎ（ｎ＝１、２…Ｎｗ）を集計し、各単語を表す文字列にその出現頻度の値とを対応付けて不揮発性記憶部３３０ｂへ書き込むことによって上記出現頻度辞書が生成される。本実施形態では、上記企業Ｂにて用いられている複数の文に基づいて生成された出現頻度辞書が不揮発性記憶部３３０ｂに格納されている。詳細については後述するが、この出現頻度辞書は、記憶装置１００に記憶されている機械学習用データから企業Ｂでの出現頻度が低い単語を表す素性データを削除して、企業Ｂ向けの機械学習用データを生成する際に用いられる。 An example of data stored in the nonvolatile storage unit 330b is an appearance frequency dictionary shown in FIG. As shown in FIG. 4, in the appearance frequency dictionary, a character string representing a word and a value representing the appearance frequency of the word in a plurality of sentences described in a predetermined field are registered for each word. Yes. This appearance frequency dictionary is generated as follows. First, for each of a plurality of sentences described in a predetermined field, words constituting each sentence are extracted by morphological analysis or the like. As a result, if Nw words are extracted, the appearance frequencies Wn (n = 1, 2,... Nw) in the plurality of sentences are totalized for each of the words n (n = 1, 2,... Nw). The appearance frequency dictionary is generated by associating the character string representing each word with the value of the appearance frequency and writing it in the nonvolatile storage unit 330b. In the present embodiment, an appearance frequency dictionary generated based on a plurality of sentences used in the company B is stored in the nonvolatile storage unit 330b. As will be described in detail later, this appearance frequency dictionary deletes feature data representing words having a low appearance frequency at the company B from the machine learning data stored in the storage device 100, and machine learning for the company B is performed. Used when generating business data.

一方、不揮発性記憶部３３０ｂに格納されているソフトウェアの一例としては、オペレーティングシステム（Operating System 以下、「ＯＳ」）を制御部３１０に実現させるためのＯＳソフトウェアや、データ変換ソフトウェアが挙げられる。ここで、データ変換ソフトウェアとは、既に作成済みの機械学習用データを上記出現頻度辞書を参照して他の分野向けの機械学習用データに変換して出力する処理を制御部３１０に実行させるためのソフトウェアである。以下、これらソフトウェアを実行することによって制御部３１０に付与される機能について説明する。 On the other hand, examples of software stored in the nonvolatile storage unit 330b include OS software for causing the control unit 310 to implement an operating system (hereinafter referred to as “OS”) and data conversion software. Here, the data conversion software is for causing the control unit 310 to execute processing for converting machine learning data already created into machine learning data for other fields by referring to the appearance frequency dictionary and outputting the data. Software. Hereinafter, functions provided to the control unit 310 by executing these software will be described.

データ処理装置３００の電源（図示省略）が投入されると、制御部３１０は、まず、ＯＳソフトウェアを不揮発性記憶部３３０ｂから読み出し、これを実行する。ＯＳソフトウェアにしたがって作動しＯＳを実現している状態の制御部３１０には、データ処理装置３００の各部を制御する機能や、他のソフトウェアを不揮発性記憶部３３０ｂから読み出して実行する機能などが付与される。本実施形態では、上記ＯＳソフトウェアの実行を完了し、ＯＳを実現している状態の制御部３１０は、即座に、上記データ変換ソフトウェアを不揮発性記憶部３３０ｂから読み出し、これを実行する。このデータ変換ソフトウェアにしたがって作動している制御部３１０には、以下に述べる３つの機能が付与される。 When the power (not shown) of the data processing device 300 is turned on, the control unit 310 first reads the OS software from the nonvolatile storage unit 330b and executes it. The control unit 310 operating according to the OS software and realizing the OS is provided with a function for controlling each unit of the data processing device 300 and a function for reading out and executing other software from the nonvolatile storage unit 330b. Is done. In the present embodiment, the control unit 310 that completes the execution of the OS software and implements the OS immediately reads the data conversion software from the nonvolatile storage unit 330b and executes it. The control unit 310 operating according to the data conversion software is given the following three functions.

第１に、インタフェイス部３２０を介して入力された機械学習用データに含まれている素性データの各々について、所定の分野について記載された複数の文中での出現頻度を特定する特定機能である。本実施形態では、制御部３１０は、インタフェイス部３２０を介して入力された機械学習用データに含まれている素性データの表す単語について、企業Ｂ内で利用されている文中での出現頻度を上記出現頻度辞書を参照して特定する。第２に、上記特定機能によって特定された出現頻度が所定の閾値を超えているか否かを判定する判定機能である。そして、第３に、上記機械学習用データから、上記特定機能によって特定された出現頻度が上記所定の閾値以下であると上記判定機能によって判定された単語を削除してインタフェイス部３２０を介して記憶装置２００へ出力する出力機能である。 The first function is to specify the appearance frequency in a plurality of sentences described in a predetermined field for each of the feature data included in the machine learning data input via the interface unit 320. . In the present embodiment, the control unit 310 determines the appearance frequency in the sentence used in the company B for the word represented by the feature data included in the machine learning data input via the interface unit 320. It specifies with reference to the said appearance frequency dictionary. Second, it is a determination function that determines whether the appearance frequency specified by the specific function exceeds a predetermined threshold. Thirdly, from the machine learning data, the word determined by the determination function that the appearance frequency specified by the specific function is equal to or less than the predetermined threshold is deleted and the interface unit 320 is used. This is an output function for outputting to the storage device 200.

以上に説明したように、本実施形態に係るデータ処理装置３００のハードウェア構成は一般的なコンピュータ装置と同一であり、不揮発性記憶部３３０ｂに格納されている各種ソフトウェアにしたがって制御部３１０を作動させることによって、本発明に係るデータ処理装置に特有な機能が実現される。このように、本実施形態では、本発明に係るデータ処理装置に特有な機能をソフトウェアモジュールで実現する場合について説明したが、これらの機能を担っているハードウェアモジュールで本発明に係るデータ処理装置を構成するとしても良いことは勿論である。具体的には、既存の機械学習用データが入力される入力手段と、上記特定機能を担っている特定手段と、上記判定機能を担っている判定手段と、上記出力機能を担っている出力手段と、を夫々ハードウェアモジュールで実現し、これらハードウェアモジュールを組み合わせて、本発明に係るデータ処理装置を構成するとしても良い。 As described above, the hardware configuration of the data processing device 300 according to this embodiment is the same as that of a general computer device, and the control unit 310 is operated according to various software stored in the nonvolatile storage unit 330b. By doing so, functions specific to the data processing apparatus according to the present invention are realized. As described above, in the present embodiment, the case where the functions specific to the data processing apparatus according to the present invention are realized by the software module has been described. However, the data processing apparatus according to the present invention is realized by the hardware module that bears these functions. Of course, it may be configured. Specifically, input means for inputting existing machine learning data, specifying means for carrying out the specific function, determining means for carrying out the determining function, and output means for carrying out the output function May be realized by hardware modules, and the data processing apparatus according to the present invention may be configured by combining these hardware modules.

（Ｂ：動作）
次いで、データ処理装置３００が行う動作のうち、本発明に係るデータ処理装置の特徴を顕著に示す動作について図を参照しつつ説明する。なお、以下に説明する動作例では、その前提として、図２に示す機械学習用データ（すなわち、企業Ａ向けに作成された機械学習用データ）が記憶装置１００に記憶されているとともに、企業Ｂにて用いられている複数の文書に基づいて生成された出現頻度辞書がデータ処理装置３００の不揮発性記憶部３３０ｂに格納されているものとする。そして、この出現頻度辞書には、図２の第１のグループに属する機械学習用データに含まれている素性データの表す単語（例えば、「東京」、「新幹線」など）は全て登録されているものの、「○○」や「△△」、「××」など図２の第２のグループに属する機械学習用データに含まれている素性データの表す単語のなかには登録されていないものもある。これら「○○」や「△△」、「××」は、企業Ａ内でのみ用いられている単語だからである。 (B: Operation)
Next, among the operations performed by the data processing device 300, operations that significantly show the characteristics of the data processing device according to the present invention will be described with reference to the drawings. In the operation example described below, the machine learning data shown in FIG. 2 (that is, machine learning data created for the company A) is stored in the storage device 100 and the company B is assumed as the premise. It is assumed that an appearance frequency dictionary generated based on a plurality of documents used in is stored in the nonvolatile storage unit 330b of the data processing device 300. In the appearance frequency dictionary, all the words (for example, “Tokyo”, “Shinkansen”, etc.) represented by the feature data included in the machine learning data belonging to the first group in FIG. 2 are registered. However, some of the words represented by the feature data included in the machine learning data belonging to the second group in FIG. 2 such as “XX”, “ΔΔ”, and “XX” are not registered. This is because “OO”, “ΔΔ”, and “XX” are words used only within the company A.

上記データ変換ソフトウェアにしたがって作動している制御部３１０は、まず、記憶装置１００に記憶されている機械学習用データを１つづつ読み出すことをインタフェイス部３２０へ指示し、このインタフェイス部３２０を介して機械学習用データが入力されると、図５に示すフローチャートにしたがった処理を行う。 The control unit 310 operating according to the data conversion software first instructs the interface unit 320 to read out the machine learning data stored in the storage device 100 one by one. When machine learning data is input via the process, processing according to the flowchart shown in FIG. 5 is performed.

図５は、インタフェイス部３２０を介して機械学習用データが入力された場合に、制御部３１０が行うデータ変換処理の流れを示すフローチャートである。図５に示すように、制御部３１０は、まず、インタフェイス部３２０を介して入力された機械学習用データを揮発性記憶部３３０ａに書き込む（ステップＳＡ１）。そして、制御部３１０は、揮発性記憶部３３０ａに記憶されている機械学習用データに含まれている素性データの各々の表す素性（すなわち、単語）の出現頻度を出現頻度辞書を参照して特定する（ステップＳＡ２）。具体的には、制御部３１０は、上記素性データの表す単語と同一の単語が上記出現頻度辞書に登録されている場合には、その単語に対応付けて出現頻度辞書に登録されている値をその素性の出現頻度を表す値として特定する。逆に、その素性データの表す単語が上記出現頻度辞書に登録されていない場合には、制御部３１０は、企業Ｂではその単語が全く利用されていないことを表す値（例えば、“０”）をその単語の出現頻度を表す値として特定する。例えば、企業Ａ内でのみ用いられている単語である「○○」や「△△」、「××」については、上記出現頻度辞書に該当する単語が登録されていないため、その出現頻度を表す値として“０”が特定される。 FIG. 5 is a flowchart showing the flow of data conversion processing performed by the control unit 310 when machine learning data is input via the interface unit 320. As shown in FIG. 5, the control unit 310 first writes the machine learning data input via the interface unit 320 in the volatile storage unit 330a (step SA1). Then, the control unit 310 identifies the appearance frequency of each feature (that is, word) represented by the feature data included in the machine learning data stored in the volatile storage unit 330a with reference to the appearance frequency dictionary. (Step SA2). Specifically, when the same word as the word represented by the feature data is registered in the appearance frequency dictionary, the control unit 310 sets the value registered in the appearance frequency dictionary in association with the word. It is specified as a value representing the appearance frequency of the feature. On the other hand, when the word represented by the feature data is not registered in the appearance frequency dictionary, the control unit 310 indicates that the company B does not use the word at all (for example, “0”). Is specified as a value representing the appearance frequency of the word. For example, for the words “XX”, “ΔΔ”, and “XX” that are used only in the company A, since the corresponding word is not registered in the appearance frequency dictionary, the appearance frequency is changed. “0” is specified as the value to be represented.

次いで、制御部３１０は、揮発性記憶部３３０ａに記憶されている機械学習用データに含まれている素性データの全てについて、上記ステップＳＡ２にて特定した出現頻度の値が所定の閾値を超えている否かを判定する（ステップＳＡ３）。そして、制御部３１０は、上記ステップＳＡ３の判定結果が“Ｎｏ”である場合には、後述するステップＳＡ４の処理を実行し、逆に、上記ステップＳＡ３の判定結果が“Ｙｅｓ”である場合には、後述するステップＳＡ５の処理を実行する。なお、以下では、図２の第１のグループに属する機械学習用データについては、上記ステップＳＡ３の判定結果は“Ｙｅｓ”になるものとする。これに対して、「○○」や「△△」、「××」などの単語を表す素性データについては、上記ステップＳＡ２にてその出現頻度を表す値として“０”が特定されるため、これらの素性データを含む機械学習用データ（すなわち、図２の第２のグループに属する機械学習用データ）については、上記ステップＳＡ３の判定結果は“Ｎｏ”になる。 Next, for all of the feature data included in the machine learning data stored in the volatile storage unit 330a, the control unit 310 causes the appearance frequency value specified in step SA2 to exceed a predetermined threshold value. It is determined whether or not there is (step SA3). Then, when the determination result of step SA3 is “No”, the control unit 310 executes a process of step SA4 described later, and conversely, when the determination result of step SA3 is “Yes”. Performs the process of step SA5 described later. In the following, for the machine learning data belonging to the first group in FIG. 2, the determination result in step SA3 is “Yes”. On the other hand, for feature data representing words such as “XX”, “ΔΔ”, and “XX”, “0” is specified as a value representing the appearance frequency in step SA2 above. With respect to machine learning data including these feature data (that is, machine learning data belonging to the second group in FIG. 2), the determination result in step SA3 is “No”.

上記ステップＳＡ３の判定結果が“Ｎｏ”である場合に後続して実行されるステップＳＡ４では、制御部３１０は、揮発性記憶部３３０ａに記憶されている機械学習用データから、ステップＳＡ２にて特定した出現頻度の値が上記閾値以下である素性データを全て削除する。例えば、図２の第２のグループに属する機械学習用データについては、「○○」や「△△」、「××」などの単語を表す素性データが削除される。そして、ステップＳＡ３の判定結果が“Ｙｅｓ”である場合、または、上記ステップＳＡ４に後続して実行されるステップＳＡ５では、制御部３１０は、揮発性記憶部３３０ａに記憶されている機械学習用データをインタフェイス部３２０を介して出力し、その機械学習用データを記憶装置２００に記憶させる。なお、ステップＳＡ５にて、揮発性記憶部３３０ａに記憶されている機械学習用データに素性データが１つも含まれていない場合には、その機械学習用データを記憶装置２００へ出力しないようにしても良いことは勿論である。 In step SA4, which is subsequently executed when the determination result in step SA3 is “No”, the control unit 310 specifies in step SA2 from the machine learning data stored in the volatile storage unit 330a. All feature data whose appearance frequency values are equal to or less than the threshold value are deleted. For example, with respect to the machine learning data belonging to the second group in FIG. 2, feature data representing words such as “OO”, “ΔΔ”, and “XX” are deleted. When the determination result in step SA3 is “Yes”, or in step SA5 executed subsequent to step SA4, the control unit 310 stores the machine learning data stored in the volatile storage unit 330a. Is output via the interface unit 320, and the machine learning data is stored in the storage device 200. If no feature data is included in the machine learning data stored in the volatile storage unit 330a in step SA5, the machine learning data is not output to the storage device 200. Of course, it is also good.

以降、制御部３１０は、記憶装置１００に記憶されている全ての機械学習用データについて、上記ステップＳＡ１からステップＳＡ５の処理を繰り返し実行する。その結果、記憶装置２００には、図６に示す機械学習用データが記憶されることになる。図２に示す機械学習用データとの比較から明らかなように、図６に示す機械学習用データには、企業Ａ内でのみ用いられていた単語を表す素性データは含まれておらず、一般的な単語を表す素性データのみが含まれている。このように、本実施形態に係るデータ処理装置によれば、ある特定の企業向けに作成された機械学習用データから、他の企業向けの機械学習用データを生成することが可能になる。 Thereafter, the control unit 310 repeatedly executes the processing from step SA1 to step SA5 for all the machine learning data stored in the storage device 100. As a result, the machine learning data shown in FIG. 6 is stored in the storage device 200. As is clear from the comparison with the machine learning data shown in FIG. 2, the machine learning data shown in FIG. 6 does not include feature data representing words used only within the company A. Only feature data representing typical words is included. As described above, according to the data processing apparatus of the present embodiment, it is possible to generate machine learning data for other companies from machine learning data created for a specific company.

（Ｃ：変形例）
以上、本発明の１実施形態について説明したが、係る実施形態を以下のように変形しても良いことは勿論である。
（Ｃ−１：変形例１）
上述した実施形態では、記憶装置１００および２００とデータ処理装置３００とを夫々個別のハードウェアで実現する場合について説明したが、同一のハードウェアで実現するとしても勿論良い。具体的には、データ処理装置３００の不揮発性記憶部３３０ｂに、出現頻度辞書の他に、作成済みの機械学習用データを記憶させておくとともに、新たに生成した機械学習用データもその不揮発性記憶部３３０ｂへ出力するようにすれば良い。 (C: Modification)
Although one embodiment of the present invention has been described above, it goes without saying that the embodiment may be modified as follows.
(C-1: Modification 1)
In the above-described embodiment, the case where the storage devices 100 and 200 and the data processing device 300 are realized by separate hardware has been described, but it is needless to say that they may be realized by the same hardware. Specifically, in addition to the appearance frequency dictionary, machine learning data that has already been created is stored in the nonvolatile storage unit 330b of the data processing device 300, and newly generated machine learning data is also stored in the nonvolatile storage unit 330b. What is necessary is just to make it output to the memory | storage part 330b.

（Ｃ−２：変形例２）
上述した実施形態では、ある企業Ａ向けに作成された機械学習用データから、その企業Ａとは異なる企業Ｂ向けの機械学習用データを生成する場合について説明した。しかしながら、既存の機械学習用データが、ある分野（以下、分野１）について記載された複数の文の各々についての分類結果と各文の特徴を表す素性データとを含んでおり、さらに、その分野１とは異なる分野（以下、分野２）について記載された複数の文に含まれている各単語のそれら複数の文中での出現頻度に基づいて、上記機械学習用データを更新して新たな機械学習用データを生成する態様であれば、分野１や分野２はどのような分野であっても構わないことは勿論である。 (C-2: Modification 2)
In the above-described embodiment, a case has been described in which machine learning data for a company B different from the company A is generated from machine learning data created for a company A. However, the existing machine learning data includes a classification result for each of a plurality of sentences described for a certain field (hereinafter referred to as field 1) and feature data representing the characteristics of each sentence. Based on the appearance frequency of each word included in a plurality of sentences described in a field different from 1 (hereinafter referred to as field 2), the machine learning data is updated and a new machine is updated. Of course, field 1 and field 2 may be any field as long as the learning data is generated.

（Ｃ−３：変形例３）
上述した実施形態では、既存の機械学習用データに含まれている１または複数の単語の各々について、所定の分野について記載された複数の文中での出現頻度をデータ処理装置に特定させるための出現頻度辞書がデータ処理装置に記憶されている場合について説明した。しかしながら、上記出現頻度辞書を生成する際に用いられるデータ（すなわち、上記複数の文の各々を表す文データ）を不揮発性記憶部３３０ｂに記憶させておき、それら文データの表す文中での出現頻度をその都度算出させるようにしても良い。 (C-3: Modification 3)
In the embodiment described above, for each of one or a plurality of words included in the existing machine learning data, an appearance for causing the data processing device to specify an appearance frequency in a plurality of sentences described in a predetermined field. The case where the frequency dictionary is stored in the data processing device has been described. However, data used when generating the appearance frequency dictionary (that is, sentence data representing each of the plurality of sentences) is stored in the nonvolatile storage unit 330b, and the appearance frequency in the sentence represented by the sentence data is stored. May be calculated each time.

（Ｃ−４：変形例４）
上述した実施形態では、既存の機械学習用データに含まれている１または複数の単語の各々について、所定の分野について記載された複数の文中での出現頻度を特定し、その出現頻度が所定の閾値以下である単語をその機械学習用データから削除する場合について説明した。しかしながら、既存の機械学習用データにその機械学習用データの属する分野における専門用語である単語が含まれている場合には、その単語の出現頻度に拘わらず削除するとしても良い。このようなことは、以下のようにして実現される。すなわち、データ記憶装置の不揮発性記憶部に、上記既存の機械学習用データの属する分野についての専門用語を表すデータ（例えば、その専門用語に対応する文字列データ、以下、専門用語データ）を予め記憶させておく。そして、上記不揮発性記憶部３３０ｂに記憶されている専門用語データの表す文字列と一致する単語については、所定の分野について記載された複数の文中での出現頻度が上記所定の閾値を超えている場合であっても機械学習用データから削除させるようにすれば良い。このようにすると、以下に述べるような効果を奏する。すなわち、上述の如き専門用語が上記所定の分野において別の意味で頻繁に用いられている場合（すなわち、上記所定の分野における出現頻度が高い場合）に、上記所定の分野内での出現頻度のみを根拠に機械学習用データからの削除を行わないようにしてしまうと、誤った機械学習用データが生成されてしまう虞がある。しかしながら、本変形例に係る態様であれば、上述の如き専門用語は確実に削除されるため、誤った機械学習用データが生成されることがない、といった効果を奏する。 (C-4: Modification 4)
In the above-described embodiment, for each of one or more words included in existing machine learning data, the appearance frequency in a plurality of sentences described in a predetermined field is specified, and the appearance frequency is predetermined. The case where a word that is equal to or lower than the threshold is deleted from the machine learning data has been described. However, when a word that is a technical term in the field to which the machine learning data belongs is included in the existing machine learning data, it may be deleted regardless of the appearance frequency of the word. Such a thing is implement | achieved as follows. That is, data representing technical terms in the field to which the existing machine learning data belongs (for example, character string data corresponding to the technical terms, hereinafter referred to as technical term data) is stored in advance in the nonvolatile storage unit of the data storage device. Remember. For words that match the character string represented by the technical term data stored in the nonvolatile storage unit 330b, the appearance frequency in a plurality of sentences described in a predetermined field exceeds the predetermined threshold. Even in such a case, it may be deleted from the machine learning data. In this way, the following effects can be obtained. That is, when the technical terms as described above are frequently used in a different meaning in the predetermined field (that is, when the frequency of appearance in the predetermined field is high), only the appearance frequency in the predetermined field. If the deletion from the machine learning data is not performed on the basis of the above, incorrect machine learning data may be generated. However, in the aspect according to the present modification, the technical terms as described above are surely deleted, so that there is an effect that erroneous machine learning data is not generated.

また、既存の機械学習用データに含まれている単語が一般用語である場合には、その単語の出現頻度が上記閾値以下であっても削除しないようにするとしても良い。このようなことは、上述の如き一般用語を表すデータ（例えば、その一般用語に対応する文字列データ、以下、一般用語データ）を不揮発性記憶部３３０ｂに予め記憶させておくとともに、その一般用語データの表す文字列と一致する単語については、所定の分野について記載された複数の文中での出現頻度が上記所定の閾値以下であっても、機械学習用データから削除しないようにすれば良い。 In addition, when a word included in existing machine learning data is a general term, it may not be deleted even if the appearance frequency of the word is equal to or lower than the threshold value. This is because data representing general terms as described above (for example, character string data corresponding to the general terms, hereinafter, general term data) is stored in the nonvolatile storage unit 330b in advance, and the general terms The word that matches the character string represented by the data may not be deleted from the machine learning data even if the appearance frequency in the plurality of sentences described in the predetermined field is equal to or lower than the predetermined threshold.

（Ｃ−５：変形例５）
上述した実施形態では、既存の機械学習用データに含まれている１または複数の単語の各々について、所定の分野について記載された複数の文中での出現頻度を特定し、その出現頻度が所定の閾値以下である単語をその機械学習用データから削除する場合について説明した。しかしながら、係る出現頻度が上記閾値を超えている場合であっても、その単語が上記所定の分野において異なる意味で用いられている場合には、係る単語をそのまま機械学習用データに残しておくことは好ましくない。そこで、上記出現頻度が高い単語であっても上記所定の分野における意味が異なっていると推定される単語については、機械学習用データから削除するようにしても良い。このようなことは以下のようにして実現される。すなわち、上記実施形態にて説明したデータ処理装置の不揮発性記憶部に、上記出現頻度辞書に替えてその出現頻度辞書を生成する際に用いられる複数の文の各々を表す文データを格納しておくとともに、上記各文データの表す文に含まれている単語を抽出し、各文データ毎に単語の集合を生成する生成手段を上記データ処理装置に設けておく。そして、出現頻度が所定の閾値を超えている単語であっても、その単語とその単語とともに機械学習用データに含まれている単語とからなる集合が、上記生成手段により生成された集合の何れとも類似しない場合には、その単語を削除させるようにすれば良い。なお、このような単語の集合同士の類非を判定する手法としては、ベクトル空間法などの周知技術を用いることによって実現される。 (C-5: Modification 5)
In the above-described embodiment, for each of one or more words included in existing machine learning data, the appearance frequency in a plurality of sentences described in a predetermined field is specified, and the appearance frequency is predetermined. The case where a word that is equal to or lower than the threshold is deleted from the machine learning data has been described. However, even if the appearance frequency exceeds the threshold, if the word is used in a different meaning in the predetermined field, leave the word as it is in the machine learning data. Is not preferred. Therefore, even if the word has a high appearance frequency, words estimated to have different meanings in the predetermined field may be deleted from the machine learning data. Such a thing is implement | achieved as follows. That is, sentence data representing each of a plurality of sentences used when generating the appearance frequency dictionary instead of the appearance frequency dictionary is stored in the nonvolatile storage unit of the data processing apparatus described in the above embodiment. At the same time, the data processing device is provided with generating means for extracting words included in the sentence represented by each sentence data and generating a set of words for each sentence data. Even if a word has an appearance frequency exceeding a predetermined threshold, a set of the word and a word included in the machine learning data together with the word is any of the sets generated by the generation unit. If they are not similar, the word can be deleted. Note that such a technique for determining the similarity between sets of words is realized by using a known technique such as a vector space method.

（Ｃ−６：変形例６）
上述した実施形態では、既存の機械学習用データから、その機械学習用データの属する分野とは異なる分野に属する機械学習用データを生成する場合について説明した。しかしながら、このようにして生成された機械学習用データをデータ処理装置に記憶させ、その機械学習用データを用いて、そのデータ処理装置に機械学習を行わせるようにしても良いことは勿論である。そして、その機械学習によって新たに分類されたデータのうち、その分類の確度（最尤度）が最も高いデータについての分類結果と、そのデータの特徴を表す素性データとを対応付けて新たな機械学習用データをデータ処理装置に生成させ記憶させるとしても勿論良い。 (C-6: Modification 6)
In the embodiment described above, a case has been described in which machine learning data belonging to a field different from the field to which the machine learning data belongs is generated from existing machine learning data. However, it goes without saying that the machine learning data generated in this way may be stored in a data processing device and the data processing device may be caused to perform machine learning using the machine learning data. . Then, among the data newly classified by the machine learning, the classification result for the data having the highest classification accuracy (maximum likelihood) and the feature data representing the characteristics of the data are associated with each other to create a new machine Of course, the learning data may be generated and stored in the data processing device.

（Ｃ−７：変形例７）
上述した各実施形態では、本発明に係るデータ処理装置に特有な機能を制御部に実現させるためのソフトウェアを不揮発性記憶部に予め記憶させておく場合について説明した。しかしながら、例えばＣＤ−ＲＯＭ（Compact Disk- Read Only Memory）やＤＶＤ（Digital Versatile Disk）などのコンピュータ装置読み取り可能な記録媒体に、上記ソフトウェアを記録しておき、このような記録媒体を用いて一般的なコンピュータ装置に上記ソフトウェアをインストールするとしても良いことは勿論である。このようにすると、一般的なコンピュータ装置を本発明に係るデータ処理装置として機能させることが可能になるといった効果を奏する。 (C-7: Modification 7)
In each of the above-described embodiments, the case has been described in which software for causing the control unit to realize a function specific to the data processing apparatus according to the present invention is stored in advance in the nonvolatile storage unit. However, for example, the software is recorded on a computer-readable recording medium such as a CD-ROM (Compact Disk-Read Only Memory) or a DVD (Digital Versatile Disk), and is generally used with such a recording medium. Of course, the software may be installed in a simple computer device. If it does in this way, there will be an effect that it becomes possible to make a general computer device function as a data processing device concerning the present invention.

本発明の１実施形態に係るデータ処理装置３００を有するコンピュータシステムの全体構成の一例を示す図である。It is a figure which shows an example of the whole structure of the computer system which has the data processor 300 which concerns on one Embodiment of this invention. 同コンピュータシステム１０に含まれている記憶装置１００に記憶されている機械学習用データの一例を示す図である。3 is a diagram showing an example of machine learning data stored in a storage device 100 included in the computer system 10. FIG. 同データ処理装置３００のハードウェア構成の一例を示す図である。2 is a diagram illustrating an example of a hardware configuration of the data processing apparatus 300. FIG. 同データ処理装置３００の不揮発性記憶部３３０ｂに記憶されている出現頻度辞書のデータフォーマットを表す図である。It is a figure showing the data format of the appearance frequency dictionary memorize | stored in the non-volatile memory | storage part 330b of the data processor 300. FIG. 同データ処理装置３００の制御部３１０がデータ変換ソフトウェアにしたがって行うデータ変換処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the data conversion process which the control part 310 of the data processing apparatus 300 performs according to data conversion software. 同データ処理装置３００によって記憶装置２００へ書き込まれる機械学習用データの一例を示す図である。4 is a diagram illustrating an example of machine learning data written to the storage device 200 by the data processing device 300. FIG.

Explanation of symbols

１０…コンピュータシステム、１００，２００…記憶装置、３００…データ処理装置、３１０…制御部、３２０…インタフェイス部、３３０…記憶部、３３０ａ…揮発性記憶部、３３０ｂ…不揮発性記憶部、３４０…バス。 DESCRIPTION OF SYMBOLS 10 ... Computer system, 100, 200 ... Memory | storage device, 300 ... Data processing apparatus, 310 ... Control part, 320 ... Interface part, 330 ... Memory | storage part, 330a ... Volatile memory | storage part, 330b ... Nonvolatile memory part, 340 ... bus.

Claims

Input means for inputting machine learning data including one or more words;
A specifying means for specifying an appearance frequency in a plurality of sentences described in a predetermined field for each of one or a plurality of words included in the machine learning data input to the input means;
Determining means for determining whether or not the appearance frequency specified by the specifying means exceeds a predetermined threshold;
An output unit that deletes and outputs the word determined by the determination unit when the appearance frequency specified by the specifying unit is equal to or lower than the predetermined threshold from the machine learning data input to the input unit;
A data processing apparatus.

Storage means for storing data representing the appearance frequency in the plurality of sentences for each word extracted from the plurality of sentences described in the predetermined field;
The specifying means is:
For each of one or more words included in the machine learning data input to the input unit, the appearance frequency in the plurality of sentences described for the predetermined field is referred to the stored content of the storage unit. The data processing device according to claim 1, wherein the data processing device is specified.

Comprising storage means for storing sentence data representing each of a plurality of sentences described in the predetermined field;
The specifying means is:
For each of one or more words included in the machine learning data input to the input unit, the appearance frequency in the plurality of sentences described for the predetermined field is referred to the stored content of the storage unit. The data processing device according to claim 1, wherein the data processing device is specified.

A generation unit that extracts words included in a sentence represented by the sentence data stored in the storage unit and generates a set of words for each sentence data;
The output means includes
The word determined by the determining means that the appearance frequency specified by the specifying means exceeds the predetermined threshold is included in the word and the machine learning data input to the input means together with the word. 4. If the set consisting of a word and a set of words is not similar to any of the sets generated by the generating means, the word is deleted from the machine learning data and output. 5. Data processing equipment.

The machine learning data input to the input unit includes a storage unit that is created for a field different from the predetermined field and stores a plurality of technical terms for the different field,
The output means includes
If the word determined by the determining means that the appearance frequency specified by the specifying means exceeds the predetermined threshold value matches the technical term stored in the storage means, to the input means The data processing apparatus according to claim 1, wherein the word is deleted from the input machine learning data and output.

Comprising storage means for storing a plurality of general terms that are commonly used in any field,
The output means includes
Even if the appearance frequency specified by the specifying unit is determined to be less than or equal to the predetermined threshold, if the word matches the general term stored in the storage unit, the word is input to the input unit. The data processing apparatus according to claim 1, wherein the word is output from the machine learning data without being deleted.

Computer equipment,
When machine learning data including one or more words is input, each of one or more words included in the machine learning data appears in a plurality of sentences described in a predetermined field A first process for identifying a frequency;
A second process for determining whether or not the appearance frequency specified by the first process exceeds a predetermined threshold;
A word that is determined in the second process when the appearance frequency specified in the first process is less than or equal to the predetermined threshold is output from the input machine learning data. A program that executes and processes.