WO2015040860A1 - 分類辞書生成装置、分類辞書生成方法及び記録媒体 - Google Patents
分類辞書生成装置、分類辞書生成方法及び記録媒体 Download PDFInfo
- Publication number
- WO2015040860A1 WO2015040860A1 PCT/JP2014/004776 JP2014004776W WO2015040860A1 WO 2015040860 A1 WO2015040860 A1 WO 2015040860A1 JP 2014004776 W JP2014004776 W JP 2014004776W WO 2015040860 A1 WO2015040860 A1 WO 2015040860A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- classification dictionary
- lower limit
- value
- classification
- information
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the present invention relates to a classification dictionary generation device, a classification dictionary generation method, and a recording medium that generate a dictionary for appropriately classifying documents.
- Non-Patent Document 1 An example of a system that automatically generates a classification dictionary by a computer is described in Non-Patent Document 1.
- the system described in Non-Patent Document 1 uses an identification function (classification dictionary) for dividing an unclassified document into a target category and other categories using a document set to which a classification category is assigned in advance. learn. Specifically, the system extracts a word belonging to a specific part of speech from a document included in a document set to which a classification category is assigned in advance, and associates each extracted word with each dimension of the vector, A vector is created in which the value of the corresponding dimension is 1 when appears, and 0 when it does not appear.
- identification function classification dictionary
- the system uses a set of vectors created from each document, and learns an identification function to divide the target category into a positive example set and other categories into a negative example set by a support vector machine.
- the support vector machine is a learning method for obtaining an optimal separation hyperplane by maximizing a margin when separating given data into a positive example set and a negative example set in the superspace.
- Patent Document 1 discloses a weight vector composed of weights assigned to each word (that is, each dimension of the vector) based on a specific part of speech as an example of an identification function.
- the weight takes a positive or negative value.
- the system described in Patent Literature 1 extracts words from the target document, and calculates the sum of the weights of the classification dictionary for the target category for the extracted words as the score of the category. Further, the system classifies the extracted word into the category if the score is equal to or greater than a threshold value. That is, the score of the target category is added to the appearance of a word having a positive weight value, and conversely, the score of the target category is reduced to the appearance of a negative word.
- Patent Document 1 and Non-Patent Document 1 when a document including information on a certain category (target category) is classified into a target category, information (words) other than the target category is included in the document. If there are many, the score that is the sum of the weights of the appearing words tends to be low. This is because in the above case, there are many words having negative weights. Therefore, if the information of the target category is less than the other information, the systems of Patent Literature 1 and Non-Patent Literature 1 have a problem of generating a classification dictionary that calculates a low score representing the category likeness.
- Patent Document 1 and Non-Patent Document 1 cannot learn an identification function for predicting that the system is a positive example. Furthermore, the system of Non-Patent Document 1 cannot detect that the score of the discrimination function (classification dictionary) tends to be low in the above case.
- the object of the present invention is to solve the above problem, so that even if the information corresponding to the target category is less than the information of the non-target category, the score of the target category is compared with the document not including the target category information.
- the present invention is to provide a dictionary creation device, a classification dictionary generation method, and a recording medium that create a classification dictionary that calculates a higher value.
- a classification dictionary generation apparatus includes a lower limit storage unit that stores lower limit information for determining a lower limit value of a dimension value of a classification dictionary for classifying a category of a document, and the category is known.
- a classification dictionary is generated.
- a classification dictionary generation method stores lower limit information for determining a lower limit value of a dimension value of a classification dictionary for classifying a category of a document, learning data in which the category is known, and the storage Based on the lower limit information, a classification dictionary in which all the dimension values are equal to or higher than the lower limit value is generated.
- a computer-readable recording medium has a process for storing lower limit information for determining a lower limit value of a dimension value of a classification dictionary for classifying a document category, and the category is known
- the score of the target category is higher than that of the document that does not include the target category information.
- the classification dictionary creating apparatus calculates a classification function from learning data whose category is known, corrects a lower limit value in the calculated classification function, and classifies a document into a category. Create a dictionary.
- FIG. 1 a first embodiment of the present invention will be described with reference to FIG. 1
- the drawing reference numerals attached to FIG. 1 are attached to each element for convenience as an example for facilitating understanding, and are not intended to be any limitation.
- FIG. 1 is a diagram showing an example of a classification dictionary generation device 10 according to the first embodiment of the present invention.
- the classification dictionary generation device 10 according to the first exemplary embodiment of the present invention includes a control unit 11, a lower limit storage unit 15, a learning data storage unit 16, and a classification dictionary storage unit 17.
- the control unit 11 includes an identification function calculation unit 12, a classification dictionary generation unit 13, and an interface unit 14.
- the interface unit 14 reads the learning data stored in the learning data storage unit 16 and outputs it to the discrimination function calculation unit 12. Further, the interface unit 14 writes the calculated classification dictionary in the classification dictionary storage unit 17.
- the discriminant function calculation unit 12 calculates the discriminant function using the learning data.
- the learning data is, for example, a set of documents to which category information is assigned.
- the identification function is a function that divides each document into a target category and other categories by using a document set to which a classification category is assigned in advance.
- An example of the discrimination function is a weight vector, for example.
- the classification dictionary generation unit 13 generates a classification dictionary related to the target category. For example, the classification dictionary generation unit 13 generates a classification dictionary using an identification function based on the lower limit information.
- the lower limit storage unit 15 stores lower limit information including a lower limit value. Details of the lower limit information will be described later with reference to FIG.
- the learning data storage unit 16 stores learning data.
- the classification dictionary storage unit 17 stores the classification dictionary generated by the classification dictionary generation unit 13.
- FIG. 5 is a diagram illustrating a configuration example of learning data stored in the learning data storage unit 16.
- the learning data includes “DID” that is the ID of the learning data document, “Learning data document” that is the learning data document body, and “Category” that is the category information of the learning data document. ”Is associated with the data.
- the learning data storage unit 16 is, for example, DID "2" and the learning data document "Tanaka of XX. Thank you for your help. And the category “no request” are stored in association with each other. The meaning of the request shown in FIG. 5 will be described later.
- a computer that implements the classification dictionary generation apparatus 10 according to the first embodiment of the present invention will be described with reference to FIG.
- FIG. 2 is a typical hardware configuration diagram of the classification dictionary generation apparatus 10 according to the first embodiment of this invention.
- the classification dictionary generation device 10 includes, for example, a CPU (Central Processing Unit) 1, a RAM (Random Access Memory) 2, a storage device 3, a communication interface 4, an input device 5, an output device 6, and the like.
- a CPU Central Processing Unit
- RAM Random Access Memory
- the identification function calculation unit 12 and the classification dictionary generation unit 13 are realized by the CPU 1 that executes a program expanded in the main memory such as the RAM 2.
- the interface unit 14 is realized, for example, when the CPU 1 executes an application program using a function provided by the OS (Operating System) of the CPU 1.
- the storage device 3 is, for example, a hard disk or a flash memory.
- the storage device 3 functions as a lower limit storage unit 15, a learning data storage unit 16, and a classification dictionary storage unit 17.
- the storage device 3 stores the above application program.
- the communication interface 4 is connected to the CPU 1 and connected to a network or an external storage medium. External data may be taken into the CPU 1 via the communication interface 4.
- the input device 5 is, for example, a keyboard, a mouse, or a touch panel.
- the output device 6 is a display, for example.
- the hardware configuration illustrated in FIG. 2 is merely an example, and each unit of the classification dictionary generation apparatus 10 illustrated in FIG. 1 may be configured with independent logic circuits.
- the target category is selected in order to consider the classification for detecting a document containing a request indicating that the other party wants to do something, that is, a reply by e-mail or a request for an answer to a question. “Requested” and non-target category “not requested”.
- the classification dictionary generation device 10 is not limited to the above classification, and in order to consider classification for detecting whether a certain document is a sports newspaper, the target category is “sports newspaper” and the non-target category is “ Other than sports newspapers.
- the classification dictionary generation apparatus 10 of the present invention generates a dictionary for classification based on a target category for classification (target category) and a non-target category for other classification.
- FIG. 3 is a flowchart showing the operation of the classification dictionary generation apparatus 10 according to the first embodiment of the present invention.
- S101 to S104 indicate processing steps of the operation example.
- the interface unit 14 reads the learning data stored in the learning data storage unit 16 and outputs it to the discrimination function calculation unit 12 (S101).
- the discriminant function calculator 12 calculates the discriminant function based on the learning data read by the interface unit 14 (S102). The detailed operation of the discrimination function calculation unit 12 will be described when the flowchart of FIG. 4 is described.
- the classification dictionary generation unit 13 calculates an identification function (weight vector) that is lower than the lower limit value set based on the lower limit information stored in the lower limit value storage unit 15 in the calculated identification function (weight vector). It converts into the set lower limit, and outputs it as a classification dictionary (S103).
- an identification function weight vector
- S103 classification dictionary
- the interface unit 14 writes the classification dictionary generated by the classification dictionary generation unit 13 in the classification dictionary storage unit 17 (S104).
- FIG. 4 is a flowchart showing the operation of the discriminant function calculation unit 12 according to the first embodiment of the present invention.
- S201 to S202 indicate processing steps of the operation example.
- the discriminant function calculation unit 12 extracts the features reflecting the contents of each document in the learning data read by the interface unit 14, in this example, all nouns, verbs, auxiliary verbs in the document, and the feature vector Is generated (S201).
- the feature vector Is generated S201.
- FIG. 6 is a diagram illustrating a configuration example of a feature vector calculated by the discrimination function calculation unit 12 from the learning data illustrated in FIG.
- the feature vector in the example shown in FIG. 6 is a noun, verb, auxiliary verb word extracted by the discriminant function calculation unit 12 by performing morphological analysis on the learning data, and a dimension value “1” for each word.
- the features extracted when calculating the feature vector are nouns, verbs, and auxiliary verbs.
- the discriminant function calculation unit 12 performs morphological analysis on the learning data, sets the dimension value of the words of the features (nouns, verbs, auxiliary verbs) to “1”, and other words such as particles, adjectives, adverbs, etc.
- the dimension value of the word is calculated as “0”.
- the dimension value is “0”, that is, the feature vectors of words other than nouns, verbs, and auxiliary verbs are not described (noted) in the learning data.
- a feature vector including a dimension of “0” exists.
- the identification function calculation unit 12 extracts features (hereinafter referred to as “features”) reflected in the contents of each document from the learning data input by the interface unit 14, that is, from each document to which the category information is added, and a feature vector Is calculated (generated).
- feature is a phrase, clause, partial character string, two or more words composed of a plurality of words in addition to words of a predetermined condition appearing in a document such as nouns, verbs and auxiliary verbs shown in FIG. It may be a dependency relationship of phrases and phrases, but is not limited to these.
- the discriminant function calculation unit 12 uses machine learning from the generated feature vector and category information (information about whether or not the target category is used) by using a target category document as a positive example and a non-target category document as a negative example.
- the discrimination function is calculated (S202).
- the calculation method described in Non-Patent Document 1 may be used.
- the identification function is calculated by setting the positive example value to +1 and the negative example value to -1.
- the machine learning an arbitrary method for learning a weight for each dimension of a vector using a set of vectors with categories as an input can be used.
- the discrimination function calculation unit 12 calculates a discrimination function using a support vector machine as machine learning.
- the method by which the discriminant function calculator 12 calculates the discriminant function is known, the details of the operation are omitted. Further, the discrimination function calculated by the discrimination function calculation unit 12 is shown in FIG.
- FIG. 7 is a diagram illustrating a configuration example of the lower limit information stored in the lower limit storage unit 15.
- the lower limit information is data in which the ID of the lower limit information, a method (pattern) for determining the lower limit value, and the lower limit value are associated with each other.
- the pattern for determining the lower limit value whose ID of the lower limit information is “(a)” is “determining the lower limit value of the discriminant function (learned weight vector) to a specific value”, and is determined by the pattern.
- the lower limit is “ ⁇ 1.0”.
- FIG. 8 is a diagram showing the data of the discriminant function calculated by the discriminant function calculator 12 and the data of the classification dictionary generated by the classification dictionary generator 13 based on the discriminant function.
- the ID of the lower limit information is “(a)”
- the data of the discriminant function Is “confirmation 2.0, please 1.5, Tanaka-0.5, Yamada-2.0, wish-3.0,...”
- the classification dictionary data is “confirmation 2.0. , Please 1.5, Tanaka-0.5, Yamada-1.0, Wish-1.0, ... ".
- the classification dictionary generation unit 13 has a dimension value (negative weight in this example) corresponding to the non-target category among the dimensions of the identification function calculated by the identification function calculation unit 12. Then, a classification dictionary is generated that is equal to or higher than the lower limit determined by the lower limit information stored in the lower limit storage unit 15.
- the dimension of the discriminant function is the dimension of the vector.
- the classification dictionary generation unit 13 uses, for example, a pattern in which the ID of the lower limit information is (a), that is, a lower limit value that “determines the lower limit value of the identification function to a specific value” is used.
- a classification dictionary To generate a classification dictionary.
- a lower limit value is set in advance, and the discriminant function value (weight vector weight) lower than the lower limit value is determined based on the discriminant function (weight vector) obtained by the discriminant function calculation unit 12 through machine learning. This is a method of converting to a lower limit value.
- the lower limit value is ⁇ 1.0
- the lower limit value is ⁇ 1.0 as shown in FIG.
- the classification dictionary generating unit 13 has identification functions whose ID of the lower limit information is (a) as “confirmation” 2.0, “please” 1.5, “Tanaka” -0.5, Since “Yamada” ⁇ 2.0, “Wish” ⁇ 3.0,..., All those whose lower limit value is lower than ⁇ 1.0 are converted to ⁇ 1.0. Specifically, as shown in FIG. 8, for example, the classification dictionary generating unit 13 converts the identification function “Yamada-2.0” to “Yamada-1.0”. As a result, when the ID of the lower limit information is (a), the classification dictionary generation unit 13 “confirmation 2.0, please 1.5, Tanaka-0.5, Yamada-1.0, wish-1.0, ... ”Is generated.
- the classification dictionary generation unit 13 sets a pattern in which the lower limit information ID is (b), that is, the lower limit value is set to “set the lower limit value to 30% of the minimum value of the identification function”.
- the discriminant function calculation unit 12 determines a ratio that is greater than 0 and less than 1 with respect to the minimum value (hereinafter referred to as the minimum value) among the discriminant functions obtained by machine learning.
- the lower limit value is determined by multiplying the value and the ratio, and the value of the discriminant function below the lower limit value is converted to the lower limit value. In this example, 30% of the minimum value of the discrimination function is set as the lower limit value.
- the pattern for determining the lower limit value of the lower limit information is not limited to FIG. Specifically, the lower limit value of ID (a) in the lower limit information may be ⁇ 0.9, and the determination method of the lower limit value of ID (b) in the lower limit information is “33% of the minimum value of the identification function”. May be.
- the classification dictionary generation unit 13 may generate a classification dictionary by automatically selecting patterns (IDs (a) to (c) of lower limit information) for determining the lower limit value of the lower limit information shown in FIG.
- the classification dictionary may be generated in a state predetermined by the user.
- the learning data storage unit 16 stores learning data.
- the interface unit 14 reads the learning data stored in the learning data storage unit 16 and outputs it to the discrimination function calculation unit 12.
- the discriminant function calculator 12 calculates the discriminant function based on the learning data read by the interface unit 14.
- the classification dictionary generation unit 13 generates a classification dictionary based on the identification function calculated by the identification function calculation unit 12 and the lower limit information stored by the lower limit value storage unit 15.
- the interface unit 14 writes the classification dictionary generated by the classification dictionary generation unit 13 in the classification dictionary storage unit 17.
- the classification dictionary storage unit 17 stores the output classification dictionary.
- FIG. 9 is a diagram illustrating a configuration example of the classification dictionary generation apparatus 10 ′ according to the second embodiment of the present invention. Note that in the second embodiment of the present invention, the description of the same configuration as that of the first embodiment of the present invention is omitted.
- the classification dictionary generation unit 13' included in the control unit 11 ' generates a classification dictionary based on the lower limit information shown in FIG.
- the classification dictionary generation unit 13 ' is a method of setting a lower limit to the ID (c), that is, the weight of the lower limit information shown in FIG. 7, as a constrained optimization problem during machine learning.
- Equation (1) i represents the i-th document
- y i is a variable that takes 1 for the target category
- ⁇ 1 for the non-target category
- x i is a feature vector.
- w ⁇ x i indicates the inner product of the w and x i.
- a lower limit in the case of a constrained optimization problem in which a lower limit is set for each dimension of the weight vector, a lower limit can be introduced into logistic regression.
- w j represents the value of the j-th dimension of the weight vector w.
- ⁇ represents a lower limit value.
- ⁇ j ⁇ ⁇ w j ( ⁇ ⁇ 0)
- an optimization algorithm that can handle box constant optimization such as L-BFGS-B can be used.
- the classification dictionary generating unit 13 ′ is shown by (c) in FIG.
- a classification dictionary that is, “confirmation 1.5, please 1.25, Tanaka-0.2, Yamada-1.0, wish-1.0,...” Is generated. That is, the classification dictionary generation unit 13 ′ calculates a weight vector by optimizing it as a constrained optimization problem with the lower limit value of each dimension of the weight vector as a constraint, and generates a classification dictionary from the calculated weight vector To do.
- the classification dictionary generation device 10 ′ performs post-processing (classification) on the learned discrimination function (weight vector) performed by the classification dictionary generation device 10 according to the first embodiment of the present invention. Instead of generating a classification dictionary to be adjusted by the dictionary generation unit 13), an optimal classification dictionary is generated at the time of learning. Thereby, even if the information corresponding to the target category is less than the information of the non-target category, the classification dictionary generating device 10 ′ has a higher target category score than the document not including the target category information. A classification dictionary to be calculated can be created.
- FIG. 10 is a diagram illustrating a configuration example of the classification dictionary generation device 100 according to the third embodiment of the present invention. Note that in the third embodiment of the present invention, the description of the same configuration as each of the above embodiments is omitted.
- the classification dictionary generation apparatus 100 includes a lower limit storage unit 15 that stores lower limit information for determining a lower limit value of a dimension value of a classification dictionary for classifying a category of a document, and the category is And a control unit 110 that generates the classification dictionary based on known learning data.
- control unit 110 generates a classification dictionary in which all the dimension values are equal to or higher than the lower limit value based on the lower limit information stored in the lower limit value storage unit 15.
- the classification dictionary generating apparatus 100 having the above configuration stores lower limit information for determining a lower limit value of a dimension value of a classification dictionary for classifying a category of a document, and based on learning data in which the category is known, Generate a classification dictionary. At this time, the classification dictionary generation apparatus 100 generates a classification dictionary in which all the dimension values are equal to or higher than the lower limit value based on the stored lower limit information. As a result, the classification dictionary generating apparatus 100 calculates a higher score for the target category compared to a document that does not include the target category information, even when the information corresponding to the target category is less than the non-target category information. A classification dictionary can be created.
- control unit 110 of the classification dictionary generation apparatus 100 is a computer, and the CPU (Central Processing Unit) (for example, CPU 1 in FIG. 2) or MPU (Micro-Processing Unit) of the computer is described above.
- Software for realizing the functions of the embodiments may be executed.
- the control unit 110 of the classification dictionary generation device 100 stores the above-described program in, for example, the storage device 3 in FIG.
- the storage device 3 includes, for example, a computer-readable storage device such as a hard disk device, and various storage media such as a CD-R (Compact Disc Recordable).
- the computer may acquire software (program) for realizing the functions of the above-described embodiments via a network.
- the above-described program of the classification dictionary generating apparatus 100 includes at least (1) a process of storing lower limit information for determining a lower limit value of a dimension value of a classification dictionary for classifying a document category, and (2) the category is known. And generating the classification dictionary based on the learning data.
- generates the said classification dictionary is a process which produces
- the computer of the classification dictionary generating apparatus 100 reads out the program code of the acquired software (program) and executes it. Therefore, the classification dictionary generation device 100 may execute the same processing as the processing of the classification dictionary generation device in each embodiment described above.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
本発明の第1の実施形態における分類辞書作成装置は、カテゴリが既知である学習データから識別関数を算出し、算出した識別関数における下限値を修正して、文書をカテゴリに分類するための分類辞書を作成する。
<第2の実施形態>
本発明の第2の実施形態について説明する。図9は、本発明の第2の実施形態における分類辞書生成装置10’の構成例を示す図である。なお、本発明の第2の実施形態において、本発明の第1の実施形態と同様の構成については、説明を省略する。
∀j α<wj (α<0) (2)
数式(1)の最小化を数式(2)の制約で最適化するためには、例えば、L-BFGS-Bなど、box constraint optimizationを扱える最適化のアルゴリズムを用いることができる。図7に示す下限情報のID(c)のように、数式(2)のαを-1.0(下限値)とした場合、分類辞書生成部13’は、図8の(c)が示す分類辞書、即ち「確認1.5,ください1.25,田中-0.2,山田-1.0,願い-1.0,・・・」を生成する。つまり、分類辞書生成部13’は、重みベクトルの各次元の値の下限値を制約とする制約付き最適化問題として最適化することで重みベクトルを算出し、算出した重みベクトルから分類辞書を生成する。
<第3の実施形態>
本発明の第3の実施形態について説明する。図10は、本発明の第3の実施形態における、分類辞書生成装置100の構成例を示す図である。なお、本発明の第3の実施形態において、上記各実施形態と同様の構成については、説明を省略する。
2 RAM
3 記憶装置
4 通信インターフェース
5 入力装置
6 出力装置
10 分類辞書生成装置
10’ 分類辞書生成装置
11 制御部
11’ 制御部
12 識別関数算出部
13 分類辞書生成部
13’ 分類辞書生成部
14 インターフェース部
15 下限値記憶部
16 学習データ記憶部
17 分類辞書記憶部
100 分類辞書生成装置
110 制御部
Claims (10)
- 文書のカテゴリを分類するための分類辞書の次元の値の下限値を決定する下限情報を記憶する下限値記憶手段と、
前記カテゴリが既知である学習データに基づいて、前記分類辞書を生成する制御手段と、を備え、
前記制御手段は、前記下限値記憶手段に記憶された下限情報に基づいて、全ての前記次元の値が前記下限値以上となる分類辞書を生成する分類辞書生成装置。 - 前記学習データは、カテゴリ情報が付与された文書の集合を含み、
前記制御手段は、前記文書の集合の各文書に対して、前記各文書の内容を反映した特徴を抽出して特徴ベクトルを算出し、前記分類辞書の次元の値のうち、非目標カテゴリに対応する前記次元の値が前記下限値以上となる分類辞書を生成する請求項1に記載の分類辞書生成装置。 - 前記学習データから識別関数を算出する識別関数算出手段をさらに備え、
前記制御手段は、前記識別関数算出手段が算出した識別関数と、前記下限値記憶手段に記憶される下限情報とに基づいて、前記分類辞書を生成する請求項1又は2に記載の分類辞書生成装置。 - 前記下限値記憶手段は、前記識別関数の次元の値のうち、予め定められた前記下限値よりも小さい前記次元の値を前記下限値とする下限情報を記憶する請求項3に記載の分類辞書生成装置。
- 前記下限情報記憶手段は、前記識別関数の次元の値の最小値と0より大きく1未満の予め定められた割合との積により下限値を定め、当該下限値を前記識別関数の値とする下限情報を記憶する請求項3に記載の分類辞書生成装置。
- 前記学習データを記憶する学習データ記憶手段と前記分類辞書を記憶する分類辞書記憶手段をさらに備え、
前記制御手段は、前記分類辞書を前記分類辞書記憶手段に書き込む請求項1から5のいずれか1項に記載の分類辞書生成装置。 - 前記制御手段は、重みベクトルの各次元の値の下限値を制約とする制約付き最適化問題として最適化することで重みベクトルを算出し、算出した重みベクトルから前記分類辞書を生成する請求項1又は2に記載の分類辞書生成装置。
- 前記識別関数算出手段は、前記特徴として、文書中に出現する単語、複数単語から構成されるフレーズ、文節、部分文字列、2つ以上の単語や文節の係り受け関係、及び部分文字列、のうち少なくとも1つを用いて前記識別関数を算出する請求項3に記載の分類辞書生成装置。
- 文書のカテゴリを分類するための分類辞書の次元の値の下限値を決定する下限情報を記憶し、
前記カテゴリが既知である学習データと、前記記憶された下限情報とに基づいて、全ての前記次元の値が前記下限値以上となる分類辞書を生成する分類辞書生成方法。 - 文書のカテゴリを分類するための分類辞書の次元の値の下限値を決定する下限情報を記憶する処理と、
前記カテゴリが既知である学習データに基づいて、前記分類辞書を生成する処理と、をコンピュータに実行させ、
該分類辞書を生成する処理は、前記記憶された下限情報に基づいて、全ての前記次元の値が前記下限値以上となる分類辞書を生成する処理である、プログラムを記録するコンピュータで読み取り可能な記録媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2015537559A JP6436086B2 (ja) | 2013-09-18 | 2014-09-17 | 分類辞書生成装置、分類辞書生成方法及びプログラム |
US14/915,797 US20160224654A1 (en) | 2013-09-18 | 2014-09-17 | Classification dictionary generation apparatus, classification dictionary generation method, and recording medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2013-192674 | 2013-09-18 | ||
JP2013192674 | 2013-09-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015040860A1 true WO2015040860A1 (ja) | 2015-03-26 |
Family
ID=52688524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2014/004776 WO2015040860A1 (ja) | 2013-09-18 | 2014-09-17 | 分類辞書生成装置、分類辞書生成方法及び記録媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20160224654A1 (ja) |
JP (1) | JP6436086B2 (ja) |
WO (1) | WO2015040860A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717040A (zh) * | 2019-09-18 | 2020-01-21 | 平安科技(深圳)有限公司 | 词典扩充方法及装置、电子设备、存储介质 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200082282A1 (en) * | 2018-09-10 | 2020-03-12 | Purdue Research Foundation | Methods for inducing a covert misclassification |
US20230196034A1 (en) * | 2021-12-21 | 2023-06-22 | International Business Machines Corporation | Automatically integrating user translation feedback |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010271800A (ja) * | 2009-05-19 | 2010-12-02 | Nippon Telegr & Teleph Corp <Ntt> | 回答文書分類装置、回答文書分類方法及びプログラム |
JP2013061718A (ja) * | 2011-09-12 | 2013-04-04 | Nippon Telegr & Teleph Corp <Ntt> | サポートベクタ選択装置、方法、及びプログラム |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6192360B1 (en) * | 1998-06-23 | 2001-02-20 | Microsoft Corporation | Methods and apparatus for classifying text and for building a text classifier |
US8176004B2 (en) * | 2005-10-24 | 2012-05-08 | Capsilon Corporation | Systems and methods for intelligent paperless document management |
US9275129B2 (en) * | 2006-01-23 | 2016-03-01 | Symantec Corporation | Methods and systems to efficiently find similar and near-duplicate emails and files |
JP5316158B2 (ja) * | 2008-05-28 | 2013-10-16 | 株式会社リコー | 情報処理装置、全文検索方法、全文検索プログラム、及び記録媒体 |
US20140105447A1 (en) * | 2012-10-15 | 2014-04-17 | Juked, Inc. | Efficient data fingerprinting |
-
2014
- 2014-09-17 WO PCT/JP2014/004776 patent/WO2015040860A1/ja active Application Filing
- 2014-09-17 US US14/915,797 patent/US20160224654A1/en not_active Abandoned
- 2014-09-17 JP JP2015537559A patent/JP6436086B2/ja active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010271800A (ja) * | 2009-05-19 | 2010-12-02 | Nippon Telegr & Teleph Corp <Ntt> | 回答文書分類装置、回答文書分類方法及びプログラム |
JP2013061718A (ja) * | 2011-09-12 | 2013-04-04 | Nippon Telegr & Teleph Corp <Ntt> | サポートベクタ選択装置、方法、及びプログラム |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717040A (zh) * | 2019-09-18 | 2020-01-21 | 平安科技(深圳)有限公司 | 词典扩充方法及装置、电子设备、存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2015040860A1 (ja) | 2017-03-02 |
JP6436086B2 (ja) | 2018-12-12 |
US20160224654A1 (en) | 2016-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11163947B2 (en) | Methods and systems for multi-label classification of text data | |
GB2547068B (en) | Semantic natural language vector space | |
AU2016256764B2 (en) | Semantic natural language vector space for image captioning | |
US9542477B2 (en) | Method of automated discovery of topics relatedness | |
US20170200065A1 (en) | Image Captioning with Weak Supervision | |
RU2583716C2 (ru) | Метод построения и обнаружения тематической структуры корпуса | |
JP5137567B2 (ja) | 検索フィルタリング装置及び検索フィルタリングプログラム | |
GB2544857A (en) | Multimedia document summarization | |
US8458194B1 (en) | System and method for content-based document organization and filing | |
JP6870421B2 (ja) | 判定プログラム、判定装置および判定方法 | |
US11869129B2 (en) | Learning apparatus and method for creating image and apparatus and method for image creation | |
US9348901B2 (en) | System and method for rule based classification of a text fragment | |
WO2014073206A1 (ja) | 情報処理装置、及び、情報処理方法 | |
JP6291443B2 (ja) | 接続関係推定装置、方法、及びプログラム | |
US20230259707A1 (en) | Systems and methods for natural language processing (nlp) model robustness determination | |
Varghese et al. | Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts | |
US20180018392A1 (en) | Topic identification based on functional summarization | |
JP6436086B2 (ja) | 分類辞書生成装置、分類辞書生成方法及びプログラム | |
WO2023033942A1 (en) | Efficient index lookup using language-agnostic vectors and context vectors | |
JP2010061176A (ja) | テキストマイニング装置、テキストマイニング方法、および、テキストマイニングプログラム | |
JP2022185799A (ja) | 情報処理プログラム、情報処理方法および情報処理装置 | |
JP7175244B2 (ja) | 分類装置、学習装置、分類方法及びプログラム | |
JP5462748B2 (ja) | データ可視化装置、データ変換装置、方法、及びプログラム | |
JP2014238626A (ja) | 文書分類装置 | |
WO2021065058A1 (ja) | 概念構造抽出装置、記憶媒体及び方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14846621 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2015537559 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14915797 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14846621 Country of ref document: EP Kind code of ref document: A1 |