WO2017149911A1 - 文書分類装置、文書分類方法、および文書分類プログラム - Google Patents
文書分類装置、文書分類方法、および文書分類プログラム Download PDFInfo
- Publication number
- WO2017149911A1 WO2017149911A1 PCT/JP2016/088160 JP2016088160W WO2017149911A1 WO 2017149911 A1 WO2017149911 A1 WO 2017149911A1 JP 2016088160 W JP2016088160 W JP 2016088160W WO 2017149911 A1 WO2017149911 A1 WO 2017149911A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- path
- document
- node
- correct
- classification model
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
Definitions
- One aspect of the present invention relates to an apparatus, a method, and a program for classifying a document using a tree structure.
- Patent Document 1 described below describes automatic classification generation that configures an information category as a binary tree including nodes of a binary tree including information related to search in hierarchical classification of documents.
- the document classification device performs first machine learning using, as input data, a target document to which a correct path in a tree structure in which each node indicates a document category is provided.
- a generation unit that generates a classification model indicating a correct path to the end node, and a second machine learning that applies a target document to which a correct path is not assigned to the classification model are performed, and an N hierarchy node to an N + 1 hierarchy node
- An update unit that updates a classification model by setting a correction path from the N + 1 hierarchy node to an N + 2 hierarchy node that is not a child node of the N + 1 hierarchy node based on the correct path when the path is different from the correct answer path; Is provided.
- a document classification method is a document classification method executed by a document classification device including a processor, and each node receives a target document to which a correct path in a tree structure in which each node indicates a document category is assigned as input data And generating a classification model indicating a correct path to the end node of the target document by executing the first machine learning, and applying a target document to which the correct path is not assigned to the classification model. If the path from the N hierarchy node to the N + 1 hierarchy node is different from the correct path, the N + 1 hierarchy node that is not a child node of the N + 1 hierarchy node is selected from the N + 1 hierarchy node based on the correct answer path. And updating the classification model by setting a correction path to.
- a document classification program performs first machine learning using a target document to which a correct path in a tree structure in which each node indicates a document category is given as input data.
- a generation step for generating a classification model indicating a correct path to the end node and a second machine learning for applying a target document to which a correct path is not applied to the classification model are performed, and an N hierarchy node to an N + 1 hierarchy node are executed.
- An update step of updating the classification model by setting a correction path from the N + 1 hierarchy node to an N + 2 hierarchy node that is not a child node of the N + 1 hierarchy node based on the correct path when the path is different from the correct answer path; Is executed on the computer.
- a classification model is generated by machine learning (so-called supervised learning) using a target document to which a correct answer is given. Then, if the path is different from the correct path in machine learning that applies the target document to the classification model without giving the correct answer, it does not proceed to the lower node as it is, but to the node of another subtree based on the correct path A modified path is generated. The presence of this correction path makes it possible to return to the direction approaching the correct answer even when the classification process proceeds in the wrong direction. By using the classification model processed in this way, it is possible to increase the accuracy of document classification using a tree structure.
- the accuracy of document classification using a tree structure can be improved.
- the document classification device 10 is a computer system that classifies a plurality of electronic documents by associating document categories with individual electronic documents.
- An electronic document is data recorded on an arbitrary recording medium such as a database or a memory and readable by a computer, and includes text or character strings.
- an electronic document is also simply referred to as a “document”.
- a document category is a classification for classifying the nature of a document. In this specification, the document category is also simply referred to as “category”. “Associate” means that an object is associated with another object, and the other object can be guided from one object by this association.
- a plurality of categories are organized in a tree structure.
- the tree structure is a data structure that represents a hierarchical relationship of a plurality of elements by a hierarchical structure in which one element has a plurality of child elements and one child element has a plurality of grandchild elements.
- Each element in the tree structure is called a node, and the two nodes are connected by a line called a link.
- a hierarchical structure of categories expressed in this tree structure is referred to as a “category tree”.
- Each node in the category tree indicates a category.
- This category tree is prepared in advance by hand and stored in a predetermined storage device (for example, a storage unit in the document classification device 10).
- the document classification device 10 determines the category of the document by sequentially processing the document from the uppermost layer (root node) to the lowermost layer (terminal node) according to the category tree.
- the root node is assumed to be the first hierarchy, and the hierarchy number is incremented by one, such as the second hierarchy, the third hierarchy,...
- the hierarchy number in the category tree is represented by a natural number.
- classification using a tree structure can improve calculation and memory utilization as a whole.
- classification using a tree structure is executed locally. For this reason, once an incorrect node (category) is reached, the classification proceeds toward a node below the incorrect node, and the document is classified into a category having low relevance (error propagation).
- the document classification device 10 attempts to prevent the error propagation by executing a process using imitation learning.
- Imitation learning is a method of learning a policy so that an action similar to that of an expert can be performed by imitating the action of an expert who is the subject of an ideal action.
- a policy is a mapping from the current state to the next action, which can be approximated by a classifier.
- Imitation learning itself is well known, and one of them is Dataset Aggregation (DAGGER).
- DAGGER Dataset Aggregation
- the document classification device 10 generates a classification model using imitation learning.
- the classification model is a policy in which a path (route) for guiding a document to be processed (referred to as “target document” in this specification) from a start point (for example, a root node) to a terminal node is defined.
- the document classification device 10 refers to the document database 20 to generate a classification model.
- the document database 20 is a device that stores a large number of documents.
- the “database” is a device (storage unit) that stores a data set so as to cope with any data operation (for example, extraction, addition, deletion, overwriting, etc.) from a processor or an external computer.
- the implementation method of the document database 20 is not limited. For example, a database management system or a text file may be used.
- the document classification device 10 can read the document by accessing the document database 20 via an arbitrary communication network. Note that it is not essential to separate the document classification device 10 and the document database 20, and the document classification device 10 may include the document database 20.
- the method for collecting documents stored in the document database 20 is not limited.
- the document database 20 may store web pages collected from the Internet by crawling as documents, or may store manually registered documents.
- the content of the document is not limited, for example, news (for example, a title or text), comments in a social networking service (SNS), or a product page (for example, a product title or product description) in an online shopping site. Etc.
- a correct path in the category tree is assigned in advance to at least a part of the document stored in the document database 20.
- the correct path is an ideal (or correct) path from the viewpoint (root node) of the category tree to the correct terminal node (terminal category) that should be associated with the document. This correct answer path is given manually.
- the document database 20 also stores a document for evaluating the generated classification model.
- FIG. 1 shows a general hardware configuration of a computer 100 that functions as the document classification device 10.
- the computer 100 includes a processor 101, a main storage unit 102, an auxiliary storage unit 103, a communication control unit 104, an input device 105, and an output device 106.
- the processor 101 is an electronic component that executes an operating system and application programs.
- the main storage unit 102 is an electronic component that temporarily stores programs to be executed and data, and includes, for example, a ROM and a RAM.
- the auxiliary storage unit 103 is an electronic component that permanently stores data to be processed or processed data, and includes a storage device such as a hard disk or a flash memory.
- the communication control unit 104 is an electronic component that transmits / receives data to / from another device via a wired or wireless connection, and includes, for example, a network card or a wireless communication module.
- the input device 105 is a device that receives input from the user, and is, for example, a keyboard and a mouse.
- the output device 106 is a device that outputs data designated or processed by the processor 101 in a manner that a person can recognize, and is, for example, a monitor and a printer.
- the document classification device 10 may be composed of a single computer or a plurality of computers. When a plurality of computers are used, these documents are connected via a communication network such as the Internet or an intranet, so that one document classification device 10 is logically constructed.
- FIG. 2 shows a functional configuration of the document classification device 10.
- the document classification device 10 includes a generation unit 11, an update unit 12, and an evaluation unit 13 as functional components.
- the generation unit 11 and the update unit 12 correspond to a classifier.
- These functional elements are realized by reading predetermined software (a document classification program P1 described later) on the processor 101 or the main storage unit 102 and executing the software.
- the processor 101 operates the communication control unit 104, the input device 105, or the output device 106 in accordance with the software, and reads and writes data in the main storage unit 102 or the auxiliary storage unit 103. Data or a database necessary for processing is stored in the main storage unit 102 or the auxiliary storage unit 103.
- the generation unit 11 is a functional element that generates a classification model by executing machine learning (first machine learning) using input data (training data) as a target document to which a correct path is assigned.
- the generation unit 11 reads a document with a correct answer path from the document database 20.
- the generation unit 11 executes machine learning using a document with a correct path as input data, thereby generating a classification model indicating a correct path to the terminal node (correct terminal node) for the document.
- the generation unit 11 executes this processing for a plurality of target documents (for example, a large number of target documents) each having a correct answer path assigned thereto, whereby a classification model indicating a correct path for each of the plurality of target documents is obtained. Generated. It can be said that the processing in the generation unit 11 is “supervised learning”.
- the generation unit 11 outputs the generated classification model as an “initial classification model” to the update unit 12 together with the set of processed target documents.
- the update unit 12 is a functional element that updates the classification model by executing machine learning (second machine learning) that applies the target document to which the correct path is not assigned to the initial classification model.
- the update unit 12 uses the target document input from the generation unit 11, that is, the target document used for generating the initial classification model.
- the update unit 12 executes machine learning using a target document to which a correct path is not given as input data (training data). This means that machine learning is performed without referring to the correct path.
- the updating unit 12 every time a path from the Nth layer to the N + 1th layer is obtained by the machine learning, the updating unit 12 refers to the correct answer path and determines whether the path obtained from the machine learning is different from the correct answer path. judge. N is a natural number. If both paths are different, that is, if the result of machine learning is an error, the updating unit 12 sets a correction path for correcting the error based on the correct answer path.
- This corrected path is a path from an incorrect node in the N + 1 hierarchy to a correct node in the N + 2 hierarchy (this correct node is included in the correct path).
- the update unit 12 learns a path from the N + 1 hierarchy to the N + 2 hierarchy on the assumption of the corrected path. As a result of this learning, the path may return to the correct answer path, or may advance to a child node of the wrong node in the N + 1 hierarchy.
- the update unit 12 executes this process for a plurality of target documents each of which is not given a correct path. It is assumed that in the processing of a certain target document, the path from the N hierarchy node to the N + 1 hierarchy node is different from the correct answer path. In this case, based on the correct path of the target document, the updating unit 12 starts from the N + 1 hierarchy node (incorrect answer node) to an N + 2 hierarchy node that is not a child node of the N + 1 hierarchy node (this is included in the correct path). ) May update the classification model. By such imitation learning, a path that returns a path once advanced in the wrong direction to the correct path is generated.
- the update unit 12 may follow the existing correction path again and proceed with the subsequent processing. Of course, in this case, the update unit 12 may reuse the correction path without generating a new correction path.
- the updating unit 12 further proceeds without setting a correction path.
- the update unit 12 outputs the classification model obtained by executing the above processing for a plurality of target documents to the evaluation unit 13 as “updated classification model”.
- the updating unit 12 may execute the process (that is, the second machine learning) executed on the initial classification model for the updated classification model in response to an instruction from the evaluation unit 13. .
- This re-execution may set a further correction path for the updated classification model.
- the evaluation unit 13 is a functional element that evaluates the updated classification model.
- the evaluation of the classification model is a process for determining whether or not the classification model can classify a document into a correct category at a certain level or higher.
- the evaluation unit 13 reads out evaluation data (a set of documents for evaluation) from the document database 20 and applies the data to the updated classification model, thereby associating the category with each document. Then, the evaluation unit 13 evaluates the processing result using a predetermined evaluation method, and determines whether or not the evaluation value satisfies a predetermined standard.
- the evaluation method is not limited.
- Micro F1 that is generally used in hierarchical document classification may be used, or in addition to this Micro F1, a revenue loss rate (ARL: Average Revenue Loss) may be used.
- the standard that the evaluation value should satisfy is not limited.
- the evaluation unit 13 may determine that the evaluation value satisfies the criterion when the evaluation value is greater than or equal to a predetermined threshold value.
- the evaluation unit 13 may determine that the evaluation value satisfies the criterion when the degree of convergence of the evaluation value is within a predetermined range (when the difference from the previous evaluation value is equal to or less than a predetermined threshold).
- the evaluation unit 13 When the evaluation value satisfies the standard, the evaluation unit 13 adopts and outputs the updated classification model as a final result.
- the output destination is not limited.
- the evaluation unit 13 may transmit the classification model to another device, may output the classification model by outputting from the output device 106 (for example, a monitor or a printer), or may store the classification model in a predetermined database.
- the classification model may be stored.
- the output classification model can be used to classify any document.
- the evaluation unit 13 instructs the update unit 12 to perform the process again.
- the updating unit 12 executes the second machine learning on the updated classification model in response to this instruction.
- the standard used by the evaluation unit 13 may be set according to the attribute of the object described in the document.
- the type of object is not limited and may be any tangible, intangible, or event.
- An object can be said to be at least part of the content of a document.
- the attribute of the object may or may not be described in the document. For example, if the object is a product sold on an online shopping site, the attributes of the object may be the price of the product, the number of products sold, the sales amount of the product, and the like.
- the first line indicates that the data set D is initialized by assigning an empty set to the data set D indicating the classification model.
- the second to eighth lines show a loop process corresponding to generation of a classification model and one or more updates.
- the third line shows the setting of the mixing policy ⁇ k used in the k-th process.
- ⁇ * indicates a policy called “Oracle” that returns a correct path (ideal path).
- ⁇ ⁇ k indicates the current policy (the policy trained in the process on the seventh line described later).
- ⁇ k indicates the mixing ratio between Oracle and the current policy (in other words, the contribution ratio of Oracle).
- the fourth line shows that the path on the category tree is sampled by machine learning using the mixed policy ⁇ k .
- the fifth line indicates that a data set D k that is a set of paths newly obtained by the machine learning is acquired.
- the sixth line indicates that the data set D (classification model) is updated by adding the data set D k to the data set D.
- the seventh line shows that the policy ⁇ ⁇ k + 1 is learned using the updated data set D.
- the first loop process corresponds to the generation unit 11.
- the generation unit 11 obtains a set of correct paths (data set D 1 ) according to Oracle ⁇ *, and uses the set as it is as a data set D (classification model). Then, the generation unit 11 obtains a policy ⁇ ⁇ 2 corresponding to this data set D.
- the second and subsequent loop processes correspond to the update unit 12.
- the updating unit 12 uses the mixing policy ⁇ 2 calculated using a predetermined mixing rate ⁇ 2 . Since this mixed strategy takes into account the current strategy, it may take different actions than Oracle, and thus a correction path may be generated.
- the update unit 12 obtains the set of corrected paths (data set D 2 ), and adds this set to the data set D indicating the set of correct paths. Then, the updating unit 12 obtains a policy ⁇ ⁇ 3 corresponding to the updated data set D (classification model).
- the process after the third time is the same as the second time.
- the ninth line corresponds to the evaluation unit 13 that outputs the classification model adopted by the evaluation.
- FIG. 3 shows a category tree for home appliances.
- “home appliance” is a first hierarchy node (root node), and there are 3 nodes in the second hierarchy.
- Each second hierarchy node has a child node (third hierarchy node).
- some of the third hierarchy nodes have child nodes (fourth hierarchy nodes). All the fourth hierarchy nodes and the third hierarchy node having no child nodes are terminal nodes.
- the example of the process about the terminal node "film camera" in a 4th hierarchy is demonstrated.
- FIG. 4 shows that the generation unit 11 corresponding to the first loop processing has obtained a correct answer path Rc [“home appliance” ⁇ “camera & photo” ⁇ “camera” ⁇ “film camera”].
- this correct answer path Rc is along a link of a given category tree.
- FIG. 5 shows that the update unit 12 corresponding to the second loop processing has obtained the following two paths R1 and R2.
- R1 ["Home Appliance” ⁇ “TV &Accessories” ⁇ "Camera” ⁇ “Film Camera”]
- R2 [“Home Appliance” ⁇ “Camera & Photo” ⁇ “Flash” ⁇ “Film Camera”]
- the path from the first hierarchy node “home appliance” to the second hierarchy node “TV & accessory” is different from the correct path Rc.
- the update unit 12 can generate a correction path from the second hierarchy node “TV & accessory” to the third hierarchy node “camera” that is not a child node of the second hierarchy node “TV & accessory”. .
- the path R1 can return to the correct path Rc to reach the end node “film camera”.
- the update unit 12 can generate a correction path from the third hierarchy node “flash” to the fourth hierarchy node “film camera” that is not a child node of the third hierarchy node “flash”. In this case, the path R2 finally reaches the end node “film camera” which is the end point of the correct path Rc.
- FIG. 6 shows that the following two paths R3 and R4 are obtained by re-execution of the update unit 12 based on an instruction from the evaluation unit 13.
- R3 [“Home Appliance” ⁇ “Personal Appliance” ⁇ “Camera” ⁇ “Film Camera”]
- R4 [“Home appliance” ⁇ “Personal appliance” ⁇ “Radio” ⁇ “Film camera”]
- the path from the first layer node “home appliance” to the second layer node “individual home appliance” is different from the correct path Rc.
- the update unit 12 can generate a correction path from the second hierarchy node “personal appliance” to a third hierarchy node “camera” that is not a child node of the second hierarchy node “personal appliance”. .
- the path R3 can return to the correct path Rc to reach the end node “film camera”.
- the update unit 12 may generate a correction path from the third hierarchy node “radio” to the fourth hierarchy node “film camera” that is not a child node of the third hierarchy node “radio”. is there. In this case, the path R4 finally reaches the end node “film camera” which is the end point of the correct path Rc.
- the correction path is not along the link of a given category tree.
- the generation unit 11 generates a classification model indicating a correct path by so-called supervised learning (that is, using an oracle) (step S11).
- the update unit 12 updates the classification model by imitation learning indicated by the algorithm (step S12).
- the evaluation unit 13 evaluates the updated classification model (step S13). If the evaluation value satisfies the criterion (YES in step S14), evaluation unit 13 outputs the classification model (step S15). On the other hand, when the evaluation value does not satisfy the standard (NO in step S14), the process returns to step S12, and the classification model is further updated.
- the document classification program P1 includes a main module P10, a generation module P11, an update module P12, and an evaluation module P13.
- the main module P10 is a part that comprehensively controls the generation of the classification model.
- the functions realized by executing the generation module P11, the update module P12, and the evaluation module P13 are the same as the functions of the generation unit 11, the update unit 12, and the evaluation unit 13, respectively.
- the document classification program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Alternatively, the document classification program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.
- the document classification device performs the first machine learning using the target document to which the correct path in the tree structure in which each node indicates the document category as input data.
- a generation unit that generates a classification model indicating a correct path to the end node for the target document, and a second machine learning that applies the target document to which the correct path is not assigned to the classification model,
- a classification model is set by setting a corrected path from the N + 1 hierarchy node to the N + 2 hierarchy node that is not a child node of the N + 1 hierarchy node based on the correct answer path
- an updating unit for updating is provided.
- a document classification method is a document classification method executed by a document classification device including a processor, and each node receives a target document to which a correct path in a tree structure in which each node indicates a document category is assigned as input data And generating a classification model indicating a correct path to the end node of the target document by executing the first machine learning, and applying a target document to which the correct path is not assigned to the classification model. If the path from the N hierarchy node to the N + 1 hierarchy node is different from the correct path, the N + 1 hierarchy node that is not a child node of the N + 1 hierarchy node is selected from the N + 1 hierarchy node based on the correct answer path. And updating the classification model by setting a correction path to.
- a document classification program performs first machine learning using a target document to which a correct path in a tree structure in which each node indicates a document category is given as input data.
- a generation step for generating a classification model indicating a correct path to the end node and a second machine learning for applying a target document to which a correct path is not applied to the classification model are performed, and an N hierarchy node to an N + 1 hierarchy node are executed.
- An update step of updating the classification model by setting a correction path from the N + 1 hierarchy node to an N + 2 hierarchy node that is not a child node of the N + 1 hierarchy node based on the correct path when the path is different from the correct answer path; Is executed on the computer.
- a classification model is generated by machine learning (so-called supervised learning) using a target document to which a correct answer is given. Then, if the path is different from the correct path in machine learning that applies the target document to the classification model without giving the correct answer, it does not proceed to the lower node as it is, but to the node of another subtree based on the correct path A modified path is generated. The presence of this correction path makes it possible to return to the direction approaching the correct answer even when the classification process proceeds in the wrong direction.
- the classification model processed in this way it is possible to increase the accuracy of document classification using a tree structure. For example, classification processing that has proceeded in the wrong direction can be finally guided to the correct terminal node (terminal category). Even if it is not possible to guide to the correct terminal node, the classification process is guided to another terminal node that is highly related or similar to the terminal node (the terminal node that is sibling with the correct terminal node). can do.
- the updating unit may set a correction path from the N + 1 hierarchy node to the N + 2 hierarchy node included in the correct answer path.
- the accuracy of document classification can be improved by returning the path that has once traveled in the wrong direction to the correct path.
- the update unit may execute the second machine learning without using a policy for returning a correct path.
- the probability that the obtained path is wrong will increase.
- the accuracy of document classification can be increased accordingly.
- the update unit may execute the second machine learning using a mixture of a policy that returns a correct path and a trained policy. Since the mixed strategy includes a clue to the correct path, there is a possibility that path errors in the second machine learning are reduced. As a result, the number of correction path settings is reduced, and the overall time required to generate the classification model can be expected.
- the document classification device may further include an evaluation unit that classifies the document using the classification model updated by the update unit and evaluates the classification model. Thereby, the classification model can be evaluated.
- the update unit when the evaluation value indicating the evaluation by the evaluation unit does not satisfy a predetermined criterion, applies the target document to which the correct path is not given to the updated classification model.
- the classification model may be updated again by executing the second machine learning. In this case, a classification model that satisfies a certain level can be provided.
- the predetermined standard may be set according to the attribute of the object described in the document.
- the strictness of the classification model evaluation can be set according to the contents of the document.
- the update unit 12 sets a correction path from the N + 1 hierarchy node to the N + 2 hierarchy node included in the correct path.
- the end point of the correction path may not be a node included in the correct path.
- the update unit 12 may set a correction path to the N + 2 hierarchy node included in the subtree located between the subtree having the incorrect N + 1 hierarchy node as the root node and the subtree including the correct path. Good. In this case, even if the document is not classified into the correct category, the probability that the document is classified into a category that has a sibling relationship with the correct category (a category that is highly relevant or similar to the correct category) Becomes higher.
- the evaluation unit may be omitted.
- the updating unit may output the final classification model after repeatedly updating the classification model a predetermined number of times.
- the update unit may output the classification model after updating the classification model only once.
- the document classification device may include a classification unit that classifies an arbitrary document using the finally obtained classification model. This classification part can be said to be a practical stage of the classification model.
- the processing procedure of the document classification method executed by at least one processor is not limited to the example in the above embodiment.
- the document classification apparatus may omit some of the steps (processes) described above, or may execute the steps in a different order. Also, any two or more of the steps described above may be combined, or a part of the steps may be corrected or deleted.
- the document classification device may execute other steps in addition to the above steps.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
1: Initialize:D←{}
2: for k=1,2,…,K do
3: πk←βkπ*+(1-βk)π^k
4: Sample T-step trajectories using πk
5: Get dataset Dk=(φ(sπk),π*(sπk))
6: Aggregate datasets:D←D∪Dk
7: Train classifier π^k+1 on D
8: end for
9: Return best π^k on validation
R1:[“家電”→“テレビ&アクセサリ”→“カメラ”→“フィルムカメラ”]
R2:[“家電”→“カメラ&写真”→“フラッシュ”→“フィルムカメラ”]
R3:[“家電”→“個人向け家電”→“カメラ”→“フィルムカメラ”]
R4:[“家電”→“個人向け家電”→“ラジオ”→“フィルムカメラ”]
Claims (9)
- 各ノードが文書カテゴリを示す木構造における正解パスが付与された対象文書を入力データとして第1の機械学習を実行することで、該対象文書についての末端ノードまでの正しいパスを示す分類モデルを生成する生成部と、
前記正解パスが付与されていない前記対象文書を前記分類モデルに適用する第2の機械学習を実行し、N階層ノードからN+1階層ノードへのパスが前記正解パスと異なる場合に、前記正解パスに基づいて、該N+1階層ノードから、該N+1階層ノードの子ノードではないN+2階層ノードへの修正パスを設定することで前記分類モデルを更新する更新部と
を備える文書分類装置。 - 前記更新部が、前記N+1階層ノードから、前記正解パスに含まれる前記N+2階層ノードへの前記修正パスを設定する、
請求項1に記載の文書分類装置。 - 前記更新部が、前記正しいパスを返す方策を用いることなく前記第2の機械学習を実行する、
請求項1または2に記載の文書分類装置。 - 前記更新部が、前記正しいパスを返す方策と訓練した方策との混合を用いて前記第2の機械学習を実行する、
請求項1または2に記載の文書分類装置。 - 前記更新部により更新された分類モデルを用いて文書を分類して該分類モデルを評価する評価部をさらに備える請求項1~4のいずれか一項に記載の文書分類装置。
- 前記評価部による評価を示す評価値が所定の基準を満たさない場合に、前記更新部が、前記正解パスが付与されていない前記対象文書を前記更新された分類モデルに適用する前記第2の機械学習を実行することで前記分類モデルの更新を再実行する、
請求項5に記載の文書分類装置。 - 前記所定の基準が、前記文書に記載されたオブジェクトの属性に従って設定された、
請求項6に記載の文書分類装置。 - プロセッサを備える文書分類装置により実行される文書分類方法であって、
各ノードが文書カテゴリを示す木構造における正解パスが付与された対象文書を入力データとして第1の機械学習を実行することで、該対象文書についての末端ノードまでの正しいパスを示す分類モデルを生成する生成ステップと、
前記正解パスが付与されていない前記対象文書を前記分類モデルに適用する第2の機械学習を実行し、N階層ノードからN+1階層ノードへのパスが前記正解パスと異なる場合に、前記正解パスに基づいて、該N+1階層ノードから、該N+1階層ノードの子ノードではないN+2階層ノードへの修正パスを設定することで前記分類モデルを更新する更新ステップと
を含む文書分類方法。 - 各ノードが文書カテゴリを示す木構造における正解パスが付与された対象文書を入力データとして第1の機械学習を実行することで、該対象文書についての末端ノードまでの正しいパスを示す分類モデルを生成する生成ステップと、
前記正解パスが付与されていない前記対象文書を前記分類モデルに適用する第2の機械学習を実行し、N階層ノードからN+1階層ノードへのパスが前記正解パスと異なる場合に、前記正解パスに基づいて、該N+1階層ノードから、該N+1階層ノードの子ノードではないN+2階層ノードへの修正パスを設定することで前記分類モデルを更新する更新ステップと
をコンピュータに実行させるための文書分類プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/077,767 US11657077B2 (en) | 2016-03-03 | 2016-12-21 | Document classification device, document classification method and document classification program |
EP16892745.7A EP3425521A4 (en) | 2016-03-03 | 2016-12-21 | CLASSIFICATION DEVICE FOR DOCUMENTS, CLASSIFICATION PROCEDURE FOR DOCUMENTS AND DOCUMENT CLASSIFICATION PROGRAM |
JP2017519587A JP6148427B1 (ja) | 2016-03-03 | 2016-12-21 | 文書分類装置、文書分類方法、および文書分類プログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662303052P | 2016-03-03 | 2016-03-03 | |
US62/303,052 | 2016-03-03 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017149911A1 true WO2017149911A1 (ja) | 2017-09-08 |
Family
ID=59742765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2016/088160 WO2017149911A1 (ja) | 2016-03-03 | 2016-12-21 | 文書分類装置、文書分類方法、および文書分類プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US11657077B2 (ja) |
EP (1) | EP3425521A4 (ja) |
WO (1) | WO2017149911A1 (ja) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2635259C1 (ru) * | 2016-06-22 | 2017-11-09 | Общество с ограниченной ответственностью "Аби Девелопмент" | Способ и устройство для определения типа цифрового документа |
US11551135B2 (en) | 2017-09-29 | 2023-01-10 | Oracle International Corporation | Techniques for generating a hierarchical model to identify a class among a plurality of classes |
JP7095439B2 (ja) * | 2018-07-02 | 2022-07-05 | 富士フイルムビジネスイノベーション株式会社 | 情報処理装置、情報処理システム、及び情報処理プログラム |
US20210182736A1 (en) * | 2018-08-15 | 2021-06-17 | Nippon Telegraph And Telephone Corporation | Learning data generation device, learning data generation method, and non-transitory computer readable recording medium |
US20220253591A1 (en) * | 2019-08-01 | 2022-08-11 | Nippon Telegraph And Telephone Corporation | Structured text processing apparatus, structured text processing method and program |
US11615236B1 (en) * | 2022-07-19 | 2023-03-28 | Intuit Inc. | Machine learning model based electronic document completion |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009122851A (ja) * | 2007-11-13 | 2009-06-04 | Internatl Business Mach Corp <Ibm> | データを分類する技術 |
JP2010170192A (ja) * | 2009-01-20 | 2010-08-05 | Yahoo Japan Corp | 階層構造改変処理装置、階層構造改変方法及びプログラム |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6947936B1 (en) * | 2001-04-30 | 2005-09-20 | Hewlett-Packard Development Company, L.P. | Method for a topic hierarchy classification system |
US7266548B2 (en) | 2004-06-30 | 2007-09-04 | Microsoft Corporation | Automated taxonomy generation |
US7644052B1 (en) * | 2006-03-03 | 2010-01-05 | Adobe Systems Incorporated | System and method of building and using hierarchical knowledge structures |
US20120166366A1 (en) * | 2010-12-22 | 2012-06-28 | Microsoft Corporation | Hierarchical classification system |
US9081854B2 (en) * | 2012-07-06 | 2015-07-14 | Hewlett-Packard Development Company, L.P. | Multilabel classification by a hierarchy |
US10262274B2 (en) * | 2013-07-22 | 2019-04-16 | Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi | Incremental learner via an adaptive mixture of weak learners distributed on a non-rigid binary tree |
US20150324459A1 (en) * | 2014-05-09 | 2015-11-12 | Chegg, Inc. | Method and apparatus to build a common classification system across multiple content entities |
-
2016
- 2016-12-21 WO PCT/JP2016/088160 patent/WO2017149911A1/ja active Application Filing
- 2016-12-21 US US16/077,767 patent/US11657077B2/en active Active
- 2016-12-21 EP EP16892745.7A patent/EP3425521A4/en not_active Ceased
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009122851A (ja) * | 2007-11-13 | 2009-06-04 | Internatl Business Mach Corp <Ibm> | データを分類する技術 |
JP2010170192A (ja) * | 2009-01-20 | 2010-08-05 | Yahoo Japan Corp | 階層構造改変処理装置、階層構造改変方法及びプログラム |
Non-Patent Citations (1)
Title |
---|
See also references of EP3425521A4 * |
Also Published As
Publication number | Publication date |
---|---|
US20190050755A1 (en) | 2019-02-14 |
US11657077B2 (en) | 2023-05-23 |
EP3425521A4 (en) | 2019-08-21 |
EP3425521A1 (en) | 2019-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017149911A1 (ja) | 文書分類装置、文書分類方法、および文書分類プログラム | |
CN107944629B (zh) | 一种基于异质信息网络表示的推荐方法及装置 | |
US10614266B2 (en) | Recognition and population of form fields in an electronic document | |
WO2017216980A1 (ja) | 機械学習装置 | |
EP2909740B1 (en) | Ranking for inductive synthesis of string transformations | |
Zhang et al. | RotBoost: A technique for combining Rotation Forest and AdaBoost | |
US20070055655A1 (en) | Selective schema matching | |
US9710525B2 (en) | Adaptive learning of effective troubleshooting patterns | |
US8943084B2 (en) | Method, program, and system for converting part of graph data to data structure as an image of homomorphism | |
US10185725B1 (en) | Image annotation based on label consensus | |
US20120102417A1 (en) | Context-Aware User Input Prediction | |
US9020879B2 (en) | Intelligent data agent for a knowledge management system | |
US7343378B2 (en) | Generation of meaningful names in flattened hierarchical structures | |
US20140114949A1 (en) | Knowledge Management System | |
US10977573B1 (en) | Distantly supervised wrapper induction for semi-structured documents | |
JP6148427B1 (ja) | 文書分類装置、文書分類方法、および文書分類プログラム | |
US9720984B2 (en) | Visualization engine for a knowledge management system | |
KR20230054701A (ko) | 하이브리드 기계 학습 | |
US20140114903A1 (en) | Knowledge Management Engine for a Knowledge Management System | |
US20210097352A1 (en) | Training data generating system, training data generating method, and information storage medium | |
US20220067555A1 (en) | Creation Assisting Device, Creation Assisting Method, And Recording Medium | |
US20090030869A1 (en) | Visualization techniques for imprecise statement completion | |
US20180240356A1 (en) | Data-driven feedback generator for programming assignments | |
JP2023550510A (ja) | 推薦方法、装置、電子機器及び記憶媒体 | |
US12020008B2 (en) | Extensibility recommendation system for custom code objects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2017519587 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2016892745 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2016892745 Country of ref document: EP Effective date: 20181004 |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16892745 Country of ref document: EP Kind code of ref document: A1 |