WO2018122931A1 - Information processing device, method, and program - Google Patents

Information processing device, method, and program Download PDF

Info

Publication number
WO2018122931A1
WO2018122931A1 PCT/JP2016/088752 JP2016088752W WO2018122931A1 WO 2018122931 A1 WO2018122931 A1 WO 2018122931A1 JP 2016088752 W JP2016088752 W JP 2016088752W WO 2018122931 A1 WO2018122931 A1 WO 2018122931A1
Authority
WO
WIPO (PCT)
Prior art keywords
target data
classification
data
classified
new target
Prior art date
Application number
PCT/JP2016/088752
Other languages
French (fr)
Japanese (ja)
Inventor
真吏佳 金子
Original Assignee
株式会社Pfu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Pfu filed Critical 株式会社Pfu
Priority to PCT/JP2016/088752 priority Critical patent/WO2018122931A1/en
Publication of WO2018122931A1 publication Critical patent/WO2018122931A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • This disclosure relates to a technique for classifying data.
  • a feature amount of each document image is derived from a plurality of document images assigned with a first classification label, and clustering processing is executed using the feature amounts, so that the plurality of document images converge to one cluster.
  • the cluster is divided into a plurality of sub-clusters at a predetermined threshold of the joint distance, a second classification label is assigned to the document image included in each sub-cluster, and the feature amount and the second Image classification apparatus for generating a classification rule for performing machine learning using a classification label and classifying a document image having a feature amount corresponding to the second classification label to a classification destination designated by the first classification label
  • Techniques using the clustering result as learning input have been proposed (see Patent Documents 1, 2, and 3).
  • the information processing apparatus calculates a feature amount of each of a plurality of pieces of document information to which common attribute information is assigned, and a plurality of pieces of document information based on the feature amounts calculated by the feature amount calculation unit.
  • a distance calculating unit that calculates a distance in the feature amount space between each of the two, and a distribution that creates distribution map information in which each of a plurality of document information is plotted on the feature amount space based on the distance calculated by the distance calculating unit.
  • the present disclosure has an object to more accurately determine whether new target data is target data of an unknown classification.
  • An example of the present disclosure includes a storage unit that stores feature data of classified target data in association with the classification of the target data, a target data reception unit that receives input of new target data, and a feature from the new target data
  • a feature extraction means for generating feature data by extracting the feature data, and a set of target data consisting of the classified target data and the new target data, the feature data of the classified target data stored by the storage means, and the Based on the feature data of the new target data generated by the feature extraction means, clustering means for clustering the classified target data into the number of classified classifications + 1 clusters, and as a result of the clustering, the new target data When a cluster containing only target data appears, query the new target data classification
  • a query output means for outputting for an information processing apparatus including a.
  • the present disclosure can be grasped as an information processing apparatus, a system, a method executed by a computer, or a program executed by a computer.
  • the present disclosure can also be understood as a program recorded on a recording medium readable by a computer, other devices, machines, or the like.
  • a computer-readable recording medium refers to a recording medium that stores information such as data and programs by electrical, magnetic, optical, mechanical, or chemical action and can be read from a computer or the like.
  • FIG. 1 is a diagram illustrating an outline of a configuration of a scanner according to an embodiment. It is a figure which shows the outline of a function structure of the information processing apparatus which concerns on embodiment. It is a flowchart which shows the outline
  • the information processing apparatus, method, and program according to the present disclosure are used to capture image data obtained by imaging a medium such as paper or a card using a scanner, and information recorded on the type of medium or the medium.
  • a medium such as paper or a card using a scanner
  • information recorded on the type of medium or the medium An embodiment when implemented in a system for classifying each type will be described.
  • the information processing apparatus, method, and program according to the present disclosure can be widely used for techniques for classifying data, and the application target of the present disclosure is not limited to the example shown in the present embodiment.
  • FIG. 1 is a schematic diagram showing a configuration of a system according to the present embodiment.
  • the system according to the present embodiment includes an information processing apparatus 1 and a scanner 3.
  • the information processing apparatus 1 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, an EEPROM (Electrically Erasable Memory and Programmable Read Only Memory D), and the like.
  • a computer including a communication unit 15 such as a NIC (Network Interface Card), an input device 16 such as a keyboard and a touch panel, and an output device 17 such as a display and a speaker.
  • NIC Network Interface Card
  • FIG. 2 is a diagram showing an outline of the configuration of the scanner 3 according to this embodiment.
  • the scanner 3 according to the present embodiment is a device that acquires image data by capturing an image of a document, business card, receipt, photo / illustration or the like set by the user, and sends the document to the imaging unit 37. 36, an imaging unit 37, a scan button 38, a CPU 31, a ROM 32, a RAM 33, a storage device 34, a communication unit 35, and the like.
  • the scanner 3 is exemplified as the imaging method of the scanner 3 that adopts the imaging method of imaging while automatically feeding the document set on the sheet feeder 36.
  • the imaging method of the scanner is not limited.
  • the scanner may be of a type that images a document set at a reading position by a user.
  • the communication means, hardware configuration, and the like of the scanner that can employ the method according to the present embodiment are not limited to the examples in the present embodiment.
  • the scanner 3 is used as the imaging device used in the present system has been described.
  • the imaging device used in the present system is not limited to the scanner.
  • a camera may be employed as the imaging device.
  • the information processing apparatus may be distributed and implemented using a cloud or distributed computing technology, and the scanner may be built in the information processing apparatus.
  • the system shown in this embodiment is a system for classifying image data obtained by using the scanner 3 according to the type of medium and the type of information recorded on the medium, and has been learned at this time. If it is classification data, automatic classification is performed using the learning results. If it is classification data that has never been learned, users are asked questions to receive user feedback and learn feedback content. Equipped with a user interface that gradually improves the accuracy of automatic classification.
  • the system shown in this embodiment employs a technique that uses unsupervised learning and supervised learning in two stages. Specifically, the system shown in this embodiment uses unsupervised learning (clustering) in the first stage, and when a cluster of only new target data is created, it is determined that the classification is unknown, Make an inquiry. On the other hand, if it is determined that the classification is not unknown, the system shown in the present embodiment performs classification estimation using a supervised classification model in the second stage.
  • FIG. 3 is a diagram illustrating an outline of a functional configuration of the information processing apparatus 1 according to the present embodiment.
  • the information processing apparatus 1 reads out a program recorded in the storage 14 to the RAM 13 and executes it by the CPU 11, whereby a storage unit 21, a target data reception unit 22, a feature extraction unit 23, a determination unit 24, clustering. It functions as an information processing apparatus including the unit 25, the estimation unit 26, the confirmation output unit 27, the inquiry output unit 28, the response reception unit 29, the classification determination unit 30, and the classification model management unit 20.
  • each function of the information processing apparatus 1 is executed by the CPU 11 that is a general-purpose processor. However, some or all of these functions may be executed by one or more dedicated processors. . Some or all of these functions may be executed by a device installed at a remote value or a plurality of devices installed in a distributed manner using cloud technology or the like.
  • the storage unit 21 stores the feature data of the classified target data in association with the classification of the target data. Further, when the new target data classification is determined, the storage unit 21 stores the feature data of the new target data as classified target data in association with the determined classification. As a result, the accumulated number of classified target data increases, and the accuracy of the estimation process by the classification model described later improves.
  • the target data receiving unit 22 receives input of new target data.
  • the feature extraction unit 23 generates feature data by extracting features from new target data.
  • a feature vector is used as the feature data.
  • the method for converting the feature of the target data into data is not limited to a vector.
  • the determination unit 24 has the same feature data as the feature data of the new target data generated by the feature extraction unit 23 in the feature data of the classified target data stored by the storage unit 21 before clustering. It is determined whether or not. If the same feature data exists, the clustering process is skipped.
  • the clustering unit 25 sets the target data composed of the classified target data and the new target data, the feature data of the classified target data stored in the storage unit 21, and the new target data generated by the feature extraction unit 23. Is clustered into “the number of classifications into which the classified target data is classified + 1” clusters.
  • the estimation unit 26 determines that the new target data is likely to belong to the existing (known) classification. Then, the classification of the new target data is estimated. At this time, the estimation unit 26 determines the new target based on the feature data of the classified target data accumulated by the storage unit 21, the classification of the classified target data, and the feature data of the new target data. Guess the classification of the data.
  • the confirmation output unit 27 performs an output for confirming the result of the estimation by the estimation unit 26 via the output device 17.
  • the query output unit 28 determines that the new target data is likely to belong to an unknown classification, and the new target data Output for inquiring the classification is performed via the output device 17.
  • the response reception unit 29 receives an input of a response by the user with respect to an inquiry output or a confirmation output to the user via the input device 16.
  • the classification determination unit 30 determines the classification of new target data according to the response to the query output or confirmation output for the user.
  • the classification model management unit 20 holds a classification model used in the estimation process by the estimation unit 26. Further, the classification model management unit 20 generates or updates a classification model based on the feature data of the new target data and the classification determined by the classification determination unit 30. Thereby, the estimation part 26 can estimate the classification
  • a general classification algorithm such as a pattern recognition model (learning model) using supervised learning, for example, SVM (support vector machine) may be used.
  • the classification model is generated by providing and learning feature data (feature vector) and a set of classification labels corresponding thereto as teacher data.
  • accuracy verification may be performed using a method such as cross verification, and a model with improved accuracy may be employed.
  • FIG. 4 is a flowchart showing an outline of the flow of data classification processing according to the present embodiment.
  • the data classification process according to the present embodiment is executed when the information processing apparatus 1 receives the image data transmitted by the scanner 3.
  • step S101 input of new target data to be classified is accepted.
  • the scanner 3 captures the paper medium and generates image data. Further, the scanner 3 transmits the generated image data to the information processing apparatus 1.
  • the target data receiving unit 22 receives the image data transmitted from the scanner 3 and input to the information processing apparatus 1 as new target data, and records it in the RAM 13. Thereafter, the process proceeds to step S102.
  • step S102 features of new target data are extracted and feature data is generated.
  • the feature extraction unit 23 extracts features from the new target data received in step S101, and generates feature data.
  • the feature extraction unit 23 for example, paper size (width, height, card size flag, receipt size flag, photo size flag, etc.), number of colors, blank page ratio, Line direction, ruled line (length, width, center coordinates, number, etc.), characters (recognition language, character rectangle, character position, character size, appearance word frequency (Bag of Words / TFIDF), etc.), image (used color information , Density information, figure outlines, local feature quantities such as SIFT / SURF (Bag of Features), features by business form (business card tags, receipt matching results, receipt tags, etc.) and the like are extracted as features.
  • the feature extraction unit 23 generates feature data (in this embodiment, a feature vector) by digitizing the extracted features. Thereafter, the process proceeds to step S103.
  • step S103 it is determined whether there is accumulated classified target data.
  • the information processing apparatus 1 determines whether the number of classified target data stored in the storage unit 21 (or the number of classifications into which the classified target data is classified) is greater than zero. This is a process for determining whether or not the new target data received in step S101 is the first target data. If the number of classified target data stored in the storage unit 21 (or the number of classifications into which the classified target data is classified) is greater than 0 as a result of the determination, the new target data accepted in step S101 Is not the first target data, the process proceeds to step S104.
  • step S101 the new data received in step S101 is displayed. Since the target data is the first target data and naturally belongs to an unknown classification, the process proceeds to “classification determination processing by user inquiry” in step S108.
  • step S104 the presence / absence of classified target data having the same feature data is determined.
  • the determination unit 24 searches the feature data of the classified target data stored by the storage unit 21 before clustering, so that the new data generated by the feature extraction unit 23 in the stored classified target data. It is determined whether there is classified target data having the same feature data as the target data. When it is determined that there is classified target data having the same feature data, the clustering process shown in step S105 and step S106 is skipped, and the process proceeds to step S107, and the estimation unit 26 does not perform clustering. Make a guess. On the other hand, if it is determined that there is no classified target data having the same feature data, the process proceeds to step S105.
  • the clustering process when it is determined that there is classified target data having “identical” feature data, the clustering process is skipped to reduce the overall processing load.
  • the determination condition is “ It is not necessarily limited to “same”.
  • the determination unit 24 sets a new threshold generated by the feature extraction unit 23 in the feature data of the classified target data stored by the storage unit 21 by a method of setting a predetermined threshold as a determination condition. It may be determined whether there is feature data that is the same as or similar to the feature data of the data.
  • the determination condition is not limited to “same” and the condition has a wide range, the processing load for searching the feature data of the classified target data becomes large. It is preferable to set in consideration.
  • the classification is determined by the classification determination process using the classification model.
  • the classification associated with the classified target data having the same (or approximate) feature data may be immediately determined as the classification of the new target data without performing the process.
  • step S105 clustering processing is performed.
  • the clustering unit 25 includes all the classified target data stored in the storage unit 21 (however, only a part may be used depending on the data amount), and the new target data received in step S101.
  • the set of target data is clustered so that all target data are elements of any one cluster.
  • the feature data of the classified target data stored by the storage unit 21 and the feature data of the new target data generated in step S102 are used.
  • a general clustering algorithm based on the distance between feature vectors between target data is used for clustering.
  • the algorithm used for clustering is not limited.
  • step S106 it is determined whether or not the cluster to which the new target data belongs includes other classified target data.
  • the information processing apparatus 1 determines whether the new target data belongs to the existing classification by determining whether the cluster to which the new target data belongs includes other classified target data. Estimate whether or not.
  • the processing is performed according to the “classification model according to step S107”. Proceed to “Classification process”.
  • the cluster to which the new target data belongs includes only the new target data and does not include other classified target data (that is, it is estimated that the new target data is an unknown classification). In this case, the process proceeds to “classification determination process by user inquiry” in step S108.
  • step S107 classification determination processing using a classification model is executed.
  • the classification of the target data is estimated using the classification model, and the classification of the target data is determined through confirmation of the estimation result by the user. Details of the processing will be described later with reference to FIG. Thereafter, the process proceeds to step S109.
  • step S108 a classification determination process based on a user inquiry is executed.
  • the classification input by the user is determined as the classification of the target data. Details of the processing will be described later with reference to FIG. Thereafter, the process proceeds to step S109.
  • step S109 the accumulation processing of classified target data is executed.
  • the storage unit 21 stores the feature data of the new target data whose classification is determined by the classification determination unit 30 in association with the determined classification as the classified target data. That is, the feature data and its classification accumulated here are used as the classified target data and its classification in the data classification process executed when other new target data is received. Thereafter, the processing shown in this flowchart ends.
  • FIG. 5 is a flowchart showing an overview of the flow of classification determination processing based on the classification model according to the present embodiment. This flowchart explains in detail the processing shown in step S107 of FIG.
  • step S201 and step S202 when the classification model has not been generated, one existing classification is adopted as the estimation result.
  • the information processing apparatus 1 determines whether a classification model has been generated (step S201). When it is determined that the classification model has not been generated, since the existing classification is only one classification, the estimation unit 26 estimates the existing one classification as a classification of new target data (step S202). On the other hand, if it is determined that the classification model has been generated, the process proceeds to step S203.
  • step S203 the classification of new target data is estimated using the classification model.
  • the estimation unit 26 reads out the classification model generated / updated based on the classification of the classified target data from the classification model management unit 20. Then, the estimation unit 26 inputs the feature data of the new target data into the classification model generated / updated based on the feature data of the classified target data and the classification of the classified target data. Guess the classification of the target data. Thereafter, the process proceeds to step S204.
  • the estimation unit 26 compares the feature data of the classified target data with the feature data of the new target data, and specifies the classification of the classified target data that approximates the new target data, thereby determining the new target data.
  • the classification of the target data may be estimated.
  • the cluster generated by clustering (unsupervised) in step S105 does not necessarily match the classification estimated (supervised) in step S203. This is because, in the present embodiment, the clustering process is processed to infer whether the new target data belongs to the existing classification, and is generated or updated using the determined classification. This is because it is independent of the classification estimation process by the classification model.
  • step S204 a user query of the estimation result is performed.
  • the confirmation output unit 27 performs output for confirming the result of estimation in step S203 using the classification model. Specifically, the confirmation output unit 27 outputs a message such as “Is this a“ business card ”?” Including the classification of the estimation result. Thereafter, the process proceeds to step S205.
  • step S205 a response from the user is accepted.
  • the response receiving unit 29 receives an input of a response to the confirmation output in step S204. Specifically, the user confirms the output message (for example, “Is this a“ business card ”?”) Including the guess result classification, and inputs a response to the message. For example, the user performs input (for example, “Yes”) indicating that the estimation is correct when the estimation result in step S203 is correct, and inputs the correct classification when the estimation result is incorrect.
  • the user may input the classification by freely inputting text (for example, “receipt”) and adding a new classification, or output by the query output unit 28. A classification may be input by selecting from existing classifications.
  • the response receiving unit 29 receives these inputs from the user. Thereafter, the process proceeds to step S206.
  • step S206 the classification of new target data is determined according to the response.
  • the classification determination unit 30 refers to the response received in step S205, and if an input indicating that the estimation result in step S203 is correct is received, the classification determination unit 30 uses the estimation result in step S203 as new target data. Determine the classification.
  • the classification determination unit 30 refers to the response received in step S205, and when the input indicating the correct classification is received from the user because the estimation result in step S203 is incorrect, the input is made by the user response.
  • the classified classification is determined as a classification of new target data. Thereafter, the process proceeds to step S207.
  • step S207 to step S209 the classification model is updated when the estimation result is incorrect and the number of existing classifications is 2 or more. If the correct classification input by the user is accepted because the estimation result is incorrect ("NO" in step S207), and the classification target data has two or more classifications ("YES” in step S208), the classification The model management unit 20 updates the held classification model based on the feature data of the new target data and the classification input by the user (step S209). On the other hand, when the estimation result is correct (“YES” in step S207), the classification model held is not updated. Also, when the number of classified target data is less than 2 (“NO” in step S208), the classification model is not updated because the classification model cannot be generated. Thereafter, the processing shown in this flowchart ends.
  • the learning model is updated when the estimation result is incorrect (“NO” in step S207).
  • the classification model may be updated.
  • the determination in step S207 may be omitted. Note that the update timing of the classification model may be determined in consideration of the processing load in the system.
  • FIG. 6 is a flowchart showing an overview of the flow of classification determination processing by user inquiry according to the present embodiment. This flowchart explains in detail the processing shown in step S108 of FIG.
  • step S301 a user inquiry for classification is performed.
  • the inquiry output unit 28 performs an output for inquiring about the classification of new target data. Specifically, the inquiry output unit 28 outputs a message such as “What kind of document is this?”. Thereafter, the process proceeds to step S302.
  • step S302 a response from the user is accepted.
  • the response receiving unit 29 receives an input of a response to the output in step S301. Specifically, the user confirms the output message, and makes an input indicating the classification to which the new target data should belong as a response thereto.
  • the user may input a classification by freely inputting text and adding a new classification, or inputting a classification by selecting from existing classifications output by the query output unit 28. May be.
  • the response receiving unit 29 receives input from the user. Thereafter, the process proceeds to step S303.
  • step S303 a new classification of target data is determined according to the response.
  • the classification determination unit 30 refers to the response accepted in step S302, and determines the classification input by the user response as a new classification of target data. Thereafter, the process proceeds to step S304.
  • step S304 and step S305 a classification model is generated or updated when there are two or more existing classifications.
  • the information processing apparatus 1 determines whether the number of classifications of the classified target data is two or more (step S304). If the number of classified target data is less than two, a classification model cannot be generated, and the process shown in this flowchart ends. On the other hand, when the number of classified target data is two or more, the classification model management unit 20 generates or updates a classification model (step S305).
  • step S101 when the new target data received in step S101 is the first target data (“NO” in step S103), the classification model management unit 20 is input by the feature data of the new target data and the user. A new classification model is generated based on the classification. In other cases (“NO” in step S106), the classification model management unit 20 updates the classification model held based on the feature data of the new target data and the classification input by the user. Thereafter, the processing shown in this flowchart ends.
  • the information processing apparatus 1 that has received the image data has zero accumulated target data (step S103). "NO") and "What kind of manuscript is this?"
  • the user inputs a document type (for example, “business card”), and the information processing apparatus 1 generates a classification model using the input result.
  • the information processing apparatus 1 determines whether or not the classification is unknown by clustering. While the target data stored is small, the new target data is likely to be an element of a cluster that does not include other data (“NO” in step S106). Is the message type? And the user's input of the document type is repeated several times.
  • the information processing apparatus 1 determines that the classification is an existing classification ( The target data for which “YES” in step S106) is estimated based on the classification model and confirmed and output to the user, and the target data determined to be an unknown classification as a result of clustering (“NO” in step S106) An inquiry output “What kind of manuscript is this?” Is performed. By repeating such processing, the accuracy of data classification by the information processing apparatus 1 according to the present embodiment is improved.
  • ⁇ Effect> in the system for classifying data, whether or not the newly input classification target data belongs to the already defined classification (in other words, unknown Whether it is a classification or not).
  • the already defined classification in other words, unknown Whether it is a classification or not.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention addresses the problem of more accurately determining whether or not new target data is unknown classification target data. An information processing device 1 is provided with: a feature extraction unit 23 that generates feature data by extracting features from new target data; a clustering unit 25 that clusters a set of target data comprising classified target data and the new target data into clusters the number of which is a number obtained by adding one to the number of classifications to which the classified target data have been classified, on the basis of feature data of the classified target data and the feature data of the new target data generated by the feature extraction unit 23; and an inquiry output unit 28 that outputs an inquiry for inquiring about the classification of the new target data in the case where a cluster including only the new target data has appeared as the result of the clustering.

Description

情報処理装置、方法およびプログラムInformation processing apparatus, method, and program
 本開示は、データを分類するための技術に関する。 This disclosure relates to a technique for classifying data.
 従来、第1の分類ラベルが付与された複数の文書画像から、それぞれの文書画像の特徴量を導出し、特徴量を使用してクラスタリング処理を実行し、複数の文書画像が1つのクラスタに収束するまでの結合距離を算出し、クラスタを、結合距離の所定の閾値において複数のサブクラスターに分割し、各サブクラスターに含まれる文書画像に第2の分類ラベルを付与し、特徴量と第2の分類ラベルとを用いて機械学習を実行し、第2の分類ラベルに対応する特徴量を有する文書画像を第1の分類ラベルで指定された分類先に分類する分類ルールを生成する画像分類装置(特許文献1を参照)等、クラスタリング結果を学習のインプットとする技術が提案されている(特許文献1、2および3を参照)。 Conventionally, a feature amount of each document image is derived from a plurality of document images assigned with a first classification label, and clustering processing is executed using the feature amounts, so that the plurality of document images converge to one cluster. And the cluster is divided into a plurality of sub-clusters at a predetermined threshold of the joint distance, a second classification label is assigned to the document image included in each sub-cluster, and the feature amount and the second Image classification apparatus for generating a classification rule for performing machine learning using a classification label and classifying a document image having a feature amount corresponding to the second classification label to a classification destination designated by the first classification label Techniques using the clustering result as learning input have been proposed (see Patent Documents 1, 2, and 3).
 また、情報処理装置に、共通の属性情報が付与された複数の文書情報のそれぞれの特徴量を算出する特徴量算出手段と、特徴量算出手段が算出した特徴量に基づいて、複数の文書情報のそれぞれの間の特徴量空間における距離を算出する距離算出手段と、距離算出手段が算出した距離に基づいて、複数の文書情報のそれぞれを特徴量空間上にプロットした分布図情報を作成する分布図作成手段とを備えることで、文書情報の分類が適切か否か利用者に判断させるための情報を作成する技術が提案されている(特許文献4を参照)。 In addition, the information processing apparatus calculates a feature amount of each of a plurality of pieces of document information to which common attribute information is assigned, and a plurality of pieces of document information based on the feature amounts calculated by the feature amount calculation unit. A distance calculating unit that calculates a distance in the feature amount space between each of the two, and a distribution that creates distribution map information in which each of a plurality of document information is plotted on the feature amount space based on the distance calculated by the distance calculating unit There has been proposed a technique for creating information for allowing a user to determine whether or not the classification of document information is appropriate by providing a diagram creating means (see Patent Document 4).
特開2016-071412号公報JP 2016-071412 A 特開2014-123286号公報JP 2014-123286 A 特開2010-282483号公報JP 2010-282383 A 特開2015-026355号公報JP2015-026355A
 従来、データを自動的に分類するための技術として、機械学習による分類やクラスタリングによる分類等、様々な自動分類システムが提案され、また採用されている。 Conventionally, as a technique for automatically classifying data, various automatic classification systems such as classification by machine learning and classification by clustering have been proposed and adopted.
 しかし、従来の自動分類システムでは、データを学習済みの分類の何れかに分類しようとするため、対象のデータが学習済みの分類の何れにも属さないデータであった場合、確実に不正解の分類結果が出力されるという問題があった。 However, since the conventional automatic classification system tries to classify the data into any of the learned classifications, if the target data does not belong to any of the learned classifications, There was a problem that classification results were output.
 本開示は、上記した問題に鑑み、新たな対象データが未知の分類の対象データであるか否かをより正確に判定することを課題とする。 In view of the above-described problems, the present disclosure has an object to more accurately determine whether new target data is target data of an unknown classification.
 本開示の一例は、分類済み対象データの特徴データを、該対象データの分類と関連づけて蓄積する蓄積手段と、新たな対象データの入力を受け付ける対象データ受付手段と、前記新たな対象データから特徴を抽出して特徴データを生成する特徴抽出手段と、前記分類済み対象データ及び前記新たな対象データからなる対象データの集合を、前記蓄積手段によって蓄積されている分類済み対象データの特徴データ及び前記特徴抽出手段によって生成された前記新たな対象データの特徴データに基づいて、前記分類済み対象データが分類されている分類の数+1のクラスタにクラスタリングするクラスタリング手段と、前記クラスタリングの結果、前記新たな対象データのみを含むクラスタが現れた場合に、該新たな対象データの分類を問い合わせるための出力を行う問合せ出力手段と、を備える情報処理装置である。 An example of the present disclosure includes a storage unit that stores feature data of classified target data in association with the classification of the target data, a target data reception unit that receives input of new target data, and a feature from the new target data A feature extraction means for generating feature data by extracting the feature data, and a set of target data consisting of the classified target data and the new target data, the feature data of the classified target data stored by the storage means, and the Based on the feature data of the new target data generated by the feature extraction means, clustering means for clustering the classified target data into the number of classified classifications + 1 clusters, and as a result of the clustering, the new target data When a cluster containing only target data appears, query the new target data classification A query output means for outputting for an information processing apparatus including a.
 本開示は、情報処理装置、システム、コンピューターによって実行される方法またはコンピューターに実行させるプログラムとして把握することが可能である。また、本開示は、そのようなプログラムをコンピューターその他の装置、機械等が読み取り可能な記録媒体に記録したものとしても把握できる。ここで、コンピューター等が読み取り可能な記録媒体とは、データやプログラム等の情報を電気的、磁気的、光学的、機械的または化学的作用によって蓄積し、コンピューター等から読み取ることができる記録媒体をいう。 The present disclosure can be grasped as an information processing apparatus, a system, a method executed by a computer, or a program executed by a computer. The present disclosure can also be understood as a program recorded on a recording medium readable by a computer, other devices, machines, or the like. Here, a computer-readable recording medium refers to a recording medium that stores information such as data and programs by electrical, magnetic, optical, mechanical, or chemical action and can be read from a computer or the like. Say.
 本開示によれば、新たな対象データが未知の分類の対象データであるか否かをより正確に判定することが可能となる。 According to the present disclosure, it is possible to more accurately determine whether or not new target data is target data of an unknown classification.
実施形態に係るシステムの構成を示す概略図である。It is the schematic which shows the structure of the system which concerns on embodiment. 実施形態に係るスキャナーの構成の概略を示す図である。1 is a diagram illustrating an outline of a configuration of a scanner according to an embodiment. 実施形態に係る情報処理装置の機能構成の概略を示す図である。It is a figure which shows the outline of a function structure of the information processing apparatus which concerns on embodiment. 実施形態に係るデータ分類処理の流れの概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the flow of the data classification process which concerns on embodiment. 実施形態に係る、分類モデルによる分類決定処理の流れの概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the flow of the classification determination process by a classification model based on Embodiment. 実施形態に係る、ユーザー問合せによる分類決定処理の流れの概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the flow of the classification determination process by a user inquiry based on Embodiment.
 以下、本開示に係る情報処理装置、方法およびプログラムの実施の形態を、図面に基づいて説明する。但し、以下に説明する実施の形態は、実施形態を例示するものであって、本開示に係る情報処理装置、方法およびプログラムを以下に説明する具体的構成に限定するものではない。実施にあたっては、実施の態様に応じた具体的構成が適宜採用され、また、種々の改良や変形が行われてよい。 Hereinafter, embodiments of an information processing apparatus, a method, and a program according to the present disclosure will be described with reference to the drawings. However, the embodiment described below exemplifies the embodiment, and the information processing apparatus, method, and program according to the present disclosure are not limited to the specific configuration described below. In implementation, a specific configuration according to the embodiment is appropriately adopted, and various improvements and modifications may be performed.
 本実施形態では、本開示に係る情報処理装置、方法およびプログラムを、スキャナーを用いて紙やカード等の媒体を撮像することで得られた画像データを、媒体の種類や媒体に記録された情報の種類毎に分類するためのシステムにおいて実施した場合の実施の形態について説明する。但し、本開示に係る情報処理装置、方法およびプログラムは、データを分類するための技術について広く用いることが可能であり、本開示の適用対象は、本実施形態において示した例に限定されない。 In this embodiment, the information processing apparatus, method, and program according to the present disclosure are used to capture image data obtained by imaging a medium such as paper or a card using a scanner, and information recorded on the type of medium or the medium. An embodiment when implemented in a system for classifying each type will be described. However, the information processing apparatus, method, and program according to the present disclosure can be widely used for techniques for classifying data, and the application target of the present disclosure is not limited to the example shown in the present embodiment.
 <システムの構成>
 図1は、本実施形態に係るシステムの構成を示す概略図である。本実施形態に係るシステムは、情報処理装置1およびスキャナー3を備える。情報処理装置1は、CPU(Central Processing Unit)11、ROM(Read Only Memory)12、RAM(Random Access Memory)13、EEPROM(Electrically Erasable and Programmable Read Only Memory)やHDD(Hard Disk Drive)等のストレージ14、NIC(Network Interface Card)等の通信ユニット15、キーボードやタッチパネル等の入力デバイス16、およびディスプレイやスピーカー等の出力デバイス17等を備えるコンピューターである。
<System configuration>
FIG. 1 is a schematic diagram showing a configuration of a system according to the present embodiment. The system according to the present embodiment includes an information processing apparatus 1 and a scanner 3. The information processing apparatus 1 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, an EEPROM (Electrically Erasable Memory and Programmable Read Only Memory D), and the like. 14, a computer including a communication unit 15 such as a NIC (Network Interface Card), an input device 16 such as a keyboard and a touch panel, and an output device 17 such as a display and a speaker.
 図2は、本実施形態に係るスキャナー3の構成の概略を示す図である。本実施形態に係るスキャナー3は、ユーザーがセットした、文書、名刺、レシートまたは写真/イラスト等の原稿を撮像することで、画像データを取得する装置であり、原稿を撮像部37に送るシートフィーダー36、撮像部37、スキャンボタン38、CPU31、ROM32、RAM33、記憶装置34および通信ユニット35等を備える撮像装置である。なお、本実施形態では、スキャナー3の撮像方式として、シートフィーダー36にセットされた原稿を自動送りしながら撮像する撮像方式を採用したスキャナー3を例示したが、スキャナーの撮像方式は限定されない。例えば、スキャナーは、ユーザーによって読取位置にセットされた原稿を撮像するタイプのものであってもよい。本実施形態に係る方法を採用可能なスキャナーの通信手段およびハードウェア構成等は、本実施形態における例示に限定されない。また、本実施形態では、本システムにおいて用いる撮像装置として、スキャナー3を用いる例について説明したが、本システムにおいて用いられる撮像装置は、スキャナーに限定されない。例えば、撮像装置としてカメラが採用されてもよい。 FIG. 2 is a diagram showing an outline of the configuration of the scanner 3 according to this embodiment. The scanner 3 according to the present embodiment is a device that acquires image data by capturing an image of a document, business card, receipt, photo / illustration or the like set by the user, and sends the document to the imaging unit 37. 36, an imaging unit 37, a scan button 38, a CPU 31, a ROM 32, a RAM 33, a storage device 34, a communication unit 35, and the like. In the present embodiment, the scanner 3 is exemplified as the imaging method of the scanner 3 that adopts the imaging method of imaging while automatically feeding the document set on the sheet feeder 36. However, the imaging method of the scanner is not limited. For example, the scanner may be of a type that images a document set at a reading position by a user. The communication means, hardware configuration, and the like of the scanner that can employ the method according to the present embodiment are not limited to the examples in the present embodiment. In this embodiment, an example in which the scanner 3 is used as the imaging device used in the present system has been described. However, the imaging device used in the present system is not limited to the scanner. For example, a camera may be employed as the imaging device.
 また、図1では、スキャナー3と情報処理装置1がネットワークや周辺機器接続用コネクターを介して接続される態様について例示しているが、システムの構成は、図1の例示に限定されない。情報処理装置は、クラウドや分散コンピューティングの技術を用いて分散して実装されてもよいし、スキャナーは情報処理装置に内蔵されてもよい。 1 illustrates an example in which the scanner 3 and the information processing apparatus 1 are connected via a network or a peripheral device connector, but the system configuration is not limited to that illustrated in FIG. The information processing apparatus may be distributed and implemented using a cloud or distributed computing technology, and the scanner may be built in the information processing apparatus.
 本実施形態に示すシステムは、スキャナー3を用いて得られた画像データを、媒体の種類や媒体に記録された情報の種類毎に分類するためのシステムであり、この際、学習したことがある分類のデータである場合には、学習結果を利用した自動分類を行い、学習したことのない分類のデータである場合には、ユーザーに質問してユーザーフィードバックを受け、フィードバック内容を学習することで徐々に自動分類精度を向上させるユーザーインターフェースを備える。 The system shown in this embodiment is a system for classifying image data obtained by using the scanner 3 according to the type of medium and the type of information recorded on the medium, and has been learned at this time. If it is classification data, automatic classification is performed using the learning results.If it is classification data that has never been learned, users are asked questions to receive user feedback and learn feedback content. Equipped with a user interface that gradually improves the accuracy of automatic classification.
 このため、本実施形態に示すシステムでは、教師なし学習と教師あり学習を2段階で使い分ける手法を採用している。具体的には、本実施形態に示すシステムは、1段階目において教師なし学習(クラスタリング)を用い、新たな対象データのみのクラスタができた場合、未知の分類であると判定してユーザーへの問い合わせを行う。一方、未知の分類ではないと判定された場合、本実施形態に示すシステムは、2段階目において教師ありの分類モデルを用いた分類推測を行う。 For this reason, the system shown in this embodiment employs a technique that uses unsupervised learning and supervised learning in two stages. Specifically, the system shown in this embodiment uses unsupervised learning (clustering) in the first stage, and when a cluster of only new target data is created, it is determined that the classification is unknown, Make an inquiry. On the other hand, if it is determined that the classification is not unknown, the system shown in the present embodiment performs classification estimation using a supervised classification model in the second stage.
 図3は、本実施形態に係る情報処理装置1の機能構成の概略を示す図である。情報処理装置1は、ストレージ14に記録されているプログラムが、RAM13に読み出され、CPU11によって実行されることで、蓄積部21、対象データ受付部22、特徴抽出部23、判定部24、クラスタリング部25、推測部26、確認出力部27、問合せ出力部28、応答受付部29、分類決定部30および分類モデル管理部20を備える情報処理装置として機能する。なお、本実施形態では、情報処理装置1の備える各機能は、汎用プロセッサであるCPU11によって実行されるが、これらの機能の一部または全部は、1または複数の専用プロセッサによって実行されてもよい。また、これらの機能の一部または全部は、クラウド技術等を用いて、遠隔値に設置された装置や、分散設置された複数の装置によって実行されてもよい。 FIG. 3 is a diagram illustrating an outline of a functional configuration of the information processing apparatus 1 according to the present embodiment. The information processing apparatus 1 reads out a program recorded in the storage 14 to the RAM 13 and executes it by the CPU 11, whereby a storage unit 21, a target data reception unit 22, a feature extraction unit 23, a determination unit 24, clustering. It functions as an information processing apparatus including the unit 25, the estimation unit 26, the confirmation output unit 27, the inquiry output unit 28, the response reception unit 29, the classification determination unit 30, and the classification model management unit 20. In the present embodiment, each function of the information processing apparatus 1 is executed by the CPU 11 that is a general-purpose processor. However, some or all of these functions may be executed by one or more dedicated processors. . Some or all of these functions may be executed by a device installed at a remote value or a plurality of devices installed in a distributed manner using cloud technology or the like.
 蓄積部21は、分類済み(classified)対象データの特徴データを、当該対象データの分類(class)と関連づけて蓄積する。また、蓄積部21は、新たな対象データの分類が決定されると、当該新たな対象データの特徴データを、決定された分類と関連付けて、分類済み対象データとして蓄積する。これによって、分類済み対象データの蓄積数が増え、後述する分類モデルによる推測処理の精度が向上する。 The storage unit 21 stores the feature data of the classified target data in association with the classification of the target data. Further, when the new target data classification is determined, the storage unit 21 stores the feature data of the new target data as classified target data in association with the determined classification. As a result, the accumulated number of classified target data increases, and the accuracy of the estimation process by the classification model described later improves.
 対象データ受付部22は、新たな対象データの入力を受け付ける。 The target data receiving unit 22 receives input of new target data.
 特徴抽出部23は、新たな対象データから特徴を抽出して特徴データを生成する。なお、本実施形態では、特徴データとして、特徴ベクトルが用いられるが、対象データの特徴をデータ化する方式は、ベクトルに限定されない。 The feature extraction unit 23 generates feature data by extracting features from new target data. In this embodiment, a feature vector is used as the feature data. However, the method for converting the feature of the target data into data is not limited to a vector.
 判定部24は、クラスタリングの前に、蓄積部21によって蓄積されている分類済み対象データの特徴データ中に、特徴抽出部23によって生成された新たな対象データの特徴データと同一の特徴データがあるか否かを判定する。同一の特徴データが存在する場合、クラスタリング処理はスキップされる。 The determination unit 24 has the same feature data as the feature data of the new target data generated by the feature extraction unit 23 in the feature data of the classified target data stored by the storage unit 21 before clustering. It is determined whether or not. If the same feature data exists, the clustering process is skipped.
 クラスタリング部25は、分類済み対象データ及び新たな対象データからなる対象データの集合を、蓄積部21によって蓄積されている分類済み対象データの特徴データ及び特徴抽出部23によって生成された新たな対象データの特徴データに基づいて、「分類済み対象データが分類されている分類の数+1」のクラスタにクラスタリングする。 The clustering unit 25 sets the target data composed of the classified target data and the new target data, the feature data of the classified target data stored in the storage unit 21, and the new target data generated by the feature extraction unit 23. Is clustered into “the number of classifications into which the classified target data is classified + 1” clusters.
 推測部26は、クラスタリングの結果、新たな対象データを含むクラスタが他の分類済み対象データを含む場合に、当該新たな対象データは既存(既知)の分類に属する可能性が高いと判断して、当該新たな対象データの分類を推測する。この際、推測部26は、蓄積部21によって蓄積されている分類済み対象データの特徴データと、当該分類済み対象データの分類と、新たな対象データの特徴データとに基づいて、当該新たな対象データの分類を推測する。 If the cluster including the new target data includes other classified target data as a result of clustering, the estimation unit 26 determines that the new target data is likely to belong to the existing (known) classification. Then, the classification of the new target data is estimated. At this time, the estimation unit 26 determines the new target based on the feature data of the classified target data accumulated by the storage unit 21, the classification of the classified target data, and the feature data of the new target data. Guess the classification of the data.
 確認出力部27は、推測部26による推測の結果を確認するための出力を、出力デバイス17を介して行う。 The confirmation output unit 27 performs an output for confirming the result of the estimation by the estimation unit 26 via the output device 17.
 問合せ出力部28は、クラスタリングの結果、新たな対象データのみを含むクラスタが現れた場合に、当該新たな対象データは未知の分類に属する可能性が高いと判断して、当該新たな対象データの分類を問い合わせるための出力を、出力デバイス17を介して行う。 When a cluster that includes only new target data appears as a result of clustering, the query output unit 28 determines that the new target data is likely to belong to an unknown classification, and the new target data Output for inquiring the classification is performed via the output device 17.
 応答受付部29は、ユーザーに対する問合せ出力や確認出力に対する、ユーザーによる応答の入力を、入力デバイス16を介して受け付ける。 The response reception unit 29 receives an input of a response by the user with respect to an inquiry output or a confirmation output to the user via the input device 16.
 分類決定部30は、ユーザーに対する問合せ出力や確認出力に対する応答に従って、新たな対象データの分類を決定する。 The classification determination unit 30 determines the classification of new target data according to the response to the query output or confirmation output for the user.
 分類モデル管理部20は、推測部26による推測処理で用いられる分類モデルを保持する。また、分類モデル管理部20は、新たな対象データの特徴データおよび分類決定部30によって決定された分類に基づいて分類モデルを生成または更新する。これによって、推測部26は、更新された最新のモデルを用いて対象データの分類を推測することができる。 The classification model management unit 20 holds a classification model used in the estimation process by the estimation unit 26. Further, the classification model management unit 20 generates or updates a classification model based on the feature data of the new target data and the classification determined by the classification determination unit 30. Thereby, the estimation part 26 can estimate the classification | category of object data using the updated newest model.
 なお、分類モデルには、教師あり学習を用いるパターン認識モデル(学習モデル)、例えば、SVM(サポートベクターマシン)等の汎用的な分類アルゴリズムが用いられてよい。この場合、分類モデルは、特徴データ(特徴ベクトル)とそれに対応する分類ラベルの組を教師データとして与えて学習させることで生成される。また、分類モデルを更新する際には、交差検証等の手法を用いて精度検証を実施し、精度が良くなるモデルを採用することとしてもよい。 For the classification model, a general classification algorithm such as a pattern recognition model (learning model) using supervised learning, for example, SVM (support vector machine) may be used. In this case, the classification model is generated by providing and learning feature data (feature vector) and a set of classification labels corresponding thereto as teacher data. In addition, when updating the classification model, accuracy verification may be performed using a method such as cross verification, and a model with improved accuracy may be employed.
 <処理の流れ>
 次に、本実施形態に係るシステムによって実行される処理の流れを、フローチャートを用いて説明する。なお、以下に説明するフローチャートに示された処理の具体的な内容および処理順序は、本開示を実施するための一例である。具体的な処理内容および処理順序は、本開示の実施の形態に応じて適宜選択されてよい。
<Process flow>
Next, the flow of processing executed by the system according to the present embodiment will be described using a flowchart. Note that the specific contents and processing order of the processing shown in the flowcharts described below are examples for carrying out the present disclosure. Specific processing contents and processing order may be appropriately selected according to the embodiment of the present disclosure.
 図4は、本実施形態に係るデータ分類処理の流れの概要を示すフローチャートである。本実施形態に係るデータ分類処理は、スキャナー3によって送信された画像データを、情報処理装置1が受信したことを契機として実行される。 FIG. 4 is a flowchart showing an outline of the flow of data classification processing according to the present embodiment. The data classification process according to the present embodiment is executed when the information processing apparatus 1 receives the image data transmitted by the scanner 3.
 ステップS101では、分類の対象となる新たな対象データの入力が受け付けられる。ユーザーが、スキャナー3のシートフィーダーや読み取り台に原稿や写真等が記録された紙媒体をセットし、スキャン開始の操作を行うと、スキャナー3は当該紙媒体を撮像し、画像データを生成する。更に、スキャナー3は生成された画像データを情報処理装置1に送信する。対象データ受付部22は、スキャナー3によって送信され、情報処理装置1に入力された画像データを、新たな対象データとして受け付け、RAM13に記録する。その後、処理はステップS102へ進む。 In step S101, input of new target data to be classified is accepted. When the user sets a paper medium on which a document, a photograph, or the like is recorded on the sheet feeder or reading table of the scanner 3 and performs a scan start operation, the scanner 3 captures the paper medium and generates image data. Further, the scanner 3 transmits the generated image data to the information processing apparatus 1. The target data receiving unit 22 receives the image data transmitted from the scanner 3 and input to the information processing apparatus 1 as new target data, and records it in the RAM 13. Thereafter, the process proceeds to step S102.
 ステップS102では、新たな対象データの特徴が抽出され、特徴データが生成される。特徴抽出部23は、ステップS101で受け付けられた新たな対象データから特徴を抽出し、特徴データを生成する。本実施形態では、対象データは画像データであるため、特徴抽出部23は、例えば紙サイズ(幅、高さ、カードサイズフラグ、レシートサイズフラグ、写真サイズフラグ等)、色数、空白ページ割合、行方向、罫線(縦横それぞれの長さ、中心座標、個数等)、文字(認識言語、文字矩形、文字位置、文字サイズ、出現単語頻度(Bag of Words/TFIDF)等)、画像(使用色情報、濃淡情報、図形の輪郭、SIFT/SURF等の局所特徴量(Bag of Features)等)、帳票別特徴(名刺タグ、レシート照合結果、レシートタグ等)等を特徴として抽出する。そして、特徴抽出部23は、抽出された特徴を数値化することで、特徴データ(本実施形態では、特徴ベクトル)を生成する。その後、処理はステップS103へ進む。 In step S102, features of new target data are extracted and feature data is generated. The feature extraction unit 23 extracts features from the new target data received in step S101, and generates feature data. In the present embodiment, since the target data is image data, the feature extraction unit 23, for example, paper size (width, height, card size flag, receipt size flag, photo size flag, etc.), number of colors, blank page ratio, Line direction, ruled line (length, width, center coordinates, number, etc.), characters (recognition language, character rectangle, character position, character size, appearance word frequency (Bag of Words / TFIDF), etc.), image (used color information , Density information, figure outlines, local feature quantities such as SIFT / SURF (Bag of Features), features by business form (business card tags, receipt matching results, receipt tags, etc.) and the like are extracted as features. Then, the feature extraction unit 23 generates feature data (in this embodiment, a feature vector) by digitizing the extracted features. Thereafter, the process proceeds to step S103.
 ステップS103では、蓄積されている分類済み対象データの有無が判定される。情報処理装置1は、蓄積部21に蓄積されている分類済み対象データの数(または、分類済み対象データが分類されている分類の数)が0より大きいか否かを判定する。これは、ステップS101で受け付けられた新たな対象データが、初めての対象データであるか否かを判定するための処理である。判定の結果、蓄積部21に蓄積されている分類済み対象データの数(または、分類済み対象データが分類されている分類の数)が0より大きい場合、ステップS101で受け付けられた新たな対象データは初めての対象データではないため、処理はステップS104へ進む。一方、判定の結果、蓄積部21に蓄積されている分類済み対象データの数(または、分類済み対象データが分類されている分類の数)が0である場合、ステップS101で受け付けられた新たな対象データは初めての対象データであり、当然に未知の分類に属するため、処理はステップS108の「ユーザー問合せによる分類決定処理」へ進む。 In step S103, it is determined whether there is accumulated classified target data. The information processing apparatus 1 determines whether the number of classified target data stored in the storage unit 21 (or the number of classifications into which the classified target data is classified) is greater than zero. This is a process for determining whether or not the new target data received in step S101 is the first target data. If the number of classified target data stored in the storage unit 21 (or the number of classifications into which the classified target data is classified) is greater than 0 as a result of the determination, the new target data accepted in step S101 Is not the first target data, the process proceeds to step S104. On the other hand, if the number of classified target data stored in the storage unit 21 (or the number of classifications into which the classified target data is classified) is 0 as a result of the determination, the new data received in step S101 is displayed. Since the target data is the first target data and naturally belongs to an unknown classification, the process proceeds to “classification determination processing by user inquiry” in step S108.
 ステップS104では、同一の特徴データを有する分類済み対象データの有無が判定される。判定部24は、クラスタリングの前に、蓄積部21によって蓄積されている分類済み対象データの特徴データを検索することで、蓄積された分類済み対象データ中に、特徴抽出部23によって生成された新たな対象データと同一の特徴データを有する分類済み対象データがあるか否かを判定する。同一の特徴データを有する分類済み対象データが有ると判定された場合、ステップS105及びステップS106に示されたクラスタリング処理はスキップされて処理はステップS107へ進み、推測部26は、クラスタリングを行うことなく推測を行う。一方、同一の特徴データを有する分類済み対象データが無いと判定された場合、処理はステップS105へ進む。 In step S104, the presence / absence of classified target data having the same feature data is determined. The determination unit 24 searches the feature data of the classified target data stored by the storage unit 21 before clustering, so that the new data generated by the feature extraction unit 23 in the stored classified target data. It is determined whether there is classified target data having the same feature data as the target data. When it is determined that there is classified target data having the same feature data, the clustering process shown in step S105 and step S106 is skipped, and the process proceeds to step S107, and the estimation unit 26 does not perform clustering. Make a guess. On the other hand, if it is determined that there is no classified target data having the same feature data, the process proceeds to step S105.
 本実施形態では、「同一」の特徴データを有する分類済み対象データが有ると判定された場合にクラスタリング処理をスキップすることで、処理全体の負荷を低減させることとしているが、判定条件は、「同一」に限定されなくてもよい。例えば、判定部24は、判定条件に所定の閾値を設定する等の方法で、蓄積部21によって蓄積されている分類済み対象データの特徴データ中に、特徴抽出部23によって生成された新たな対象データの特徴データと同一または近似する特徴データがあるか否かを判定してもよい。但し、判定条件を「同一」に限定せず、条件に幅を持たせる場合には、分類済み対象データの特徴データを検索する処理の負荷が大きくなるため、判定条件は、処理全体の負荷を考慮して設定されることが好ましい。 In this embodiment, when it is determined that there is classified target data having “identical” feature data, the clustering process is skipped to reduce the overall processing load. However, the determination condition is “ It is not necessarily limited to “same”. For example, the determination unit 24 sets a new threshold generated by the feature extraction unit 23 in the feature data of the classified target data stored by the storage unit 21 by a method of setting a predetermined threshold as a determination condition. It may be determined whether there is feature data that is the same as or similar to the feature data of the data. However, if the determination condition is not limited to “same” and the condition has a wide range, the processing load for searching the feature data of the classified target data becomes large. It is preferable to set in consideration.
 なお、本実施形態では、同一(または近似)の特徴データを有する分類済み対象データが有ると判定された場合、分類モデルによる分類決定処理によって分類を決定することとしているが、分類モデルによる分類決定処理を行わず、同一(または近似)の特徴データを有する分類済み対象データに関連づけられた分類を、新たな対象データの分類として直ちに決定することとしてもよい。 In this embodiment, when it is determined that there is already classified target data having the same (or approximate) feature data, the classification is determined by the classification determination process using the classification model. The classification associated with the classified target data having the same (or approximate) feature data may be immediately determined as the classification of the new target data without performing the process.
 ステップS105では、クラスタリング処理が行われる。クラスタリング部25は、蓄積部21に蓄積されている分類済み対象データの全て(但し、データ量によっては一部のみを用いることとしてもよい)、及びステップS101で受け付けられた新たな対象データ、からなる対象データの集合を、すべての対象データが何れか一つのクラスタの要素となるようクラスタリングする。クラスタリングの際には、蓄積部21によって蓄積されている分類済み対象データの特徴データ、及びステップS102で生成された新たな対象データの特徴データが用いられる。本実施形態では、クラスタリングには、対象データ間の特徴ベクトルの距離に基づく、一般的なクラスタリングアルゴリズムが用いられる。但し、クラスタリングに用いられるアルゴリズムは限定されない。 In step S105, clustering processing is performed. The clustering unit 25 includes all the classified target data stored in the storage unit 21 (however, only a part may be used depending on the data amount), and the new target data received in step S101. The set of target data is clustered so that all target data are elements of any one cluster. In the clustering, the feature data of the classified target data stored by the storage unit 21 and the feature data of the new target data generated in step S102 are used. In the present embodiment, a general clustering algorithm based on the distance between feature vectors between target data is used for clustering. However, the algorithm used for clustering is not limited.
 また、クラスタリング部25は、新たな対象データを含む対象データの集合を、「分類済み対象データが分類されている分類の数+1」のクラスタにクラスタリングする。例えば、クラスタリング部25は、分類済み対象データが3分類に分類されている場合、「3+1=4」のクラスタにクラスタリングする。クラスタ数をこのように設定することで、新たな対象データが未知の分類に属する可能性が高いか否かを判断することが出来る。その後、処理はステップS106へ進む。 Further, the clustering unit 25 clusters the set of target data including the new target data into clusters of “the number of classifications in which the classified target data is classified + 1”. For example, when the classified target data is classified into three classifications, the clustering unit 25 performs clustering into “3 + 1 = 4” clusters. By setting the number of clusters in this way, it can be determined whether or not there is a high possibility that the new target data belongs to the unknown classification. Thereafter, the process proceeds to step S106.
 ステップS106では、新たな対象データが属するクラスタが、他の分類済み対象データを含むか否かが判定される。情報処理装置1は、ステップS105におけるクラスタリングの結果、新たな対象データが属するクラスタが、他の分類済み対象データを含むか否かを判定することで、新たな対象データが既存の分類に属するか否かを推定する。判定の結果、新たな対象データが属するクラスタが、他の分類済み対象データを含む(即ち、新たな対象データが既存の分類に属すると推定された)場合、処理はステップS107の「分類モデルによる分類決定処理」へ進む。一方、判定の結果、新たな対象データが属するクラスタが、新たな対象データのみを含み、他の分類済み対象データを含まない(即ち、新たな対象データが未知の分類であると推定された)場合、処理はステップS108の「ユーザー問合せによる分類決定処理」へ進む。 In step S106, it is determined whether or not the cluster to which the new target data belongs includes other classified target data. As a result of the clustering in step S105, the information processing apparatus 1 determines whether the new target data belongs to the existing classification by determining whether the cluster to which the new target data belongs includes other classified target data. Estimate whether or not. As a result of the determination, when the cluster to which the new target data belongs includes other classified target data (that is, it is estimated that the new target data belongs to the existing classification), the processing is performed according to the “classification model according to step S107”. Proceed to “Classification process”. On the other hand, as a result of the determination, the cluster to which the new target data belongs includes only the new target data and does not include other classified target data (that is, it is estimated that the new target data is an unknown classification). In this case, the process proceeds to “classification determination process by user inquiry” in step S108.
 ステップS107では、分類モデルによる分類決定処理が実行される。分類モデルによる分類決定処理では、分類モデルを用いて対象データの分類が推測され、ユーザーによる推測結果の確認を経て、対象データの分類が決定される。処理の詳細は図5を用いて後述する。その後、処理はステップS109へ進む。 In step S107, classification determination processing using a classification model is executed. In the classification determination process using the classification model, the classification of the target data is estimated using the classification model, and the classification of the target data is determined through confirmation of the estimation result by the user. Details of the processing will be described later with reference to FIG. Thereafter, the process proceeds to step S109.
 ステップS108では、ユーザー問合せによる分類決定処理が実行される。ユーザー問合せによる分類決定処理では、ユーザーによって入力された分類が、対象データの分類として決定される。処理の詳細は図6を用いて後述する。その後、処理はステップS109へ進む。 In step S108, a classification determination process based on a user inquiry is executed. In the classification determination process based on the user inquiry, the classification input by the user is determined as the classification of the target data. Details of the processing will be described later with reference to FIG. Thereafter, the process proceeds to step S109.
 ステップS109では、分類済み対象データの蓄積処理が実行される。蓄積部21は、分類決定部30によって分類が決定された新たな対象データの特徴データを、決定された分類と関連付けて、分類済み対象データとして蓄積する。即ち、ここで蓄積された特徴データ及びその分類は、他の新たな対象データが受け付けられた際に実行されるデータ分類処理で、分類済み対象データ及びその分類として用いられる。その後、本フローチャートに示された処理は終了する。 In step S109, the accumulation processing of classified target data is executed. The storage unit 21 stores the feature data of the new target data whose classification is determined by the classification determination unit 30 in association with the determined classification as the classified target data. That is, the feature data and its classification accumulated here are used as the classified target data and its classification in the data classification process executed when other new target data is received. Thereafter, the processing shown in this flowchart ends.
 図5は、本実施形態に係る、分類モデルによる分類決定処理の流れの概要を示すフローチャートである。本フローチャートは、図4のステップS107に示された処理を詳細に説明するものである。 FIG. 5 is a flowchart showing an overview of the flow of classification determination processing based on the classification model according to the present embodiment. This flowchart explains in detail the processing shown in step S107 of FIG.
 ステップS201およびステップS202では、分類モデルが生成済みでない場合に、既存の1分類が推測結果として採用される。情報処理装置1は、分類モデルが生成済みであるか否かを判定する(ステップS201)。分類モデルが生成済みでないと判定された場合、既存の分類は1分類のみであるため、推測部26は、既存の1分類を、新たな対象データの分類として推測する(ステップS202)。一方、分類モデルが生成済みであると判定された場合、処理はステップS203へ進む。 In step S201 and step S202, when the classification model has not been generated, one existing classification is adopted as the estimation result. The information processing apparatus 1 determines whether a classification model has been generated (step S201). When it is determined that the classification model has not been generated, since the existing classification is only one classification, the estimation unit 26 estimates the existing one classification as a classification of new target data (step S202). On the other hand, if it is determined that the classification model has been generated, the process proceeds to step S203.
 ステップS203では、分類モデルを用いて、新たな対象データの分類が推測される。推測部26は、分類済み対象データの分類に基づいて生成/更新された分類モデルを分類モデル管理部20から読み出す。そして、推測部26は、分類済み対象データの特徴データおよび当該分類済み対象データの分類に基づいて生成/更新された分類モデルに、新たな対象データの特徴データを入力することで、当該新たな対象データの分類を推測する。その後、処理はステップS204へ進む。 In step S203, the classification of new target data is estimated using the classification model. The estimation unit 26 reads out the classification model generated / updated based on the classification of the classified target data from the classification model management unit 20. Then, the estimation unit 26 inputs the feature data of the new target data into the classification model generated / updated based on the feature data of the classified target data and the classification of the classified target data. Guess the classification of the target data. Thereafter, the process proceeds to step S204.
 なお、本実施形態では、分類モデルを用いたデータ分類の推測について説明しているが、データ分類の推測には、その他の手法が用いられてもよい。例えば、推測部26は、分類済み対象データの特徴データと新たな対象データの特徴データとを比較し、当該新たな対象データに近似する分類済み対象データの分類を特定することによって、当該新たな対象データの分類を推測してもよい。 In this embodiment, the estimation of data classification using a classification model has been described. However, other methods may be used for estimation of data classification. For example, the estimation unit 26 compares the feature data of the classified target data with the feature data of the new target data, and specifies the classification of the classified target data that approximates the new target data, thereby determining the new target data. The classification of the target data may be estimated.
 また、ステップS105におけるクラスタリング(教師なし)で生成されるクラスタと、ステップS203において推測(教師あり)される分類とは、必ずしも一致しない。これは、本実施形態では、クラスタリング処理は、新たな対象データが既存の分類に属するか否かを推測するために処理されているのであり、決定された分類を用いて生成または更新されている分類モデルによる分類の推測処理とは独立しているためである。 Also, the cluster generated by clustering (unsupervised) in step S105 does not necessarily match the classification estimated (supervised) in step S203. This is because, in the present embodiment, the clustering process is processed to infer whether the new target data belongs to the existing classification, and is generated or updated using the determined classification. This is because it is independent of the classification estimation process by the classification model.
 ステップS204では、推測結果のユーザー問合せが行われる。確認出力部27は、分類モデルを用いたステップS203での推測の結果を確認するための出力を行う。具体的には、確認出力部27は、推測結果の分類を含む「これは”名刺”ですね?」等のメッセージを出力する。その後、処理はステップS205へ進む。 In step S204, a user query of the estimation result is performed. The confirmation output unit 27 performs output for confirming the result of estimation in step S203 using the classification model. Specifically, the confirmation output unit 27 outputs a message such as “Is this a“ business card ”?” Including the classification of the estimation result. Thereafter, the process proceeds to step S205.
 ステップS205では、ユーザーによる応答が受け付けられる。応答受付部29は、ステップS204の確認出力に対する応答の入力を受け付ける。具体的には、ユーザーは、出力された、推測結果の分類を含むメッセージ(例えば、「これは”名刺”ですね?」)を確認し、これに対する応答を入力する。例えば、ユーザーは、ステップS203での推測の結果が正しい場合には正しい旨を示す入力(例えば、「Yes」)を、推測の結果が誤っている場合には正しい分類を示す入力を行う。正しい分類を示す入力を行う場合、ユーザーは、テキスト(例えば、「レシート」)を自由に入力して新たな分類を追加することで分類を入力してもよいし、問合せ出力部28によって出力された既存の分類から選択することで分類を入力してもよい。応答受付部29は、ユーザーによるこれらの入力を受け付ける。その後、処理はステップS206へ進む。 In step S205, a response from the user is accepted. The response receiving unit 29 receives an input of a response to the confirmation output in step S204. Specifically, the user confirms the output message (for example, “Is this a“ business card ”?”) Including the guess result classification, and inputs a response to the message. For example, the user performs input (for example, “Yes”) indicating that the estimation is correct when the estimation result in step S203 is correct, and inputs the correct classification when the estimation result is incorrect. When making an input indicating the correct classification, the user may input the classification by freely inputting text (for example, “receipt”) and adding a new classification, or output by the query output unit 28. A classification may be input by selecting from existing classifications. The response receiving unit 29 receives these inputs from the user. Thereafter, the process proceeds to step S206.
 ステップS206では、応答に従って、新たな対象データの分類が決定される。分類決定部30は、ステップS205で受け付けられた応答を参照し、ステップS203での推測の結果が正しい旨を示す入力が受け付けられた場合には、ステップS203での推測の結果を新たな対象データの分類に決定する。一方、分類決定部30は、ステップS205で受け付けられた応答を参照し、ステップS203での推測の結果が誤っていたためにユーザーから正しい分類を示す入力が受け付けられた場合には、ユーザー応答によって入力された分類を、新たな対象データの分類に決定する。その後、処理はステップS207へ進む。 In step S206, the classification of new target data is determined according to the response. The classification determination unit 30 refers to the response received in step S205, and if an input indicating that the estimation result in step S203 is correct is received, the classification determination unit 30 uses the estimation result in step S203 as new target data. Determine the classification. On the other hand, the classification determination unit 30 refers to the response received in step S205, and when the input indicating the correct classification is received from the user because the estimation result in step S203 is incorrect, the input is made by the user response. The classified classification is determined as a classification of new target data. Thereafter, the process proceeds to step S207.
 ステップS207からステップS209では、推測結果が誤っており、且つ既存の分類数が2以上ある場合に、分類モデルが更新される。推測結果が誤っていたためにユーザーによって入力された正しい分類が受け付けられ(ステップS207の「NO」)、且つ分類済み対象データの分類数が2分類以上ある(ステップS208の「YES」)場合、分類モデル管理部20は、新たな対象データの特徴データおよびユーザーによって入力された分類に基づいて、保持されている分類モデルを更新する(ステップS209)。一方、推測結果が正しかった場合(ステップS207の「YES」)、保持されている分類モデルは更新されない。また、分類済み対象データの分類数が2分類未満である場合(ステップS208の「NO」)も、分類モデルを生成することが出来ていないため、分類モデルの更新は行われない。その後、本フローチャートに示された処理は終了する。 In step S207 to step S209, the classification model is updated when the estimation result is incorrect and the number of existing classifications is 2 or more. If the correct classification input by the user is accepted because the estimation result is incorrect ("NO" in step S207), and the classification target data has two or more classifications ("YES" in step S208), the classification The model management unit 20 updates the held classification model based on the feature data of the new target data and the classification input by the user (step S209). On the other hand, when the estimation result is correct (“YES” in step S207), the classification model held is not updated. Also, when the number of classified target data is less than 2 (“NO” in step S208), the classification model is not updated because the classification model cannot be generated. Thereafter, the processing shown in this flowchart ends.
 なお、本実施形態では、推測結果が誤っていた場合(ステップS207の「NO」)に学習モデルを更新することとしているが、推測結果が正しい場合にも、正しい推測結果を分類モデルに反映させることで、分類モデルを更新することとしてもよい。換言すれば、ステップS207の判定は省略されてもよい。なお、分類モデルの更新タイミングは、システムにおける処理の負荷を考慮して決定されてもよい。 In this embodiment, the learning model is updated when the estimation result is incorrect (“NO” in step S207). However, even when the estimation result is correct, the correct estimation result is reflected in the classification model. Thus, the classification model may be updated. In other words, the determination in step S207 may be omitted. Note that the update timing of the classification model may be determined in consideration of the processing load in the system.
 図6は、本実施形態に係る、ユーザー問合せによる分類決定処理の流れの概要を示すフローチャートである。本フローチャートは、図4のステップS108に示された処理を詳細に説明するものである。 FIG. 6 is a flowchart showing an overview of the flow of classification determination processing by user inquiry according to the present embodiment. This flowchart explains in detail the processing shown in step S108 of FIG.
 ステップS301では、分類のユーザー問合せが行われる。問合せ出力部28は、新たな対象データの分類を問い合わせるための出力を行う。具体的には、問合せ出力部28は、「これは何の原稿種ですか?」等のメッセージを出力する。その後、処理はステップS302へ進む。 In step S301, a user inquiry for classification is performed. The inquiry output unit 28 performs an output for inquiring about the classification of new target data. Specifically, the inquiry output unit 28 outputs a message such as “What kind of document is this?”. Thereafter, the process proceeds to step S302.
 ステップS302では、ユーザーによる応答が受け付けられる。応答受付部29は、ステップS301の出力に対する応答の入力を受け付ける。具体的には、ユーザーは、出力されたメッセージを確認し、これに対する応答として、新たな対象データが属すべき分類を示す入力を行う。ここで、ユーザーは、テキストを自由に入力して新たな分類を追加することで分類を入力してもよいし、問合せ出力部28によって出力された既存の分類から選択することで分類を入力してもよい。応答受付部29は、ユーザーによる入力を受け付ける。その後、処理はステップS303へ進む。 In step S302, a response from the user is accepted. The response receiving unit 29 receives an input of a response to the output in step S301. Specifically, the user confirms the output message, and makes an input indicating the classification to which the new target data should belong as a response thereto. Here, the user may input a classification by freely inputting text and adding a new classification, or inputting a classification by selecting from existing classifications output by the query output unit 28. May be. The response receiving unit 29 receives input from the user. Thereafter, the process proceeds to step S303.
 ステップS303では、応答に従って、新たな対象データの分類が決定される。分類決定部30は、ステップS302で受け付けられた応答を参照し、ユーザー応答によって入力された分類を、新たな対象データの分類に決定する。その後、処理はステップS304へ進む。 In step S303, a new classification of target data is determined according to the response. The classification determination unit 30 refers to the response accepted in step S302, and determines the classification input by the user response as a new classification of target data. Thereafter, the process proceeds to step S304.
 ステップS304およびステップS305では、既存の分類数が2以上ある場合に、分類モデルが生成または更新される。情報処理装置1は、分類済み対象データの分類数が2分類以上あるか否かを判定する(ステップS304)。分類済み対象データの分類数が2分類未満である場合、分類モデルを生成することは出来ないため、本フローチャートに示された処理は終了する。一方、分類済み対象データの分類数が2分類以上ある場合、分類モデル管理部20は、分類モデルの生成または更新を行う(ステップS305)。 In step S304 and step S305, a classification model is generated or updated when there are two or more existing classifications. The information processing apparatus 1 determines whether the number of classifications of the classified target data is two or more (step S304). If the number of classified target data is less than two, a classification model cannot be generated, and the process shown in this flowchart ends. On the other hand, when the number of classified target data is two or more, the classification model management unit 20 generates or updates a classification model (step S305).
 ここで、ステップS101で受け付けられた新たな対象データが初めての対象データである場合(ステップS103の「NO」)、分類モデル管理部20は、新たな対象データの特徴データおよびユーザーによって入力された分類に基づいて、分類モデルを新規に生成する。それ以外の場合(ステップS106の「NO」)、分類モデル管理部20は、新たな対象データの特徴データおよびユーザーによって入力された分類に基づいて、保持されている分類モデルを更新する。その後、本フローチャートに示された処理は終了する。 Here, when the new target data received in step S101 is the first target data (“NO” in step S103), the classification model management unit 20 is input by the feature data of the new target data and the user. A new classification model is generated based on the classification. In other cases (“NO” in step S106), the classification model management unit 20 updates the classification model held based on the feature data of the new target data and the classification input by the user. Thereafter, the processing shown in this flowchart ends.
 <実施例>
 以下に、上記説明した本実施形態に係るシステムをユーザーが実際に使用する場合の一般的な流れを説明する。
<Example>
Hereinafter, a general flow when the user actually uses the system according to the present embodiment described above will be described.
 はじめに、分類済みの対象データが蓄積されていない状態でユーザーが何らかの原稿をスキャナー3に撮像させると、画像データを受け付けた情報処理装置1は、蓄積済みの対象データが0であるため(ステップS103の「NO」)、「これは何の原稿種ですか?」とのメッセージを出力する。ユーザーは、これに対して原稿種(例えば「名刺」)を入力し、情報処理装置1は、この入力結果を用いて分類モデルを生成する。 First, when the user causes the scanner 3 to capture an image of any document in a state where the classified target data is not accumulated, the information processing apparatus 1 that has received the image data has zero accumulated target data (step S103). "NO") and "What kind of manuscript is this?" In response to this, the user inputs a document type (for example, “business card”), and the information processing apparatus 1 generates a classification model using the input result.
 次にユーザーが何らかの原稿をスキャナー3に撮像させると、画像データを受け付けた情報処理装置1は、クラスタリングによって未知の分類であるか否かを判定する。蓄積されている対象データが少ない間は、新たな対象データは他のデータを含まないクラスタの要素となる可能性が高いため(ステップS106の「NO」)、情報処理装置1による「これは何の原稿種ですか?」とのメッセージ出力と、これに対するユーザーの原稿種入力が何度か繰り返される。 Next, when the user causes the scanner 3 to capture an image of an original, the information processing apparatus 1 that has received the image data determines whether or not the classification is unknown by clustering. While the target data stored is small, the new target data is likely to be an element of a cluster that does not include other data (“NO” in step S106). Is the message type? And the user's input of the document type is repeated several times.
 その後、ある程度の対象データが蓄積されると、新たな対象データが他のデータを含むクラスタの要素となる可能性が上がるため、情報処理装置1は、クラスタリングの結果既存の分類であると判定(ステップS106の「YES」)された対象データについては分類モデルによる推測を行なってユーザーに確認出力し、クラスタリングの結果未知の分類であると判定(ステップS106の「NO」)された対象データについては「これは何の原稿種ですか?」との問合せ出力を行う。このような処理が繰り返されることで、本実施形態に係る情報処理装置1によるデータ分類の精度は向上していく。 After that, when a certain amount of target data is accumulated, the possibility that the new target data becomes an element of a cluster including other data increases. Therefore, the information processing apparatus 1 determines that the classification is an existing classification ( The target data for which “YES” in step S106) is estimated based on the classification model and confirmed and output to the user, and the target data determined to be an unknown classification as a result of clustering (“NO” in step S106) An inquiry output “What kind of manuscript is this?” Is performed. By repeating such processing, the accuracy of data classification by the information processing apparatus 1 according to the present embodiment is improved.
 従来、機械学習において、学習用の訓練データについては学習済みであるが学習済みのデータが少ない状況では、汎化能力が不足し(過剰適合)、未知のデータに対する推測精度は低くなる。この状態で分類モデルから導き出されるスコア(正解であると考えられる確率)を参照することで、対象のデータが既に定義済みの分類に属するか否か(未知の分類であるか否か)の判定を行っても、当該スコアを導き出すモデル自体の信頼性が低いため、推測精度も低い。これに対して、本実施形態に示すシステムでは、対象のデータが既に定義済みの分類に属するか否か(未知の分類であるか否か)の判定に、分類モデルに依存しない相対的なデータ間の関係を用いるクラスタリングを採用することで、高精度に、未知の分類か否か判定することを可能としている。 Conventionally, in machine learning, when training data for learning has already been learned, but there is little learned data, generalization ability is insufficient (overfit), and the estimation accuracy for unknown data is low. In this state, by referring to the score (probability considered to be correct) derived from the classification model, it is determined whether the target data belongs to the already defined classification (whether it is an unknown classification) However, since the reliability of the model itself for deriving the score is low, the estimation accuracy is also low. On the other hand, in the system shown in the present embodiment, relative data that does not depend on the classification model for determining whether the target data belongs to the already defined classification (whether it is an unknown classification). By adopting clustering that uses the relationship between them, it is possible to determine whether or not the classification is unknown with high accuracy.
 <効果>
 本実施形態に係る情報処理装置、方法およびプログラムによれば、データを分類するシステムにおいて、新たに入力された分類対象のデータが既に定義済みの分類に属するか否か(換言すれば、未知の分類であるか否か)を判定することが出来る。また、未知の分類か否かを高精度に判定し、ユーザーに問合せを行うことで、ユーザーの誤操作(例えば、情報処理装置が提示した誤った分類をユーザーが承認してしまい、誤った分類を行なってしまうような操作等)を防止することが出来、また、確度の高いユーザーフィードバックを得ることが出来る。
<Effect>
According to the information processing apparatus, method, and program according to the present embodiment, in the system for classifying data, whether or not the newly input classification target data belongs to the already defined classification (in other words, unknown Whether it is a classification or not). In addition, it is possible to accurately determine whether the classification is unknown, and to inquire the user so that the user's erroneous operation (for example, the incorrect classification presented by the information processing device is approved by the user). Operation and the like that can be performed) and user feedback with high accuracy can be obtained.
   1 情報処理装置
   3 スキャナー
1 Information processing device 3 Scanner

Claims (12)

  1.  分類済み対象データの特徴データを、該対象データの分類と関連づけて蓄積する蓄積手段と、
     新たな対象データの入力を受け付ける対象データ受付手段と、
     前記新たな対象データから特徴を抽出して特徴データを生成する特徴抽出手段と、
     前記分類済み対象データ及び前記新たな対象データからなる対象データの集合を、前記蓄積手段によって蓄積されている分類済み対象データの特徴データ及び前記特徴抽出手段によって生成された前記新たな対象データの特徴データに基づいて、前記分類済み対象データが分類されている分類の数+1のクラスタにクラスタリングするクラスタリング手段と、
     前記クラスタリングの結果、前記新たな対象データのみを含むクラスタが現れた場合に、該新たな対象データの分類を問い合わせるための出力を行う問合せ出力手段と、
     を備える情報処理装置。
    Storage means for storing the characteristic data of the classified target data in association with the classification of the target data;
    Target data receiving means for receiving input of new target data;
    Feature extraction means for extracting features from the new target data and generating feature data;
    A set of target data composed of the classified target data and the new target data is classified into feature data of the classified target data stored by the storage unit and features of the new target data generated by the feature extraction unit. Clustering means for clustering into the number of classifications + 1 classification clusters in which the classified target data is classified based on data;
    As a result of the clustering, when a cluster including only the new target data appears, an inquiry output unit that performs an output for inquiring the classification of the new target data;
    An information processing apparatus comprising:
  2.  前記出力に対する応答の入力を受け付ける応答受付手段と、
     前記応答に従って前記新たな対象データの分類を決定する分類決定手段と、
     を更に備える、請求項1に記載の情報処理装置。
    Response accepting means for accepting an input of a response to the output;
    Classification determination means for determining a classification of the new target data according to the response;
    The information processing apparatus according to claim 1, further comprising:
  3.  前記蓄積手段は、前記分類決定手段によって分類が決定された前記新たな対象データの特徴データを、決定された分類と関連付けて、分類済み対象データとして蓄積する、
     請求項2に記載の情報処理装置。
    The storage means stores the feature data of the new target data whose classification is determined by the classification determination means, in association with the determined classification, and stores it as classified target data.
    The information processing apparatus according to claim 2.
  4.  前記クラスタリングの結果、前記新たな対象データを含むクラスタが他の分類済み対象データを含む場合に、前記蓄積手段によって蓄積されている分類済み対象データの特徴データと、該分類済み対象データの分類と、前記新たな対象データの特徴データとに基づいて、該新たな対象データの分類を推測する推測手段を更に備える、
     請求項2または3に記載の情報処理装置。
    As a result of the clustering, when the cluster including the new target data includes other classified target data, the feature data of the classified target data stored by the storage unit, the classification of the classified target data, , Further comprising an estimation means for estimating a classification of the new target data based on the feature data of the new target data.
    The information processing apparatus according to claim 2 or 3.
  5.  前記推測手段は、前記分類済み対象データの特徴データおよび該分類済み対象データの分類に基づいて生成された分類モデルと、前記新たな対象データの特徴データとを用いて、該新たな対象データの分類を推測する、
     請求項4に記載の情報処理装置。
    The inference means uses the feature data of the classified target data, the classification model generated based on the classification of the classified target data, and the feature data of the new target data, and uses the feature data of the new target data. Guess the classification,
    The information processing apparatus according to claim 4.
  6.  生成された前記分類モデルを管理する分類モデル管理手段であって、前記新たな対象データの特徴データおよび前記分類決定手段によって決定された分類に基づいて前記分類モデルを生成または更新する分類モデル管理手段を更に備える、
     請求項5に記載の情報処理装置。
    Classification model management means for managing the generated classification model, wherein the classification model management means generates or updates the classification model based on the feature data of the new target data and the classification determined by the classification determination means Further comprising
    The information processing apparatus according to claim 5.
  7.  前記推測手段は、前記分類済み対象データの特徴データと前記新たな対象データの特徴データとを比較し、該新たな対象データに近似する前記分類済み対象データの分類を特定することによって、該新たな対象データの分類を推測する、
     請求項4に記載の情報処理装置。
    The inference means compares the feature data of the classified target data with the feature data of the new target data, and specifies the classification of the classified target data that approximates the new target data, thereby determining the new target data. Guess the classification of target data,
    The information processing apparatus according to claim 4.
  8.  前記推測手段による推測の結果を確認するための出力を行う確認出力手段を更に備え、
     前記応答受付手段は、前記確認出力に対する応答の入力を受け付け、
     前記分類決定手段は、前記確認出力に対する応答に従って、前記新たな対象データの分類を決定する、
     請求項4から7の何れか一項に記載の情報処理装置。
    A confirmation output means for performing an output for confirming a result of the estimation by the estimation means;
    The response receiving means receives an input of a response to the confirmation output;
    The classification determining means determines a classification of the new target data according to a response to the confirmation output;
    The information processing apparatus according to any one of claims 4 to 7.
  9.  前記クラスタリングの前に、前記蓄積手段によって蓄積されている分類済み対象データの特徴データ中に、前記特徴抽出手段によって生成された前記新たな対象データの特徴データと同一または近似する特徴データがあるか否かを判定する判定手段を更に備え、
     前記推測手段は、前記判定手段によって同一または近似する特徴データがあると判定された場合に、前記クラスタリングを行うことなく前記推測を行う、
     請求項4から8の何れか一項に記載の情報処理装置。
    Before the clustering, in the feature data of the classified target data stored by the storage unit, is there feature data that is the same as or approximates the feature data of the new target data generated by the feature extraction unit A determination means for determining whether or not,
    The estimation unit performs the estimation without performing the clustering when the determination unit determines that there is feature data that is the same or approximate.
    The information processing apparatus according to any one of claims 4 to 8.
  10.  前記問合せ出力手段は、更に、前記分類済み対象データが分類されている分類の数または前記蓄積手段に蓄積されている分類済み対象データの数が0の場合に、前記新たな対象データの分類を問い合わせるための出力を行う、
     請求項1から9の何れか一項に記載の情報処理装置。
    The inquiry output means further determines the classification of the new target data when the number of classifications into which the classified target data is classified or the number of classified target data stored in the storage means is zero. Do output to query,
    The information processing apparatus according to any one of claims 1 to 9.
  11.  コンピューターが、
     分類済み対象データの特徴データを、該対象データの分類と関連づけて蓄積する蓄積ステップと、
     新たな対象データの入力を受け付ける対象データ受付ステップと、
     前記新たな対象データから特徴を抽出して特徴データを生成する特徴抽出ステップと、
     前記分類済み対象データ及び前記新たな対象データからなる対象データの集合を、前記蓄積ステップで蓄積された分類済み対象データの特徴データ及び前記特徴抽出ステップで生成された前記新たな対象データの特徴データに基づいて、前記分類済み対象データが分類されている分類の数+1のクラスタにクラスタリングするクラスタリングステップと、
     前記クラスタリングの結果、前記新たな対象データのみを含むクラスタが現れた場合に、該新たな対象データの分類を問い合わせるための出力を行う問合せ出力ステップと、
     を実行する方法。
    Computer
    An accumulation step of accumulating the characteristic data of the classified target data in association with the classification of the target data;
    A target data receiving step for receiving input of new target data;
    A feature extraction step of extracting features from the new target data to generate feature data;
    A set of target data composed of the classified target data and the new target data is classified into feature data of the classified target data accumulated in the accumulation step and feature data of the new target data generated in the feature extraction step. A clustering step of clustering the classified target data into the number of classifications plus one cluster based on
    As a result of the clustering, when a cluster including only the new target data appears, an inquiry output step for performing an output for inquiring about the classification of the new target data;
    How to run.
  12.  コンピューターを、
     分類済み対象データの特徴データを、該対象データの分類と関連づけて蓄積する蓄積手段と、
     新たな対象データの入力を受け付ける対象データ受付手段と、
     前記新たな対象データから特徴を抽出して特徴データを生成する特徴抽出手段と、
     前記分類済み対象データ及び前記新たな対象データからなる対象データの集合を、前記蓄積手段によって蓄積されている分類済み対象データの特徴データ及び前記特徴抽出手段によって生成された前記新たな対象データの特徴データに基づいて、前記分類済み対象データが分類されている分類の数+1のクラスタにクラスタリングするクラスタリング手段と、
     前記クラスタリングの結果、前記新たな対象データのみを含むクラスタが現れた場合に、該新たな対象データの分類を問い合わせるための出力を行う問合せ出力手段と、
     として機能させるプログラム。
    Computer
    Storage means for storing the characteristic data of the classified target data in association with the classification of the target data;
    Target data receiving means for receiving input of new target data;
    Feature extraction means for extracting features from the new target data and generating feature data;
    A set of target data composed of the classified target data and the new target data is classified into feature data of the classified target data stored by the storage unit and features of the new target data generated by the feature extraction unit. Clustering means for clustering into the number of classifications + 1 classification clusters in which the classified target data is classified based on data;
    As a result of the clustering, when a cluster including only the new target data appears, an inquiry output unit that performs an output for inquiring the classification of the new target data;
    Program to function as.
PCT/JP2016/088752 2016-12-26 2016-12-26 Information processing device, method, and program WO2018122931A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/088752 WO2018122931A1 (en) 2016-12-26 2016-12-26 Information processing device, method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/088752 WO2018122931A1 (en) 2016-12-26 2016-12-26 Information processing device, method, and program

Publications (1)

Publication Number Publication Date
WO2018122931A1 true WO2018122931A1 (en) 2018-07-05

Family

ID=62707051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/088752 WO2018122931A1 (en) 2016-12-26 2016-12-26 Information processing device, method, and program

Country Status (1)

Country Link
WO (1) WO2018122931A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112513892A (en) * 2018-07-31 2021-03-16 三菱电机株式会社 Information processing device, program, and information processing method
JP6865901B1 (en) * 2020-03-30 2021-04-28 三菱電機株式会社 Diagnostic system, diagnostic method and program
CN113676609A (en) * 2020-05-15 2021-11-19 夏普株式会社 Image forming apparatus and document data classifying method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09153049A (en) * 1995-11-29 1997-06-10 Hitachi Ltd Method and device for supporting document classification
JP2002230012A (en) * 2000-12-01 2002-08-16 Sumitomo Electric Ind Ltd Document clustering device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09153049A (en) * 1995-11-29 1997-06-10 Hitachi Ltd Method and device for supporting document classification
JP2002230012A (en) * 2000-12-01 2002-08-16 Sumitomo Electric Ind Ltd Document clustering device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112513892A (en) * 2018-07-31 2021-03-16 三菱电机株式会社 Information processing device, program, and information processing method
JP6865901B1 (en) * 2020-03-30 2021-04-28 三菱電機株式会社 Diagnostic system, diagnostic method and program
WO2021199194A1 (en) * 2020-03-30 2021-10-07 三菱電機株式会社 Diagnosis system, diagnosis method, and program
CN115349111A (en) * 2020-03-30 2022-11-15 三菱电机株式会社 Diagnostic system, diagnostic method, and program
CN113676609A (en) * 2020-05-15 2021-11-19 夏普株式会社 Image forming apparatus and document data classifying method
CN113676609B (en) * 2020-05-15 2024-05-14 夏普株式会社 Image forming apparatus and document data classification method

Similar Documents

Publication Publication Date Title
CN110245557B (en) Picture processing method, device, computer equipment and storage medium
CN111209827B (en) Method and system for OCR (optical character recognition) bill problem based on feature detection
JP6365024B2 (en) Service providing apparatus, method, and program
JP7149721B2 (en) Information processing device, character recognition engine optimization method and program
WO2018122931A1 (en) Information processing device, method, and program
CN106127222B (en) A kind of the similarity of character string calculation method and similitude judgment method of view-based access control model
JP2022088602A (en) Table generation method, device, electronic apparatus, storage medium and program
JP2019169025A (en) Information processing device, character recognition engine selection method, and program
CN109740135A (en) Chart generation method and device, electronic equipment and storage medium
JP2012198684A (en) Information processing device, business form type estimation method, and business form type estimation program
US20170039451A1 (en) Classification dictionary learning system, classification dictionary learning method and recording medium
CN114724156B (en) Form identification method and device and electronic equipment
CN108090044B (en) Contact information identification method and device
JP6573233B2 (en) Recognizability index calculation apparatus, method, and program
US10977527B2 (en) Method and apparatus for detecting door image by using machine learning algorithm
CN113221792B (en) Chapter detection model construction method, cataloguing method and related equipment
CN116016421A (en) Method, computing device readable storage medium, and computing device for facilitating media-based content sharing performed in a computing device
CN114637877A (en) Labeling method, electronic device and storage medium
CN110516717B (en) Method and apparatus for generating image recognition model
CN113704623A (en) Data recommendation method, device, equipment and storage medium
CN113920291A (en) Error correction method and device based on picture recognition result, electronic equipment and medium
US20160364458A1 (en) Methods and Systems for Using Field Characteristics to Index, Search For, and Retrieve Forms
JP2015097036A (en) Recommended image presentation apparatus and program
US20180307669A1 (en) Information processing apparatus
CN113641746B (en) Document structuring method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16925477

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16925477

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP