JP2021114091A

JP2021114091A - Program and clustering device

Info

Publication number: JP2021114091A
Application number: JP2020006034A
Authority: JP
Inventors: 正樹中川; Masaki Nakagawa; トゥアンクーングエン; Tuan Cuong Nguyen
Original assignee: Tokyo University of Agriculture and Technology NUC
Current assignee: Tokyo University of Agriculture and Technology NUC
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2021-08-05
Anticipated expiration: 2040-01-17
Also published as: JP7365697B2

Abstract

To provide a program and a clustering device which reduce score errors and variations in scores of a scorer, and enhance score efficiency.SOLUTION: In a clustering device, a processing part includes: a scale image generation part for reducing one handwriting pattern image and generating a plurality of scale images; a feature extraction part for giving each of the images to a learning model, and extracting a symbol position feature concerning appearance positions for each kind of symbols for each of a plurality of scale images; a feature vector generation part for dividing the symbol position feature into the plurality of kinds, and determining appearance probability for each of kinds of the symbols for each of the divisions and generating a feature vector; an integral part for taking a maximum value of appearance probability of the same symbol in the same part of the feature vector generated from each of the plurality of scale images, and generating an integral feature vector; and a classification part for classifying a plurality of handwriting patterns into a plurality of groups based on the integral feature vector generated from each of the plurality of handwriting pattern images.SELECTED DRAWING: Figure 1

Description

本発明は、プログラム及びクラスタリング装置に関する。 The present invention relates to a program and a clustering device.

受験者（解答者）の考える力を育て、それを測るために、選択式の問題だけでなく記述式の問題を課す必要性が社会的に認識されてきている。記述式問題を課す場合、短期間で信頼性の高い採点を行うことが求められる。手書き認識の技術を用いれば、人による採点を支援する採点支援や自動採点を行うことが可能となる。記述式解答の必要性が高い算数・数学において、手書き解答の認識を採用した学習システムのプロトタイプが発表されている（例えば、非特許文献１〜３）。 There is a social recognition that it is necessary to impose not only multiple-choice questions but also descriptive questions in order to develop and measure the thinking ability of examinees (answerers). When imposing descriptive questions, it is required to give a reliable score in a short period of time. By using handwriting recognition technology, it is possible to perform scoring support and automatic scoring that support human scoring. Prototypes of learning systems that employ recognition of handwritten answers have been published in arithmetic and mathematics, where there is a high need for descriptive answers (for example, Non-Patent Documents 1 to 3).

鈴木雅人他、「手書き数式解析に基づく基礎数学学習支援システムの開発」、電子情報通信学会、信学技法、ＥＴ２００９−１２９、ｐ.１４７−１５２（２０１０）Masato Suzuki et al., "Development of Basic Mathematics Learning Support System Based on Handwritten Mathematical Analysis", Institute of Electronics, Information and Communication Engineers, Academic Techniques, ET2009-129, p.147-152 (2010) 千葉智史他、「手書き数式認識を利用したタブレットＰＣ上での数学ｅラーニングシステムの試作」、情報処理学会研究報告、Ｖｏｌ.２０１４−ＣＥ−１２７、Ｎｏ.１０、ｐ.１−５（２０１４）Satoshi Chiba et al., "Prototype of Mathematics e-Learning System on Tablet PC Using Handwritten Formula Recognition", IPSJ Research Report, Vol.2014-CE-127, No.10, p.1-5 (2014) 小西渉他、「手書き数式認識を用いた算数・数学自動採点システム」、情報処理学会研究報告、コンピュータと教育研究会報告、２０１６−ＣＥ−１３３（７）、１−７（２０１６）Wataru Konishi et al., "Automatic Mathematics and Mathematics Scoring System Using Handwritten Formula Recognition", IPSJ Research Report, Computer and Education Study Group Report, 2016-CE-133 (7), 1-7 (2016)

人による採点を行う際に、解答がランダムに採点者に提示されると、採点がぶれる（ばらつく）ことがよく起こる。大規模な試験では、複数人で採点し、採点の点差が一定以上だと第三者が採点する必要が生じ、第三者を使うコストと時間がかさむ。一方、自動採点では、受験者が採点結果を確認し、採点誤りを指摘できる環境が不可欠であるが、それを前提としても、認識の確信度が高いものだけを自動採点し、確信度が低いものは棄却して人による採点に回す方式が現実的である。従って、採点支援か自動採点かによらず、採点者による採点のばらつきを抑えることが重要である。 When scoring by a person, if the answer is randomly presented to the grader, the scoring often fluctuates (varies). In a large-scale examination, multiple people score, and if the score difference is above a certain level, a third party needs to score, which increases the cost and time of using the third party. On the other hand, in automatic scoring, it is indispensable to have an environment in which examinees can check the scoring results and point out scoring errors. It is realistic to reject things and use them for scoring by people. Therefore, it is important to suppress the variation in scoring by graders regardless of whether it is scoring support or automatic scoring.

本発明は、以上のような課題に鑑みてなされたものであり、その目的とするところは、採点者の採点誤りや採点のばらつきを低減して採点の効率を高めることが可能なプログラム及びクラスタリング装置を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is a program and clustering capable of reducing scoring errors and scoring variations of graders and improving scoring efficiency. To provide the equipment.

（１）本発明は、手書き入力された複数の手書きパターンを分類するためのプログラムであって、１つの手書きパターンの画像を少なくとも１つの段階に縮小して、縮小前の元画像と少なくとも１つの縮小画像とを含む複数のスケール画像を生成するスケール画像生成部と、前記複数のスケール画像のそれぞれを畳み込みニューラルネットワークを用いた学習モデルに与えて、前記複数のスケール画像ごとに、シンボルの種類ごとの出現位置に関するシンボル位置特徴を抽出する特徴抽出部と、前記シンボル位置特徴に対して複数種類の区画分割を行い、前記複数種類の区画分割のそれぞれで得られた区画ごとにシンボルの種類ごとの出現確率を求めて特徴ベクトルを生成する特徴ベクトル生成部と、前記複数のスケール画像のそれぞれから生成した前記特徴ベクトルの同一箇所で同一シンボルの出
現確率の最大値をとって統合特徴ベクトルを生成する統合部と、複数の前記手書きパターンの画像のそれぞれから生成した前記統合特徴ベクトルに基づいて、複数の前記手書きパターンを複数のグループに分類する分類部としてコンピュータを機能させることを特徴とするプログラムに関する。また、本発明は、コンピュータ読み取り可能な情報記憶媒体であって、上記各部としてコンピュータを機能させるためのプログラムを記憶した情報記憶媒体に関係する。また、本発明は、上記各部を含むクラスタリング装置に関係する。 (1) The present invention is a program for classifying a plurality of handwritten patterns input by handwriting, and reduces an image of one handwritten pattern to at least one step to reduce the image of one handwritten pattern to at least one original image before reduction and at least one. A scale image generator that generates a plurality of scale images including a reduced image and each of the plurality of scale images are given to a learning model using a convolutional neural network, and each of the plurality of scale images is given for each symbol type. A feature extraction unit that extracts symbol position features related to the appearance position of the image, and a plurality of types of divisions are performed on the symbol position features, and each division obtained by each of the plurality of types of divisions is used for each symbol type. An integrated feature vector is generated by taking the maximum value of the appearance probability of the same symbol at the same location of the feature vector generated from each of the plurality of scale images and the feature vector generation unit that obtains the appearance probability and generates the feature vector. The present invention relates to an integration unit and a program characterized by operating a computer as a classification unit that classifies a plurality of the handwriting patterns into a plurality of groups based on the integrated feature vector generated from each of the images of the plurality of handwriting patterns. .. The present invention also relates to a computer-readable information storage medium, which stores a program for operating a computer as each of the above parts. The present invention also relates to a clustering apparatus including each of the above parts.

本発明によれば、複数のスケール画像のそれぞれを学習モデルに与えて、複数のスケール画像ごとにシンボル位置特徴を抽出することで、大きさの異なるシンボルのクラス（種類）と位置を抽出することができる。また、シンボル位置特徴に対して複数種類の区画分割を行い、複数種類の区画分割のそれぞれで得られた区画ごとにシンボルの種類ごとの出現確率を求めて特徴ベクトルを生成することで、シンボルの位置ずれの大小に依らずシンボルを検出することができる。また、複数のスケール画像のそれぞれから抽出した複数のシンボル位置特徴のそれぞれに対して複数種類の区画分割を行って特徴ベクトルを生成することで、位置特徴のサイズを揃えて統合することができる。複数のスケール画像のそれぞれから生成した特徴ベクトルを統合して得られる統合特徴ベクトルに基づき複数の手書きパターンを複数のグループに分類することで、手書きパターンを似通ったものごとに精度良くまとめる（クラスタリングする）ことができ、採点者の採点誤りや採点のばらつきを低減して採点の効率を高めることができる。 According to the present invention, by giving each of a plurality of scale images to a learning model and extracting symbol position features for each of the plurality of scale images, the classes (types) and positions of symbols having different sizes can be extracted. Can be done. In addition, multiple types of divisions are performed on the symbol position features, and the appearance probability of each symbol type is obtained for each division obtained by each of the multiple types of divisions to generate a feature vector of the symbol. The symbol can be detected regardless of the magnitude of the misalignment. Further, by generating a feature vector by performing a plurality of types of partitioning for each of the plurality of symbol position features extracted from each of the plurality of scale images, the sizes of the position features can be made uniform and integrated. By classifying multiple handwriting patterns into multiple groups based on the integrated feature vector obtained by integrating the feature vectors generated from each of the multiple scale images, the handwriting patterns are accurately grouped (clustered) for each similar one. ), And it is possible to improve the efficiency of scoring by reducing scoring errors and scoring variations of graders.

（２）また本発明に係るプログラム、情報記憶媒体及びクラスタリング装置では、前記複数のスケール画像のそれぞれを前記学習モデルに与えて、前記複数のスケール画像ごとに特徴マップを取得し、当該特徴マップを識別器に入力して得られる前記シンボル位置特徴と当該特徴マップとを掛け合わせて平均化したベクトルを前記識別器に入力してシンボルの種類ごとの出現確率を示す第１出現確率ベクトルを生成し、前記複数のスケール画像のそれぞれから生成した前記第１出現確率ベクトルの同一箇所で同一シンボルの出現確率の最大値をとって第１統合出現確率ベクトルを生成し、当該第１統合出現確率ベクトルとシンボルの種類ごとの出現の有無を示す教師データとに基づいて前記学習モデルを学習させる学習部として更にコンピュータを機能させてもよい（学習部を更に含んでもよい）。 (2) Further, in the program, the information storage medium, and the clustering apparatus according to the present invention, each of the plurality of scale images is given to the learning model, a feature map is acquired for each of the plurality of scale images, and the feature map is obtained. A vector obtained by multiplying the symbol position feature obtained by inputting to the classifier and the feature map is input to the classifier to generate a first appearance probability vector indicating the appearance probability for each symbol type. , The first integrated appearance probability vector is generated by taking the maximum value of the appearance probability of the same symbol at the same position of the first appearance probability vector generated from each of the plurality of scale images, and the first integrated appearance probability vector is combined with the first integrated appearance probability vector. The computer may further function as a learning unit for learning the learning model based on the teacher data indicating the presence or absence of the appearance of each symbol type (the learning unit may be further included).

本発明によれば、特徴マップを識別器に入力して得られるシンボル位置特徴と特徴マップを掛け合わせて得られるベクトルから生成した第１出現確率ベクトルを用いて学習モデルを学習させることで、注目すべきシンボルの出現に重みをつけて学習することができ、学習モデルを適切に学習することができる。 According to the present invention, attention is paid to learning a learning model using a first appearance probability vector generated from a vector obtained by multiplying a symbol position feature obtained by inputting a feature map into a classifier and a feature map. It is possible to weight the appearance of symbols to be learned and learn the learning model appropriately.

（３）また本発明に係るプログラム、情報記憶媒体及びクラスタリング装置では、前記学習部は、前記複数のスケール画像のそれぞれを前記学習モデルに与えて、前記複数のスケール画像ごとに特徴マップを取得し、当該特徴マップの最大値をとったベクトルを前記識別器に入力してシンボルの種類ごとの出現確率を示す第２出現確率ベクトルを生成し、前記複数のスケール画像のそれぞれから生成した前記第２出現確率ベクトルの同一箇所で同一シンボルの出現確率の最大値をとって第２統合出現確率ベクトルを生成し、前記第１統合出現確率ベクトルと前記第２統合出現確率ベクトルを平均化した第３統合出現確率ベクトルを生成し、当該第３統合出現確率ベクトルと前記教師データとに基づいて前記学習モデルを学習させてもよい。 (3) Further, in the program, the information storage medium, and the clustering apparatus according to the present invention, the learning unit gives each of the plurality of scale images to the learning model, and acquires a feature map for each of the plurality of scale images. , The vector taking the maximum value of the feature map is input to the classifier to generate a second appearance probability vector indicating the appearance probability for each symbol type, and the second appearance probability vector generated from each of the plurality of scale images is generated. The second integrated appearance probability vector is generated by taking the maximum value of the appearance probability of the same symbol at the same place of the appearance probability vector, and the first integrated appearance probability vector and the second integrated appearance probability vector are averaged to form the third integrated. An appearance probability vector may be generated, and the learning model may be trained based on the third integrated appearance probability vector and the teacher data.

本発明によれば、特徴マップの最大値をとったベクトルから生成した第２出現確率ベクトルと第１出現確率ベクトルを用いて、学習モデルを学習させることで、学習モデルをより適切に学習することができる。 According to the present invention, the learning model is learned more appropriately by training the learning model using the second appearance probability vector and the first appearance probability vector generated from the vector that takes the maximum value of the feature map. Can be done.

本実施形態のクラスタリング装置の機能ブロック図の一例を示す図。The figure which shows an example of the functional block diagram of the clustering apparatus of this embodiment. 手書き数式パターン画像から統合特徴ベクトルを生成する処理の流れを示す図。The figure which shows the flow of the process of generating an integrated feature vector from a handwritten mathematical expression pattern image. 深層畳み込みニューラルネットワークと識別器の構成の一例を示す図。The figure which shows an example of the structure of a deep convolutional neural network and a classifier. シンボル位置特徴を模式的に示す図。The figure which shows the symbol position feature schematically. 階層的空間プーリングの一例を模式的に示す図。The figure which shows an example of the hierarchical space pooling schematically. 学習処理の流れを示す図。The figure which shows the flow of a learning process. 大域注目プーリングの処理の流れを示す図。The figure which shows the flow of the process of the global attention pooling. 手書き数式パターン画像に、当該手書き数式パターン画像から抽出したシンボル位置特徴を重ね合わせた図。The figure which superposed the symbol position feature extracted from the handwritten mathematical expression pattern image on the handwritten mathematical expression pattern image.

以下、本実施形態について説明する。なお、以下に説明する本実施形態は、特許請求の範囲に記載された本発明の内容を不当に限定するものではない。また本実施形態で説明される構成の全てが、本発明の必須構成要件であるとは限らない。 Hereinafter, this embodiment will be described. The present embodiment described below does not unreasonably limit the content of the present invention described in the claims. Moreover, not all of the configurations described in the present embodiment are essential constituent requirements of the present invention.

１．構成
図１に本実施形態のクラスタリング装置の機能ブロック図の一例を示す。なお本実施形態のクラスタリング装置は図１の構成要素（各部）の一部を省略した構成としてもよい。 1. 1. Configuration Figure 1 shows an example of a functional block diagram of the clustering device of this embodiment. The clustering apparatus of the present embodiment may have a configuration in which some of the constituent elements (each part) of FIG. 1 are omitted.

入力部１６０は、紙などに筆記された手書き数式パターン（手書きパターンの一例）を入力するためのものであり、その機能は、手書き数式パターンを画像（白黒画像或いは濃淡画像）として読み取る光学読み取り装置（スキャナー、カメラ等）により実現できる。 The input unit 160 is for inputting a handwritten mathematical expression pattern (an example of a handwritten pattern) written on paper or the like, and its function is an optical reading device that reads the handwritten mathematical expression pattern as an image (black-and-white image or grayscale image). It can be realized by (scanner, camera, etc.).

記憶部１７０は、処理部１００の各部としてコンピュータを機能させるためのプログラムや各種データを記憶するとともに、処理部１００のワーク領域として機能し、その機能はハードディスク、ＲＡＭなどにより実現できる。 The storage unit 170 stores programs and various data for operating the computer as each unit of the processing unit 100, and also functions as a work area of the processing unit 100, and the function can be realized by a hard disk, RAM, or the like.

表示部１９０は、処理部１００で生成された画像を出力するものであり、その機能は、入力部１６０としても機能するタッチパネル、ＬＣＤ或いはＣＲＴなどのディスプレイにより実現できる。 The display unit 190 outputs an image generated by the processing unit 100, and its function can be realized by a display such as a touch panel, LCD, or CRT that also functions as an input unit 160.

処理部１００（プロセッサ）は、入力部１６０からのデータ（画像データ）やプログラムなどに基づいて、分類（クラスタリング）処理、学習処理、表示制御などの処理を行う。この処理部１００は記憶部１７０内の主記憶部をワーク領域として各種処理を行う。処理部１００の機能は各種プロセッサ（ＣＰＵ、ＤＳＰ等）、ＡＳＩＣ（ゲートアレイ等）などのハードウェアや、プログラムにより実現できる。処理部１００は、スケール画像生成部１１０、特徴抽出部１１１、特徴ベクトル生成部１１２、統合部１１３、分類部１１４、表示制御部１１５、学習部１１６を含む。 The processing unit 100 (processor) performs processing such as classification (clustering) processing, learning processing, and display control based on data (image data) from the input unit 160, a program, and the like. The processing unit 100 performs various processes using the main storage unit in the storage unit 170 as a work area. The function of the processing unit 100 can be realized by hardware such as various processors (CPU, DSP, etc.), ASIC (gate array, etc.), or a program. The processing unit 100 includes a scale image generation unit 110, a feature extraction unit 111, a feature vector generation unit 112, an integration unit 113, a classification unit 114, a display control unit 115, and a learning unit 116.

スケール画像生成部１１０は、１つの手書き数式パターン画像を少なくとも１つの段階に縮小して、縮小前の元画像と少なくとも１つの縮小画像とを含む複数のスケール画像を生成する。 The scale image generation unit 110 reduces one handwritten mathematical expression pattern image to at least one step, and generates a plurality of scale images including the original image before reduction and at least one reduced image.

特徴抽出部１１１は、複数のスケール画像のそれぞれを畳み込みニューラルネットワークを用いた学習モデルに与えて、複数のスケール画像ごとに、シンボルの種類ごとの出現位置に関するシンボル位置特徴を抽出する。なお、シンボルとは、数式を構成する要素であり、具体的には、数式に現れる数字、文字（英文字、ギリシャ文字）、演算子（算術演算子、論理演算子、集合演算子、関係演算子）、記号(加算記号、演算記号、積算記号、
積分記号、極限記号（lim）、特殊記号（∞等）)、関数（log、sin、cos、tan等）、括弧
（大、中、小）である。 The feature extraction unit 111 gives each of the plurality of scale images to the learning model using the convolutional neural network, and extracts the symbol position feature relating to the appearance position of each symbol type for each of the plurality of scale images. Symbols are elements that make up mathematical formulas. Specifically, numbers, characters (alphabetic characters, Greek characters) and operators (arithmetic operators, logical operators, set operators, relational operations) that appear in mathematical formulas. Child), symbol (addition symbol, operation symbol, integration symbol,
Integral symbols, limit symbols (lim), special symbols (∞, etc.)), functions (log, sin, cos, tan, etc.), parentheses (large, medium, small).

特徴ベクトル生成部１１２は、シンボル位置特徴に対して複数種類の区画分割を行い、前記複数種類の区画分割のそれぞれで得られた区画ごとにシンボルの種類ごとの出現確率を求めて特徴ベクトルを生成する。なお、複数種類の区画分割とは、ｎ×ｍの区画分割のｎ及び／又はｍの値が異なる区画分割である。 The feature vector generation unit 112 performs a plurality of types of divisions for the symbol position feature, obtains the appearance probability of each symbol type for each division obtained by each of the plurality of types of divisions, and generates a feature vector. do. The plurality of types of divisions are divisions in which the values of n and / or m of the n × m divisions are different.

統合部１１３は、複数のスケール画像のそれぞれから生成した前記特徴ベクトルの同一箇所で同一シンボルの出現確率の最大値をとって統合特徴ベクトルを生成する。 The integration unit 113 generates an integrated feature vector by taking the maximum value of the appearance probability of the same symbol at the same position of the feature vector generated from each of the plurality of scale images.

分類部１１４は、複数の手書き数式パターン画像のそれぞれから生成した統合特徴ベクトルに基づいて、複数の手書き数式パターンを複数のグループに分類する。 The classification unit 114 classifies the plurality of handwritten mathematical expression patterns into a plurality of groups based on the integrated feature vector generated from each of the plurality of handwritten mathematical expression pattern images.

表示制御部１１５は、分類部１１４によって複数のグループに分類された複数の手書き数式パターンをグループ毎に表示部１９０に表示させる制御を行う。 The display control unit 115 controls the display unit 190 to display a plurality of handwritten mathematical expression patterns classified into a plurality of groups by the classification unit 114 for each group.

学習部１１６は、複数のスケール画像（学習用のスケール画像）のそれぞれを前記学習モデルに与えて、複数のスケール画像ごとに特徴マップを取得し、当該特徴マップを識別器に入力して得られるシンボル位置特徴と当該特徴マップとを掛け合わせて平均化したベクトルを識別器に入力してシンボルの種類ごとの出現確率を示す第１出現確率ベクトルを生成し、複数のスケール画像のそれぞれから生成した第１出現確率ベクトルの同一箇所で同一シンボルの出現確率の最大値をとって第１統合出現確率ベクトルを生成し、当該第１統合出現確率ベクトルとシンボルの種類ごとの出現の有無を示す教師データとに基づいて前記学習モデルを学習させる。また、学習部１１６は、複数のスケール画像のそれぞれを学習モデルに与えて、複数のスケール画像ごとに特徴マップを取得し、当該特徴マップの最大値をとったベクトルを識別器に入力してシンボルの種類ごとの出現確率を示す第２出現確率ベクトルを生成し、前記複数のスケール画像のそれぞれから生成した第２出現確率ベクトルの同一箇所で同一シンボルの出現確率の最大値をとって第２統合出現確率ベクトルを生成し、第１統合出現確率ベクトルと第２統合出現確率ベクトルを平均化（統合）した第３統合出現確率ベクトルを生成し、当該第３統合出現確率ベクトルと教師データとに基づいて学習モデルを学習させてもよい。 The learning unit 116 is obtained by giving each of a plurality of scale images (scale images for learning) to the learning model, acquiring a feature map for each of the plurality of scale images, and inputting the feature map into the classifier. A vector obtained by multiplying the symbol position feature and the feature map and averaging them is input to the classifier to generate a first appearance probability vector showing the appearance probability for each symbol type, and is generated from each of a plurality of scale images. Teacher data that generates the first integrated appearance probability vector by taking the maximum value of the appearance probability of the same symbol at the same location of the first appearance probability vector, and indicates the presence or absence of appearance for each type of the first integrated appearance probability vector and the symbol. The learning model is trained based on the above. Further, the learning unit 116 gives each of the plurality of scale images to the learning model, acquires a feature map for each of the plurality of scale images, and inputs a vector having the maximum value of the feature map to the classifier to symbolize. A second appearance probability vector indicating the appearance probability for each type of is generated, and the maximum value of the appearance probability of the same symbol is taken at the same place of the second appearance probability vector generated from each of the plurality of scale images and the second integration is performed. An appearance probability vector is generated, a third integrated appearance probability vector is generated by averaging (integrating) the first integrated appearance probability vector and the second integrated appearance probability vector, and based on the third integrated appearance probability vector and teacher data. The learning model may be trained.

２．本実施形態の手法
図２は、手書き数式パターンの入力画像（手書き数式パターン画像）から統合特徴ベクトルを生成する処理の流れを示す図である。 2. FIG. 2 is a diagram showing a flow of processing for generating an integrated feature vector from an input image of a handwritten mathematical expression pattern (handwritten mathematical expression pattern image).

まず、１つの手書き数式パターン画像（クラスタリング対象となる手書き数式パターン画像）を少なくとも１つの段階に縮小して、縮小前の元画像と少なくとも１つの縮小画像を含む複数のスケール画像を生成する。図１に示す例では、１つの手書き数式パターン画像から、辺長を１／２に縮小した縮小画像（１／４画像）と、辺長を１／４に縮小した縮小画像（１／１６画像）とを生成することで、３つのスケール画像（元画像、１／４画像、１／１６画像）を生成している。なお、より小さな縮尺で縮小した縮小画像を用いてもよいし、別の縮尺で縮小した縮小画像を追加して用いてもよい。 First, one handwritten mathematical expression pattern image (handwritten mathematical expression pattern image to be clustered) is reduced to at least one step to generate a plurality of scale images including the original image before reduction and at least one reduced image. In the example shown in FIG. 1, from one handwritten mathematical formula pattern image, a reduced image (1/4 image) in which the side length is reduced to 1/2 and a reduced image (1/16 image) in which the side length is reduced to 1/4. ) And three scale images (original image, 1/4 image, 1/16 image) are generated. A reduced image reduced to a smaller scale may be used, or a reduced image reduced to another scale may be additionally used.

次に、複数のスケール画像のそれぞれを、学習モデルである深層畳み込みニューラルネットワーク（Deep CNN）及び識別器（Classifier）に与えて、複数のスケール画像ごとに、シンボルの種類（クラス）ごとの出現位置に関する特徴であるシンボル位置特徴（LC features）を抽出する。図３に、深層畳み込みニューラルネットワークと識別器の構成の
一例を示す。深層畳み込みニューラルネットワーク（Deep CNN）は、畳み込み層ＣＬとプーリング層ＰＬ（最大プーリング層）からなり、特徴マップ（CNN feature map）を出力
する。特徴マップ（特徴マップ群）は、２次元配列の入力（スケール画像）に畳み込み（Convolution）やプーリング（Pooling）を適用して得られる特徴の２次元配列である特徴マップを複数含む。特徴マップは、識別器（Classifier）に入力される。ここでは、最後の２つの畳み込み層ＣＬから出力される特徴マップを結合して、識別器に入力している。識別器は、畳み込み層ＣＬとドロップアウトＤＯからなり、シンボル位置特徴を出力する。 Next, each of the plurality of scale images is given to the learning model, the deep convolutional neural network (Deep CNN) and the classifier (Classifier), and the appearance position of each symbol type (class) is given to each of the plurality of scale images. Extract the symbol position features (LC features) that are the features related to. FIG. 3 shows an example of the configuration of the deep convolutional neural network and the classifier. The deep convolutional neural network (Deep CNN) is composed of a convolutional layer CL and a pooling layer PL (maximum pooling layer), and outputs a feature map (CNN feature map). The feature map (feature map group) includes a plurality of feature maps which are two-dimensional arrays of features obtained by applying convolution or pooling to the input (scale image) of the two-dimensional array. The feature map is input to the classifier. Here, the feature maps output from the last two convolution layers CL are combined and input to the classifier. The classifier consists of a convolution layer CL and a dropout DO and outputs symbol position features.

図２にシンボル位置特徴（LC features）として示す箱は、図４に示すように複数の平
面をまとめて表現したものである。シンボル位置特徴は、シンボルの複数の種類（クラス）に対応する複数の平面で表され、各平面は、入力面（スケール画像）上の各位置でのそのクラスの出現確率を示す。奥行Ｄ（奥行方向に並ぶ平面の数）はクラスの数（#classes）を示し、高さＨは入力面の高さ（縦方向の画素数）を示し、幅Ｗは入力面の幅（横方向の画素数）を示す。 The box shown as the symbol position feature (LC features) in FIG. 2 is a collective representation of a plurality of planes as shown in FIG. The symbol position feature is represented by a plurality of planes corresponding to a plurality of types (classes) of symbols, and each plane indicates the probability of occurrence of that class at each position on the input plane (scale image). Depth D (the number of planes lined up in the depth direction) indicates the number of classes (#classes), height H indicates the height of the input surface (number of pixels in the vertical direction), and width W indicates the width of the input surface (horizontal). The number of pixels in the direction) is shown.

次に、複数のスケール画像のそれぞれから抽出した複数のシンボル位置特徴に対して、階層的空間プーリング（ＨＳＰ：Hierarchical Spatial Pooling）を行う。階層的空間プーリングでは、シンボル位置特徴の各平面に対して複数種類の区画分割を行い、複数種類の区画分割のそれぞれで得られた区画ごとにシンボルの種類（クラス）ごとの出現確率を求めて階層空間特徴ベクトル（HS features、本発明の特徴ベクトルに対応）を生成する。図５は、階層的空間プーリングの一例を模式的に示す図である。この例では、シンボル位置特徴の各平面に対して、３種類の区画分画（１×１の区画分割、３×５の区画分割及び５×７の区画分割）を適用し、３種類の区画分割のそれぞれから、区画ごとに各クラスの出現確率の最大値を求め（区画ごとに最大プーリングを行い）、区画ごとにｋ次元の特徴ベクトルを生成する。ｋは、クラスの数（#classes）であり、ｋ次元の特徴ベクトルは、ｋ種類のシンボルそれぞれの出現確率を要素とするベクトルである。そして、分割した区画数分のｋ次元の特徴ベクトルを連結し、更に、区画分割の種類数分の特徴ベクトルを連結して、階層空間特徴ベクトルを生成する。１×１の区画分割からは１×ｋ＝ｋ次元の特徴ベクトル（入力面におけるクラスごとの出現確率）が得られ、３×５の区画分割からは３×５×ｋ＝１５ｋ次元の特徴ベクトル（入力面を３×５の区画に分割した場合の各区画におけるクラスごとの出現確率）が得られ、５×７の区画分割からは５×７×ｋ＝３５ｋ次元の特徴ベクトル（入力面を５×７の区画に分割した場合の各区画におけるクラスごとの出現確率）が得られ、これらを連結してｋ＋１５ｋ＋３５ｋ＝５１ｋ次元の階層空間特徴ベクトルが得られる。１×１の区画分割から得られる特徴ベクトルは、シンボル位置特徴の各平面を分割していないため位置情報を表現しないが、他の区画分割から得られる特徴ベクトルは、位置情報を表現する。なお、上記３種類の区画分割のいずれかを除いたり、別の種類の区画分割を追加して行ってもよい。 Next, hierarchical spatial pooling (HSP: Hierarchical Spatial Pooling) is performed on a plurality of symbol position features extracted from each of the plurality of scale images. In hierarchical space pooling, multiple types of divisions are performed for each plane of the symbol position feature, and the appearance probability for each symbol type (class) is obtained for each division obtained by each of the multiple types of divisions. Generate a hierarchical spatial feature vector (HS features, corresponding to the feature vector of the present invention). FIG. 5 is a diagram schematically showing an example of hierarchical space pooling. In this example, three types of partitioning (1x1 partitioning, 3x5 partitioning and 5x7 partitioning) are applied to each plane of the symbol position feature, and three types of partitioning are applied. From each of the divisions, the maximum value of the appearance probability of each class is obtained for each division (maximum pooling is performed for each division), and a k-dimensional feature vector is generated for each division. k is the number of classes (#classes), and the k-dimensional feature vector is a vector whose elements are the appearance probabilities of each of the k types of symbols. Then, the k-dimensional feature vectors for the number of divided sections are connected, and the feature vectors for the number of types of the divided sections are further connected to generate the hierarchical space feature vector. A 1 × k = k-dimensional feature vector (probability of appearance for each class on the input surface) is obtained from the 1 × 1 partition, and a 3 × 5 × k = 15 k-dimensional feature vector is obtained from the 3 × 5 partition. (Probability of appearance for each class in each section when the input surface is divided into 3 × 5 sections) is obtained, and from the 5 × 7 section division, a feature vector of 5 × 7 × k = 35k dimensions (input surface is divided into 3 × 5 sections). The probability of appearance for each class in each section when divided into 5 × 7 sections) is obtained, and these are connected to obtain a hierarchical space feature vector of k + 15k + 35k = 51k dimensions. The feature vector obtained from the 1 × 1 partition does not represent the position information because each plane of the symbol position feature is not divided, but the feature vector obtained from the other partition expresses the position information. It should be noted that any of the above three types of division may be removed, or another type of division may be added.

次に、複数のスケール画像のそれぞれから生成した階層空間特徴ベクトルの同一箇所で同一シンボル（同一クラス）の出現確率の最大値をとって（最大集約（ＭＡ：Max aggregation）を行って）統合特徴ベクトル（複数スケール階層空間特徴ベクトル、MHS features）を生成する。図２に示す例では、元画像、１／４画像及び１／１６画像のそれぞれか
ら生成した３つの階層空間特徴ベクトルを統合して統合特徴ベクトルを生成する。 Next, the maximum value of the appearance probability of the same symbol (same class) is taken at the same place of the hierarchical spatial feature vector generated from each of the plurality of scale images (maximum aggregation (MA) is performed), and the integrated feature is integrated. Generate a vector (multi-scale hierarchical spatial feature vector, MHS features). In the example shown in FIG. 2, the integrated feature vector is generated by integrating the three hierarchical spatial feature vectors generated from each of the original image, the 1/4 image, and the 1/16 image.

手書き数式パターンのクラスタリングでは、まず、複数の手書き数式パターン画像それぞれから生成した統合特徴ベクトルに基づいて、手書き数式パターン間の類似度（統合特徴ベクトル間の距離）を求める。統合特徴ベクトル間の距離を求める関数としては、例えば、コサイン距離関数を利用することができる。そして、２つの手書き数式パターン画像の統合特徴ベクトル間の距離を、クラスタリング対象とする手書き数式パターンの全ての組み合わせについて求め、類似する手書き数式パターンごとにクラスタにクラスタリングする。クラスタリングには、K-means++法などの手法を利用することができる。 In the clustering of handwritten mathematical expression patterns, first, the similarity between the handwritten mathematical expression patterns (distance between the integrated feature vectors) is obtained based on the integrated feature vectors generated from each of the plurality of handwritten mathematical expression pattern images. As a function for obtaining the distance between integrated feature vectors, for example, a cosine distance function can be used. Then, the distance between the integrated feature vectors of the two handwritten mathematical expression pattern images is obtained for all combinations of the handwritten mathematical expression patterns to be clustered, and clustered into clusters for each similar handwritten mathematical expression pattern. A method such as the K-means ++ method can be used for clustering.

数式において、シンボルは役割に応じて大きさが変わる。例えば、サブスクリプト（下付き）やスーパースクリプト（上付き）は、ベースラインのシンボルよりも小さい。図３に示す例では、手書き数式パターン画像において、異なる大きさの「ｘ」が出現している。これが数式のシンボル検出を難しくする。本実施形態の手法では、複数のスケール画像のそれぞれを１つの深層畳み込みニューラルネットワーク（Deep CNN）に与えて、複数のスケール画像ごとにシンボル位置特徴（LC features）を抽出することで、大きさの異な
るシンボルのクラスと位置を抽出することができる。これは、いずれかの縮尺のスケール画像が、あるシンボルのクラスと位置を抽出するのに適するからである。すなわち、縮尺の小さなスケール画像は、大きく筆記されるシンボルのクラスと位置を抽出するのに適し、縮尺の大きなスケール画像（元画像など）は、小さく筆記されるシンボルのクラスと位置を抽出するのに適する。 In mathematical formulas, symbols vary in size depending on their role. For example, subscripts (subscripts) and superscripts (superscripts) are smaller than baseline symbols. In the example shown in FIG. 3, “x” of different sizes appear in the handwritten mathematical expression pattern image. This makes it difficult to detect symbols in mathematical formulas. In the method of the present embodiment, each of the plurality of scale images is given to one deep convolutional neural network (Deep CNN), and the symbol position features (LC features) are extracted for each of the plurality of scale images to obtain the size. You can extract different symbol classes and positions. This is because a scale image of either scale is suitable for extracting the class and position of a symbol. That is, a small scale image is suitable for extracting the class and position of a symbol that is written large, and a large scale image (such as the original image) is suitable for extracting the class and position of a symbol that is written small. Suitable for.

また、数式認識では、シンボルの位置関係が重要である。例えば、括弧や分数罫の認識には、そのシンボルの形状だけでなく、その周りのシンボルの文脈も必要である。そこで本実施形態の手法では、シンボルだけでなくその近傍の情報を得るために、深層畳み込みニューラルネットワークにおいて、最後の２つの畳み込み層ＣＬから出力される特徴マップを結合して、識別器に入力する。 In addition, the positional relationship of symbols is important in mathematical expression recognition. For example, recognition of parentheses and fractional rules requires not only the shape of the symbol, but also the context of the symbols around it. Therefore, in the method of the present embodiment, in order to obtain not only the symbol but also the information in the vicinity thereof, in the deep convolutional neural network, the feature maps output from the last two convolutional layers CL are combined and input to the classifier. ..

また、本実施形態の手法では、シンボル位置特徴に対して複数種類の区画分割を行い、複数種類の区画分割のそれぞれで得られた区画ごとにシンボルの種類ごとの出現確率を求めて階層空間特徴ベクトル（HS features）を生成する処理（階層的空間プーリング）を
採用することで、シンボルの位置ずれの大小に依らずシンボルを検出することができる。すなわち、シンボルの位置ずれが大きい場合でも、区画数の少ない区画分割から得られる特徴ベクトルにより、大まかな位置でそのシンボルを検出することができ、シンボルの位置ずれが小さい場合には、区画数の多い区画分割から得られる特徴ベクトルにより、より精密な位置でそのシンボルを検出することができる。なお、階層的空間プーリングにおいて、シンボルの位置ずれに対応するために、各区画におけるシンボルごとの出現確率に対して、区画を中心したガウス関数を重畳するようにしてもよい。 Further, in the method of the present embodiment, a plurality of types of divisions are performed on the symbol position feature, and the appearance probability of each symbol type is obtained for each division obtained by each of the plurality of types of divisions, and the hierarchical space feature is obtained. By adopting the process of generating vectors (HS features) (hierarchical spatial pooling), symbols can be detected regardless of the magnitude of the symbol misalignment. That is, even if the misalignment of the symbol is large, the symbol can be detected at a rough position by the feature vector obtained from the division with a small number of compartments, and if the misalignment of the symbol is small, the number of compartments The feature vector obtained from many compartments allows the symbol to be detected at a more precise position. In the hierarchical space pooling, in order to deal with the misalignment of symbols, a Gaussian function centered on the section may be superimposed on the appearance probability of each symbol in each section.

また、スケール画像のサイズが異なるとシンボル位置特徴のサイズも異なるが、本実施形態の手法では、複数のスケール画像のそれぞれから抽出した複数のシンボル位置特徴に対して階層的空間プーリングを適用することで、特徴のサイズを同一サイズに揃えて統合することが可能となる。そして、複数のスケール画像のそれぞれから生成した階層空間特徴ベクトルを統合して得られる統合特徴ベクトル（MHS features）に基づき複数の手書き数式パターンを複数のグループに分類することで、手書き数式パターンを似通ったものごとに精度良くクラスタリングすることができ、採点者の採点誤りや採点のばらつきを低減して採点の効率を高めることができる。 Further, if the size of the scale image is different, the size of the symbol position feature is also different, but in the method of the present embodiment, hierarchical spatial pooling is applied to a plurality of symbol position features extracted from each of the plurality of scale images. Therefore, it is possible to align the size of the features to the same size and integrate them. Then, by classifying a plurality of handwritten mathematical expression patterns into a plurality of groups based on the integrated feature vector (MHS features) obtained by integrating the hierarchical spatial feature vectors generated from each of the plurality of scale images, the handwritten mathematical expression patterns are similar. It is possible to cluster each item with high accuracy, reduce scoring errors and scoring variations of graders, and improve scoring efficiency.

次に、深層畳み込みニューラルネットワーク（学習モデル）の学習について説明する。図６は、学習処理の流れを示す図である。学習時にはシンボルの位置情報を用いずに、大域注目プーリング（ＧＡｔＰ：Global Attentive Pooling）により、シンボルの種類ごとの出現確率を示す出現確率ベクトルを抽出する。 Next, learning of a deep convolutional neural network (learning model) will be described. FIG. 6 is a diagram showing a flow of learning processing. At the time of learning, an appearance probability vector indicating the appearance probability for each type of symbol is extracted by global attention pooling (GAtP: Global Attentive Pooling) without using the position information of the symbol.

まず、クラスタリング時と同様に、１つの手書き数式パターン画像（学習用の手書き数式パターン画像）を少なくとも１つの段階に縮小して、縮小前の元画像と少なくとも１つの縮小画像を含む複数のスケール画像を生成する。次に、複数のスケール画像のそれぞれを、深層畳み込みニューラルネットワーク（Deep CNN）に与えて、最後の２つの畳み込み層ＣＬから出力される特徴マップ（CNN feature map）を得る。特徴マップとして示す箱
の奥行（Ｄ）は、特徴マップに含まれる特徴（特徴マップ群に含まれる特徴マップ）の数
（#features）を示す。 First, as in the case of clustering, one handwritten mathematical expression pattern image (handwritten mathematical expression pattern image for learning) is reduced to at least one step, and a plurality of scale images including the original image before reduction and at least one reduced image are included. To generate. Next, each of the plurality of scale images is given to the deep convolutional neural network (Deep CNN) to obtain a feature map (CNN feature map) output from the last two convolutional layers CL. The depth (D) of the box shown as the feature map indicates the number (#features) of the features (feature maps included in the feature map group) included in the feature map.

次に、特徴マップに対して、大域注目プーリング（ＧＡｔＰ）を行う。この処理では、あるシンボルの全ての出現位置から特徴を抽出するが、注目すべきクラスの出現に重みをつける。図７は、大域注目プーリングの処理の流れを示す図である。まず、特徴マップ（Ｆ）を識別器（Classifier）に入力して、シンボル位置特徴（Ｍ_ｉ，ｉ＝１．．ｋ）を求める。ｋはクラスの数（#classes）である。次に、シンボルごとに、特徴マップ（Ｆ）にシンボル位置特徴を掛け合わせて（乗算して）、ｋ個のクラス注目特徴マップ（Ｆ×Ｍ_１〜Ｆ×Ｍ_ｋ）を得る。クラスｃ（ｃ番目のクラス）のシンボル位置特徴（入力面上の各位置でのクラスｃの出現確率）をＭ_ｃとすると、特徴マップ（Ｆ）にシンボル位置特徴（Ｍ_ｃ）を掛け合わせて、クラス注目特徴マップ（Ｆ×Ｍ_ｃ）を得る。クラス注目特徴マップ（Ｆ×Ｍ_ｃ）は、クラスｃの出現に重みを付けた特徴マップである。次に、クラス注目特徴マップ（Ｆ×Ｍ_１〜Ｆ×Ｍ_ｋ）のそれぞれについて、高さＨと幅Ｗの値の平均をとって（特徴ごとに平均化して）、クラス注目ベクトル（Ａｔｔ_Ｆ１〜Ａｔｔ_Ｆｋ）を求める。特徴マップ（Ｆ）にシンボル位置特徴（Ｍ_ｃ）を掛け合わせて平均化したクラス注目ベクトル（Ａｔｔ_Ｆｃ）は、以下の式（１）により求めることができる。 Next, global attention pooling (GAtP) is performed on the feature map. In this process, features are extracted from all occurrence positions of a symbol, but the appearance of notable classes is weighted. FIG. 7 is a diagram showing a flow of processing for global attention pooling. First, by entering a feature map (F) to the discriminator (Classifier), determine the symbol position feature _{(M i, i = 1..k)} . k is the number of classes (#classes). Next, for each symbol, the feature map (F) is multiplied (multiplied) by the symbol position feature to obtain k class attention feature maps (F × M _{1 to} F × M _k ). When the symbol position feature class c (c-th class) a (probability of class c at each position on the input surface) and M _c, by multiplying the symbol position feature (M _c) to the feature map (F) , Obtain a class attention feature map (F × _Mc ). The class attention feature map (F × _Mc ) is a feature map that weights the appearance of class c. Next, for each of the class attention feature maps (F × M _{1 to} F × M _k ), the values of the height H and the width W are averaged (averaged for each feature), and the class attention vector (Att _{F1) is taken.} ~ Att _Fk ) is obtained. Averaged class interest vector by multiplying the symbol position feature _{(M c)} to the feature map (F) _{(Att Fc)} can be obtained by the following equation (1).

ここで、ｘ，ｙは、特徴マップ（Ｆ）やシンボル位置特徴（Ｍ_ｃ）の高さＨと幅Ｗの次元を示す。

Here, x and y indicate the dimensions of the height H and the width W of the feature map (F) and the symbol position feature ( _Mc).

次に、クラス注目ベクトル（Ａｔｔ_Ｆ１〜Ａｔｔ_Ｆｋ）のそれぞれを識別器（Classifier）に入力して得られるベクトル（ｆ（Ａｔｔ_Ｆ１）〜ｆ(Ａｔｔ_Ｆｋ)）を対角化（Diagonalize）して、クラスごとの出現確率を示す第１出現確率ベクトル（ｙ_１〜ｙ_ｋを各要
素とするベクトルｙ）を求める。ベクトル（ｆ（Ａｔｔ_Ｆｃ））のｃ番目の要素であるｙ_ｃ（クラスｃの出現確率）は、以下の式（２）により求めることができる。 Next, the vectors (f (Att _F1 ) to f (Att _Fk )) obtained by inputting each of the class attention vectors (Att _{F1 to} _{Att Fk) into the classifier are diagonalized.} , The first appearance probability vector (vector y having _{y 1 to} _{y k} as each element) indicating the appearance probability for each class is obtained. The c-th element of the vector (f (Att _Fc _{)), y c} (probability of appearance of class c), can be obtained by the following equation (2).

ここで、ｕ_ｃは、クラスｃのOne-hotベクトル（ｃ番目の要素だけ１で他の要素が０で
あるベクトル）であり、＜ａ,ｂ＞は、ベクトルａとベクトルｂの内積演算を示す。

Here, u _c is the One-hot vector class c (vector other elements only 1 c th element is 0), <a, b> is the inner product operation of the vector a and vector b show.

次に、図６に示すように、複数のスケール画像のそれぞれから生成した第１出現確率ベクトル（ｙ）の同一箇所で同一シンボルの出現確率の最大値をとって（最大集約（ＭＡ）を行って）第１統合出現確率ベクトル（Ｍｙ）を生成する。 Next, as shown in FIG. 6, the maximum value of the appearance probability of the same symbol is taken at the same position of the first appearance probability vector (y) generated from each of the plurality of scale images (maximum aggregation (MA) is performed). The first integrated appearance probability vector (My) is generated.

次に、第１統合出現確率ベクトル（Ｍｙ）と教師データ（クラスごとの出現確率（出現の有無）を示すベクトル）とに基づいて、バイナリエントロピー損失（Binary entropy loss）を用いて学習モデル（深層畳み込みニューラルネットワーク）を学習させる。図６
に示す例では、学習用の手書き数式パターンにシンボル「ａ」、「ｂ」、「２」、「＋」、「分数罫（frac）」が出現するため、教師データとして、これらのクラスの出現確率が「１」でありそれ以外のクラスの出現確率が「０」であるベクトルを用意している。 Next, based on the first integrated appearance probability vector (My) and the teacher data (vector showing the appearance probability (presence / absence) of each class), a learning model (deep layer) is used using binary entropy loss. Train a convolutional neural network). Figure 6
In the example shown in, symbols "a", "b", "2", "+", and "fractional rule (frac)" appear in the handwritten mathematical expression pattern for learning, so these classes appear as teacher data. A vector is prepared in which the probability is "1" and the appearance probability of other classes is "0".

なお、大域注目プーリング（ＧＡｔＰ）は、事前学習をしない場合には、収束に多大な時間を要するか、最悪の場合は収束しない可能性がある。また、手書き数式パターン画像中に存在しないクラスに対しても一定の確率を出してしまう可能性がある。そこで、この
問題を解決するために、大域最大プーリング（ＧＭＰ：Global Max Pooling）によってクラスごとの出現確率（第２出現確率ベクトル）を求め、複数のスケール画像のそれぞれから生成した第２出現確率ベクトルの同一箇所で同一シンボルの出現確率の最大値をとって第２統合出現確率ベクトルを生成し、第２統合出現確率ベクトルと第１統合出現確率ベクトル（Ｍｙ）とを統合（平均化）して第３統合出現確率ベクトルを生成し、第３統合出現確率ベクトルと教師データとに基づいて学習モデルを学習させるようにしてもよい。大域最大プーリングでは、特徴マップ（Ｆ）の特徴ごとにｘ，ｙについて最大値をとった式（３）のｄ（depth、#features）次元のベクトルを求める。 It should be noted that the global attention pooling (GAtP) may take a large amount of time to converge without prior learning, or may not converge in the worst case. In addition, there is a possibility that a certain probability will be given to a class that does not exist in the handwritten mathematical expression pattern image. Therefore, in order to solve this problem, the appearance probability (second appearance probability vector) for each class is obtained by Global Max Pooling (GMP), and the second appearance probability vector generated from each of a plurality of scale images is obtained. The second integrated appearance probability vector is generated by taking the maximum value of the appearance probability of the same symbol at the same place, and the second integrated appearance probability vector and the first integrated appearance probability vector (My) are integrated (averaged). A third integrated appearance probability vector may be generated, and the learning model may be trained based on the third integrated appearance probability vector and the teacher data. In the global maximum pooling, the d (depth, #features) -dimensional vector of the equation (3) in which the maximum values are taken for x and y for each feature of the feature map (F) is obtained.

このベクトルを識別器（Classifier）に入力して第２出現確率ベクトルを求める。この処理では、それぞれのクラスに対して、最もそれらしいものの位置からその特徴を抽出する。第１統合出現確率ベクトルと第２統合出現確率ベクトルの統合により、深層畳み込みニューラルネットワークは、大域注目プーリングと大域最大プーリングのそれぞれの損失関数の両方を低減する方向で学習を進める。学習の初期段階では、大域最大プーリングの利点を活用して学習が進み、それが進行した段階では、大域注目プーリングの機能が発揮される。

This vector is input to the classifier to obtain the second appearance probability vector. In this process, for each class, its features are extracted from the position of the most likely one. By integrating the first integrated appearance probability vector and the second integrated appearance probability vector, the deep convolutional neural network proceeds with learning in the direction of reducing both the loss functions of the global attention pooling and the global maximum pooling. In the initial stage of learning, learning progresses by taking advantage of the global maximum pooling, and in the advanced stage, the function of global attention pooling is demonstrated.

３．評価実験
本実施形態のクラスタリング手法（及び、学習手法）を評価する実験を行った。クラスタリングの評価指標として、以下の式（４）に示す純度（Purity）を用いた。クラスタごとに採点することを考えると、純度が１に近い方が良い。 3. 3. Evaluation experiment An experiment was conducted to evaluate the clustering method (and learning method) of this embodiment. Purity shown in the following formula (4) was used as an evaluation index for clustering. Considering scoring for each cluster, the purity should be close to 1.

ここで、Ｋは、クラスタの数であり、Ｊは、実際のクラスの数であり、Ｈは、サンプルの数であり、Ｇ＝｛ｇ_１，ｇ_２，…，ｇ_Ｋ｝は、結果のクラスタであり、Ｃ＝｛ｃ_１，ｃ_２，…，ｃ_Ｊ｝は、実際のクラスである。

Where K is the number of clusters, J is the number of actual classes, H is the number of samples, and G = {g ₁ , g ₂ , ..., G _K } is the result. It is a cluster, and C = {c ₁ , c ₂ , ..., c _J } is the actual class.

本実験では、手書き数式パターン画像のサンプルとしてCROHME 2016 dataset（H. Mouchere, C. Viard-Gaudin, R. Zanibbi, U. Garain, ICFHR2016 CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions, in: 2016 15th Int. Conf. Front. Handwrit. Recognit., IEEE, 2016: pp. 607-612. doi:10.1109/ICFHR.2016.0116.）を用いた。これには、１０１種類のシンボルが出現する。また、これには、８，
８３４の学習サンプル、９８６の検証サンプル、１,１４７のテストサンプルがあるが、
本実験では、テストサンプルを用いず、学習サンプルの一部である３６種類の数式の６２０のサンプルをテストサンプルに用いる。なぜなら、元のテストサンプルには、１つの数式あたり１つのサンプルしかないためである。一方、学習サンプルから選んだ６２０のサンプルには、１つの数式あたり１５から２２のサンプルがある。元の８，８３４の学習サンプルからこれら６２０のサンプルを除いた８,２１４のサンプルを学習に用いた。検証
サンプルについてはそのまま利用した。以降、このサンプルセットをＳ−ＣＲＯＨＭＥと呼ぶ。 In this experiment, CROHME 2016 dataset (H. Mouchere, C. Viard-Gaudin, R. Zanibbi, U. Garain, ICFHR2016 CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions, in: 2016 15th Int . Conf. Front. Handwrit. Recognit., IEEE, 2016: pp. 607-612. Doi: 10.1109 / ICFHR.2016.0116.) Was used. 101 kinds of symbols appear in this. Also, this includes 8,
There are 834 training samples, 986 verification samples, and 1,147 test samples.
In this experiment, 620 samples of 36 kinds of mathematical formulas, which are a part of the learning sample, are used as the test sample without using the test sample. This is because the original test sample has only one sample per formula. On the other hand, the 620 samples selected from the training samples include 15 to 22 samples per formula. 8,214 samples were used for training, excluding these 620 samples from the original 8,834 training samples. The verification sample was used as it was. Hereinafter, this sample set will be referred to as S-CROHME.

本実験では、全てのサンプル（数式画像）について、その縦横比を固定して、縦方向が
１２８画素になるように正規化（通常はこれ以上の画素数をもつので１２８画素に縮小、稀に小さいものは１２８画素に拡大）し、その結果、横方向が８００画素より大きいものは、横方向が８００画素になるように線形に圧縮した。なお、横方向が８００画素未満のものはそのままとした。複数解像度の画像（複数段階の縮小画像）を生成するために、２×２のマスクによる平均プーリングを２回行った。その結果、縦方向が１２８画素、６４画素、３２画素の数式画像（３つのスケール画像）が生成された。なお。数式では、複数のシンボルが出現し、それらを判読できるためには、縦方向が１２８画素、横方向がそれ以上になることが普通であることから、上記のように前処理するが、それらより少ない画素数（例えば、６４画素）の場合は、縦方向が６４画素、３２画素、１６画素の数式画像を用意したり、区画分割の区画数を少なくするなどの方法で対処できる。 In this experiment, the aspect ratio of all samples (mathematical images) is fixed and normalized so that the vertical direction is 128 pixels (usually, since the number of pixels is larger than this, it is reduced to 128 pixels, rarely. Smaller ones were enlarged to 128 pixels), and as a result, those larger than 800 pixels in the horizontal direction were linearly compressed so as to have 800 pixels in the horizontal direction. Those having less than 800 pixels in the horizontal direction were left as they were. Average pooling with a 2x2 mask was performed twice to generate a multi-resolution image (multi-step reduced image). As a result, mathematical images (three scale images) having 128 pixels, 64 pixels, and 32 pixels in the vertical direction were generated. note that. In mathematical formulas, multiple symbols appear, and in order to be able to read them, it is common for the vertical direction to be 128 pixels and the horizontal direction to be more than that. When the number of pixels is small (for example, 64 pixels), it can be dealt with by preparing a mathematical image having 64 pixels, 32 pixels, or 16 pixels in the vertical direction, or reducing the number of divisions.

表１に、本実験で用いた深層畳み込みニューラルネットワークの具体的構成を示す。表１において、♯ｍａｐｓは、特徴マップの数であり、ｋは、カーネル（フィルタ）の数であり、ｓは、ストライドの大きさであり、ｐは、パディングの大きさである。また、ＲｅＬＵは、活性化関数である。 Table 1 shows the specific configuration of the deep convolutional neural network used in this experiment. In Table 1, #maps is the number of feature maps, k is the number of kernels (filters), s is the stride magnitude, and p is the padding magnitude. ReLU is also an activation function.

階層的空間プーリング（ＨＳＰ）では、１×１の区画分割、３×５の区画分割、３×７の区画分割及び５×７の区画分割を行った。また、２つの数式画像の統合特徴ベクトル間
の距離を算出するために、コサイン距離関数を用いた。クラスタリング手法としては、K-means++法を採用した。

In Hierarchical Space Pooling (HSP), 1x1 partitioning, 3x5 partitioning, 3x7 partitioning and 5x7 partitioning were performed. In addition, a cosine distance function was used to calculate the distance between the integrated feature vectors of the two mathematical images. The K-means ++ method was adopted as the clustering method.

クラスタの数Ｋを大きくとれば、高い純度（Purity）は簡単に得られる。例えば、クラスタの数Ｋがサンプルの数Ｈに等しければ、純度は１になる。しかし、これではクラスタリングをしていないのと同じであり、採点効率は上がらない。従って、クラスタの数Ｋは、数式の種類数程度にしなければならない。但し、実際の採点場面では、正確な数式認識をしなければ、その種類数は分からない。ここでは、評価のために、前もって分かっている数式の種類数をクラスタの数Ｋに設定した。すなわち、Ｓ−ＣＲＯＨＭＥを３６のクラスタにクラスタリングした。 High purity can be easily obtained by increasing the number K of clusters. For example, if the number K of clusters is equal to the number H of samples, the purity is 1. However, this is the same as not clustering, and the scoring efficiency does not increase. Therefore, the number K of clusters must be about the number of types of mathematical expressions. However, in the actual scoring scene, the number of types cannot be known without accurate mathematical recognition. Here, for evaluation, the number of types of mathematical formulas known in advance was set to the number K of clusters. That is, S-CROHME was clustered into 36 clusters.

まず、階層空間特徴ベクトル（HS features）を生成する際の階層的空間プーリング（
ＨＳＰ）の効果を評価した。区画分割を行わない（１×１の区画分割のみを行う）場合、単一の区画分割（３×５、３×７、５×７のいずれかの区画分割のみ）を行う場合、複数種類の区画分画（階層的空間プーリング）を行う場合のそれぞれで、Ｓ−ＣＲＯＨＭＥのテストサンプルをクラスタリングして純度を求めた。評価結果を表２に示す。 First, hierarchical spatial pooling when generating hierarchical spatial feature vectors (HS features) (
The effect of HSP) was evaluated. When no partitioning is performed (only 1x1 partitioning is performed), when a single partitioning (only one of 3x5, 3x7, 5x7) is performed, there are multiple types. Purity was determined by clustering S-CROHME test samples in each case of compartmentalization (hierarchical spatial pooling). The evaluation results are shown in Table 2.

区画分割を行わない場合の純度０．９２に対して、単一の区画分割を行う場合はいずれの区画分割でも０．９６以上の純度を達成した。また、複数種類の区画分割を行う場合、いずれの区画分割の組み合わせでも、単一の区画分割を行う場合より高い純度を達成した。これは、単一の区画分割を行うより、複数種類の区画分割を行う方が効果的であることを示している。特に、１×１の区画分割と３×５の区画分割の組み合わせで純度０．９９を達成した。

In contrast to the purity of 0.92 when no division was performed, a purity of 0.96 or higher was achieved in any of the divisions when a single division was performed. In addition, when a plurality of types of divisions were performed, higher purity was achieved in any combination of divisions than in the case of performing a single division. This indicates that it is more effective to perform multiple types of partitioning than to perform a single partitioning. In particular, a purity of 0.99 was achieved with a combination of 1x1 compartmentalization and 3x5 compartmentalization.

次に、大域プーリングの違いによる効果を評価した。学習時に、大域平均プーリング（ＧＡＰ：Global Average Pooling）を行う場合、大域最大プーリング（ＧＭＰ）を行う場
合、大域注目プーリングと大域最大プーリングを統合する（ＧＡｔＰ＋ＧＭＰ）場合のそれぞれで、純度を求めた。評価結果を表３に示す。なお、大域平均プーリングは、特徴マップ（Ｆ）の特徴ごとに平均値をとったベクトルを識別器（Classifier）に入力して出現確率ベクトルを求める手法であり、それぞれのクラスに対して、全ての出現位置からその特徴を抽出する。 Next, the effect of the difference in global pooling was evaluated. At the time of learning, the purity was determined in each of the cases of performing global average pooling (GAP: Global Average Pooling), performing global maximum pooling (GMP), and integrating global attention pooling and global maximum pooling (GAtP + GMP). The evaluation results are shown in Table 3. The global average pooling is a method of obtaining the appearance probability vector by inputting the vector obtained by taking the average value for each feature of the feature map (F) into the classifier, and for each class, all the vectors are obtained. The feature is extracted from the appearance position.

区画分割を行う場合、ＧＡｔＰ＋ＧＭＰは、若干ではあるがＧＭＰ単独より良く、ＧＡＰよりは明らかに良い性能を示している。一方、区間分割を行わない（１×１の区間分割を行う）場合、ＧＡＰでは０．２８の純度しか得られない。

When partitioning, GAtP + GMP is slightly better than GMP alone and clearly better than GAP. On the other hand, when the section division is not performed (1 × 1 section division is performed), only 0.28 purity can be obtained by GAP.

大域プーリングの違いは、クラスタリングだけでなく、シンボルの平均認識率にも表れる。ＧＡｔＰ＋ＧＭＰ、ＧＭＰ、ＧＡＰの平均認識率は、それぞれ、０．９４、０．９３、０．１７であった。ＧＡｔＰ＋ＧＭＰの良さを認識性能でも確認することができた。 The difference in global pooling is reflected not only in clustering but also in the average recognition rate of symbols. The average recognition rates of GAtP + GMP, GMP, and GAP were 0.94, 0.93, and 0.17, respectively. We were able to confirm the goodness of GAtP + GMP in terms of recognition performance.

図８は、ある手書き数式パターン画像（スケール画像）に、当該手書き数式パターン画像から抽出したシンボル位置特徴（シンボルの種類ごとの、画像上の各位置での出現確率）を重ね合わせた図である。シンボル位置特徴は、シンボルの出現確率が高いほど高い輝度で示されている。図８の１番左の１列目の画像は、元画像に元画像から抽出したシンボル位置特徴を重ね合わせた画像であり、２列目の画像は、１／４画像に１／４画像から抽出したシンボル位置特徴を重ね合わせた画像であり、３列目の画像は、１／１６画像に１／１６画像から抽出したシンボル位置特徴を重ね合わせた画像であり、１番右の４列目の画像は、これら３つの画像を合成した（同一位置での値の最大値をとった）画像である。また、１番上の１行目から１１行目にかけて、数式に出現する各シンボル（「ｘ」、「Σ」、「Ｐ」、「ｎ」、「ｊ」、「ｉ」、「ａ」、「０」、「＝」、「）」、「（」）の各位置での出現確率を示しており、一番下の１２行目の画像は、これらを合成した画像（全てのシンボルの各位置での出現確率）である。なお、図８では、縮小画像（１／４画像、１／１６画像）とそのシンボル位置特徴を元画像のサイズに拡大して表示している。 FIG. 8 is a diagram in which a symbol position feature (appearance probability at each position on the image for each symbol type) extracted from the handwritten mathematical expression pattern image (scale image) is superimposed on a certain handwritten mathematical expression pattern image (scale image). .. The symbol position feature is shown with higher brightness as the probability of appearance of the symbol increases. The leftmost image in the first column of FIG. 8 is an image in which the symbol position features extracted from the original image are superimposed on the original image, and the image in the second column is a 1/4 image and a 1/4 image. It is an image in which the extracted symbol position features are superimposed, and the image in the third column is an image in which the symbol position features extracted from the 1/16 image are superimposed on the 1/16 image, and the image in the fourth column on the far right. The image of is a composite image of these three images (the maximum value at the same position is taken). In addition, from the first line to the eleventh line at the top, each symbol ("x", "Σ", "P", "n", "j", "i", "a", The appearance probabilities at each position of "0", "=", ")", and "(") are shown, and the image on the 12th line at the bottom is an image obtained by synthesizing these (each of all symbols). Appearance probability at the position). In FIG. 8, the reduced image (1/4 image, 1/16 image) and its symbol position feature are enlarged and displayed in the size of the original image.

図８から、異なるサイズのシンボルが３つのスケール画像のいずれかで良く処理されていることが分かる。「ｘ」、「ｎ」、「ｊ」、「ａ」などの小さいサイズのシンボルは、元画像で検出され、「（」、「）」、「＝」、「Ｐ」などの中間のサイズのシンボルは、１／４画像で検出され、大きいサイズのシンボルの「Σ」は、１／１６画像で検出されて
いる。 From FIG. 8, it can be seen that symbols of different sizes are well processed in any of the three scale images. Small size symbols such as "x", "n", "j", "a" are detected in the original image and are of intermediate size such as "(", ")", "=", "P". The symbol is detected in the 1/4 image, and the large size symbol "Σ" is detected in the 1/16 image.

シンボルの位置も正しく推定されている。しかし、１０行目、１１行目の画像が示すように、シンボル「（」とシンボル「）」の両方が同一箇所で検出されている。シンボル「（」とシンボル「）」は、数式ではペアで現れるため、位置情報を用いずにクラスの情報だけで学習させられる半教師付き学習では、別々の位置に区別されない。しかし、クラスタリングに対する悪影響は軽微である。 The position of the symbol is also estimated correctly. However, as shown in the images on the 10th and 11th lines, both the symbol "(" and the symbol ")" are detected at the same location. Since the symbol "(" and the symbol ")" appear as a pair in the mathematical formula, they are not distinguished into different positions in the semi-supervised learning in which the learning is performed only by the class information without using the position information. However, the adverse effect on clustering is minor.

なお、本発明は、上述の実施の形態に限定されるものではなく、種々の変更が可能である。本発明は、実施の形態で説明した構成と実質的に同一の構成（例えば、機能、方法及び結果が同一の構成、あるいは目的及び効果が同一の構成）を含む。また、本発明は、実施の形態で説明した構成の本質的でない部分を置き換えた構成を含む。また、本発明は、実施の形態で説明した構成と同一の作用効果を奏する構成又は同一の目的を達成することができる構成を含む。また、本発明は、実施の形態で説明した構成に公知技術を付加した構成を含む。 The present invention is not limited to the above-described embodiment, and various modifications can be made. The present invention includes a configuration substantially the same as the configuration described in the embodiment (for example, a configuration having the same function, method and result, or a configuration having the same purpose and effect). The present invention also includes a configuration in which a non-essential part of the configuration described in the embodiment is replaced. The present invention also includes a configuration that exhibits the same effects as the configuration described in the embodiment or a configuration that can achieve the same object. Further, the present invention includes a configuration in which a known technique is added to the configuration described in the embodiment.

上記実施の形態では、手書き数式パターンを光学読み取り装置により画像として読み取るオフライン方式（画像入力方式）の場合について説明したが、タブレット等により筆記媒体の位置（筆点）の座標データを一定時間間隔で検出するオンライン方式（筆点座標入力方式）に本発明を適用することもできる。オンライン方式の場合は、時系列の筆点列を接続し、更に太さを与えて肉付けすることで画像に変換すれば、上記実施の形態をそのまま適用することができる。或いは、時系列の１次元データの特徴を抽出し、オフライン方式と同様に、シンボル位置特徴（LC features）を抽出するようにしてもよい。オンライ
ン方式の場合、オフライン方式のように手書き数式パターン画像を縮小して複数のスケール画像を用意する必要はなく、ＬＳＴＭ（Long Short-Term Memory）のニューラルネットワークを用いて、シンボルごとにシンボル位置特徴を抽出することができる。そして、１次元の階層的空間プーリングにより、シンボル位置特徴に対して複数種類の区画分割を行って、区画ごとにシンボルの種類ごとの出現確率を求めて特徴ベクトルを生成し、全ての種類の区画分割からの特徴ベクトルを連結して階層空間特徴ベクトルを生成する。複数のスケール画像を用意しないので、階層空間特徴ベクトルの最大集約（ＭＡ）を行う必要はない。 In the above embodiment, the case of the offline method (image input method) in which the handwritten mathematical formula pattern is read as an image by an optical reading device has been described, but the coordinate data of the position (writing point) of the writing medium is read at regular time intervals by a tablet or the like. The present invention can also be applied to a detection online method (handwriting coordinate input method). In the case of the online method, the above embodiment can be applied as it is by connecting a time-series brush stroke sequence, giving it a thickness, and fleshing it to convert it into an image. Alternatively, the features of the one-dimensional data of the time series may be extracted, and the symbol position features (LC features) may be extracted as in the offline method. In the case of the online method, it is not necessary to reduce the handwritten mathematical expression pattern image and prepare multiple scale images as in the offline method, and the symbol position feature is used for each symbol using the LSTM (Long Short-Term Memory) neural network. Can be extracted. Then, by one-dimensional hierarchical space pooling, a plurality of types of divisions are performed on the symbol position features, the appearance probability of each symbol type is obtained for each division, and a feature vector is generated, and all types of divisions are generated. A hierarchical spatial feature vector is generated by concatenating the feature vectors from the division. Since a plurality of scale images are not prepared, it is not necessary to perform maximum aggregation (MA) of hierarchical space feature vectors.

また、上記実施の形態では、手書き数式パターンを分類する場合について説明したが、本発明はこれに限らない。本発明は、シンボルが２次元空間で配置される構造（化学物質の構造式、化学反応式、論理式などの手書きパターン）の分類に適用することができる。 Further, in the above embodiment, the case of classifying the handwritten mathematical expression pattern has been described, but the present invention is not limited to this. The present invention can be applied to the classification of structures (handwritten patterns such as structural formulas, chemical reaction formulas, and logical formulas of chemical substances) in which symbols are arranged in a two-dimensional space.

１００…処理部、１１０…スケール画像生成部、１１１…特徴抽出部、１１２…特徴ベクトル生成部、１１３…統合部、１１４…分類部、１１５…表示制御部、１１６…学習部、１６０…入力部、１７０…記憶部、１９０…表示部 100 ... Processing unit, 110 ... Scale image generation unit, 111 ... Feature extraction unit, 112 ... Feature vector generation unit, 113 ... Integration unit, 114 ... Classification unit, 115 ... Display control unit, 116 ... Learning unit, 160 ... Input unit , 170 ... storage, 190 ... display

Claims

A program for classifying multiple handwritten patterns that have been handwritten.
A scale image generator that reduces an image of one handwritten pattern to at least one step and generates a plurality of scale images including an original image before reduction and at least one reduced image.
A feature extraction unit that applies each of the plurality of scale images to a learning model using a convolutional neural network and extracts symbol position features related to the appearance position of each symbol type for each of the plurality of scale images.
A feature vector generation unit that performs a plurality of types of divisions on the symbol position feature, obtains the appearance probability of each symbol type for each division obtained by each of the plurality of types of divisions, and generates a feature vector. ,
An integration unit that generates an integrated feature vector by taking the maximum value of the appearance probability of the same symbol at the same location of the feature vector generated from each of the plurality of scale images.
A program characterized in that a computer functions as a classification unit that classifies a plurality of the handwritten patterns into a plurality of groups based on the integrated feature vector generated from each of the images of the plurality of handwritten patterns.

In claim 1,
Each of the plurality of scale images is given to the training model, a feature map is acquired for each of the plurality of scale images, and the symbol position feature obtained by inputting the feature map into the classifier and the feature map are obtained. A vector obtained by multiplying and averaging is input to the classifier to generate a first appearance probability vector indicating the appearance probability for each symbol type, and the first appearance probability vector generated from each of the plurality of scale images is used. The first integrated appearance probability vector is generated by taking the maximum value of the appearance probability of the same symbol at the same location, and the learning is based on the first integrated appearance probability vector and teacher data indicating the presence or absence of appearance for each symbol type. A program characterized by further functioning a computer as a learning unit for learning a model.

In claim 2,
The learning unit
Each of the plurality of scale images is given to the training model, a feature map is acquired for each of the plurality of scale images, and a vector having the maximum value of the feature map is input to the classifier for each symbol type. A second appearance probability vector indicating the appearance probability of is generated, and the maximum value of the appearance probability of the same symbol is taken at the same place of the second appearance probability vector generated from each of the plurality of scale images to obtain the second integrated appearance probability. A vector is generated, a third integrated appearance probability vector obtained by averaging the first integrated appearance probability vector and the second integrated appearance probability vector is generated, and the third integrated appearance probability vector and the teacher data are used as the basis. A program characterized by training a learning model.

A clustering device that classifies multiple handwritten patterns that have been handwritten.
A scale image generator that reduces an image of one handwritten pattern to at least one step and generates a plurality of scale images including an original image before reduction and at least one reduced image.
A feature extraction unit that applies each of the plurality of scale images to a learning model using a convolutional neural network and extracts symbol position features related to the appearance position of each symbol type for each of the plurality of scale images.
A feature vector generation unit that performs a plurality of types of divisions on the symbol position feature, obtains the appearance probability of each symbol type for each division obtained by each of the plurality of types of divisions, and generates a feature vector. ,
An integration unit that generates an integrated feature vector by taking the maximum value of the appearance probability of the same symbol at the same location of the feature vector generated from each of the plurality of scale images.
A clustering apparatus including a classification unit that classifies a plurality of the handwritten patterns into a plurality of groups based on the integrated feature vector generated from each of the images of the plurality of handwritten patterns.