JP7388660B2

JP7388660B2 - Information processing device, user terminal, information processing method, and information processing program

Info

Publication number: JP7388660B2
Application number: JP2021087403A
Authority: JP
Inventors: 亮太吉橋; 智大田中; 賢治土井; 拓海藤野; 直晃山下
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-11-29
Anticipated expiration: 2041-05-25
Also published as: JP2022180741A

Description

本開示は、情報処理装置、利用者端末、情報処理方法、及び情報処理プログラムに関する。 The present disclosure relates to an information processing device, a user terminal, an information processing method, and an information processing program.

各種の画像情報に含まれる文字を認識する技術が知られている。画像情報から文字認識を行う技術は、ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅｃｏｇｎｉｔｉｏｎ）と呼ばれている。 Techniques for recognizing characters included in various types of image information are known. A technique for performing character recognition from image information is called OCR (Optical Character Recognition).

例えば、特許文献１には、画像の文字領域を検出する文字領域検出部と、文字領域を一文字単位で分割する文字領域分割部と、分割領域に存在する文字に対して一文字ごとに文字認識を行い、一文字に付き１つ以上の文字認識処理結果の候補を出力する文字認識部と、を備える文字認識装置が開示されている。 For example, Patent Document 1 describes a character area detection unit that detects a character area of an image, a character area division unit that divides the character area into individual characters, and a character recognition unit that performs character recognition for each character existing in the divided area. A character recognition device is disclosed that includes a character recognition unit that performs character recognition processing and outputs one or more character recognition processing result candidates for each character.

しかしながら、特許文献１に開示された文字認識装置のように、文字領域検出器と文字認識器を分離した場合、処理速度が遅くなる可能性がある。また、近年、スマートフォンなどの利用が活性化しており、スマートフォンなどが備える計算資源においても快適に動作する文字認識技術が求められている。 However, when a character area detector and a character recognizer are separated like the character recognition device disclosed in Patent Document 1, the processing speed may become slow. Furthermore, in recent years, the use of smartphones and the like has become more active, and there is a need for character recognition technology that can comfortably operate on the computational resources provided by smartphones and the like.

スマートフォンなどの利用者端末によって、画像に含まれる文字認識を行えれば、通信網の範囲外においても処理を行うことができる。また、画像データをアップロードするための通信料金を節約することができる。また、画像データのアップロードによる個人情報の流出の心配を減らすことができる。また、利用者が画像データに含まれる文字を用いて検索したいと考えた場合に、利用者が利用者端末に画像データに含まれる文字を入力することなく、検索を行うことができる。 If characters included in an image can be recognized using a user terminal such as a smartphone, processing can be performed even outside the range of a communication network. Furthermore, communication charges for uploading image data can be saved. Furthermore, concerns about leakage of personal information due to uploading image data can be reduced. Furthermore, when a user wants to search using characters included in image data, the user can perform the search without inputting the characters included in the image data into the user terminal.

このように、利用者端末で画像データから文字認識を迅速に行うことができれば、利用者に多くの便益を提供することができる。 In this way, if character recognition can be quickly performed from image data at a user terminal, many benefits can be provided to the user.

特開２０１２－１８５７２２号公報Japanese Patent Application Publication No. 2012-185722

本開示は上記課題を鑑み、認識対象の認識精度を落とすことなく、処理速度を向上させることができる情報処理装置、利用者端末、情報処理方法、及び情報処理プログラムを提供することを目的とする。 In view of the above problems, the present disclosure aims to provide an information processing device, a user terminal, an information processing method, and an information processing program that can improve processing speed without reducing the recognition accuracy of a recognition target. .

上述した課題を解決し、目的を達成するために、本開示に係る情報処理装置は、複数の認識対象が含まれるデータを取得する取得部と、前記データから認識対象が含まれる複数の領域を推定する推定部と、前記領域ごとに抽出される特徴量に基づいて、代表特徴量を抽出する抽出部と、前記代表特徴量に基づいて対応する領域に含まれる認識対象を推論する推論部と、前記認識対象を出力する出力部と、を備える。 In order to solve the above-mentioned problems and achieve the purpose, an information processing device according to the present disclosure includes an acquisition unit that acquires data that includes a plurality of recognition targets, and an acquisition unit that acquires data that includes a plurality of recognition targets from the data. an estimation unit that estimates, an extraction unit that extracts a representative feature based on the feature extracted for each region, and an inference unit that infers a recognition target included in the corresponding region based on the representative feature. , and an output unit that outputs the recognition target.

実施形態の一態様によれば、認識対象の認識精度を落とすことなく、処理速度を向上させることができる情報処理装置、利用者端末、情報処理方法、及び情報処理プログラムを提供することができる。 According to one aspect of the embodiment, it is possible to provide an information processing apparatus, a user terminal, an information processing method, and an information processing program that can improve processing speed without reducing recognition accuracy of a recognition target.

図１は、実施形態に係る情報処理の一例を示す図である。FIG. 1 is a diagram illustrating an example of information processing according to an embodiment. 図２は、実施形態に係る情報処理を実現する情報処理装置の模式的な概略図である。FIG. 2 is a schematic diagram of an information processing apparatus that implements information processing according to the embodiment. 図３は、実施形態に係る情報処理システムの構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of an information processing system according to an embodiment. 図４は、実施形態に係る情報処理装置の構成例を示す図である。FIG. 4 is a diagram illustrating a configuration example of an information processing device according to an embodiment. 図５は、実施形態に係るモデル情報記憶部に記憶される情報の一例を示す図である。FIG. 5 is a diagram illustrating an example of information stored in the model information storage unit according to the embodiment. 図６は、実施形態に係るパターン記憶部に記憶される情報の一例を示す図である。FIG. 6 is a diagram illustrating an example of information stored in the pattern storage unit according to the embodiment. 図７は、実施形態に係る利用者端末の構成例を示す図である。FIG. 7 is a diagram illustrating an example configuration of a user terminal according to the embodiment. 図８は、実施形態に係る情報処理の検出精度と認識精度を示す図である。FIG. 8 is a diagram showing detection accuracy and recognition accuracy of information processing according to the embodiment. 図９は、実施形態に係る情報処理の処理速度と検出精度を示す図である。FIG. 9 is a diagram showing the processing speed and detection accuracy of information processing according to the embodiment. 図１０は、実施例に係る情報処理を利用者端末にて実行した場合の認識時間を示す図である。FIG. 10 is a diagram showing recognition time when information processing according to the embodiment is executed on a user terminal. 図１１は、実施形態に係る情報処理の一例を示すフローチャートである。FIG. 11 is a flowchart illustrating an example of information processing according to the embodiment. 図１２は、情報処理装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 12 is a hardware configuration diagram showing an example of a computer that implements the functions of the information processing device.

以下に、本願に係る情報処理装置、利用者端末、情報処理方法、及び情報処理プログラムを実施するための形態（以下、「実施形態」と記載する。）について図面を参照しつつ詳細に説明する。なお、この実施形態により本願に係る情報処理装置、利用者端末、情報処理方法、及び情報処理プログラムが限定されるものではない。また、以下の各実施形態において同一の部位には同一の符号を付し、重複する説明は省略する。 Below, an information processing device, a user terminal, an information processing method, and a form for implementing an information processing program (hereinafter referred to as "embodiment") according to the present application will be described in detail with reference to the drawings. . Note that the information processing apparatus, user terminal, information processing method, and information processing program according to the present application are not limited to this embodiment. Further, in each of the embodiments below, the same parts are given the same reference numerals, and redundant explanations will be omitted.

（実施形態）
〔１－１．実施形態に係る情報処理の概要〕
まず、図１を用いて、実施形態に係る情報処理の一例について説明する。図１は、実施形態に係る情報処理の一例を示す図である。図１では、実施形態に係る情報処理が情報処理装置１００により実行される例を示す。図１を用いて実施形態に係る情報処理をステップごとに説明する。 (Embodiment)
[1-1. Overview of information processing according to embodiment]
First, an example of information processing according to the embodiment will be described using FIG. 1. FIG. 1 is a diagram illustrating an example of information processing according to an embodiment. FIG. 1 shows an example in which information processing according to the embodiment is executed by an information processing apparatus 100. Information processing according to the embodiment will be explained step by step using FIG.

まず、情報処理装置１００は、複数の認識対象が含まれる画像データを取得する（ステップＳ１）。例えば、情報処理装置１００は、認識対象として「ｐｅａｃｅ」の文字が含まれる画像データを取得する。なお、ここで、認識対象は「ｐｅａｃｅ」の文字に限定するものではなく、その他の任意の文字であってよい。 First, the information processing apparatus 100 acquires image data that includes a plurality of recognition targets (step S1). For example, the information processing apparatus 100 acquires image data that includes the characters "peace" as a recognition target. Note that here, the recognition target is not limited to the character "peace", but may be any other character.

次に、情報処理装置１００は、取得した画像データに対して、特徴量を抽出する処理を実行する（ステップＳ２）。例えば、情報処理装置１００は、画像データに対して文字を含む画像データにアノテーションラベルが付与された訓練データを用いて学習処理が実行された学習モデルを用いて、画像データから特徴量を抽出する処理を実行してよい。 Next, the information processing device 100 executes a process of extracting feature amounts from the acquired image data (step S2). For example, the information processing device 100 extracts features from image data using a learning model in which a learning process is performed using training data in which an annotation label is attached to image data that includes characters. Processing may be executed.

次に、情報処理装置１００は、特徴量を抽出する処理によって生成された領域ヒートマップを出力する（ステップＳ３－１）。また、情報処理装置１００は、特徴量を抽出する処理によって生成された結合ヒートマップを出力する（ステップＳ３－２）。ここで、領域ヒートマップとは、文字の中心が活性化するように訓練された特徴量抽出器によって生成された文字が存在する確率を画像データの位置ごとに出力したマップである。また、結合ヒートマップとは、文字と文字の間の間隔が活性化するように訓練された特徴量抽出器によって生成された文字と文字の間の間隔が存在する確率を画像データの位置ごとに出力したマップである。 Next, the information processing device 100 outputs a region heat map generated by the process of extracting feature amounts (step S3-1). Further, the information processing apparatus 100 outputs a combined heat map generated by the process of extracting feature amounts (step S3-2). Here, the area heat map is a map that outputs the probability that a character exists, generated by a feature extractor trained to activate the center of the character, for each position of image data. In addition, a joint heat map is a feature extractor that is trained to activate the spacing between characters, and calculates the probability of the existence of an interval between characters for each position in the image data. This is the output map.

次に、情報処理装置１００は、領域ヒートマップから領域ごとに代表特徴量を抽出する位置を決定する（ステップＳ４）。例えば、情報処理装置１００は、領域ヒートマップに示された領域ごとに領域内の最も活性化された位置を、代表特徴量を抽出する位置として決定する。ここで、代表特徴量とは画像データから抽出された特徴量のうち認識対象の特徴が反映された特徴量を意味する。 Next, the information processing apparatus 100 determines the position from which the representative feature amount is extracted for each region from the region heat map (step S4). For example, the information processing apparatus 100 determines the most activated position within each region shown in the region heat map as the position from which the representative feature amount is extracted. Here, the representative feature amount means a feature amount that reflects the feature of the recognition target among the feature amounts extracted from the image data.

次に、情報処理装置１００は、領域ごとに代表特徴量に対応する認識対象を推論する（ステップＳ５）。例えば、情報処理装置１００は、領域ヒートマップから領域ごとに代表特徴量を抽出し、抽出された代表特徴量に対応する認識対象を、代表特徴量に対応する認識対象のパターンが記憶されたデータベースに基づいて推論する。 Next, the information processing device 100 infers a recognition target corresponding to the representative feature amount for each region (step S5). For example, the information processing device 100 extracts representative features for each region from a region heat map, and stores recognition targets corresponding to the extracted representative features in a database that stores patterns of recognition targets corresponding to the representative features. make inferences based on

次に、情報処理装置１００は、認識結果を出力する（ステップＳ６）。例えば、情報処理装置１００は、領域ごとに抽出された代表特徴量に対応する認識結果として、「ｐｅａｃｅ」の文字が推論された場合、「ｐｅａｃｅ」の文字を出力する。 Next, the information processing device 100 outputs the recognition result (step S6). For example, when the character "peace" is inferred as the recognition result corresponding to the representative feature extracted for each region, the information processing apparatus 100 outputs the character "peace".

これにより、情報処理装置１００は、推論処理に用いる特徴量を減少させることができることから、認識処理の速度を上げることができる。したがって、認識対象の認識精度を落とすことなく、処理速度を向上させることができる。 Thereby, the information processing apparatus 100 can reduce the amount of features used for inference processing, and therefore can increase the speed of recognition processing. Therefore, the processing speed can be improved without reducing the recognition accuracy of the recognition target.

〔１－２．実施形態に係る情報処理の具体例〕
次に、実施形態に係る情報処理について具体例を用いてさらに詳細に説明する。図２は、実施形態に係る情報処理の具体例の模式的な概略図である。図２を用いて実施形態に係る情報処理をステップごとに説明する。 [1-2. Specific example of information processing according to embodiment]
Next, information processing according to the embodiment will be described in more detail using a specific example. FIG. 2 is a schematic diagram of a specific example of information processing according to the embodiment. Information processing according to the embodiment will be explained step by step using FIG.

まず、情報処理装置１００は、複数の認識対象が含まれる画像データを取得する（ステップＳ１０）。例えば、情報処理装置１００は、図２に示すように文字列Ｃ１「ＲｅｓｅａｒｃｈＯｒｇａｎｉｚａｔｉｏｎｆｏｒｔｈｅ２１ｓｔＣｅｎｔｕｒｙ」が含まれる画像データＧＤ１を取得する。 First, the information processing device 100 acquires image data that includes a plurality of recognition targets (step S10). For example, the information processing device 100 obtains image data GD1 that includes a character string C1 "Research Organization for the 21st Century" as shown in FIG.

次に、情報処理装置１００は、取得した画像データに対して特徴量を抽出する処理を実行する（ステップＳ１１）。例えば、情報処理装置１００は、全層畳み込みネットワーク（ＦＣＮ：ＦｕｌｌｙＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｗｏｒｋ）の一つであるＵ－ｎｅｔを用いて、特徴量抽出処理を実行してよい。なお、Ｕ－ｎｅｔは画像データにおいて認識対象がどこに存在するかを推定する為のネットワークである。 Next, the information processing apparatus 100 executes a process of extracting feature amounts from the acquired image data (step S11). For example, the information processing apparatus 100 may perform the feature amount extraction process using U-net, which is one of the fully convolutional networks (FCNs). Note that U-net is a network for estimating where a recognition target exists in image data.

次に、情報処理装置１００は、特徴量抽出処理により得られた領域ヒートマップと、結合ヒートマップと、を出力する（ステップＳ１２－１）。また、情報処理装置１００は、特徴量抽出処理により得られた文字特徴マップを出力する（ステップＳ１２－２）。例えば、情報処理装置１００は、Ｕ－ｎｅｔを用いた特徴量抽出処理により得られた画像データＧＤ１の位置ごとに文字列Ｃ１を構成する文字が存在する確率を示す領域ヒートマップＲＭ１と、画像データＧＤ１の位置ごとに文字列Ｃ１に含まれる文字と文字の間の間隔が存在する確率を示す結合ヒートマップＬＭ１と、を出力する。また、情報処理装置１００は、Ｕ－ｎｅｔを用いた特徴量抽出処理により得られた文字特徴マップＣＦＭ１を出力する。ここで、文字特徴マップＣＦＭ１とは、画像データＧＤ１に含まれる複数の文字の横方向位置特徴量Ｗと、縦方向位置特徴量Ｈと、文字形状特徴量Ｆと、によって構成されるマップである。 Next, the information processing apparatus 100 outputs the region heat map obtained by the feature extraction process and the combined heat map (step S12-1). Furthermore, the information processing device 100 outputs the character feature map obtained by the feature amount extraction process (step S12-2). For example, the information processing device 100 generates a region heat map RM1 indicating the probability that characters constituting the character string C1 exist at each position of the image data GD1 obtained by feature extraction processing using U-net, and the image data GD1. A combined heat map LM1 indicating the probability that a gap exists between characters included in the character string C1 is output for each position of GD1. Furthermore, the information processing device 100 outputs a character feature map CFM1 obtained by feature extraction processing using U-net. Here, the character feature map CFM1 is a map composed of the horizontal position feature amount W, the vertical position feature amount H, and the character shape feature amount F of a plurality of characters included in the image data GD1. .

次に、情報処理装置１００は、領域ヒートマップから代表特徴量を抽出する位置を決定する（ステップＳ１３）。例えば、情報処理装置１００は、領域ヒートマップにおいて領域ごとに最も活性化された位置を、代表特徴量を抽出する位置として決定してよい。 Next, the information processing apparatus 100 determines a position from which a representative feature is to be extracted from the area heat map (step S13). For example, the information processing apparatus 100 may determine the most activated position for each area in the area heat map as the position from which the representative feature amount is extracted.

次に、情報処理装置１００は、領域ごとに決定された代表特徴量を抽出する位置において代表特徴量を抽出する（ステップＳ１４）。例えば、情報処理装置１００は、画像データＧＤ１の代表特徴量を抽出することが決定された位置において代表特徴量として、文字特徴マップＣＦＭ１から文字形状特徴量Ｆを抽出しても良い。情報処理装置１００は、代表特徴量を抽出する位置ごとに、代表特徴量の値が設定された文字特徴量ベクトルＣＦＶ１を出力する。 Next, the information processing device 100 extracts the representative feature amount at the location where the representative feature amount is extracted determined for each region (step S14). For example, the information processing device 100 may extract the character shape feature amount F from the character feature map CFM1 as the representative feature amount at a position where it is determined to extract the representative feature amount of the image data GD1. The information processing device 100 outputs a character feature vector CFV1 in which the value of the representative feature is set for each position from which the representative feature is extracted.

次に、情報処理装置１００は、代表特徴量に対応する文字の分類確率を辞書に基づいて推論する（ステップＳ１５）。例えば、情報処理装置１００は、代表特徴量が抽出された位置ごとの代表特徴量に対応する文字の分類確率を、代表特徴量に対応する文字との関係が記憶された辞書ＤＣ１に基づいて推論する。 Next, the information processing device 100 infers the classification probability of the character corresponding to the representative feature amount based on the dictionary (step S15). For example, the information processing device 100 infers the classification probability of a character corresponding to the representative feature amount for each position from which the representative feature amount is extracted, based on the dictionary DC1 that stores the relationship between the representative feature amount and the character corresponding to the representative feature amount. do.

次に、情報処理装置１００は、代表特徴量に対応する文字の分類確率が最大値の文字を選定し、選定された文字を認識結果として出力する（ステップＳ１６）。例えば、情報処理装置１００は、画像データＧＤ１の代表特徴量に対応する文字の分類確率の最大値の文字を選定すると「ＲｅｓｅａｒｃｈＯｒｇａｎｉｚａｔｉｏｎｆｏｒｔｈｅ２１ｓｔＣｅｎｔｕｒｙ」が選定されたとする。この場合、情報処理装置１００は、認識結果として文字列Ｃ１「ＲｅｓｅａｒｃｈＯｒｇａｎｉｚａｔｉｏｎｆｏｒｔｈｅ２１ｓｔＣｅｎｔｕｒｙ」を出力する。 Next, the information processing device 100 selects the character with the maximum classification probability of the character corresponding to the representative feature amount, and outputs the selected character as a recognition result (step S16). For example, it is assumed that the information processing device 100 selects "Research Organization for the 21st Century" when selecting the character with the maximum classification probability of the characters corresponding to the representative feature amount of the image data GD1. In this case, the information processing apparatus 100 outputs the character string C1 "Research Organization for the 21st Century" as a recognition result.

〔１－３．実施形態に係る情報処理の他の例１（代表特徴量の抽出位置の変更）〕
情報処理装置１００は、領域に含まれる認識対象の大きさに応じて、代表特徴量を抽出する領域内の位置を変更する。 [1-3. Other example 1 of information processing according to the embodiment (change of extraction position of representative feature amount)]
The information processing device 100 changes the position within the region from which the representative feature is extracted, depending on the size of the recognition target included in the region.

この情報処理装置１００の処理について順を追って説明する。情報処理装置１００は、複数の認識対象が含まれる画像データを取得する。情報処理装置１００は、画像データから特徴量を抽出する処理を実行する。具体的には、画像データの特定の位置における認識対象が存在する確率を示す特徴量と、認識対象と認識対象の間の間隔が存在する確率を示す特徴量と、認識対象の特徴を示す特徴量などを、抽出する処理を実行する。なお、情報処理装置１００は、この処理において認識対象が存在する領域を推定する処理も同時に実行する。情報処理装置１００は、画像データに対して特徴量を抽出する処理によって抽出された特徴量のうち、認識対象の推論に使用する代表特徴量を抽出する位置を、認識対象が存在する領域ごとに決定する。この際、情報処理装置１００は、領域に含まれる認識対象の大きさに応じて、代表特徴量を抽出する領域内の位置を変更する。情報処理装置１００は、領域ごとに代表特徴量を抽出する位置が決定されたら、領域ごとに決定された位置において代表特徴量を抽出する。情報処理装置１００は、領域ごとに代表特徴量が抽出されたら、領域ごとの代表特徴量に対応する認識対象を推論する。情報処理装置１００は、推論された認識対象を出力する。 The processing of this information processing device 100 will be explained in order. The information processing device 100 acquires image data that includes a plurality of recognition targets. The information processing device 100 executes a process of extracting feature amounts from image data. Specifically, a feature amount indicating the probability that a recognition target exists at a specific position of image data, a feature amount indicating the probability that a gap exists between the recognition targets, and a feature indicating the characteristics of the recognition target. Execute processing to extract the amount, etc. Note that in this process, the information processing apparatus 100 also simultaneously executes a process of estimating the area where the recognition target exists. The information processing device 100 determines, for each region where the recognition target exists, a position from which a representative feature used for inference of the recognition target is extracted from among the feature quantities extracted by the process of extracting the feature quantities from the image data. decide. At this time, the information processing device 100 changes the position within the region from which the representative feature amount is extracted, depending on the size of the recognition target included in the region. Once the position at which the representative feature is to be extracted for each region is determined, the information processing device 100 extracts the representative feature at the determined position for each region. When the representative feature amount is extracted for each region, the information processing device 100 infers a recognition target corresponding to the representative feature amount for each region. The information processing device 100 outputs the inferred recognition target.

これにより、情報処理装置１００は、認識対象の大きさによって認識対象の存在する位置が不明確となる場合でも、認識対象の大きさに応じて代表特徴量を抽出する位置を変更することができる。したがって、認識対象の特徴が十分に反映された代表特徴量を抽出することが可能となり、認識精度を上げることが可能となる。 Thereby, the information processing device 100 can change the position from which the representative feature is extracted depending on the size of the recognition target even if the position of the recognition target is unclear depending on the size of the recognition target. . Therefore, it is possible to extract representative feature amounts that sufficiently reflect the features of the recognition target, and it is possible to improve recognition accuracy.

〔１－４．実施形態に係る情報処理の他の例２（推論に辞書を使用）〕
情報処理装置１００は、代表特徴量に対応する認識対象のパターンを記憶したパターン記憶部と、を備え、代表特徴量に対応する認識対象をパターン記憶部に記憶されたパターンに基づいて推論する。 [1-4. Other example 2 of information processing according to the embodiment (using dictionary for inference)]
The information processing device 100 includes a pattern storage unit that stores a pattern of a recognition target corresponding to a representative feature amount, and infers a recognition target corresponding to the representative feature amount based on the pattern stored in the pattern storage unit.

この情報処理装置１００の処理について順を追って説明する。なお、情報処理装置１００が、代表特徴量を抽出する処理までは、段落［００３１］において説明した処理と同じであるから説明を省略する。情報処理装置１００は、パターン記憶部１２２に記憶された代表特徴量に対応する認識対象のパターンに基づいて、領域ごとに抽出された代表特徴量に対応する認識対象の分類確率を推論する。すなわち、パターン記憶部１２２には、代表特徴量に対応する認識対象のパターンが代表特徴量ごとに記憶されていることから、代表特徴量が与えられれば、これに対応する認識対象の分類確率を推論することができる。情報処理装置１００は、代表特徴量に対応する認識対象の分類確率が推論されたら、分類確率が最大値の認識対象を、認識結果として出力する。 The processing of this information processing device 100 will be explained in order. Note that the process up to the process in which the information processing apparatus 100 extracts the representative feature amount is the same as the process described in paragraph [0031], so the description will be omitted. The information processing device 100 infers the classification probability of the recognition target corresponding to the representative feature extracted for each region based on the pattern of the recognition target corresponding to the representative feature stored in the pattern storage unit 122. That is, since the pattern storage unit 122 stores the recognition target pattern corresponding to the representative feature quantity for each representative feature quantity, if the representative feature quantity is given, the classification probability of the recognition target corresponding to this can be calculated. can be inferred. When the classification probability of the recognition target corresponding to the representative feature amount is inferred, the information processing device 100 outputs the recognition target with the maximum classification probability as a recognition result.

これにより、情報処理装置１００は、代表特徴量に基づいてこれに対応する認識対象の分類確率を推論し、分類確率が最大値の認識対象を、認識結果として出力することができる。その為、認識対象の認識精度を向上させることができる。 Thereby, the information processing apparatus 100 can infer the classification probability of the recognition target corresponding to the representative feature amount based on the representative feature amount, and output the recognition target with the maximum classification probability as the recognition result. Therefore, the recognition accuracy of the recognition target can be improved.

〔１－５．実施形態に係る情報処理の他の例３（誤り訂正を実行）〕〕
情報処理装置１００は、認識対象の推論結果に対して誤り訂正を行う訂正部１３５と、を備える。 [1-5. Other example 3 of information processing according to the embodiment (execution of error correction)]]
The information processing device 100 includes a correction unit 135 that performs error correction on the inference result of the recognition target.

この情報処理装置１００の処理について順を追って説明する。なお、情報処理装置１００が、代表特徴量に対応する認識対象の分類確率を推論する処理までは、段落［００３１］において説明した処理と同じであるから説明を省略する。情報処理装置１００は、代表特徴量に対応する認識対象の分類確率が推論されたら、代表特徴量に対応する認識対象の分類確率が所定の値以下の認識対象について、言語モデルを用いて認識対象の出現確率を計算する。なお、ここで言語モデルとは、単語が与えられたときに、それを構成する文字が出現する確率を計算するモデルである。情報処理装置１００は、認識対象の出現確率が計算されたら、最も出現確率が高いと計算された認識対象を、分類確率が所定の値以下の認識対象と置き換えることで誤り訂正を実行する。情報処理装置１００は、誤り訂正が実行された後の認識対象を認識結果として出力する。 The processing of this information processing device 100 will be explained in order. Note that the process up to the process in which the information processing apparatus 100 infers the classification probability of the recognition target corresponding to the representative feature amount is the same as the process described in paragraph [0031], so the description will be omitted. When the classification probability of the recognition target corresponding to the representative feature amount is inferred, the information processing device 100 uses a language model to classify recognition targets for which the classification probability of the recognition target corresponding to the representative feature amount is less than or equal to a predetermined value. Calculate the probability of occurrence of. Note that the language model here is a model that calculates, given a word, the probability that the characters that make up the word will appear. Once the appearance probability of the recognition target is calculated, the information processing device 100 performs error correction by replacing the recognition target calculated to have the highest appearance probability with a recognition target whose classification probability is less than or equal to a predetermined value. The information processing device 100 outputs the recognition target after error correction has been performed as a recognition result.

これにより、情報処理装置１００は、代表特徴量に基づいて推論された認識対象の分類確率が低い場合に、言語モデルを用いて出現確率が高い認識対象に置き換えることが可能となる。したがって、画像データに含まれる文字の認識精度を向上させることができる。 Thereby, when the classification probability of the recognition target inferred based on the representative feature amount is low, the information processing device 100 can replace it with a recognition target with a high probability of appearance using the language model. Therefore, the recognition accuracy of characters included in image data can be improved.

〔１－６．実施形態に係る情報処理の他の例４（代表特徴量の抽出位置を特定）〕〕
情報処理装置１００は、領域に含まれる認識対象の大きさが大きい場合は、領域の重心の位置の代表特徴量を抽出し、領域に含まれる認識対象の大きさが小さい場合は、認識対象が存在する確率の極大値の位置の代表特徴量を抽出する。 [1-6. Other example 4 of information processing according to the embodiment (identifying extraction position of representative feature amount)]]
If the size of the recognition target included in the region is large, the information processing device 100 extracts the representative feature amount of the position of the center of gravity of the region, and if the size of the recognition target included in the region is small, the information processing device 100 extracts the representative feature amount of the position of the center of gravity of the region. Extract the representative feature amount at the position of the maximum value of the existing probability.

この情報処理装置１００の処理について順を追って説明する。なお、情報処理装置１００が、画像データから特徴量を抽出する処理までは、段落［００３１］において説明した処理と同じであるから説明を省略する。情報処理装置１００は、画像データから特徴量が抽出されたら、認識対象の大きさが大きい領域において、認識対象が存在する確率を示す特徴量について、所定の閾値を用いて二値化する。これにより、認識対象が存在する確率が１の領域を繋げることができる。ここで、認識対象の大きさが大きい場合とは、認識対象が含まれる領域の短辺の長さが所定の閾値以上の場合を意味する。情報処理装置１００は、二値化処理によって繋げられた認識対象の存在確率が１の領域の重心、言い換えると幾何中心の位置を、認識対象の大きさが大きい領域における代表特徴量を抽出する位置として決定する。また、情報処理装置１００は、認識対象の大きさが小さい領域について、認識対象が存在する確率が極大値の位置を、代表特徴量を抽出する位置として決定する。ここで、認識対象の大きさが小さい場合とは、認識対象が含まれる領域の短辺の長さが所定の閾値以下の場合を意味する。情報処理装置１００は、領域ごとに代表特徴量が抽出されたら、領域ごとの代表特徴量に対応する認識対象の分類確率を推論する。情報処理装置１００は、代表特徴量に対応する認識対象の分類確率が推論されたら、分類確率が最大値の認識対象を、認識結果として出力する。 The processing of this information processing device 100 will be explained in order. Note that the process up to the process in which the information processing apparatus 100 extracts feature amounts from image data is the same as the process described in paragraph [0031], so the description will be omitted. When the feature amount is extracted from the image data, the information processing device 100 binarizes the feature amount indicating the probability that the recognition target exists in a region where the size of the recognition target is large using a predetermined threshold value. Thereby, it is possible to connect regions where the probability that the recognition target exists is 1. Here, the case where the size of the recognition target is large means the case where the length of the short side of the area including the recognition target is equal to or greater than a predetermined threshold. The information processing device 100 determines the center of gravity, in other words, the position of the geometric center, of the region where the existence probability of the recognition target is 1, which is connected by the binarization process, and the position from which the representative feature amount is extracted in the region where the size of the recognition target is large. Determine as. Furthermore, the information processing apparatus 100 determines, for an area where the recognition target is small in size, a position where the probability of the recognition target existing is a maximum value as a position from which the representative feature amount is extracted. Here, the case where the size of the recognition target is small means the case where the length of the short side of the area including the recognition target is equal to or less than a predetermined threshold. When the representative feature amount is extracted for each region, the information processing device 100 infers the classification probability of the recognition target corresponding to the representative feature amount for each region. When the classification probability of the recognition target corresponding to the representative feature amount is inferred, the information processing device 100 outputs the recognition target with the maximum classification probability as a recognition result.

これにより、情報処理装置１００は、認識対象の大きさによって、認識対象の存在する位置が不明確となる場合であっても、認識対象の特徴が最も反映されている可能性が高いと考えられる位置において、代表特徴量を抽出することができる。したがって、画像データに含まれる認識対象の認識精度を向上させることができる。 As a result, the information processing device 100 can determine that the characteristics of the recognition target are most likely reflected even if the location of the recognition target is unclear due to the size of the recognition target. At the position, representative features can be extracted. Therefore, the recognition accuracy of the recognition target included in the image data can be improved.

〔２．情報システムの構成〕
次に、図３を用いて実施形態に係る情報システムの構成について説明する。図３は、実施形態に係る情報システムの構成例を示す図である。図３に示すように、情報システム１は、情報処理装置１００と、利用者端末２００と、を含む。なお、図３に示した情報システム１は、複数台の情報処理装置１００や、複数台の利用者端末２００が含まれ構成されていてもよい。情報処理装置１００と、利用者端末２００とは、所定の通信網（ネットワークＮ）を介して、有線又は無線により通信可能に接続される。 [2. Information system configuration]
Next, the configuration of the information system according to the embodiment will be described using FIG. 3. FIG. 3 is a diagram illustrating a configuration example of an information system according to an embodiment. As shown in FIG. 3, the information system 1 includes an information processing device 100 and a user terminal 200. Note that the information system 1 shown in FIG. 3 may include a plurality of information processing apparatuses 100 and a plurality of user terminals 200. The information processing device 100 and the user terminal 200 are communicably connected via a predetermined communication network (network N) by wire or wirelessly.

情報処理装置１００は、画像データに含まれる認識対象の認識処理や、画像データに含まれる認識対象の推論のための学習モデルの学習処理に用いられる。情報処理装置１００は、例えばＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、ＷＳ（ＷｏｒｋＳｔａｔｉｏｎ）、サーバの機能を備えるコンピュータなどの情報処理装置であってよい。後述して説明するが、情報処理装置１００はＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などを備えてよい。 The information processing device 100 is used for recognition processing of a recognition target included in image data and learning processing of a learning model for inference of a recognition target included in image data. The information processing device 100 may be, for example, an information processing device such as a PC (Personal Computer), a WS (Work Station), or a computer having server functions. As will be described later, the information processing device 100 may include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a RAM (Random Access Memory), and the like.

利用者端末２００は、利用者が利用する情報処理装置である。利用者端末２００は、例えば、スマートフォン、タブレット型端末、デスクトップ型ＰＣ、ノート型ＰＣ、携帯電話機、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）等の情報処理装置であってよい。なお、図１に示す例においては、利用者端末２００がスマートフォンである場合を示している。 The user terminal 200 is an information processing device used by a user. The user terminal 200 may be, for example, an information processing device such as a smartphone, a tablet terminal, a desktop PC, a notebook PC, a mobile phone, or a PDA (Personal Digital Assistant). Note that the example shown in FIG. 1 shows a case where the user terminal 200 is a smartphone.

〔３．情報処理装置の構成〕
次に、図４を用いて、実施形態に係る情報処理装置１００の構成について説明する。図４は、実施形態に係る情報処理装置の構成例を示す図である。図４に示すように、情報処理装置１００は、通信部１１０と、記憶部１２０と、制御部１３０と、を有する。なお、情報処理装置１００は、情報処理装置１００の管理者から各種操作を受け付ける入力部（例えば、キーボードやマウス等）や、各種情報を表示するための表示部（例えば、液晶ディスプレイ等）を有してもよい。 [3. Configuration of information processing device]
Next, the configuration of the information processing device 100 according to the embodiment will be described using FIG. 4. FIG. 4 is a diagram illustrating a configuration example of an information processing device according to an embodiment. As shown in FIG. 4, the information processing device 100 includes a communication section 110, a storage section 120, and a control section 130. Note that the information processing device 100 has an input unit (for example, a keyboard, a mouse, etc.) that accepts various operations from the administrator of the information processing device 100, and a display unit (for example, a liquid crystal display, etc.) for displaying various information. You may.

（通信部１１０について）
通信部１１０は、例えば、ＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣａｒｄ）等によって実現される。そして、通信部１１０は、ネットワークＮと有線または無線で接続され、利用者端末２００との間で情報の送受信を行う。 (About communication department 110)
The communication unit 110 is realized by, for example, a NIC (Network Interface Card). The communication unit 110 is connected to the network N by wire or wirelessly, and transmits and receives information to and from the user terminal 200.

（記憶部１２０について）
記憶部１２０は、例えば、ＲＡＭ、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＦＲＡＭ（登録商標）（ＦｅｒｒｏｅｌｅｃｔｒｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、フラッシュメモリ（ＦｌａｓｈＭｅｍｏｒｙ）等の半導体メモリ素子、または、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、光ディスク等の記憶装置によって実現される。図４に示すように、記憶部１２０は、モデル情報記憶部１２１と、パターン記憶部１２２と、を有する。 (About storage unit 120)
The storage unit 120 is, for example, a RAM, a ROM (Read Only Memory), an EEPROM (Electrically Erasable Programmable Read Only Memory), or a FRAM (registered trademark) (Ferroelectric Random Memory). Semiconductor memory elements such as dom Access Memory, flash memory, etc. Alternatively, it is realized by a storage device such as a hard disk, SSD (Solid State Drive), or optical disk. As shown in FIG. 4, the storage unit 120 includes a model information storage unit 121 and a pattern storage unit 122.

（モデル情報記憶部１２１について）
モデル情報記憶部１２１は、機械学習モデルに関する各種の情報を記憶する。図５は、本開示の実施形態に係るモデル情報記憶部に記憶される情報の一例を示す図である。図５に示す例では、モデル情報記憶部１２１は、「モデルＩＤ」、「モデルデータ」という項目を有する。 (About model information storage unit 121)
The model information storage unit 121 stores various information regarding machine learning models. FIG. 5 is a diagram illustrating an example of information stored in the model information storage unit according to the embodiment of the present disclosure. In the example shown in FIG. 5, the model information storage unit 121 has items called "model ID" and "model data."

「モデルＩＤ」は、機械学習モデルを識別するための識別情報を示す。「モデルデータ」は、機械学習モデルのモデルデータを示す。例えば、「モデルデータ」には、画像データを入力すると画像データの位置ごとに認識対象が存在する確率と、認識対象と認識対象との間の間隔が存在する確率と、認識対象の特徴が反映された特徴量と、を出力するモデルのためのデータが格納される。 "Model ID" indicates identification information for identifying a machine learning model. "Model data" indicates model data of a machine learning model. For example, when image data is input, "model data" reflects the probability that a recognition target exists for each position in the image data, the probability that there is an interval between recognition targets, and the characteristics of the recognition target. The extracted feature values and data for the model that outputs them are stored.

図５に示す例では、モデルＩＤ「Ｍ１」で識別されるモデルは、機械学習モデルＭ１を示す。また、モデルデータ「ＭＤＴ１」は、機械学習モデルＭ１のモデルデータを示す。 In the example shown in FIG. 5, the model identified by model ID "M1" indicates machine learning model M1. Moreover, model data "MDT1" indicates model data of machine learning model M1.

ここで、機械学習モデルＭ１がニューラルネットワークである場合は、モデルデータ「ＭＤＴ１」には、例えば、ニューラルネットワークを構成する複数の層のそれぞれに含まれるノードが互いにどのように結合するかという結合情報や、結合されたノード間で入出力される数値に掛け合わされる結合係数などの各種情報が含まれる。なお、結合情報とは、例えば、各層に含まれるノード数や、各ノードの結合先のノードの種類を指定する情報や、各ノードで実行される非線形変換を実現する活性化関数などの情報を含む。各ノードで実行される非線形変換を実現する活性化関数は、例えば、正規化線形関数（ＲｅＬｕ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒｕｎｉｔ）関数や、シグモイド関数、ソフトマックス（Ｓｏｆｔｍａｘ）関数、恒等関数、ステップ関数、その他の任意の関数などであってよい。結合係数は、各ノードで実行される線形変換に用いられる。例えば、ニューラルネットワークの隠れ層において、ある層のノードから、より深い層のノードに出力値が出力される際に、出力値に対して付与される重みを含む。また、結合係数は各層の固有のバイアスベクトルを含んでもよい。 Here, if the machine learning model M1 is a neural network, the model data "MDT1" includes, for example, connection information on how nodes included in each of the plurality of layers configuring the neural network are connected to each other. It also includes various information such as coupling coefficients that are multiplied by numerical values input and output between coupled nodes. Note that the connection information includes, for example, the number of nodes included in each layer, information specifying the type of node to which each node is connected, and information such as the activation function that realizes the nonlinear transformation executed at each node. include. The activation function that realizes the nonlinear transformation executed at each node is, for example, a rectified linear unit (ReLu) function, a sigmoid function, a softmax function, an identity function, a step function, etc. It may be any function of . The coupling coefficients are used in the linear transformation performed at each node. For example, in a hidden layer of a neural network, when an output value is output from a node in a certain layer to a node in a deeper layer, it includes a weight given to an output value. The coupling coefficient may also include a unique bias vector for each layer.

すなわち、モデルデータは、画像データが入力されると、画像データの位置ごとに認識対象が存在する確率と、認識対象と認識対象との間の間隔が存在する確率と、認識対象の特徴が反映された特徴量と、を出力するように学習されたモデルのためのデータである。 In other words, when image data is input, the model data reflects the probability that a recognition target exists for each position in the image data, the probability that there is a gap between the recognition targets, and the characteristics of the recognition target. This is the data for the model trained to output the calculated feature values and .

（パターン記憶部１２２について）
パターン記憶部１２２は、代表特徴量に対応する認識対象のパターンに関する情報を記憶する。図６は、本開示の実施形態に係るパターン記憶部に記憶される情報の一例を示す図である。図６に示す例では、パターン記憶部１２２は、「パターン番号」、「認識パターン」という項目を有する。 (About pattern storage unit 122)
The pattern storage unit 122 stores information regarding the recognition target pattern corresponding to the representative feature amount. FIG. 6 is a diagram illustrating an example of information stored in the pattern storage unit according to the embodiment of the present disclosure. In the example shown in FIG. 6, the pattern storage unit 122 has items such as "pattern number" and "recognition pattern."

「パターン番号」は、画像データから抽出された認識対象の特徴が反映された代表特徴量に対応する認識対象の認識パターンの番号を示す。「認識パターン」は、代表特徴量に照合される認識対象のパターンごとに、その特徴に関する情報が記憶されている。例えば、認識対象が文字の場合は、色々な字体の文字の特徴を平均化した値が記憶されていてよい。 The "pattern number" indicates the number of the recognition pattern of the recognition target corresponding to the representative feature quantity in which the feature of the recognition target extracted from the image data is reflected. In the "recognition pattern", information regarding the characteristics of each recognition target pattern that is matched against the representative feature amount is stored. For example, if the recognition target is a character, a value obtained by averaging the characteristics of characters of various fonts may be stored.

図６に示す例では、パターン番号「ＮＯ１」に対応する認識パターン「ＲＰ１」が記憶されている。このように、パターン記憶部１２２には代表特徴量に照合される認識対象の特徴に関する情報が記憶されている。すなわち、パターン記憶部１２２には、画像データから認識対象を認識するために、画像データから抽出された認識対象の特徴が反映された代表特徴量と照合される認識対象の特徴に関する情報が記憶されている。例えば、認識対象が文字である場合は、代表特徴量に対応する文字の特徴に関する情報が記憶されている。 In the example shown in FIG. 6, the recognition pattern "RP1" corresponding to the pattern number "NO1" is stored. In this way, the pattern storage unit 122 stores information regarding the features of the recognition target that are matched with the representative feature amount. That is, in order to recognize the recognition target from image data, the pattern storage unit 122 stores information regarding the characteristics of the recognition target to be compared with the representative feature amount reflecting the characteristics of the recognition target extracted from the image data. ing. For example, when the recognition target is a character, information regarding the character feature corresponding to the representative feature amount is stored.

（制御部１３０について）
次に図４に戻って、制御部１３０について説明する。制御部１３０は、ＣＰＵやＭＰＵ（ＭｉｃｒｏＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＧＰＵ等によって、情報処理装置１００の記憶装置に記憶されている各種プログラムがＲＡＭを作業領域として実行されることにより実現される。また、制御部１３０は、例えば、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＣｉｒｃｕｉｔ）やＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等の集積回路により実現されてもよい。 (About the control unit 130)
Next, returning to FIG. 4, the control section 130 will be explained. The control unit 130 is realized by a CPU, an MPU (Micro Processing Unit), a GPU, or the like executing various programs stored in the storage device of the information processing device 100 using the RAM as a work area. Further, the control unit 130 may be realized by, for example, an integrated circuit such as an ASIC (Application Specific Circuit) or an FPGA (Field Programmable Gate Array).

図４に示すように、制御部１３０は、取得部１３１と、推定部１３２と、抽出部１３３と、推論部１３４と、訂正部１３５と、学習部１３６と、出力部１３７と、を有する。 As shown in FIG. 4, the control unit 130 includes an acquisition unit 131, an estimation unit 132, an extraction unit 133, an inference unit 134, a correction unit 135, a learning unit 136, and an output unit 137.

（取得部１３１について）
取得部１３１は、複数の認識対象が含まれる画像データを取得する。例えば、認識対象は、画像データに含まれる文字であってよいし、物体であってもよい。 (About the acquisition unit 131)
The acquisition unit 131 acquires image data that includes a plurality of recognition targets. For example, the recognition target may be characters included in image data, or may be an object.

（推定部１３２について）
推定部１３２は、画像データから認識対象が含まれる複数の領域を推定する。例えば、推定部１３２は、認識対象が文字の場合、画像データにおいて文字が含まれる領域をバウンディングボックスとして推定する。また、推定部１３２は、認識対象が物体の場合、物体が含まれる領域をバウンディングボックスとして推定する。 (About the estimation unit 132)
The estimation unit 132 estimates a plurality of regions including recognition targets from the image data. For example, when the recognition target is a character, the estimating unit 132 estimates an area that includes the character in the image data as a bounding box. Further, when the recognition target is an object, the estimating unit 132 estimates a region including the object as a bounding box.

また、推定部１３２は、画像データにおいて認識対象が存在する確率を画像データの位置ごとに推定する。推定部１３２は、Ｕ－ｎｅｔによって実現されてよい。なお、Ｕ－ｎｅｔは、全層畳み込みネットワーク（ＦＣＮ：ＦｕｌｌｙＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｗｏｒｋ）の一つであり、画像において認識対象がどこにあるかを推定する為のネットワークである。ＦＣＮは、一般的なＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）の全結合層を畳み込み層に置き換えた構造を有する。また、Ｕ－ｎｅｔは、畳み込み層の後に、逆畳み込み層が設けられている。逆畳み込み層は、畳み込み層と逆の処理を行うことで、入力された画像データと同じサイズの確率マップを出力する。なお、逆畳み込みを行うと、認識対象の位置情報を正確に出力することができない。その為、Ｕ－ｎｅｔでは、畳み込み層で得られた特徴マップを、逆畳み込み層の特徴マップに連結して、畳み込み層の特徴マップの情報が逆畳み込み層の特徴マップに伝わるようにしている。 Furthermore, the estimation unit 132 estimates the probability that a recognition target exists in the image data for each position in the image data. The estimation unit 132 may be realized by U-net. Note that U-net is one of the full-layer convolutional networks (FCN), and is a network for estimating where a recognition target is located in an image. The FCN has a structure in which the fully connected layer of a general CNN (Convolutional Neural Network) is replaced with a convolutional layer. Further, in U-net, a deconvolution layer is provided after the convolution layer. The deconvolution layer outputs a probability map of the same size as the input image data by performing the opposite process to the convolution layer. Note that if deconvolution is performed, the position information of the recognition target cannot be output accurately. Therefore, in U-net, the feature map obtained in the convolution layer is connected to the feature map in the deconvolution layer, so that information in the feature map in the convolution layer is transmitted to the feature map in the deconvolution layer.

また、推定部１３２は、画像データにおいて認識対象が存在する確率を画像データの位置ごとに推定する。すなわち、推定部１３２は、画像データの位置ごとに認識対象が存在する確率が付与された領域ヒートマップを出力する。また、推定部１３２は、画像データにおいて、認識対象と認識対象の間の間隔が存在する確率を画像データの位置ごとに推定する。すなわち、推定部１３２は、画像データの位置ごとに認識対象と認識対象の間の間隔が存在する確率が付与された結合ヒートマップを出力する。また、推定部１３２は、画像データに含まれる認識対象の特徴が反映された特徴量を画像データの位置ごとに推定する。すなわち、推定部１３２は、画像データの位置ごとに認識対象の特徴が反映された特徴量が付与された特徴ヒートマップを出力する。これらの推定部１３２の処理は、画像データから認識対象が存在する確率と、認識対象と認識対象の間の間隔が存在する確率と、認識対象の特徴が反映された特徴量と、画像データの位置ごとに出力するように学習されたＵ－ｎｅｔを用いることによって実現されてよい。 Furthermore, the estimation unit 132 estimates the probability that a recognition target exists in the image data for each position in the image data. That is, the estimation unit 132 outputs a region heat map in which a probability that a recognition target exists is assigned to each position of the image data. Furthermore, the estimating unit 132 estimates the probability that a gap exists between recognition targets in the image data for each position in the image data. That is, the estimating unit 132 outputs a combined heat map in which a probability that a gap exists between the recognition targets is assigned for each position of the image data. Furthermore, the estimating unit 132 estimates a feature amount reflecting the feature of the recognition target included in the image data for each position of the image data. That is, the estimation unit 132 outputs a feature heat map in which feature amounts reflecting the features of the recognition target are added to each position of the image data. These processes by the estimation unit 132 are based on the image data, determining the probability that a recognition target exists, the probability that there is a gap between the recognition targets, the feature amount that reflects the characteristics of the recognition target, and the image data. This may be realized by using a U-net that has been trained to output for each location.

なお、推定部１３２に使用するモデルは、後述する推論部１３４に使用するモデルと一体化されたＥｎｄ－ｔｏ－ｅｎｄモデルによって実現されてよい。この場合、推定部１３２は、画像データから特徴量を抽出する処理と、特徴量に対応する認識対象を推論する処理と、を実行する。 Note that the model used in the estimation unit 132 may be realized by an end-to-end model that is integrated with a model used in the inference unit 134, which will be described later. In this case, the estimation unit 132 executes a process of extracting a feature amount from the image data and a process of inferring a recognition target corresponding to the feature amount.

（抽出部１３３について）
抽出部１３３は、領域ごとに抽出される特徴量に基づいて、代表特徴量を抽出する。すなわち、抽出部１３３は画像データから抽出された画像データの位置ごとに付与された特徴量について代表特徴量を抽出する画像データにおける位置を決定し、決定された位置において代表特徴量を抽出する。なお、代表特徴量とは、画像データの位置ごとに認識対象の特徴が反映された特徴量が付与された特徴ヒートマップにおいて、画像データの特定の位置における特徴量を意味する。推定部１３２は、認識対象が存在する確率を、画像データの位置ごとに付与した領域ヒートマップを出力することから、領域ヒートマップにより認識対象が存在する位置を特定することができる。抽出部１３３は、領域ヒートマップに示された認識対象が存在する確率が所定の値以上の位置を、代表特徴量を抽出する位置として決定し、その位置における特徴量を代表特徴量として抽出してよい。 (About the extraction unit 133)
The extraction unit 133 extracts a representative feature amount based on the feature amount extracted for each region. That is, the extraction unit 133 determines a position in the image data from which to extract a representative feature amount for the feature amount assigned to each position of the image data extracted from the image data, and extracts the representative feature amount at the determined position. Note that the representative feature amount means a feature amount at a specific position of image data in a feature heat map in which a feature amount reflecting the feature of the recognition target is assigned to each position of the image data. Since the estimating unit 132 outputs a region heat map in which a probability that the recognition target exists is assigned to each position of the image data, the position where the recognition target exists can be specified using the region heat map. The extraction unit 133 determines a position where the probability of the presence of the recognition target shown in the region heat map is equal to or higher than a predetermined value as a position from which to extract a representative feature, and extracts the feature at that position as the representative feature. It's fine.

抽出部１３３は、領域に含まれる認識対象の大きさが大きい場合は、領域の重心の位置の特徴量を代表特徴量として抽出し、領域に含まれる認識対象の大きさが小さい場合は、認識対象が存在する確率の極大値の位置の特徴量を代表特徴量として抽出する。抽出部１３３は、推定部１３２が画像データの位置ごとに認識対象が存在する確率が付与された領域ヒートマップを推定したら、認識対象の大きさが大きい領域において認識対象が存在する確率を示す特徴量について、所定の閾値を用いて二値化して、認識対象が存在する確率が１の領域を繋げる処理を実行する。ここで、認識対象の大きさが大きい場合とは、認識対象が含まれる領域の短辺の長さが所定の閾値以上の場合を意味する。抽出部１３３は、二値化処理によって繋げられた認識対象の存在確率が１の領域の重心、言い換えると幾何中心の位置を、認識対象の大きさが大きい領域における代表特徴量を抽出する位置として決定する。また、抽出部１３３は、認識対象の大きさが小さい領域について、認識対象が存在する確率の極大値の位置を、代表特徴量を抽出する位置として決定する。ここで、認識対象の大きさが小さい場合とは、認識対象が含まれる領域の短辺の長さが所定の閾値以下の場合を意味する。具体的には、抽出部１３３は、以下の式（１）に示す計算式により代表特徴量を抽出する位置を決定する。 If the size of the recognition target included in the area is large, the extraction unit 133 extracts the feature amount at the position of the center of gravity of the area as a representative feature amount, and if the size of the recognition target included in the area is small, the extraction unit 133 extracts the feature amount at the position of the center of gravity of the area. The feature amount at the position of the maximum value of the probability that the target exists is extracted as the representative feature amount. After the estimation unit 132 estimates the region heat map in which the probability of the recognition target existing is given for each position of the image data, the extraction unit 133 extracts a feature indicating the probability that the recognition target exists in an area where the recognition target is large. The amount is binarized using a predetermined threshold value, and processing is performed to connect regions where the probability of the recognition target being present is 1. Here, the case where the size of the recognition target is large means the case where the length of the short side of the area including the recognition target is equal to or greater than a predetermined threshold. The extraction unit 133 uses the center of gravity, in other words, the position of the geometric center, of the region where the existence probability of the recognition target is 1, which is connected by the binarization process, as the position from which to extract the representative feature amount in the region where the size of the recognition target is large. decide. Further, the extraction unit 133 determines, for a region where the size of the recognition target is small, the position of the maximum value of the probability that the recognition target exists as the position from which the representative feature amount is extracted. Here, the case where the size of the recognition target is small means the case where the length of the short side of the area including the recognition target is equal to or less than a predetermined threshold. Specifically, the extraction unit 133 determines the position from which the representative feature is to be extracted using the calculation formula shown in equation (1) below.

ここで、式（１）に示すＲは領域ヒートマップを表し、Ｎは２次元座標における８つの隣接点を表す。 Here, R shown in equation (1) represents a region heat map, and N represents eight adjacent points in two-dimensional coordinates.

（推論部１３４について）
推論部１３４は、代表特徴量に基づいて対応する領域に含まれる認識対象を推論する。すなわち、推論部１３４は、領域ごとに抽出された代表特徴量に対応する認識対象を、代表特徴量に対応する認識対象のパターンが記憶されたパターン記憶部１２２を参照することによって、領域ごとに抽出された代表特徴量に対応する認識対象の分類確率を推論する。具体的には、推論部１３４は、以下の式（２）を用いて領域ごとに抽出された代表特徴量に対応する認識対象のパターンの分類確率を計算する。推論部１３４は、認識対象の分類確率が計算されたら、分類確率が最も高い認識対象のパターンを、領域ごとに抽出された代表特徴量に対応付ける。 (About the inference unit 134)
The inference unit 134 infers the recognition target included in the corresponding area based on the representative feature amount. That is, the inference unit 134 determines the recognition target corresponding to the representative feature extracted for each region by referring to the pattern storage unit 122 in which the pattern of the recognition target corresponding to the representative feature is stored. The classification probability of the recognition target corresponding to the extracted representative feature is inferred. Specifically, the inference unit 134 calculates the classification probability of the recognition target pattern corresponding to the representative feature extracted for each region using the following equation (2). Once the classification probability of the recognition target is calculated, the inference unit 134 associates the recognition target pattern with the highest classification probability with the representative feature extracted for each region.

ここで、式（２）におけるＰは代表特徴量を抽出する画像データにおける位置を示し、Ｎ箇所の位置で代表特徴量を抽出するとすると、Ｐ＝［(x₁,y₁),(x₂,y₂),(x₃,y₃),…,(x_n,y_n)］で表される。また、式（２）におけるｆはＦ個のチャンネルを有するＨ×Ｗのサイズの特徴量ヒートマップを示す。また、式（２）におけるＣは、クラス番号、すなわち、認識対象のパターン番号を示す。式（２）におけるｆ［：，Ｐ］は、ポイントＰにおいて特徴ベクトルを抜き取り、（Ｆ，Ｎ）形状の行列に保存する操作を表す。また、式（２）における線形変換のパラメータ（ｗ、ｂ）はモデルにおける学習可能パラメータである。このパラメータは、他のネットワークパラメータと結合されて、バックプロパゲーションによって最適化される。ここで、ｗはＦ×Ｃ行列である。また、ｂ＝［b₁,b₂,…,b_c］は、長さＮの単位行列１＝［1,1,…,1］^Tによって拡張されたＣ次元のベクトルである。最後に、行型ソフトマックス関数を用いてチャンネル軸に沿って分類確率を算出する。これにより、ｃｌｓは（Ｎ，Ｃ）型の行列になる。ここで、ｃｌｓの（ｉ，ｊ）における要素は、ｉ番目のポイントにおける代表特徴量が、ｊ番目の認識対象に分類される分類確率を意味する。なお、各々のポイントにおける分類確率が計算された後に、閾値を用いることによって、確信をもって分類できなかったポイントを除外してもよい。 Here, P in equation (2) indicates the position in the image data from which the representative feature is extracted, and if the representative feature is extracted at N positions, then P = [(x ₁ ,y ₁ ),(x ₂ ,y ₂ ),(x ₃ ,y ₃ ),…,(x _n ,y _n )]. Further, f in Equation (2) indicates a feature quantity heat map having a size of H×W and having F channels. Further, C in Equation (2) indicates a class number, that is, a pattern number to be recognized. f[:,P] in Equation (2) represents an operation of extracting a feature vector at point P and storing it in a (F,N)-shaped matrix. Furthermore, the linear transformation parameters (w, b) in Equation (2) are learnable parameters in the model. This parameter is combined with other network parameters and optimized by backpropagation. Here, w is an F×C matrix. Furthermore, b=[b ₁ ,b ₂ ,...,b _c ] is a C-dimensional vector extended by the unit matrix 1=[1,1,...,1] ^T of length N. Finally, the classification probability is calculated along the channel axis using a row-like softmax function. As a result, cls becomes an (N, C) type matrix. Here, the element (i, j) of cls means the classification probability that the representative feature at the i-th point is classified as the j-th recognition target. Note that after the classification probability at each point is calculated, points that cannot be classified with confidence may be excluded by using a threshold.

なお、認識対象が文字の場合は、代表特徴量を抽出する位置ごとに抽出された代表特徴量に対応する認識対象のパターンである文字の読み方向を決める処理を実行する。具体的には、領域ヒートマップＲＭ１と、結合ヒートマップＬＭ１と、を足し合わせて文字の単連結領域を抽出する。次に、領域ごとの代表特徴量を抽出する位置を、それを包含する単連結領域に割り当てる。単連結領域ごとに、代表特徴量を抽出する位置、すなわち文字の位置を、Ｘ座標の値が小さい順に並べ替えて、画像データの左から順に分類確率が最大の文字のコードを割り当てたテキストデータとする。なお、並べ替え処理を実行する際に、日本語や英語では左側から順番に読むが、アラビア語の場合は右から順番に読む為、右から順に分類確率が最大の文字のコードを割り当てたテキストデータとする。また、上下を考慮した方向から読み取ったテキストデータを生成してもよい。どの方向から読むかは、予め設定されていてもよく、ルールベースでもよい。すなわち、言語が分かっていれば、どの方向からテキストを読むかを設定してよい。多言語対応時は、複数の言語の文字認識手法と組み合わせて、テキストボックスごとに動的にどの方向から読むかを決めても良い。 Note that when the recognition target is a character, a process is executed to determine the reading direction of the character, which is a pattern of the recognition target, corresponding to the representative feature extracted for each position where the representative feature is extracted. Specifically, the region heat map RM1 and the combined heat map LM1 are added together to extract a single connected region of characters. Next, the position from which the representative feature quantity for each region is extracted is assigned to the simply connected region that includes it. For each simply connected region, the positions from which representative features are extracted, that is, the character positions, are sorted in descending order of X-coordinate value, and text data is created in which the code of the character with the highest classification probability is assigned in order from the left of the image data. shall be. When performing the sorting process, since Japanese and English are read from the left in order, but Arabic is read in order from the right, the text is assigned the code of the character with the highest classification probability from the right. Data. Alternatively, text data may be generated that is read from a direction that takes into account the top and bottom. The direction from which to read may be set in advance or may be rule-based. In other words, if you know the language, you can set which direction the text should be read from. When supporting multiple languages, it may be combined with character recognition methods for multiple languages to dynamically determine which direction each text box should be read from.

推論部１３４に使用するモデルは、前述した推定部１３２に使用するモデルと一体化されたＥｎｄ－ｔｏ－ｅｎｄモデルによって実現されてよい。なお、Ｅｎｄ－ｔｏ－ｅｎｄモデルは、入力値と出力値を一つの学習可能パイプラインによって繋げて学習する方法を用いる。この場合、推論部１３４は、画像データから特徴量を抽出する処理と、特徴量に対応する認識対象を推論する処理と、を実行する。 The model used in the inference unit 134 may be realized by an end-to-end model that is integrated with the model used in the estimation unit 132 described above. Note that the end-to-end model uses a method of learning by connecting input values and output values using one learnable pipeline. In this case, the inference unit 134 executes a process of extracting a feature amount from the image data and a process of inferring a recognition target corresponding to the feature amount.

（訂正部１３５について）
訂正部１３５は、推論部１３４の認識対象の推論結果に対して言語モデルを用いて誤り訂正を行う。なお、ここで言語モデルとは、単語が与えられたときにそれを構成する文字が出現する確率を計算するモデルである。訂正部１３５は、推論部１３４が代表特徴量に対応する認識対象の分類確率を推論したら、代表特徴量に対応する認識対象の分類確率が所定の値以下の認識対象について、言語モデルを用いて認識対象の出現確率を計算する。訂正部１３５は、言語モデルを用いて認識対象の出現確率が計算されたら、最も出現確率が高いと計算された認識対象を、分類確率が所定の値以下の認識対象と置き換えることで誤り訂正を実行する。訂正部１３５は、ｎ－ｇｒａｍ数え上げ言語モデルなどによって実現されてよい。ｎ－ｇｒａｍ数え上げ言語モデルは、文字列が与えられたときに、その後に文字ｘが出現する確率を、すぐ前のｎ－１個の文字を条件とした確率として計算するモデルである。単語辞書を用いて、単語を構成する文字の出現確率を推定する訓練を行うことで、言語モデルを生成して良い。 (About the correction unit 135)
The correction unit 135 performs error correction on the inference result of the recognition target of the inference unit 134 using a language model. Note that the language model here is a model that calculates the probability that characters forming the word will appear when a word is given. After the inference unit 134 infers the classification probability of the recognition target corresponding to the representative feature, the correction unit 135 uses the language model to calculate the classification probability of the recognition target corresponding to the representative feature by using the language model. Calculate the probability of appearance of the recognition target. Once the appearance probability of the recognition target is calculated using the language model, the correction unit 135 performs error correction by replacing the recognition target calculated to have the highest appearance probability with a recognition target whose classification probability is less than or equal to a predetermined value. Execute. The correction unit 135 may be implemented using an n-gram enumeration language model or the like. The n-gram counting language model is a model that calculates, when a character string is given, the probability that a character x will appear after it, conditional on the n-1 characters immediately before it. A language model may be generated by training to estimate the appearance probabilities of characters forming words using a word dictionary.

（学習部１３６について）
学習部１３６は、推定部１３２や推論部１３４に用いられる学習モデルを生成するための学習処理を実行する。学習部１３６は、推定部１３２と推論部１３４とが一体化されたＥｎｄ－ｔｏ－ｅｎｄモデルである場合は、Ｅｎｄ－ｔｏ－ｅｎｄモデルに対して学習処理を実行する。学習処理には、認識対象が含まれる画像データと、画像データに含まれる認識対象についてアノテーションラベルが付与されたデータを用いて行ってよい。例えば、認識対象が文字の場合は、文字レベルでのアノテーションが付与されたデータを用いて学習処理を実行してよい。また、データが少ない場合は、弱教師あり学習によって、学習処理を実行してもよい。学習部１３６は、学習処理が完了したら、学習処理によって生成された学習モデルをモデル情報記憶部１２１に記憶する。なお、学習処理は以下の式（３）に示す損失関数Ｌを小さくするようにモデルのパラメータを調整することによって行う。損失関数Ｌの最適化には、確率的勾配降下法を用いてもよい。 (About learning section 136)
The learning unit 136 executes learning processing to generate a learning model used by the estimation unit 132 and the inference unit 134. If the end-to-end model in which the estimation unit 132 and the inference unit 134 are integrated is an end-to-end model, the learning unit 136 executes learning processing on the end-to-end model. The learning process may be performed using image data that includes a recognition target and data to which an annotation label is attached for the recognition target included in the image data. For example, if the recognition target is a character, the learning process may be performed using data annotated at the character level. Furthermore, if there is little data, the learning process may be performed by weakly supervised learning. When the learning process is completed, the learning unit 136 stores the learning model generated by the learning process in the model information storage unit 121. Note that the learning process is performed by adjusting the parameters of the model so as to reduce the loss function L shown in Equation (3) below. A stochastic gradient descent method may be used to optimize the loss function L.

ここで、式（３）に示すＬ_ｄｅｔと、Ｌ_ｒｅｃは、それぞれ二乗和誤差と、交差エントロピー誤差を示す。また、式（３）に示すαは、二つの損失関数をバランスさせるハイパーパラメータを示す。以下の式（４）と、式（５）に、それぞれＬ_ｄｅｔと、Ｌ_ｒｅｃの計算式を示す。 Here, L _det and L _rec shown in equation (3) represent a sum of squares error and a cross entropy error, respectively. Further, α shown in equation (3) represents a hyperparameter that balances the two loss functions. Formulas for calculating L _det and L _rec are shown in Formula (4) and Formula (5) below, respectively.

ここで、式（４）に示すＨとＷは、（Ｈ，Ｗ）によって表記される特徴マップのサイズを示す。また、式（４）に示すＲ，Ａ，Ｒ_ｇｔ，Ａ_ｇｔは、推定された領域ヒートマップと、結合ヒートマップと、それらのそれぞれに対応する正解を示す。また、式（５）に示す（ｘ_ｇｔ，ｙ_ｇｔ，ｃ_ｇｔ）∈Ｐ_ｇｔは、文字レベルでの正解を示し、画像データにおける位置と、認識対象のパターンの分類番号を示す。また、式（５）に示すＣＥは、分類損失の交差エントロピーを示す。学習処理は、式（３）に示す損失関数を小さくすることを目的とする。 Here, H and W shown in equation (4) indicate the size of the feature map expressed by (H, W). Further, R, A, R _gt , A _gt shown in equation (4) indicate the estimated area heat map, the combined heat map, and the correct answer corresponding to each of them. Furthermore, (x _gt , y _gt , c _gt )∈P _gt shown in equation (5) indicates the correct answer at the character level, and indicates the position in the image data and the classification number of the pattern to be recognized. Further, CE shown in equation (5) indicates cross entropy of classification loss. The purpose of the learning process is to reduce the loss function shown in equation (3).

（出力部１３７について）
出力部１３７は、推論部１３４が推論した認識対象を出力する。例えば、認識対象が文字であり、推論部１３４が画像データの領域ごとに抽出された代表特徴量に基づいて、代表特徴量に対応する認識対象の分類確率を推論し、分類確率が最大値のパターンを選定したところ、「ｐｅａｃｅ」の文字が画像データに含まれると推論したとする。この場合、出力部１３７は、推論部１３４が推論した「ｐｅａｃｅ」の文字を出力する。 (About the output section 137)
The output unit 137 outputs the recognition target inferred by the inference unit 134. For example, when the recognition target is a character, the inference unit 134 infers the classification probability of the recognition target corresponding to the representative feature based on the representative feature extracted for each region of image data, and the classification probability is the maximum value. Assume that when a pattern is selected, it is inferred that the word "peace" is included in the image data. In this case, the output unit 137 outputs the characters “peace” inferred by the inference unit 134.

出力部１３７は、推論部１３４が推論した認識結果に対して、訂正部１３５が訂正を加えた認識結果を出力する。例えば、認識対象が文字であり、推論部１３４が画像データの領域ごとに抽出された代表特徴量に基づいて、代表特徴量に対応する認識対象の分類確率を推論し、分類確率の最大値のパターンを選定した所、「ｃｏｎｔｒａｇｔｏｒ」の文字が画像データに含まれると推論したとする。しかし、実際には画像データには、「ｃｏｎｔｒａｃｔｏｒ」の文字が含まれており、認識された単語の７番目の文字は誤認識であり、文字「ｇ」に分類される分類確率が３０％であったとする。そして、訂正部１３５が、言語モデルを用いて、推論された単語の７番目に出現する文字の出現確率を計算した所、「ｃ」の出現確率が最も高かったとする。この場合、訂正部１３５は、推論部１３４が推論した認識結果の「ｃｏｎｔｒａｇｔｏｒ」の７番目の文字「ｇ」を「ｃ」に置き換える訂正処理を実行する。出力部１３７は、推論部１３４が推論した認識結果に対して、訂正部１３５が訂正を加えた認識結果である「ｃｏｎｔｒａｃｔｏｒ」を出力する。 The output unit 137 outputs a recognition result in which the correction unit 135 corrects the recognition result inferred by the inference unit 134. For example, when the recognition target is a character, the inference unit 134 infers the classification probability of the recognition target corresponding to the representative feature based on the representative feature extracted for each region of the image data, and calculates the maximum value of the classification probability. Assume that after selecting a pattern, it is inferred that the word "contragtor" is included in the image data. However, in reality, the image data contains the character "contractor", and the seventh character of the recognized word is a misrecognition, and the classification probability of being classified as the character "g" is 30%. Suppose there was. Then, when the correction unit 135 uses the language model to calculate the appearance probability of the seventh character in the inferred word, it is assumed that "c" has the highest appearance probability. In this case, the correction unit 135 executes a correction process in which the seventh character “g” of “contragtor” in the recognition result inferred by the inference unit 134 is replaced with “c”. The output unit 137 outputs a “contractor” which is a recognition result obtained by correcting the recognition result inferred by the inference unit 134 by the correction unit 135.

〔４．利用者端末の構成〕
次に、図７を用いて、実施形態に係る利用者端末２００の構成について説明する。図７は、実施形態に係る利用者端末の構成例を示す図である。図７に示すように、利用者端末２００は、通信部２１０と、記憶部２２０と、制御部２３０と、を有する。なお、利用者端末２００は、利用者から各種操作を受け付ける入力部（例えば、キーボードやマウス等）や、各種情報を表示するための表示部（例えば、液晶ディスプレイ等）を有してもよい。 [4. Configuration of user terminal]
Next, the configuration of the user terminal 200 according to the embodiment will be described using FIG. 7. FIG. 7 is a diagram illustrating an example configuration of a user terminal according to the embodiment. As shown in FIG. 7, the user terminal 200 includes a communication section 210, a storage section 220, and a control section 230. Note that the user terminal 200 may include an input unit (for example, a keyboard, a mouse, etc.) for accepting various operations from the user, and a display unit (for example, a liquid crystal display, etc.) for displaying various information.

通信部２１０は、例えば、ＮＩＣ等によって実現される。そして、通信部２１０は、ネットワークＮと有線又は無線で接続され、ネットワークＮを介して、情報処理装置１００との間で各種の情報の送受信を行う。 The communication unit 210 is realized by, for example, a NIC or the like. The communication unit 210 is connected to the network N by wire or wirelessly, and transmits and receives various information to and from the information processing device 100 via the network N.

記憶部２２０は、例えば、ＲＡＭ、フラッシュメモリ等の半導体メモリ素子、または、ハードディスク、ＳＳＤ、光ディスク等の記憶装置によって実現される。図５に示すように、記憶部２２０は、モデル情報記憶部２２１と、パターン記憶部２２２と、を有する。 The storage unit 220 is realized by, for example, a semiconductor memory element such as a RAM or flash memory, or a storage device such as a hard disk, SSD, or optical disk. As shown in FIG. 5, the storage unit 220 includes a model information storage unit 221 and a pattern storage unit 222.

モデル情報記憶部２２１は、機械学習モデルに関する各種の情報を記憶する。利用者端末２００のモデル情報記憶部２２１に記憶される情報は、情報処理装置１００のモデル情報記憶部１２１に記憶される情報と同じであるから説明を省略する。 The model information storage unit 221 stores various information regarding machine learning models. The information stored in the model information storage unit 221 of the user terminal 200 is the same as the information stored in the model information storage unit 121 of the information processing device 100, so a description thereof will be omitted.

パターン記憶部２２２は、代表特徴量に照合される認識対象のパターンに関する情報を記憶する。利用者端末２００のパターン記憶部２２２に記憶される情報は、情報処理装置１００のパターン記憶部１２２に記憶される情報と同じであるから説明を省略する。 The pattern storage unit 222 stores information regarding a recognition target pattern that is matched with the representative feature amount. The information stored in the pattern storage unit 222 of the user terminal 200 is the same as the information stored in the pattern storage unit 122 of the information processing device 100, so a description thereof will be omitted.

制御部２４０は、例えば、ＣＰＵやＭＰＵ等によって、利用者端末２００に記憶されている各種プログラムがＲＡＭを作業領域として実行されることにより実現される。また、制御部２４０は、例えば、ＡＳＩＣやＦＰＧＡ等の集積回路により実現されてもよい。 The control unit 240 is realized by, for example, a CPU, an MPU, or the like executing various programs stored in the user terminal 200 using the RAM as a work area. Further, the control unit 240 may be realized by, for example, an integrated circuit such as an ASIC or an FPGA.

図５に示すように、制御部２３０は、取得部２３１と、受付部２３２と、推論部２３３と、出力部２３４と、を有する。 As shown in FIG. 5, the control unit 230 includes an acquisition unit 231, a reception unit 232, an inference unit 233, and an output unit 234.

（取得部２３１について）
取得部２３１は、情報処理装置１００から学習処理が実行されたモデルを取得する。例えば、取得部２３１は、実施形態に係る学習処理によりモデルが更新（学習）されるたびに、学習済みモデルを情報処理装置１００から取得してもよい。取得部２３１は、例えば、ネットワークＮを介して、情報処理装置１００から学習済みモデルを取得しても良い。 (About the acquisition unit 231)
The acquisition unit 231 acquires the model on which the learning process has been performed from the information processing apparatus 100. For example, the acquisition unit 231 may acquire the learned model from the information processing apparatus 100 every time the model is updated (learned) by the learning process according to the embodiment. The acquisition unit 231 may acquire the trained model from the information processing device 100 via the network N, for example.

取得部２３１は、情報処理装置１００から学習済みモデルを取得したら、取得した学習済みモデルを「モデルデータ」として、モデル情報記憶部２２１に記憶する。 After acquiring the trained model from the information processing device 100, the acquisition unit 231 stores the acquired trained model in the model information storage unit 221 as “model data”.

（受付部２３２について）
受付部２３２は、利用者から複数の認識対象が含まれる画像データについて認識対象の認識処理を受け付ける。例えば、利用者が画像データＧＤ１から文字を認識する為に、認識対象として「文字」、画像データとして画像データＧＤ１を指定して利用者端末２００に入力した場合は、受付部２３２は、利用者が指定した認識対象「文字」、画像データ「ＧＤ１」を受け付ける。 (About reception department 232)
The accepting unit 232 accepts recognition processing for recognition targets from a user regarding image data that includes a plurality of recognition targets. For example, if the user specifies "characters" as the recognition target and the image data GD1 as the image data and inputs them into the user terminal 200 in order to recognize characters from the image data GD1, the reception unit 232 Receives the recognition target "character" and image data "GD1" specified by .

（推論部２３３について）
推論部２３３は、複数の認識対象が含まれる画像データから、画像データに含まれる複数の認識対象を推論する。例えば、推論部２３３は、取得部２３１が取得した学習済みモデルに、画像データを入力することで、画像データに含まれる複数の認識対象を推論する。 (About the inference unit 233)
The inference unit 233 infers a plurality of recognition targets included in the image data from image data including a plurality of recognition targets. For example, the inference unit 233 infers a plurality of recognition targets included in the image data by inputting the image data to the trained model acquired by the acquisition unit 231.

推論部２３３は、例えば、認識対象が文字の場合は、複数の文字が含まれる画像データを学習済みモデルに入力して、画像データに含まれる文字を推論する。 For example, when the recognition target is a character, the inference unit 233 inputs image data including a plurality of characters to the trained model and infers the characters included in the image data.

（出力部２３４について）
出力部２３４は、推論部２３３が推論した認識対象を出力する。例えば、認識対象が文字であり、推論部２３３が「ｐｅａｃｅ」の文字が画像データに含まれると推論したとする。この場合、出力部２３４は、推論部２３３が推論した「ｐｅａｃｅ」の文字を出力する。 (About the output section 234)
The output unit 234 outputs the recognition target inferred by the inference unit 233. For example, assume that the recognition target is a character, and the inference unit 233 infers that the image data includes the character "peace". In this case, the output unit 234 outputs the characters “peace” inferred by the inference unit 233.

〔５．実施例〕
次に本実施形態を実施した例について説明する。実装は、ＰｙＴｏｒｃｈ－１．２．０を基に、Ｉｎｔｅｌ（登録商標）Ｘｅｏｎ（登録商標）Ｓｉｌｖｅｒ４１１４ＣＰＵ、１６ＧＢＲＡＭ、ＮＶＩＤＩＡ（登録商標）ＴｅｓｌａＶ１００（ＶＲＡＭ１６ＧＢ）ＧＰＵを備える仮想マシンを用いて行った。この構成の仮想マシンにより、本実施形態の学習処理とテスト処理を実行した。 [5. Example〕
Next, an example of implementing this embodiment will be described. The implementation was based on PyTorch-1.2.0 using a virtual machine equipped with an Intel (registered trademark) Xeon (registered trademark) Silver 4114 CPU, 16GB RAM, and NVIDIA (registered trademark) Tesla V100 (VRAM 16GB) GPU. . The learning process and test process of this embodiment were executed using the virtual machine with this configuration.

学習処理は、学習データとしてＩＣＤＡＲ２０１３データセットと、ＩＣＤＡＲ２０１５データセットを用いて実行した。ここで、ＩＣＤＡＲ２０１３データセットは、シーンテキストをテーマとしており、２２９個の訓練データと、２３３個のテスト画像データによって構成される。ＩＣＤＡＲ２０１５は、ｉｎｃｉｄｅｎｔａｌｓｃｅａｎｔｅｘｔをテーマとしており、１，０００個の訓練データと、５００個のテストデータによって構成される。また、ＳｙｎＴｅｘｔデータセットと、ＩＣＤＡＲ２０１９ＭＬＴデータセットと、を訓練データが比較的少ないＩＣＤＡＲ２０１３データセットと、ＩＣＤＡＲ２０１５データセットと、を補う為に用いた。なお、ＳｙｎＴｅｘｔデータセットは、８００Ｋの画像データに文字レベルでのアノテーションが付与されたデータセットである。ＩＣＤＡＲ２０１９ＭＬＴデータセットは、１０，０００個の訓練データを含む多数国語のデータセットである。 The learning process was performed using the ICDAR2013 dataset and the ICDAR2015 dataset as learning data. Here, the ICDAR2013 dataset has a scene text as its theme, and is composed of 229 pieces of training data and 233 pieces of test image data. ICDAR2015 has the theme of incidental scean text, and is composed of 1,000 pieces of training data and 500 pieces of test data. In addition, the SynText dataset and the ICDAR2019 MLT dataset were used to supplement the ICDAR2013 dataset and the ICDAR2015 dataset, which have relatively few training data. Note that the SynText data set is a data set in which 800K image data is annotated at the character level. The ICDAR2019 MLT dataset is a multilingual dataset containing 10,000 pieces of training data.

テスト処理において、テストデータとしてＩＣＤＡＲ２０１３データセット、及びＩＣＤＡＲ２０１５データセットを用いた場合の検出精度の評価のために、物体検出アルゴリズムの評価ソフトウェアであるＤｅｔＥｖａｌを用いて、ｐｒｅｃｉｓｉｏｎ（適合率）、ｒｅｃａｌｌ（再現率）、及びそれらのＨｍｅａｎｓ（調和平均）を算出した。また、文字の認識精度の評価のために、Ｓｔｒｏｎｇ（１画像当たり１００語）、Ｗｅａｋ（データセット当たり１，０００語）、Ｇｅｎｅｒｉｃ（９０，０００の一般語）の辞書を用いた。 In the test process, in order to evaluate the detection accuracy when using the ICDAR2013 dataset and the ICDAR2015 dataset as test data, DetEval, which is evaluation software for object detection algorithms, was used to evaluate precision, recall, and ), and their H means (harmonic mean) were calculated. Furthermore, in order to evaluate the character recognition accuracy, dictionaries of Strong (100 words per image), Weak (1,000 words per dataset), and Generic (90,000 common words) were used.

図８に、ＩＣＤＡＲ２０１５データセットを用いた場合の検出精度と認識精度とを示す。図８は、実施形態に係る情報処理の検出精度と認識精度とを示す図である。図８に示す比較例Ａ、Ｂ、及びＣは、既存の手法を示しているが、既存の手法の詳細な説明は省略する。図８に示すＤｅｔｅｃｔｉｏｎは、文字検出処理の検出精度を示している。図８に示すｗｏｒｄｓｐｏｔｔｉｎｇは、文字検出処理のモデルと文字認識処理のモデルに別々のモデルを用いた場合の認識精度を示している。図８に示すＥｎｄ－ｔｏ－ｅｎｄは、文字検出処理と文字認識処理を一体化したＥｎｄ－ｔｏ－ｅｎｄモデルの認識精度を示している。図８に示すように実施例は、比較例Ａ、Ｂ、及びＣに対して検出精度と、認識精度が同等程度であることが理解できる。また、図８に示すように、実施例はＦＰＳ（ＦｒａｍｅｓＰｅｒＳｅｃｏｎｄ）が１２であり、３．１億パラメータのモデルであることから、図８に示した手法の中で、処理速度が最も速く、モデルサイズが最も小さいことが理解できる。したがって、本実施形態によれば認識対象の認識精度を落とすことなく、処理速度を向上させることができる情報処理装置１００を提供することができる。 FIG. 8 shows detection accuracy and recognition accuracy when using the ICDAR2015 dataset. FIG. 8 is a diagram showing detection accuracy and recognition accuracy of information processing according to the embodiment. Comparative examples A, B, and C shown in FIG. 8 show existing techniques, but detailed explanations of the existing techniques will be omitted. Detection shown in FIG. 8 indicates the detection accuracy of the character detection process. Word spotting shown in FIG. 8 shows the recognition accuracy when different models are used for the character detection processing model and the character recognition processing model. End-to-end shown in FIG. 8 indicates the recognition accuracy of an end-to-end model that integrates character detection processing and character recognition processing. As shown in FIG. 8, it can be seen that the detection accuracy and recognition accuracy of the example are comparable to those of comparative examples A, B, and C. Furthermore, as shown in Fig. 8, the FPS (Frames Per Second) of the example is 12 and the model has 310 million parameters, so it has the fastest processing speed among the methods shown in Fig. 8. , it can be seen that the model size is the smallest. Therefore, according to the present embodiment, it is possible to provide the information processing apparatus 100 that can improve the processing speed without reducing the recognition accuracy of the recognition target.

次に、ＩＣＤＡＲ２０１５のｉｎｃｉｄｅｎｔａｌｓｃｅａｎｔｅｘｔベンチマークのデータセットを、Ｓｔｒｏｎｇ辞書を用いて認識処理を実行した場合の処理速度と、認識精度について説明する。図９は、実施形態に係る情報処理の処理速度と認識精度を示す図である。図９に示すように、実施例は、比較例１から５に対して、処理速度が速いことが理解できる。また、図９に示すように、実施例は、比較例１から５の認識精度と、同等程度の認識精度であることが理解できる。したがって、本実施形態によれば認識対象の認識精度を落とすことなく、処理速度を向上させることができる情報処理装置１００を提供することができる。 Next, the processing speed and recognition accuracy when recognition processing is performed using the Strong dictionary on the ICDAR2015 incidental scean text benchmark dataset will be described. FIG. 9 is a diagram showing the processing speed and recognition accuracy of information processing according to the embodiment. As shown in FIG. 9, it can be seen that the processing speed of the example is faster than that of comparative examples 1 to 5. Moreover, as shown in FIG. 9, it can be seen that the recognition accuracy of the example is equivalent to that of comparative examples 1 to 5. Therefore, according to the present embodiment, it is possible to provide the information processing apparatus 100 that can improve the processing speed without reducing the recognition accuracy of the recognition target.

次に、学習処理を実行したモデルをスマートフォンなどの利用者端末２００に組み込んだ場合の処理速度について説明する。まず、学習処理を実行したＰｙＴｏｒｃｈモデルを、ｉＰｈｏｎｅ（登録商標）１１Ｐｒｏで実行するために、ＣｏｒｅＭＬ（登録商標）フレームワークに組み込んだ。なお、ここでＣｏｒｅＭＬ（登録商標）は、ｉＰｈｏｎｅ（登録商標）のアプリケーション上でモデルの推論を実行するためのフレームワークである。ＣｏｒｅＭＬ（登録商標）フレームワークに組み込んだモデルに、画像サイズが６４０、１２８０、１９２０、２５６０の画像を入力して、ＣＰＵ、及びＮｅｕｒａｌＥｎｇｉｎｅを用いて認識処理を実行した。ここで、ＮｅｕｒａｌＥｎｇｉｎｅは、ｉＰｈｏｎｅ（登録商標）１１Ｐｒｏに搭載された機械学習の処理に特化したＳｏＣ（ＳｙｓｔｅｍｏｎａＣｈｉｐ）の一部である。 Next, the processing speed when a model that has undergone learning processing is installed in the user terminal 200 such as a smartphone will be described. First, the PyTorch model that underwent the learning process was incorporated into the Core ML (registered trademark) framework in order to be executed on the iPhone (registered trademark) 11 Pro. Note that Core ML (registered trademark) is a framework for executing model inference on an iPhone (registered trademark) application. Images with image sizes of 640, 1280, 1920, and 2560 were input to a model incorporated in the Core ML (registered trademark) framework, and recognition processing was performed using a CPU and Neural Engine. Here, the Neural Engine is a part of an SoC (System on a Chip) that is installed in the iPhone (registered trademark) 11 Pro and is specialized for machine learning processing.

図１０に利用者端末２００を用いた場合の認識処理のレイテンシーを示す。図１０は、実施例に係る情報処理を利用者端末にて実行した場合の認識時間を示す図である。ここで、参考までに１００ｍｓｅｃのレイテンシーは、システムが即座に反応していると利用者が感じ、１０００ｍｓｅｃのレイテンシーは、例え利用者が遅れを感じていたとしても途切れない処理が実行されていると感じる限界値である。図１０に示すように、本実施形態によれば、ＮｅｕｒａｌＥｎｇｉｎｅのようなアクセラレータが有れば、前者のレイテンシーを満足し、後者のレイテンシーはＣＰＵによって満足することができることが理解できる。したがって、本実施形態によれば認識対象の認識精度を落とすことなく、処理速度を向上させることができる利用者端末２００を提供することができる。 FIG. 10 shows the latency of recognition processing when the user terminal 200 is used. FIG. 10 is a diagram showing recognition time when information processing according to the embodiment is executed on a user terminal. For reference, a latency of 100 msec means that the user feels that the system is responding immediately, and a latency of 1000 msec means that the process is being executed without interruption, even if the user feels a delay. This is the limit value that can be felt. As shown in FIG. 10, it can be seen that according to this embodiment, if there is an accelerator such as Neural Engine, the former latency can be satisfied, and the latter latency can be satisfied by the CPU. Therefore, according to the present embodiment, it is possible to provide the user terminal 200 that can improve the processing speed without reducing the recognition accuracy of the recognition target.

〔６．情報処理のフロー〕
次に、図９を用いて、実施形態に係る情報処理装置による情報処理の手順について説明する。図９は、実施形態に係る情報処理の一例を示すフローチャートである。情報処理装置１００は、複数の認識対象が含まれるデータを取得する（ステップＳ１０１）。そして、情報処理装置１００は、データから認識対象が含まれる複数の領域を推定する（ステップＳ１０２）。情報処理装置１００は、領域ごとに抽出される特徴量に基づいて、代表特徴量を抽出する（ステップＳ１０３）。情報処理装置１００は、代表特徴量に基づいて対応する領域に含まれる認識対象を推論する（ステップＳ１０４）。情報処理装置１００は、推論された認識対象を出力する（ステップＳ１０５）。 [6. Information processing flow]
Next, the procedure of information processing by the information processing apparatus according to the embodiment will be described using FIG. 9. FIG. 9 is a flowchart illustrating an example of information processing according to the embodiment. The information processing device 100 acquires data that includes a plurality of recognition targets (step S101). Then, the information processing device 100 estimates a plurality of regions including the recognition target from the data (step S102). The information processing apparatus 100 extracts a representative feature amount based on the feature amount extracted for each region (step S103). The information processing device 100 infers the recognition target included in the corresponding area based on the representative feature amount (step S104). The information processing device 100 outputs the inferred recognition target (step S105).

〔７．ハードウェア構成〕
また、上述した実施形態に係る情報処理装置１００は、例えば図１０に示すような構成のコンピュータ１０００によって実現される。図１０は、情報処理装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。コンピュータ１０００は、出力装置１０１０、入力装置１０２０と接続され、演算装置１０３０、一次記憶装置１０４０、二次記憶装置１０５０、出力ＩＦ（Ｉｎｔｅｒｆａｃｅ）１０６０、入力ＩＦ１０７０、ネットワークＩＦ１０８０がバス１０９０により接続された形態を有する。 [7. Hardware configuration]
Further, the information processing apparatus 100 according to the embodiment described above is realized by, for example, a computer 1000 having a configuration as shown in FIG. 10. FIG. 10 is a hardware configuration diagram showing an example of a computer that implements the functions of the information processing device. The computer 1000 is connected to an output device 1010 and an input device 1020, and has an arithmetic device 1030, a primary storage device 1040, a secondary storage device 1050, an output IF (Interface) 1060, an input IF 1070, and a network IF 1080 connected by a bus 1090. has.

演算装置１０３０は、一次記憶装置１０４０や二次記憶装置１０５０に格納されたプログラムや入力装置１０２０から読み出したプログラム等に基づいて動作し、各種の処理を実行する。一次記憶装置１０４０は、ＲＡＭ等、演算装置１０３０が各種の演算に用いるデータを一次的に記憶するメモリ装置である。また、二次記憶装置１０５０は、演算装置１０３０が各種の演算に用いるデータや、各種のデータベースが記憶される記憶装置であり、ＲＯＭ(ＲｅａｄＯｎｌｙＭｅｍｏｒｙ)、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、フラッシュメモリ等により実現される。 The arithmetic unit 1030 operates based on programs stored in the primary storage device 1040 and the secondary storage device 1050, programs read from the input device 1020, and performs various processes. The primary storage device 1040 is a memory device such as a RAM that temporarily stores data used by the arithmetic unit 1030 for various calculations. Further, the secondary storage device 1050 is a storage device in which data used by the arithmetic unit 1030 for various calculations and various databases are stored, and includes ROM (Read Only Memory), HDD (Hard Disk Drive), flash memory, etc. This is realized by

出力ＩＦ１０６０は、モニタやプリンタといった各種の情報を出力する出力装置１０１０に対し、出力対象となる情報を送信するためのインタフェースであり、例えば、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）やＤＶＩ（ＤｉｇｉｔａｌＶｉｓｕａｌＩｎｔｅｒｆａｃｅ）、ＨＤＭＩ（登録商標）（ＨｉｇｈＤｅｆｉｎｉｔｉｏｎＭｕｌｔｉｍｅｄｉａＩｎｔｅｒｆａｃｅ）といった規格のコネクタにより実現される。また、入力ＩＦ１０７０は、マウス、キーボード、およびスキャナ等といった各種の入力装置１０２０から情報を受信するためのインタフェースであり、例えば、ＵＳＢ等により実現される。 The output IF 1060 is an interface for transmitting information to be output to an output device 1010 that outputs various information such as a monitor or a printer, and is, for example, a USB (Universal Serial Bus), a DVI (Digital Visual Interface), This is realized using a connector compliant with standards such as HDMI (registered trademark) (High Definition Multimedia Interface). Further, the input IF 1070 is an interface for receiving information from various input devices 1020 such as a mouse, a keyboard, and a scanner, and is realized by, for example, a USB or the like.

なお、入力装置１０２０は、例えば、ＣＤ（ＣｏｍｐａｃｔＤｉｓｃ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＰＤ（ＰｈａｓｅｃｈａｎｇｅｒｅｗｒｉｔａｂｌｅＤｉｓｋ）等の光学記録媒体、ＭＯ（Ｍａｇｎｅｔｏ－Ｏｐｔｉｃａｌｄｉｓｋ）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等から情報を読み出す装置であってもよい。また、入力装置１０２０は、ＵＳＢメモリ等の外付け記憶媒体であってもよい。 Note that the input device 1020 is, for example, an optical recording medium such as a CD (Compact Disc), a DVD (Digital Versatile Disc), a PD (Phase change rewritable disk), or an MO (Magneto-Optical disk). Magneto-optical recording media, tape It may be a device that reads information from a medium, a magnetic recording medium, a semiconductor memory, or the like. Furthermore, the input device 1020 may be an external storage medium such as a USB memory.

ネットワークＩＦ１０８０は、ネットワークＮを介して他の機器からデータを受信して演算装置１０３０へ送り、また、ネットワークＮを介して演算装置１０３０が生成したデータを他の機器へ送信する。 Network IF 1080 receives data from other devices via network N and sends it to computing device 1030, and also sends data generated by computing device 1030 to other devices via network N.

演算装置１０３０は、出力ＩＦ１０６０や入力ＩＦ１０７０を介して、出力装置１０１０や入力装置１０２０の制御を行う。例えば、演算装置１０３０は、入力装置１０２０や二次記憶装置１０５０からプログラムを一次記憶装置１０４０上にロードし、ロードしたプログラムを実行する。 Arithmetic device 1030 controls output device 1010 and input device 1020 via output IF 1060 and input IF 1070. For example, the arithmetic device 1030 loads a program from the input device 1020 or the secondary storage device 1050 onto the primary storage device 1040, and executes the loaded program.

例えば、コンピュータ１０００が情報処理装置１００として機能する場合、コンピュータ１０００の演算装置１０３０は、一次記憶装置１０４０上にロードされたプログラムを実行することにより、情報処理装置１００の制御部１３０の機能を実現する。 For example, when the computer 1000 functions as the information processing device 100, the arithmetic device 1030 of the computer 1000 realizes the function of the control unit 130 of the information processing device 100 by executing a program loaded onto the primary storage device 1040. do.

〔８．構成と効果〕
本開示に係る情報処理装置１００は、複数の認識対象が含まれる画像データを取得する取得部１３１と、画像データから認識対象が含まれる複数の領域を推定する推定部１３２と、領域ごとに抽出される特徴量に基づいて、代表特徴量を抽出する抽出部１３３と、代表特徴量に基づいて対応する領域に含まれる認識対象を推論する推論部１３４と、認識対象を出力する出力部１３７と、を備える。 [8. Composition and effects〕
The information processing device 100 according to the present disclosure includes an acquisition unit 131 that acquires image data including a plurality of recognition targets, an estimation unit 132 that estimates a plurality of regions including recognition targets from the image data, and an extraction unit 132 for each region. an extraction unit 133 that extracts a representative feature based on the feature set, an inference unit 134 that infers a recognition target included in a corresponding region based on the representative feature, and an output unit 137 that outputs a recognition target. , is provided.

この構成によれば、認識対象の認識精度を落とすことなく、処理速度を向上させることができる情報処理装置１００を提供することができる。 According to this configuration, it is possible to provide the information processing device 100 that can improve the processing speed without reducing the recognition accuracy of the recognition target.

また、本開示に係る情報処理装置１００の推定部１３２は、画像データにおいて認識対象が存在する確率を画像データの位置ごとに推定する。 Furthermore, the estimating unit 132 of the information processing device 100 according to the present disclosure estimates the probability that a recognition target exists in the image data for each position of the image data.

この構成によれば、認識対象が存在する確率を画像データの位置ごとに出力することができることから、認識対象の特徴が反映された特徴量を抽出する位置を決定する際に、認識対象が存在する確率が高い位置の特徴量を抽出することを決定することができる。その為、認識対象の特徴が十分に反映された特徴量を抽出することができる。 According to this configuration, the probability that the recognition target exists can be output for each position of the image data, so when determining the position from which to extract the feature amount that reflects the characteristics of the recognition target, it is possible to output the probability that the recognition target exists. It is possible to decide to extract features at positions where there is a high probability of Therefore, it is possible to extract feature amounts that sufficiently reflect the features of the recognition target.

また、本開示に係る情報処理装置１００の抽出部１３３は、領域に含まれる認識対象の大きさに応じて、代表特徴量を抽出する領域内の位置を変更する。 Further, the extraction unit 133 of the information processing device 100 according to the present disclosure changes the position within the region from which the representative feature amount is extracted, depending on the size of the recognition target included in the region.

この構成によれば、認識対象の大きさに応じて、代表特徴量を抽出する領域内の位置を変更することができる為、領域ごとに認識対象の大きさが異なる場合であっても、認識対象の特徴が十分に反映された特徴量を抽出することができる。 According to this configuration, the position within the region from which the representative feature is extracted can be changed depending on the size of the recognition target, so even if the size of the recognition target differs from region to region, the recognition It is possible to extract feature quantities that fully reflect the characteristics of the target.

また、本開示に係る情報処理装置１００の代表特徴量に対応する認識対象のパターンを記憶したパターン記憶部１２２と、を備え、推論部１３４は、代表特徴量に対応する認識対象をパターン記憶部１２２に記憶されたパターンに基づいて推論する。 The information processing device 100 according to the present disclosure also includes a pattern storage unit 122 that stores a recognition target pattern corresponding to the representative feature amount, and the inference unit 134 stores a recognition target pattern corresponding to the representative feature amount in the pattern storage unit. Inferences are made based on the patterns stored in 122.

この構成によれば、情報処理装置１００は、画像データから代表特徴量に対応する認識対象を精度よく推論することができる。 According to this configuration, the information processing apparatus 100 can accurately infer the recognition target corresponding to the representative feature amount from the image data.

また、本開示に係る情報処理装置１００の認識対象が文字であり、推論部１３４の認識対象の推論結果に対して言語モデルを用いて誤り訂正を行う訂正部１３５と、を備える。 Further, the recognition target of the information processing apparatus 100 according to the present disclosure is a character, and the information processing apparatus 100 includes a correction unit 135 that performs error correction on the inference result of the recognition target of the inference unit 134 using a language model.

この構成によれば、情報処理装置１００は、認識対象の分類確率が所定の値以下の場合に、言語モデルを用いて、文字の出現確率を計算し、出現確率が高い認識対象に置き換えることができる。その為、認識対象の認識精度を向上させることができる。 According to this configuration, when the classification probability of a recognition target is less than or equal to a predetermined value, the information processing device 100 can use the language model to calculate the appearance probability of a character and replace it with a recognition target with a higher appearance probability. can. Therefore, the recognition accuracy of the recognition target can be improved.

また、本開示に係る情報処理装置１００の抽出部１３３は、領域に含まれる認識対象の大きさが大きい場合は、領域の重心の位置の特徴量を代表特徴量として抽出し、領域に含まれる認識対象の大きさが小さい場合は、認識対象が存在する確率の極大値の位置の特徴量を代表特徴量として抽出する。 Further, when the size of the recognition target included in the region is large, the extraction unit 133 of the information processing device 100 according to the present disclosure extracts the feature amount at the position of the center of gravity of the region as the representative feature amount, and extracts the feature amount at the position of the center of gravity of the region. If the size of the recognition target is small, the feature at the position of the maximum probability of the recognition target being present is extracted as the representative feature.

この構成によれば、情報処理装置１００は、画像データの領域内の認識対象の大きさに応じて、代表特徴量を抽出する位置を変更することから、認識対象の特徴が十分に反映された特徴量を抽出することができる。 According to this configuration, the information processing device 100 changes the position from which the representative feature is extracted depending on the size of the recognition target within the image data area, so that the characteristics of the recognition target are sufficiently reflected. Features can be extracted.

本開示に係る利用者端末２００は、上述の情報処理装置１００により学習されたモデルを用いて、複数の認識対象が含まれる画像データから認識対象を出力する。 The user terminal 200 according to the present disclosure uses the model learned by the information processing device 100 described above to output a recognition target from image data including a plurality of recognition targets.

この構成によれば、認識対象の認識精度を落とすことなく、処理速度を向上させることができる利用者端末２００を提供することができる。 According to this configuration, it is possible to provide the user terminal 200 that can improve the processing speed without reducing the recognition accuracy of the recognition target.

本開示に係る情報処理方法は、複数の認識対象が含まれる画像データを取得するステップと、画像データから認識対象が含まれる複数の領域を推定するステップと、領域ごとに抽出される特徴量に基づいて、代表特徴量を抽出するステップと、代表特徴量に基づいて対応する領域に含まれる認識対象を推論するステップと、認識対象を出力するステップと、を含む。 The information processing method according to the present disclosure includes a step of acquiring image data including a plurality of recognition targets, a step of estimating a plurality of regions including the recognition targets from the image data, and a step of estimating a plurality of regions including the recognition targets from the image data. The method includes the steps of extracting a representative feature amount based on the representative feature amount, inferring a recognition target included in a corresponding region based on the representative feature amount, and outputting the recognition target.

この構成によれば、認識対象の認識精度を落とすことなく、処理速度を向上させることができる情報処理方法を提供することができる。 According to this configuration, it is possible to provide an information processing method that can improve processing speed without reducing recognition accuracy of a recognition target.

本開示に係る情報処理プログラムは、複数の認識対象が含まれる画像データを取得するステップと、画像データから認識対象が含まれる複数の領域を推定するステップと、領域ごとに抽出される特徴量に基づいて、代表特徴量を抽出するステップと、代表特徴量に基づいて対応する領域に含まれる認識対象を推論するステップと、認識対象を出力するステップと、をコンピュータに実行させる。 The information processing program according to the present disclosure includes a step of acquiring image data including a plurality of recognition targets, a step of estimating a plurality of regions including the recognition targets from the image data, and a step of estimating a plurality of regions including the recognition targets from the image data. Based on the representative feature amount, the computer executes the steps of extracting a representative feature amount, inferring a recognition target included in a corresponding region based on the representative feature amount, and outputting the recognition target.

この構成によれば、認識対象の認識精度を落とすことなく、処理速度を向上させることができる情報処理プログラムを提供することができる。 According to this configuration, it is possible to provide an information processing program that can improve processing speed without reducing recognition accuracy of a recognition target.

以上、本願の実施形態を図面に基づいて詳細に説明したが、これは例示であり、発明の開示の欄に記載の態様を始めとして、当業者の知識に基づいて種々の変形、改良を施した他の形態で本発明を実施することが可能である。 Although the embodiments of the present application have been described above in detail based on the drawings, this is merely an example, and various modifications and improvements can be made based on the knowledge of those skilled in the art, including the embodiments described in the disclosure section of the invention. It is possible to implement the invention in other forms.

また、上述してきた「部（section、module、unit）」は、「手段」や「回路」などに読み替えることができる。例えば、取得部１３１は、取得手段や取得回路に読み替えることができる。 Further, the above-mentioned "section, module, unit" can be read as "means", "circuit", etc. For example, the acquisition unit 131 can be replaced with an acquisition means or an acquisition circuit.

１００情報処理装置
１１０通信部
１２０記憶部
１２１モデル情報記憶部
１２２パターン記憶部
１３０制御部
１３１取得部
１３２推定部
１３３抽出部
１３４推論部
１３６学習部
１３７出力部
２００利用者端末
２１０通信部
２２０記憶部
２２１モデル情報記憶部
２２２パターン記憶部
２３０制御部
２３１取得部
２３２受付部
２３３推論部
２３４出力部
Ｎネットワーク 100 Information processing device 110 Communication unit 120 Storage unit 121 Model information storage unit 122 Pattern storage unit 130 Control unit 131 Acquisition unit 132 Estimation unit 133 Extraction unit 134 Inference unit 136 Learning unit 137 Output unit 200 User terminal 210 Communication unit 220 Storage unit 221 Model information storage unit 222 Pattern storage unit 230 Control unit 231 Acquisition unit 232 Reception unit 233 Inference unit 234 Output unit N Network

Claims

an acquisition unit that acquires image data including multiple recognition targets;
Estimating a plurality of regions containing the recognition target from the data, and outputting the probability that the recognition target exists for each position in the image data and the feature amount that reflects the characteristics of the recognition target for each position in the image data. Estimating section;
an extraction unit that determines a position where the probability of the recognition target being present is greater than or equal to a predetermined value as a position from which a representative feature is to be extracted, and extracts the feature at that position as a representative feature;
The recognition target included in the corresponding region based on the representative feature is converted into the representative feature extracted for each region by referring to the pattern storage unit in which the pattern of the recognition target corresponding to the representative feature is stored. an inference unit that infers a classification probability of a corresponding recognition target and infers a recognition target included in the region based on the classification probability ;
an output unit that outputs the recognition target;
An information processing device comprising:

The extraction unit changes a position within the region from which the representative feature is extracted depending on a size of a recognition target included in the region.
The information processing device according to claim 1.

a pattern storage unit storing a recognition target pattern corresponding to the representative feature amount;
The inference unit infers a recognition target corresponding to the representative feature amount based on the pattern stored in the pattern storage unit.
The information processing device according to claim 1 or claim 2 .

The recognition target is a character,
a correction unit that performs error correction on the inference result of the recognition target of the inference unit using a language model;
The information processing device according to any one of claims 1 to 3 .

When the size of the recognition target included in the area is large, the extraction unit extracts the feature amount at the position of the center of gravity of the area as a representative feature amount, and when the size of the recognition target included in the area is small, extracts the feature at the position of the maximum probability that the recognition target exists as the representative feature.
The information processing device according to any one of claims 2 to 4 .

A user terminal that outputs a recognition target from data including a plurality of recognition targets using a model learned by the information processing device according to any one of claims 1 to 5 .

obtaining image data including multiple recognition targets;
Estimating a plurality of regions containing the recognition target from the data, and outputting the probability that the recognition target exists for each position in the image data and the feature amount that reflects the characteristics of the recognition target for each position in the image data. step and
determining a position where the probability that the recognition target exists is greater than or equal to a predetermined value as a position from which a representative feature quantity is to be extracted, and extracting the feature quantity at that position as a representative feature quantity;
The recognition target included in the corresponding region based on the representative feature is converted into the representative feature extracted for each region by referring to the pattern storage unit in which the pattern of the recognition target corresponding to the representative feature is stored. inferring a classification probability of a corresponding recognition target, and inferring a recognition target included in the region based on the classification probability ;
outputting the recognition target;
Information processing methods including.

obtaining image data including multiple recognition targets;
Estimating a plurality of regions containing the recognition target from the data, and outputting the probability that the recognition target exists for each position in the image data and the feature amount that reflects the characteristics of the recognition target for each position in the image data. step and
determining a position where the probability that the recognition target exists is greater than or equal to a predetermined value as a position from which a representative feature quantity is to be extracted, and extracting the feature quantity at that position as a representative feature quantity;
The recognition target included in the corresponding region based on the representative feature is converted into the representative feature extracted for each region by referring to the pattern storage unit in which the pattern of the recognition target corresponding to the representative feature is stored. inferring a classification probability of a corresponding recognition target, and inferring a recognition target included in the region based on the classification probability ;
outputting the recognition target;
An information processing program that causes a computer to execute.