JP6815712B1

JP6815712B1 - Image processing system, image processing method, image processing program, image processing server, and learning model

Info

Publication number: JP6815712B1
Application number: JP2020128966A
Authority: JP
Inventors: 北岸　郁雄; 郁雄北岸; エドワードウィリアムダニエルウィッタッカー; 田中　雅士; 雅士田中
Original assignee: Money Forward Inc
Current assignee: Money Forward Inc
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2021-01-20
Anticipated expiration: 2040-07-30
Also published as: JP2022025843A; WO2022024835A1; JP2022027394A

Abstract

【課題】撮像されたオブジェクトのエッジを検出しなくても、動画から所定のオブジェクトの画像を抽出できると共に、オブジェクトが配される背景の相違によらず、オブジェクトの画像を適切に抽出できる画像処理システム、画像処理方法、画像処理プログラム、画像処理サーバ、及び学習モデルを提供する。【解決手段】画像処理システム１は、オブジェクトを撮像した動画の動画構成画像に基づいてオブジェクトの所定箇所の座標を取得する座標取得部１６と、座標に基づいて、オブジェクトが含まれる画像領域を動画構成画像から抽出する画像領域抽出部１８とを備え、座標取得部１６が、画像領域に基づいてオブジェクトの所定箇所の座標を取得し、画像領域抽出部１８が、取得された座標を動画構成画像に射影して、オブジェクトのオブジェクト画像領域を抽出する。【選択図】図１PROBLEM TO BE SOLVED: To extract an image of a predetermined object from a moving image without detecting an edge of an imaged object, and to appropriately extract an image of the object regardless of the difference in the background in which the object is arranged. It provides a system, an image processing method, an image processing program, an image processing server, and a learning model. An image processing system 1 has a coordinate acquisition unit 16 that acquires coordinates at a predetermined position of an object based on a moving image of a moving image of an object, and an image area including the object based on the coordinates. The image area extraction unit 18 for extracting from the constituent image is provided, the coordinate acquisition unit 16 acquires the coordinates of a predetermined position of the object based on the image area, and the image area extraction unit 18 obtains the acquired coordinates for the moving image constituent image. Extract the object image area of the object by projecting to. [Selection diagram] Fig. 1

Description

本発明は、画像処理システム、画像処理方法、画像処理プログラム、画像処理サーバ、及び学習モデルに関する。特に、本発明は、動画中の所定のオブジェクトを適切に抽出可能な画像処理システム、画像処理方法、画像処理プログラム、画像処理サーバ、及び学習モデルに関する。 The present invention relates to an image processing system, an image processing method, an image processing program, an image processing server, and a learning model. In particular, the present invention relates to an image processing system, an image processing method, an image processing program, an image processing server, and a learning model capable of appropriately extracting a predetermined object in a moving image.

従来、画像から線分を抽出する線分抽出装置であって、画像からエッジを検出するエッジ検出部と、画像内で第１方向に所定間隔で延伸する複数の第１平行線と、エッジと、の交点を求める第１交点特定部と、隣接する２本の第１平行線の各ペアについて、互いの第１平行線上の交点同士を直線の結合線で結ぶ第１交点結合部と、交点で繋がる複数の結合線からなり、延伸方向の角度差が所定範囲以内である結合線の集合を、線分として抽出する第１線分特定部とを備える線分抽出装置が知られている（例えば、特許文献１参照。）。特許文献１に記載の線分抽出装置によれば、画像に含まれる線分を高速で抽出することができる。 Conventionally, it is a line segment extraction device that extracts a line segment from an image, and has an edge detection unit that detects an edge from the image, a plurality of first parallel lines extending in a first direction in the image at predetermined intervals, and an edge. The first intersection identification part for finding the intersection of, and the first intersection connection part connecting the intersections on the first parallel lines of each pair of two adjacent first parallel lines with a straight line connection line, and the intersection. There is known a line segment extraction device including a first line segment specifying unit that extracts a set of coupling lines consisting of a plurality of coupling lines connected by a line and having an angle difference in the stretching direction within a predetermined range as a line segment (. For example, see Patent Document 1.). According to the line segment extraction device described in Patent Document 1, the line segments included in the image can be extracted at high speed.

特開２０１８−１８１２４４号公報Japanese Unexamined Patent Publication No. 2018-181244

しかしながら、特許文献１に記載の線分抽出装置においてはオブジェクトのエッジを検出することが前提になっており、オブジェクトが矩形状の場合、オブジェクトの少なくとも３つの辺を抽出することが要求される。また、特許文献１に記載の線分抽出装置は、矩形領域であれば当該オブジェクトの種類によらず、全ての矩形領域を抽出してしまう。更に、特許文献１に記載の線分抽出装置においては、オブジェクトと当該オブジェクトが置かれている背景との組み合わせによってはオブジェクトのエッジの認識が困難であり（例えば、オブジェクトの色と背景色とが略同一である場合、エッジを認識することが困難である場合がある）、その場合、オブジェクトの存在を認識することが困難になる場合がある。 However, the line segment extraction device described in Patent Document 1 is premised on detecting the edge of an object, and when the object has a rectangular shape, it is required to extract at least three sides of the object. Further, the line segment extraction device described in Patent Document 1 extracts all rectangular areas regardless of the type of the object as long as it is a rectangular area. Further, in the line segment extraction device described in Patent Document 1, it is difficult to recognize the edge of the object depending on the combination of the object and the background on which the object is placed (for example, the color of the object and the background color are different). If they are approximately the same, it may be difficult to recognize the edges), in which case it may be difficult to recognize the existence of the object.

したがって、本発明の目的は、撮像されたオブジェクトのエッジを検出しなくても、動画から所定のオブジェクトの画像を抽出できると共に、オブジェクトが配される背景の相違によらず、オブジェクトの画像を適切に抽出できる画像処理システム、画像処理方法、画像処理プログラム、画像処理サーバ、及び学習モデルを提供することにある。 Therefore, an object of the present invention is that an image of a predetermined object can be extracted from a moving image without detecting the edge of the captured object, and the image of the object can be appropriately used regardless of the background difference in which the object is arranged. It is an object of the present invention to provide an image processing system, an image processing method, an image processing program, an image processing server, and a learning model that can be extracted.

本発明は、上記目的を達成するため、オブジェクトを撮像した動画の動画構成画像に基づいてオブジェクトの所定箇所の座標を取得する座標取得部と、座標に基づいて、オブジェクトが含まれる画像領域を動画構成画像から抽出する画像領域抽出部とを備え、座標取得部が、画像領域に基づいてオブジェクトの所定箇所の座標を取得し、画像領域抽出部が、取得された座標を動画構成画像に射影して、オブジェクトのオブジェクト画像領域を抽出する画像処理システムが提供される。 In order to achieve the above object, the present invention has a coordinate acquisition unit that acquires coordinates of a predetermined position of an object based on a moving image of a moving image of an object, and an image area including the object based on the coordinates. It is provided with an image area extraction unit that extracts from the constituent image, the coordinate acquisition unit acquires the coordinates of a predetermined position of the object based on the image area, and the image area extraction unit projects the acquired coordinates onto the moving image constituent image. An image processing system that extracts the object image area of the object is provided.

また、本発明は、上記目的を達成するため、画像処理システム用の画像処理方法であって、オブジェクトを撮像した動画の動画構成画像に基づいてオブジェクトの所定箇所の座標を取得する座標取得工程と、座標に基づいて、オブジェクトが含まれる画像領域を動画構成画像から抽出する画像領域抽出工程と、画像領域に基づいてオブジェクトの所定箇所の座標を取得する工程と、取得された座標を動画構成画像に射影して、オブジェクトのオブジェクト画像領域を抽出する工程とを備える画像処理方法が提供される。 Further, in order to achieve the above object, the present invention is an image processing method for an image processing system, which is a coordinate acquisition step of acquiring coordinates of a predetermined position of an object based on a moving image of a moving image of an object. , An image area extraction step of extracting an image area containing an object from a moving image constituent image based on the coordinates, a step of acquiring the coordinates of a predetermined part of the object based on the image area, and a moving image constituent image of the acquired coordinates. An image processing method including a step of projecting an image onto an object and extracting an object image area of the object is provided.

また、本発明は、上記目的を達成するため、画像処理システム用の画像処理プログラムであって、コンピュータに、オブジェクトを撮像した動画の動画構成画像に基づいてオブジェクトの所定箇所の座標を取得する座標取得機能と、座標に基づいて、オブジェクトが含まれる画像領域を動画構成画像から抽出する画像領域抽出機能と、画像領域に基づいてオブジェクトの所定箇所の座標を取得する機能と、取得された座標を動画構成画像に射影して、オブジェクトのオブジェクト画像領域を抽出する機能とを実現させる画像処理プログラムが提供される。 Further, in order to achieve the above object, the present invention is an image processing program for an image processing system, and coordinates for acquiring the coordinates of a predetermined position of an object on a computer based on a moving image of a moving image of an object. The acquisition function, the image area extraction function that extracts the image area containing the object from the video composition image based on the coordinates, the function that acquires the coordinates of a predetermined part of the object based on the image area, and the acquired coordinates. An image processing program is provided that realizes a function of projecting a moving image and extracting an object image area of an object.

また、本発明は、上記目的を達成するため、オブジェクトを撮像した動画の動画構成画像に基づいてオブジェクトの所定箇所の座標を取得する座標取得部と、座標に基づいて、オブジェクトが含まれる画像領域を動画構成画像から抽出する画像領域抽出部とを備え、座標取得部が、画像領域に基づいてオブジェクトの所定箇所の座標を取得し、画像領域抽出部が、取得された座標を動画構成画像に射影して、オブジェクトのオブジェクト画像領域を抽出する画像処理サーバが提供される。 Further, in order to achieve the above object, the present invention has a coordinate acquisition unit that acquires coordinates of a predetermined position of an object based on a moving image of a moving image of an object, and an image area that includes the object based on the coordinates. The image area extraction unit is provided with an image area extraction unit that extracts the image from the moving image configuration image, the coordinate acquisition unit acquires the coordinates of a predetermined position of the object based on the image area, and the image area extraction unit converts the acquired coordinates into the moving image composition image. An image processing server is provided that projects and extracts the object image area of the object.

更に、本発明は、上記目的を達成するため、撮像画像が入力されると、撮像画像に含まれるオブジェクトが所定のオブジェクトであるか否か識別するために、所定のオブジェクトの１以上の隅を中心とする１以上の矩形領域を出力するよう、プロセッサを機能させる学習モデルであって、学習モデルは、所定のオブジェクトが含まれる画像、所定のオブジェクトが配され得る背景画像、及び所定のオブジェクトが含まれる画像と背景画像との組み合わせを教師データとして学習され、学習では、所定のオブジェクトの隅を中心とする１以上の矩形領域であって、中心から所定のオブジェクトが含まれる画像の外縁までの長さが最短距離になる直線を垂線とする辺が当該画像の外縁に接するサイズの矩形領域を形成し、形成された矩形領域及び当該矩形領域の中心の座標を用いて当該画像中の所定のオブジェクトを識別するための学習モデルが提供される。 Further, in order to achieve the above object, when a captured image is input, the present invention performs one or more corners of a predetermined object in order to identify whether or not the object included in the captured image is a predetermined object. A learning model that causes a processor to function so as to output one or more rectangular areas at the center. The learning model includes an image containing a predetermined object, a background image in which a predetermined object can be arranged, and a predetermined object. The combination of the included image and the background image is learned as teacher data, and in the training, one or more rectangular areas centered on the corners of a predetermined object, from the center to the outer edge of the image containing the predetermined object. A rectangular region having a size in which a side whose vertical line is a straight line having the shortest length is in contact with the outer edge of the image is formed, and a predetermined rectangular region in the image is used by using the formed rectangular region and the coordinates of the center of the rectangular region. A learning model for identifying objects is provided.

本発明に係る画像処理システム、画像処理方法、画像処理プログラム、画像処理サーバ、及び学習モデルによれば、撮像されたオブジェクトのエッジを検出しなくても、動画から所定のオブジェクトの画像を抽出できると共に、オブジェクトが配される背景の相違によらず、オブジェクトの画像を適切に抽出できる画像処理システム、画像処理方法、画像処理プログラム、画像処理サーバ、及び学習モデルを提供できる。 According to the image processing system, the image processing method, the image processing program, the image processing server, and the learning model according to the present invention, the image of a predetermined object can be extracted from the moving image without detecting the edge of the captured object. At the same time, it is possible to provide an image processing system, an image processing method, an image processing program, an image processing server, and a learning model that can appropriately extract an image of the object regardless of the difference in the background in which the object is arranged.

本実施の形態に係る画像処理システムの概要図である。It is a schematic diagram of the image processing system which concerns on this embodiment. 本実施の形態に係る画像処理システムの機能構成ブロック図である。It is a functional block diagram of the image processing system which concerns on this embodiment. 本実施形態に係る学習モデル生成部が生成する学習モデルにおける所定のオブジェクトのラベリング方法の概要図である。It is a schematic diagram of the labeling method of a predetermined object in the learning model generated by the learning model generation part which concerns on this embodiment. 本実施形態に係る画像処理システムの処理の第１の工程の概要図である。It is the schematic of the 1st process of the processing of the image processing system which concerns on this embodiment. マージン領域を設ける理由の概要図である。It is a schematic diagram of the reason why the margin area is provided. 本実施形態に係る画像処理システムの処理の第２の工程の概要図である。It is a schematic diagram of the 2nd process of the processing of the image processing system which concerns on this embodiment. 本実施形態に係る画像処理システムの処理のフロー図である。It is a processing flow diagram of the image processing system which concerns on this embodiment.

［実施の形態］
図１は、本発明の実施の形態に係る画像処理システムの概要を示す。 [Embodiment]
FIG. 1 shows an outline of an image processing system according to an embodiment of the present invention.

［画像処理システム１の概要］ [Overview of image processing system 1]

本実施形態に係る画像処理システム１は、所定のオブジェクトを含む領域の動画を撮像し、撮像した動画から当該所定のオブジェクト及び／又は当該所定のオブジェクトに記載されている情報を自動的、かつ、適切に抽出するシステムである。例えば、画像処理システム１は、撮像領域に複数の領収書（複数の領収書は、互いに形状・サイズ、表面の記載様式が異なっていてよい）、名刺、その他の紙片、及び四角形状の物体や領域（例えば、スマートフォンやパソコンのキーボードのボタン等）が存在している状態を動画撮像した場合に、これらの中から特定のオブジェクト、一例として、当該複数の領収書及び／又は当該複数の領収書に記載の情報を自動的、リアルタイムに抽出し、コンピュータにおいて利用可能なデジタル情報に変換することができる。 The image processing system 1 according to the present embodiment captures a moving image of a region including a predetermined object, and automatically and / or automatically obtains the predetermined object and / or the information described in the predetermined object from the captured moving image. It is a system that extracts properly. For example, the image processing system 1 has a plurality of receipts (the plurality of receipts may have different shapes / sizes and surface description styles), business cards, other pieces of paper, and square objects in the imaging region. When a moving image is taken of a state in which an area (for example, a button on a keyboard of a smartphone or a personal computer) exists, a specific object from these, for example, the plurality of receipts and / or the plurality of receipts. The information described in can be automatically extracted in real time and converted into digital information that can be used in a computer.

例えば、図１（ａ）に示すように、画像処理システム１が、所定の撮像領域の動画を撮像するカメラを有する情報端末２と、所定の情報処理を実行するサーバ３とを備え、情報端末２とサーバ３とが通信網４によって双方向通信可能に接続されている例を挙げて説明する。ここでは、本実施形態に係る画像処理システム１が抽出する所定のオブジェクトが、一例として、様々な形状・サイズを有し、様々な様式・書式で作成される領収書（レシート）である場合を説明する。 For example, as shown in FIG. 1A, the image processing system 1 includes an information terminal 2 having a camera that captures a moving image in a predetermined imaging region, and a server 3 that executes predetermined information processing. An example will be described in which 2 and the server 3 are connected by a communication network 4 so as to be capable of bidirectional communication. Here, as an example, a case where the predetermined object extracted by the image processing system 1 according to the present embodiment is a receipt (receipt) having various shapes and sizes and created in various formats and formats. explain.

例えば、複数のオブジェクト（例えば、オブジェクト８０、及びオブジェクト８２）が机９０の上に配置されているとする。なお、複数のオブジェクトは、例えばユーザが所定の場所に配置してよい。そして、画像処理システム１は、これらを含む領域の動画を情報端末２のカメラで撮像する。図１（ａ）の例では、机９０の上にオブジェクト８０（例えば、名刺）、及びオブジェクト８２（例えば、領収書）が配置されている。なお、オブジェクト８２の一部は折れ曲がっていてもよい。そして、画像処理システム１は、複数のオブジェクトが撮像された動画から、動画を構成する１以上の動画構成画像を抽出する。続いて画像処理システム１は、抽出した１以上の動画構成画像のそれぞれにリサイズ処理を施して、１以上のリサイズ画像を生成する。 For example, suppose a plurality of objects (for example, object 80 and object 82) are arranged on a desk 90. The plurality of objects may be arranged by the user at a predetermined location, for example. Then, the image processing system 1 captures a moving image of a region including these with the camera of the information terminal 2. In the example of FIG. 1A, the object 80 (for example, a business card) and the object 82 (for example, a receipt) are arranged on the desk 90. A part of the object 82 may be bent. Then, the image processing system 1 extracts one or more moving image constituent images constituting the moving image from the moving image in which a plurality of objects are captured. Subsequently, the image processing system 1 performs resizing processing on each of the extracted one or more moving image constituent images to generate one or more resized images.

続いて、画像処理システム１は、抽出対象であるオブジェクトの所定箇所の座標をリサイズ画像から取得する。この座標は、リサイズ画像における当該所定箇所の座標である。この場合において画像処理システム１は、画像に含まれるオブジェクトが所定のオブジェクトであるか否かを判定するための学習モデルを予め準備する。この学習モデルは、例えば、抽出対象のオブジェクトが領収書である場合において、画像に領収書と領収書とは異なる物体とが含まれていた場合、領収書については領収書として認識し、領収書とは異なる物体については領収書ではないと認識するために用いることができる学習モデルである。 Subsequently, the image processing system 1 acquires the coordinates of a predetermined position of the object to be extracted from the resized image. These coordinates are the coordinates of the predetermined location in the resized image. In this case, the image processing system 1 prepares in advance a learning model for determining whether or not the object included in the image is a predetermined object. For example, when the object to be extracted is a receipt, this learning model recognizes the receipt as a receipt and recognizes the receipt if the image contains an object different from the receipt and the receipt. It is a learning model that can be used to recognize that an object different from is not a receipt.

ここで、本実施形態においては、抽出対象であるオブジェクトの１以上の所定箇所を中心とする１以上の矩形領域（つまり、バウンディングボックス）と、当該オブジェクトのカテゴリーとの関連付けを含む学習モデルを予め準備する。つまり、従来の学習モデルのように、抽出対象であるオブジェクトの全体を囲む矩形領域と当該オブジェクトのカテゴリーとを関連付けるのではなく、抽出対象である一のオブジェクトの複数の部分をそれぞれ囲む複数の矩形領域と当該オブジェクトのカテゴリーとの関連付けを含む学習モデルを本実施形態では構築して用いる。例えば、学習モデルは、領収書の４隅を中心とする４つの正方形領域を１セットとし、当該１セットとオブジェクトのカテゴリーである領収書とを関連付け、動画構成画像が入力されると、動画構成画像の領収書が占める領域の画像及び／又は４隅の座標を出力するための学習モデルである。 Here, in the present embodiment, a learning model including an association between one or more rectangular areas (that is, a bounding box) centered on one or more predetermined points of the object to be extracted and the category of the object is prepared in advance. prepare. That is, instead of associating the rectangular area surrounding the entire object to be extracted with the category of the object as in the conventional learning model, a plurality of rectangles surrounding a plurality of parts of one object to be extracted. In this embodiment, a learning model including an association between a region and a category of the object is constructed and used. For example, in the learning model, four square areas centered on the four corners of the receipt are set as one set, the one set is associated with the receipt which is the category of the object, and when the video composition image is input, the video configuration is performed. This is a learning model for outputting the image of the area occupied by the receipt of the image and / or the coordinates of the four corners.

この学習モデルは、一例として、予め取得した大量の所定のオブジェクトの画像や、所定のオブジェクトのコーナー、及び特徴点等の特徴量、並びにオブジェクトが配され得る背景画像等についての情報に基づいて生成された学習モデルであって、動画構成画像に含まれるオブジェクトが所定のオブジェクトであるか否かを判定するための学習モデルである。なお、画像処理システム１は、所定のオブジェクトを識別する識別子に対応付けて当該オブジェクトの特徴量に関する情報を格納するテーブルを用い、動画構成画像に含まれている１以上のオブジェクトのそれぞれが所定のオブジェクトであるか否かを判断してもよい。ただし、本実施形態においては、様々な形状・サイズの所定のオブジェクトに柔軟・高速・的確に対応する観点から、学習モデルを用いて所定のオブジェクトであるか否かを判断することが好ましい。 As an example, this learning model is generated based on information about a large number of images of a predetermined object acquired in advance, feature quantities such as corners of a predetermined object and feature points, and a background image in which the object can be arranged. This is a learning model for determining whether or not the object included in the moving image is a predetermined object. The image processing system 1 uses a table that stores information about the feature amount of the object in association with an identifier that identifies the predetermined object, and each of the one or more objects included in the moving image is predetermined. You may judge whether it is an object or not. However, in the present embodiment, it is preferable to determine whether or not the object is a predetermined object by using a learning model from the viewpoint of flexibly, quickly, and accurately corresponding to a predetermined object of various shapes and sizes.

そして、画像処理システム１は、学習モデルを用い、リサイズ画像から抽出対象のオブジェクトの所定箇所の座標、例えば、オブジェクトが矩形状である場合、４つの隅の座標を取得する。この場合において画像処理システム１は、学習モデルを用い、４つの隅のそれぞれを中心とする正方形領域に基づいて、当該オブジェクトが領収書であるか否かを判断し、及び／又は領収書の４隅の座標を取得する。また、画像処理システム１においては動画を撮像しているので、例えば、情報端末２を移動させつつ動画を撮像した場合、動画構成画像の中には抽出対象のオブジェクトの全体が含まれていない動画構成画像も含まれ得ることから、リサイズ画像についても当該オブジェクトの全体が含まれていないリサイズ画像が生成され得る。そこで、画像処理システム１は、学習モデルを用い、抽出対象のオブジェクトの所定箇所の座標の全ての箇所が含まれるリサイズ画像を選択し、選択したリサイズ画像から当該オブジェクトの所定箇所の座標を取得する。 Then, the image processing system 1 uses the learning model to acquire the coordinates of a predetermined position of the object to be extracted from the resized image, for example, the coordinates of the four corners when the object has a rectangular shape. In this case, the image processing system 1 uses a learning model to determine whether or not the object is a receipt based on a square area centered on each of the four corners, and / or 4 of the receipt. Get the coordinates of the corner. Further, since the image processing system 1 captures a moving image, for example, when the moving image is captured while the information terminal 2 is moved, the moving image constituent image does not include the entire object to be extracted. Since the constituent image can also be included, a resized image that does not include the entire object can be generated for the resized image as well. Therefore, the image processing system 1 uses a learning model to select a resized image that includes all the coordinates of the predetermined portion of the object to be extracted, and acquires the coordinates of the predetermined portion of the object from the selected resized image. ..

以下の説明においては、説明の簡略化のため主として、画像処理システム１がオブジェクトの４つの隅の座標を取得して処理を実行する場合を説明するが、画像処理システム１は、オブジェクトの一部の隅の座標を取得し、残りの隅の座標を推定して用いることができる。すなわち、画像処理システム１は、オブジェクト８２の４つの隅の全ての座標を取得しなくても、一部の座標を取得することもできる。この場合、画像処理システム１は、オブジェクト８２の一部の隅の座標を取得し、座標を取得していない隅については、取得した隅の座標から推定することができる（例えば、３つの隅の座標を取得した場合、残り１つの隅の座標を推定することや、対角位置にある２つの隅の座標を取得し、残り２つの隅の座標を推定すること等ができる。）。 In the following description, for simplification of the description, the case where the image processing system 1 mainly acquires the coordinates of the four corners of the object and executes the processing will be described, but the image processing system 1 is a part of the object. The coordinates of the corners of can be obtained, and the coordinates of the remaining corners can be estimated and used. That is, the image processing system 1 can acquire some coordinates without acquiring all the coordinates of the four corners of the object 82. In this case, the image processing system 1 can acquire the coordinates of a part of the corners of the object 82, and can estimate the corners for which the coordinates have not been acquired from the coordinates of the acquired corners (for example, three corners). When the coordinates are acquired, the coordinates of the remaining one corner can be estimated, the coordinates of the two diagonal corners can be acquired, and the coordinates of the remaining two corners can be estimated.)

具体的に、図１（ｂ）の例で画像処理システム１は、リサイズ画像１００に含まれるオブジェクト８２（つまり、領収書）の４つの隅（つまり、隅１５０、隅１５２、隅１５４、及び隅１５６）の少なくとも一部の座標を取得する。一方、画像処理システム１は、抽出対象ではないオブジェクト８０（つまり、名刺）の４つの隅の座標は、学習モデルを用い、取得しない。なお、画像処理システム１は、リサイズ画像中に領収書の一部が含まれていない場合、つまり、領収書の４隅の一部がリサイズ画像中に含まれていない場合は、当該リサイズ画像を用いずに領収書の４隅の全てが含まれるリサイズ画像を用いて座標を取得してもよい。また、画像処理システム１は、オブジェクト８２の一部が折れ曲がっている場合であっても（つまり、オブジェクト８２の一部が机９０から浮き上がっている場合であっても）、オブジェクト８２の隅がリサイズ画像１００に含まれているか、オブジェクト８２の一部の隅が含まれている限り、リサイズ画像１００からオブジェクト８２の４隅の座標を取得するか、一部の隅の座標と一部の隅の座標から推定される残りの隅の座標を取得する。 Specifically, in the example of FIG. 1B, the image processing system 1 has four corners (that is, corners 150, corners 152, corners 154, and corners) of the object 82 (that is, a receipt) included in the resized image 100. Obtain at least a part of the coordinates of 156). On the other hand, the image processing system 1 does not acquire the coordinates of the four corners of the object 80 (that is, the business card) that is not the extraction target by using the learning model. The image processing system 1 displays the resized image when a part of the receipt is not included in the resized image, that is, when a part of the four corners of the receipt is not included in the resized image. Coordinates may be acquired using a resized image that includes all four corners of the receipt without using it. Further, in the image processing system 1, even if a part of the object 82 is bent (that is, even if a part of the object 82 is lifted from the desk 90), the corner of the object 82 is resized. As long as it is included in the image 100 or contains some corners of the object 82, either get the coordinates of the four corners of the object 82 from the resized image 100, or get the coordinates of some corners and some corners. Get the coordinates of the remaining corners estimated from the coordinates.

なお、画像処理システム１が用いる学習モデルにおいては、様々な背景画像に対して抽出対象のオブジェクトの画像を重畳させた学習も実行して学習モデルを構築できる。これにより、画像処理システム１においては、オブジェクト８２の外縁が背景である机９０の色との関係で認識し難い場合であっても、オブジェクト８２の所定箇所の座標を適切に取得できる。 In the learning model used by the image processing system 1, a learning model can be constructed by executing learning in which images of objects to be extracted are superimposed on various background images. As a result, in the image processing system 1, even if the outer edge of the object 82 is difficult to recognize due to the color of the desk 90 as the background, the coordinates of the predetermined position of the object 82 can be appropriately acquired.

続いて、画像処理システム１は、リサイズ画像１００から取得した座標を、当該リサイズ画像の生成元である元の動画構成画像（つまり、この動画構成画像から当該リサイズ画像が生成されている）に射影して得られる座標（例えば、図１（ｃ）に示す、座標１５０ａ、座標１５２ａ、座標１５４ａ、及び座標１５６ａ）を用い、当該元の動画構成画像から抽出対象であるオブジェクト８２が含まれる画像領域を抽出する。この場合に画像処理システム１は、オブジェクト８２の周囲に所定のマージン領域を含む画像領域を抽出してよい。 Subsequently, the image processing system 1 projects the coordinates acquired from the resized image 100 onto the original moving image constituent image (that is, the resized image is generated from the moving image constituent image) that is the generation source of the resized image. (For example, the coordinate 150a, the coordinate 152a, the coordinate 154a, and the coordinate 156a shown in FIG. 1C) are used, and an image region including the object 82 to be extracted from the original moving image configuration image is included. To extract. In this case, the image processing system 1 may extract an image area including a predetermined margin area around the object 82.

そして、画像処理システム１は、抽出した画像領域に再びリサイズ処理を施し、リサイズ画像領域を生成する。次に、画像処理システム１は、上記学習モデルを再び用い、リサイズ画像領域から抽出対象であるオブジェクトの所定箇所の座標を再度、取得する。この座標は、リサイズ画像領域における所定箇所の座標である。続いて、画像処理システム１は、リサイズ画像領域から取得した座標を、当該リサイズ画像領域の生成元である元の画像領域が抽出された元の動画構成画像に射影して得られる座標を用い、当該元の動画構成画像から抽出対象であるオブジェクト８２のオブジェクト画像領域を抽出する。これにより、画像処理システム１は、動画に撮像された抽出対象であるオブジェクト８２の画像を適切にリアルタイムに抽出できる。ここで、画像処理システム１は、所定の画像処理を施した上でオブジェクト画像領域を抽出してもよい。例えば、オブジェクトである領収書の一部が折れ曲がり、領収書が置かれた平面から当該一部が浮き上がっている場合、オブジェクト画像領域においては、浮き上がっている部分に表示されているテキストや図形に歪み等が生じている場合がある。そこで、画像処理システム１は、当該歪み等を除去する画像処理をオブジェクト画像領域に施す。そして、画像処理システム１は、例えば、光学文字認識（ＯＣＲ）により読み取り可能なデータとしてオブジェクト画像領域を格納する。 Then, the image processing system 1 resizes the extracted image area to generate the resized image area. Next, the image processing system 1 uses the learning model again to acquire the coordinates of a predetermined position of the object to be extracted from the resized image area again. These coordinates are the coordinates of a predetermined position in the resized image area. Subsequently, the image processing system 1 uses the coordinates obtained by projecting the coordinates acquired from the resized image area onto the original moving image constituent image from which the original image area that is the generation source of the resized image area is extracted. The object image area of the object 82 to be extracted is extracted from the original moving image. As a result, the image processing system 1 can appropriately extract the image of the object 82 to be extracted captured in the moving image in real time. Here, the image processing system 1 may extract the object image area after performing predetermined image processing. For example, if a part of the receipt that is an object is bent and the part is raised from the plane on which the receipt is placed, the text or figure displayed in the raised part is distorted in the object image area. Etc. may occur. Therefore, the image processing system 1 applies image processing for removing the distortion or the like to the object image area. Then, the image processing system 1 stores the object image area as data that can be read by, for example, optical character recognition (OCR).

更に、画像処理システム１は、当該データに基づいて、動画に含まれる所定のオブジェクトの表面に記載された情報を読み取り、読み取った内容を情報端末２等の表示部等に出力できる。例えば、所定のオブジェクトが領収書である場合、画像処理システム１は、撮像領域に領収書を含む動画を撮像して生成したＯＣＲ読み取り可能なデータを実際に読み取り、読取の結果を情報端末２の表示部等に出力させてもよい。この場合、画像処理システム１は、例えば、所定のオブジェクトが領収書の場合、各領収書の具体的な内容として、領収書記載の日付や発行会社、及び金額や売買対象項目を含む内容等を出力させることができる。更に、画像処理システム１は、読み取った情報を格納し、格納した情報を画像処理システム１外の会計システムや家計簿システム等に引き渡すこともできる（なお、画像処理システム１は、読み取った情報を直接、画像処理システム１外の会計システム等に引き渡してもよい。）。 Further, the image processing system 1 can read the information written on the surface of a predetermined object included in the moving image based on the data, and output the read contents to a display unit or the like of the information terminal 2. For example, when a predetermined object is a receipt, the image processing system 1 actually reads the OCR readable data generated by capturing a moving image including the receipt in the imaging region, and reads the reading result of the information terminal 2. It may be output to a display unit or the like. In this case, for example, when the predetermined object is a receipt, the image processing system 1 sets the specific contents of each receipt, such as the date and issuer of the receipt, the amount of money, and the contents to be sold. It can be output. Further, the image processing system 1 can store the read information, and the stored information can be handed over to an accounting system, a household account book system, or the like other than the image processing system 1 (the image processing system 1 can store the read information. It may be delivered directly to an accounting system or the like other than the image processing system 1).

これにより、画像処理システム１によれば、複数のオブジェクトを１枚１枚撮像することやスキャナでスキャンすることを要さず、複数のオブジェクトを机の上等に配置した状態を動画撮像するだけで、複数のオブジェクトそれぞれを識別すると共に各オブジェクト表面の情報を適切に抽出できる。したがって、オブジェクトが例えば様々な形状や様式で作成される領収書等である場合、膨大な枚数の領収書の処理を要する会計事務所や多くの枚数の領収書の処理を要する個人事業主等、又は家計簿等を作成する様々な人々の会計や経理等の処理の手間を低減させユーザビリティを向上させることができる。 As a result, according to the image processing system 1, it is not necessary to image a plurality of objects one by one or scan them with a scanner, and only image a state in which a plurality of objects are arranged on a desk or the like as a moving image. Therefore, it is possible to identify each of a plurality of objects and appropriately extract information on the surface of each object. Therefore, when the object is, for example, a receipt created in various shapes and styles, an accounting office that requires processing of a huge number of receipts, a sole proprietor who needs to process a large number of receipts, etc. Alternatively, it is possible to reduce the labor of processing accounting and accounting of various people who create household accounts and improve usability.

特に本実施形態に係る画像処理システム１は、撮像した動画から動画構成画像（元画像）を抽出し、抽出した動画構成画像をリサイズしてリサイズ画像を生成し、リサイズ画像から抽出対象のオブジェクトの所定箇所の座標を取得し、取得した座標を当該動画構成画像（元画像）に射影して抽出対象のオブジェクトが含まれる画像領域を抽出する第１の工程と、この画像領域を再びリサイズしてリサイズ画像領域を生成し、リサイズ画像領域から抽出対象のオブジェクトの所定箇所の座標を取得し、取得した座標を当該動画構成画像（元画像）に射影して抽出対象のオブジェクトのオブジェクト画像領域を抽出する第２の工程とを経て抽出対象のオブジェクトの画像（つまり、オブジェクト画像領域）を抽出する。第１の工程と第２の工程とを経ることで、オブジェクトのエッジ検出が困難であっても、高精度でオブジェクトの画像を抽出できる。 In particular, the image processing system 1 according to the present embodiment extracts a moving image constituent image (original image) from the captured moving image, resizes the extracted moving image constituent image to generate a resized image, and extracts the object to be extracted from the resized image. The first step of acquiring the coordinates of a predetermined location, projecting the acquired coordinates onto the moving image constituent image (original image) to extract the image area including the object to be extracted, and resizing this image area again. A resized image area is generated, the coordinates of a predetermined part of the object to be extracted are acquired from the resized image area, and the acquired coordinates are projected onto the video constituent image (original image) to extract the object image area of the object to be extracted. The image of the object to be extracted (that is, the object image area) is extracted through the second step. By going through the first step and the second step, even if it is difficult to detect the edge of the object, the image of the object can be extracted with high accuracy.

ここで、本実施形態においては、抽出対象であるオブジェクトの全体ではなく、複数の部分のバウンディングボックスを利用した学習モデルを構築している。これは、本発明者の鋭意研究の結果、オブジェクトの全体を含むバウンディングボックスを用いるよりも、オブジェクトの特徴的な部分を中心とした複数のバウンディングボックスを用いた学習モデルを構築して用いることで、極めて精度良く抽出対象であるオブジェクトを抽出することができ、また、システムの処理速度を向上できることを見出した結果である。 Here, in the present embodiment, a learning model is constructed by using the bounding boxes of a plurality of parts instead of the entire object to be extracted. As a result of diligent research by the present inventor, this is done by constructing and using a learning model using a plurality of bounding boxes centered on a characteristic part of the object, rather than using a bounding box containing the entire object. This is the result of finding that the object to be extracted can be extracted with extremely high accuracy and the processing speed of the system can be improved.

すなわち、画像処理システム１は、第１の工程で動画構成画像から所定のオブジェクトを含む画像領域を、所定のオブジェクトの複数の特徴部分（例えば、隅）を中心とする複数のバウンディングボックスを用いて、いわば粗く抽出し、第２の工程では、粗く抽出した画像領域に基づいて所定のオブジェクトを含むオブジェクト画像領域を、再度、オブジェクトの複数の特徴部分を中心とする複数のバウンディングボックスを用いて精密に抽出する。すなわち、オジブジェクトに対するバウンディングボックスのエリア推定自体に誤差が含まれている。そのため本実施形態では、バウンディングボックスを用いた処理を繰り返す（つまり、第１の工程と第２の工程との少なくとも２つの工程を実行する）ことで係る誤差を低減し、高精度でオブジェクトを検出することができる。なお、バウンディングボックスによるオブジェクトの検出は、一例として、画像中のオブジェクトを単一のディープニューラルネットワークで検出するＳｉｎｇｌｅＳｈｏｔＭｕｌｔｉＢｏｘＤｅｔｅｃｔｏｒ（ＳＳＤ）を利用できる。これにより、画像処理システム１によれば、オブジェクトには様々な矩形状のオブジェクト（例えば、名刺、領収書、キーボードのボタン、スマートフォン等）があるところ、抽出対象であるオブジェクト（上記の例では領収書）についての学習モデルを予め構築することで、抽出対象であるオブジェクトを動画から適切に抽出でき、意図しない矩形領域の検出・抽出を防止できる。 That is, in the first step, the image processing system 1 uses a plurality of bounding boxes centered on a plurality of feature portions (for example, corners) of a predetermined object in an image area including a predetermined object from the moving image constituent image. In the second step, the object image area containing a predetermined object is precisely extracted again using a plurality of bounding boxes centered on a plurality of feature portions of the object based on the coarsely extracted image area. Extract to. That is, the bounding box area estimation itself for the Ojibject contains an error. Therefore, in the present embodiment, the error is reduced by repeating the process using the bounding box (that is, at least two steps of the first step and the second step are executed), and the object is detected with high accuracy. can do. As an example of detecting an object by the bounding box, a Single Shot MultiBox Detector (SSD) that detects an object in an image with a single deep neural network can be used. As a result, according to the image processing system 1, where there are various rectangular objects (for example, business cards, receipts, keyboard buttons, smartphones, etc.), the objects to be extracted (receipt in the above example). By constructing a learning model for the book) in advance, the object to be extracted can be appropriately extracted from the moving image, and the detection / extraction of an unintended rectangular area can be prevented.

なお、本実施形態においてオブジェクトは、同一形状、若しくは互いに異なる形状を有し、平面的な形状を有するオブジェクトである。オブジェクトの形状に特に限定はないが、例えば、四辺形状であってよく、四隅や四辺の少なくとも一部が欠損していてもよい。また、オブジェクトの形状は隅（つまり、角）を有する形状であれば限定はなく、三角形、五角形、六角形等の多角形であってもよいし、一部に円弧形状が含まれていてもよい。更に、オブジェクトのサイズにも特に限定はない。そして、オブジェクトの表面には、様々な様式で、各種の情報（テキスト情報、図形情報、手書きの文字や数字、図形等）が印字、印刷、及び／又は記載等されていてよい。オブジェクトとしては、一例として、見積書、請求書、領収書、及び／又は名刺等が挙げられるがこれらに限られない。オブジェクトが領収書等である場合、オブジェクト表面に記載されている情報としては、発行年月日、発行時刻、宛名、金額、摘要、発行者名、及び／又は発行者の電話番号等の情報が挙げられる。したがって、本実施形態に係る画像処理システム１が撮像する動画には、様々な形状、様々なサイズの複数のオブジェクトであって、表面に様々な情報が記載されている複数のオブジェクトの画像が含まれていてよい。すなわち、画像処理システム１が撮像する複数のオブジェクトそれぞれの形状、サイズ、及び／又は表面に記載の情報は、それぞれ異なっていてよい。画像処理システム１は、複数のオブジェクトから、所定カテゴリーのオブジェクトのみを抽出できる。 In this embodiment, the objects are objects having the same shape or different shapes from each other and having a planar shape. The shape of the object is not particularly limited, but may be, for example, a four-sided shape, and the four corners or at least a part of the four sides may be missing. Further, the shape of the object is not limited as long as it has a corner (that is, a corner), and may be a polygon such as a triangle, a pentagon, or a hexagon, or even if a part of the object includes an arc shape. Good. Furthermore, the size of the object is not particularly limited. Then, various types of information (text information, graphic information, handwritten characters, numbers, figures, etc.) may be printed, printed, and / or described on the surface of the object in various formats. Examples of objects include, but are not limited to, quotations, invoices, receipts, and / or business cards. When the object is a receipt, etc., the information written on the surface of the object includes information such as the issue date, issue time, address, amount, description, issuer name, and / or the issuer's telephone number. Can be mentioned. Therefore, the moving image captured by the image processing system 1 according to the present embodiment includes images of a plurality of objects having various shapes and various sizes and having various information described on the surface thereof. It may be. That is, the shape, size, and / or information described on the surface of each of the plurality of objects imaged by the image processing system 1 may be different. The image processing system 1 can extract only objects of a predetermined category from a plurality of objects.

また、情報端末２は、携帯通信端末やスマートフォン、ノートパソコン、及び／又はタブレット型ＰＣ等であってよく、動画撮像可能な撮像装置に接続可能なＰＣ等の情報端末や時計等であってもよい。更に、通信網４は、携帯電話網、及び／又はインターネット等の通信網である。通信網４は、有線ＬＡＮ及び無線ＬＡＮ等の通信ネットワークを含むこともできる。そして、以下において本実施形態に係る画像処理システム１の詳細を説明するが、上記説明及び下記説明における名称や数値、数量等はあくまで例示であり、これらの名称や数値、数量等に限定されることはないことを付言する。 Further, the information terminal 2 may be a mobile communication terminal, a smartphone, a notebook computer, and / or a tablet-type PC, and may be an information terminal such as a PC or a clock that can be connected to an imaging device capable of capturing moving images. Good. Further, the communication network 4 is a communication network such as a mobile phone network and / or the Internet. The communication network 4 may also include a communication network such as a wired LAN and a wireless LAN. The details of the image processing system 1 according to the present embodiment will be described below, but the names, numerical values, quantities, etc. in the above description and the following description are merely examples, and are limited to these names, numerical values, quantities, etc. I add that there is no such thing.

［画像処理システム１の構成の詳細］
図２は、本発明の実施の形態に係る画像処理システムの機能構成の一例を示す。なお、以下の説明においては主として、抽出対象のオブジェクトが領収書である例を挙げて説明する。 [Details of the configuration of the image processing system 1]
FIG. 2 shows an example of the functional configuration of the image processing system according to the embodiment of the present invention. In the following description, an example in which the object to be extracted is a receipt will be mainly described.

＜画像処理システム１の構成の概要＞
画像処理システム１は、動画を撮像する動画撮像部１０と、動画から動画構成画像を抽出する構成画像抽出部１２と、画像をリサイズするリサイズ処理部１４と、画像から所定のオブジェクトの所定箇所の座標を取得する座標取得部１６と、画像領域を抽出する画像領域抽出部１８と、画像に所定の処理を施す画像処理部２０と、画像のオブジェクトの方向を調整する方向調整部２２と、所定の情報を格納する情報格納部２４と、学習モデルを生成する学習モデル生成部２６と、所定の情報の入力を受け付ける入力部２８と、所定の情報を出力する出力部３０と、オブジェクト表面のテキストデータ等を読み取る読取部３２とを備える。 <Outline of the configuration of the image processing system 1>
The image processing system 1 includes a moving image imaging unit 10 that captures a moving image, a constituent image extracting unit 12 that extracts a moving image constituent image from the moving image, a resizing processing unit 14 that resizes the image, and a predetermined location of a predetermined object from the image. A coordinate acquisition unit 16 for acquiring coordinates, an image area extraction unit 18 for extracting an image area, an image processing unit 20 for performing predetermined processing on an image, a direction adjustment unit 22 for adjusting the direction of an image object, and predetermined An information storage unit 24 that stores the information of the above, a learning model generation unit 26 that generates a learning model, an input unit 28 that accepts input of predetermined information, an output unit 30 that outputs predetermined information, and text on the surface of the object. It includes a reading unit 32 for reading data and the like.

なお、画像処理システム１は、上記複数の構成要素を物理的に同一の場所に有するだけでなく、上記複数の構成要素の一部を物理的に離れた位置に設置してもよい。例えば、画像処理システム１は、情報端末２のみで構成してもよく（つまり、ローカルのみで構成してもよく）、また、情報端末２と当該情報端末２に通信網４等で接続されるサーバ３とを備えて構成してもよい。画像処理システム１が情報端末２とサーバ３とを備えて構成される場合、情報端末２が上記複数の構成要素の一部を備え、サーバ３が残りの構成要素を備える構成を採用してもよい。この場合、例えば、情報端末２において動画を撮像し、当該動画をサーバ３に供給することで所定の処理を実行することもできる。なお、サーバは、複数のサーバの集合体であってもよく、この場合、各サーバが動画撮像部１０を除く他の構成要素の一部若しくは全てを担う。例えば、画像処理システム１の複数の処理の一部を情報端末２において実行し（例えば、動画の撮像からリサイズ画像の生成まで情報端末２において実行する）、その他の処理を他の１以上のサーバにおいて実行してもよい（例えば、リサイズ画像の生成より後の処理をサーバにおいて実行する。）。また、画像処理システム１は、動画撮像部１０を有する撮像装置と、動画撮像部１０を除く他の構成要素を有する１以上の画像処理装置とから構成してもよい。「１以上の画像処理装置」を複数の処理装置で構成する場合、動画撮像部１０を除く他の構成要素を、情報処理能力や果たすべき機能に応じ、各処理装置に適宜割り振ることができる。 The image processing system 1 may not only have the plurality of components at physically the same location, but may also install some of the plurality of components at physically separated positions. For example, the image processing system 1 may be configured only by the information terminal 2 (that is, may be configured only locally), or is connected to the information terminal 2 and the information terminal 2 by a communication network 4 or the like. It may be configured to include a server 3. When the image processing system 1 is configured to include the information terminal 2 and the server 3, even if the information terminal 2 includes a part of the plurality of components and the server 3 includes the remaining components. Good. In this case, for example, the information terminal 2 can capture a moving image and supply the moving image to the server 3 to execute a predetermined process. The server may be an aggregate of a plurality of servers, and in this case, each server bears a part or all of other components except the moving image capturing unit 10. For example, a part of a plurality of processes of the image processing system 1 is executed on the information terminal 2 (for example, the information terminal 2 executes from imaging of a moving image to generation of a resized image), and other processes are executed on another one or more servers. (For example, the processing after the generation of the resized image is executed on the server). Further, the image processing system 1 may be composed of an imaging device having a moving image imaging unit 10 and one or more image processing devices having other components other than the moving image capturing unit 10. When "one or more image processing devices" are composed of a plurality of processing devices, other components other than the moving image imaging unit 10 can be appropriately allocated to each processing device according to the information processing capacity and the function to be fulfilled.

＜画像処理システム１の構成の詳細＞
（動画撮像部１０、構成画像抽出部１２）
動画撮像部１０は、撮像領域の動画を撮像する。動画撮像部１０は、撮像領域に含まれる１以上のオブジェクトを動画で撮像する。動画撮像部１０は、撮像対象を直上（つまり、俯角９０度）から撮像することも、俯角９０度未満から撮像することもできる。なお、動画撮像部１０は、フレームレートを適宜調整して動画を撮像してもよい。動画撮像部１０は、撮像した動画を構成画像抽出部１２に供給する。構成画像抽出部１２は、動画撮像部１０から受け取った動画から動画構成画像を抽出する。構成画像抽出部１２は、動画から複数の動画構成画像を抽出する。ここで、動画構成画像とは、フレーム画像、フィールド画像、及びその他の動画を構成する様々な形式の画像である。構成画像抽出部１２は、抽出した動画構成画像をリサイズ処理部１４、画像領域抽出部１８に供給する。 <Details of the configuration of the image processing system 1>
(Video imaging unit 10, constituent image extraction unit 12)
The moving image capturing unit 10 captures a moving image in the imaging region. The moving image imaging unit 10 captures one or more objects included in the imaging region as a moving image. The moving image capturing unit 10 can take an image of the image pickup target from directly above (that is, a depression angle of 90 degrees), or can take an image from a depression angle of less than 90 degrees. The moving image capturing unit 10 may take an image of a moving image by appropriately adjusting the frame rate. The moving image imaging unit 10 supplies the captured moving image to the constituent image extraction unit 12. The constituent image extraction unit 12 extracts a moving image constituent image from the moving image received from the moving image capturing unit 10. The constituent image extraction unit 12 extracts a plurality of moving image constituent images from the moving image. Here, the moving image is an image of various formats constituting a frame image, a field image, and other moving images. The constituent image extraction unit 12 supplies the extracted moving image constituent image to the resizing processing unit 14 and the image area extraction unit 18.

（リサイズ処理部１４）
リサイズ処理部１４は、画像にリサイズ処理を施してリサイズされた画像を生成する。具体的に、リサイズ処理部１４は、構成画像抽出部１２が抽出した動画構成画像をリサイズしてリサイズ画像を生成する。例えば、リサイズ処理部１４は、動画構成画像のサイズを縮小したリサイズ画像を生成する。この場合にリサイズ処理部１４は、矩形状の動画構成画像を正方形のリサイズ画像に変形してよい。リサイズ処理部１４は、例えば、縦横画素数が３０００ｐｘ×２０００ｐｘのサイズを有する動画構成画像を、縦横画素数が３００ｐｘ×３００ｐｘのサイズにリサイズしたリサイズ画像を生成する。リサイズ処理部１４がリサイズ処理をすることで、処理速度を向上させることができる。リサイズ処理部１４は、リサイズ画像を座標取得部１６に供給する。 (Resize processing unit 14)
The resizing unit 14 resizes the image to generate a resized image. Specifically, the resizing processing unit 14 resizes the moving image constituent image extracted by the constituent image extracting unit 12 to generate a resized image. For example, the resizing processing unit 14 generates a resized image in which the size of the moving image constituent image is reduced. In this case, the resizing unit 14 may transform the rectangular moving image constituent image into a square resizing image. The resizing processing unit 14 generates, for example, a resized image obtained by resizing a moving image having a size of 3000 px × 2000 px in vertical and horizontal pixels to a size of 300 px × 300 px in vertical and horizontal pixels. The processing speed can be improved by performing the resizing process by the resizing processing unit 14. The resizing unit 14 supplies the resizing image to the coordinate acquisition unit 16.

（座標取得部１６）
座標取得部１６は、オブジェクトを撮像した動画の動画構成画像に基づいてオブジェクトの所定箇所の座標を取得する。座標取得部１６は、後述する予め準備した学習モデルを用い、抽出対象のオブジェクトの所定箇所の座標を画像から取得する。所定箇所の座標は、オブジェクトの隅を中心とする１以上の矩形領域であって、当該中心から動画構成画像の外縁若しくは動画構成画像に基づいて生成される生成画像の外縁までの長さが最短距離になる直線を垂線とする辺が、動画構成画像の外縁若しくは生成画像の外縁に接するサイズの矩形領域を形成した場合における中心の座標である。具体的に、座標取得部１６は、リサイズ処理部１４から受け取った生成画像としてのリサイズ画像からオブジェクトの所定箇所の座標を取得する。所定箇所の座標は、オブジェクトの特徴的な部分の座標であり、例えば、オブジェクトが矩形状の場合は４隅の座標若しくは少なくとも一部の隅の座標である。つまり、所定箇所の座標は、オブジェクトの隅を中心とする１以上の矩形領域（例えば、正方形）であって、当該中心からリサイズ画像の外縁までの長さが最短距離になる直線を垂線とする辺がリサイズ画像の外縁に接するサイズの矩形領域を形成した場合における中心の座標である。座標取得部１６は、リサイズ画像からオブジェクトの所定箇所の一部の座標を取得した場合、残りの所定箇所の座標を学習モデルに基づいて推定する。一例として、座標取得部１６は、オブジェクトが矩形である場合、当該オブジェクトの３つの隅の座標を取得し、残り１つの隅の座標を当該３つの隅の座標を用いて推定する。ここで、動画は複数の動画構成画像から構成されるので、リサイズ処理部１４が生成するリサイズ画像も複数、存在する。座標取得部１６は、リサイズ処理部１４から複数のリサイズ画像を取得した場合、抽出対象であるオブジェクトの所定箇所の座標の全てを取得できるリサイズ画像を選択し、選択したリサイズ画像から座標を取得してもよい。座標取得部１６は、取得した座標に関する情報を画像領域抽出部１８に供給する。 (Coordinate acquisition unit 16)
The coordinate acquisition unit 16 acquires the coordinates of a predetermined position of the object based on the moving image of the moving image of the object. The coordinate acquisition unit 16 acquires the coordinates of a predetermined position of the object to be extracted from the image by using a learning model prepared in advance, which will be described later. The coordinates of the predetermined location are one or more rectangular regions centered on the corners of the object, and the length from the center to the outer edge of the moving image constituent image or the outer edge of the generated image generated based on the moving image constituent image is the shortest. The side whose vertical line is the straight line that becomes the distance is the coordinate of the center when a rectangular region having a size in contact with the outer edge of the moving image or the outer edge of the generated image is formed. Specifically, the coordinate acquisition unit 16 acquires the coordinates of a predetermined position of the object from the resize image as the generated image received from the resize processing unit 14. The coordinates of the predetermined location are the coordinates of the characteristic part of the object, for example, when the object is rectangular, the coordinates of the four corners or the coordinates of at least a part of the corners. That is, the coordinates of the predetermined location are perpendicular lines of one or more rectangular regions (for example, squares) centered on the corners of the object, and the length from the center to the outer edge of the resized image is the shortest distance. It is the coordinate of the center when a rectangular area having a size in which the side touches the outer edge of the resized image is formed. When the coordinate acquisition unit 16 acquires the coordinates of a part of the predetermined portion of the object from the resized image, the coordinate acquisition unit 16 estimates the coordinates of the remaining predetermined portion based on the learning model. As an example, when the object is rectangular, the coordinate acquisition unit 16 acquires the coordinates of the three corners of the object and estimates the coordinates of the remaining one corner using the coordinates of the three corners. Here, since the moving image is composed of a plurality of moving image constituent images, there are also a plurality of resizing images generated by the resizing processing unit 14. When a plurality of resized images are acquired from the resizing unit 14, the coordinate acquisition unit 16 selects a resized image capable of acquiring all the coordinates of a predetermined position of the object to be extracted, and acquires the coordinates from the selected resized image. You may. The coordinate acquisition unit 16 supplies information regarding the acquired coordinates to the image area extraction unit 18.

（画像領域抽出部１８）
画像領域抽出部１８は、座標取得部１６が取得した座標に基づいて、抽出対象であるオブジェクトが含まれる画像領域を動画構成画像から抽出する。具体的に、画像領域抽出部１８は、リサイズ画像から取得された所定箇所の座標を動画構成画像に射影して画像領域を抽出する。すなわち、画像領域抽出部１８は、動画構成画像から生成されたリサイズ画像から取得されたオブジェクトの所定箇所の座標をリサイズされる前の動画構成画像に射影し、当該動画構成画像に含まれる抽出対象であるオブジェクトが含まれる画像領域を抽出する。この場合において画像領域抽出部１８は、所定のマージン領域を付加して画像領域を動画構成画像から抽出することができる。つまり、画像領域抽出部１８は、座標の射影により特定される領域の外側に所定のマージン領域を含めた領域を画像領域として抽出できる。画像領域抽出部１８は、抽出した画像領域をリサイズ処理部１４に供給する。 (Image area extraction unit 18)
The image area extraction unit 18 extracts an image area including an object to be extracted from the moving image constituent image based on the coordinates acquired by the coordinate acquisition unit 16. Specifically, the image area extraction unit 18 projects the coordinates of a predetermined position acquired from the resized image onto the moving image constituent image to extract the image area. That is, the image area extraction unit 18 projects the coordinates of a predetermined position of the object acquired from the resized image generated from the moving image constituent image onto the moving image constituent image before resizing, and the extraction target included in the moving image constituent image. Extract the image area that contains the object that is. In this case, the image area extraction unit 18 can add a predetermined margin area and extract the image area from the moving image. That is, the image area extraction unit 18 can extract an area including a predetermined margin area outside the area specified by the projection of the coordinates as an image area. The image area extraction unit 18 supplies the extracted image area to the resizing processing unit 14.

そして、リサイズ処理部１４は、画像領域抽出部１８から受け取った画像領域を再びリサイズし、リサイズ画像領域を生成する。つまり、一の動画構成画像から一のリサイズ画像が生成され、この一のリサイズ画像から取得される座標を用い、当該一の動画構成画像から一の画像領域が抽出される。そして、この一の画像領域にリサイズ処理を施してリサイズ画像領域が生成されるので、当該一の動画構成画像から抽出された所定の領域（一の画像領域）が再びリサイズ処理されることになる。リサイズ処理部１４は、リサイズ画像領域を座標取得部１６に供給する。 Then, the resizing processing unit 14 resizes the image area received from the image area extracting unit 18 again to generate a resized image area. That is, one resized image is generated from one moving image constituent image, and one image area is extracted from the one moving image constituent image using the coordinates acquired from the one resized image. Then, since the resized image area is generated by performing the resizing process on this one image area, the predetermined area (one image area) extracted from the one moving image constituent image is resized again. .. The resizing unit 14 supplies the resizing image area to the coordinate acquisition unit 16.

続いて座標取得部１６は、リサイズ画像領域に基づいて、抽出対象であるオブジェクトの所定箇所の座標を取得する。すなわち、座標取得部１６は、生成画像としてのリサイズ画像領域から、抽出対象であるオブジェクトの所定箇所の座標を取得する。所定箇所の座標は、オブジェクトの特徴的な部分の座標であり、例えば、オブジェクトが矩形状の場合は４隅の座標若しくは少なくとも一部の隅の座標である。具体的に、所定箇所の座標は、オブジェクトの隅を中心とする１以上の矩形領域であって、当該中心からリサイズ画像領域の外縁までの長さが最短距離になる直線を垂線とする辺がリサイズ画像領域の外縁に接するサイズの矩形領域を形成した場合における中心の座標である。この場合においても、座標取得部１６は、後述する学習モデルを用い、抽出対象のオブジェクトの所定箇所の座標をリサイズ画像領域から取得する。また、座標取得部１６は、リサイズ画像領域からオブジェクトの所定箇所の一部の座標を取得した場合、残りの所定箇所の座標を学習モデルに基づいて推定する。一例として、座標取得部１６は、オブジェクトが矩形である場合、当該オブジェクトの３つの隅の座標を取得し、残り１つの隅の座標を当該３つの隅の座標を用いて推定する。座標取得部１６は、取得した座標に関する情報を画像領域抽出部１８に供給する。そして、画像領域抽出部１８は、リサイズ画像領域から取得された所定箇所の座標を動画構成画像に射影して、抽出対象であるオブジェクトのオブジェクト画像領域を抽出する。画像領域抽出部１８は、抽出したオブジェクト画像領域を、画像処理部２０、情報格納部２４に供給する。 Subsequently, the coordinate acquisition unit 16 acquires the coordinates of a predetermined position of the object to be extracted based on the resized image area. That is, the coordinate acquisition unit 16 acquires the coordinates of a predetermined position of the object to be extracted from the resized image area as the generated image. The coordinates of the predetermined location are the coordinates of the characteristic part of the object, for example, when the object is rectangular, the coordinates of the four corners or the coordinates of at least a part of the corners. Specifically, the coordinates of the predetermined location are one or more rectangular regions centered on the corners of the object, and the sides whose perpendicular lines are straight lines having the shortest length from the center to the outer edge of the resized image region. These are the coordinates of the center when a rectangular area of a size tangent to the outer edge of the resized image area is formed. Also in this case, the coordinate acquisition unit 16 acquires the coordinates of a predetermined position of the object to be extracted from the resized image area by using the learning model described later. Further, when the coordinate acquisition unit 16 acquires the coordinates of a part of the predetermined portion of the object from the resized image area, the coordinate acquisition unit 16 estimates the coordinates of the remaining predetermined portion based on the learning model. As an example, when the object is rectangular, the coordinate acquisition unit 16 acquires the coordinates of the three corners of the object and estimates the coordinates of the remaining one corner using the coordinates of the three corners. The coordinate acquisition unit 16 supplies information regarding the acquired coordinates to the image area extraction unit 18. Then, the image area extraction unit 18 projects the coordinates of the predetermined portion acquired from the resized image area onto the moving image constituent image, and extracts the object image area of the object to be extracted. The image area extraction unit 18 supplies the extracted object image area to the image processing unit 20 and the information storage unit 24.

（画像処理部２０）
画像処理部２０は、画像領域抽出部１８が抽出した画像領域に所定の画像処理（例えば、ブレ、歪み、回転等の補正処理）を施して、オブジェクト画像領域を生成する。なお、画像領域抽出部１８は、抽出した画像領域に画像処理部２０による画像処理を施さずにオブジェクト画像領域としてもよい。画像処理部２０は、後述する読取部３２における情報の読み取りや入力を適切に実行可能にすることを目的として、オブジェクト画像領域に画像処理を施す。例えば、画像処理部２０は、オブジェクト画像領域が所定のオブジェクトの本来の形状から変形した形状の当該オブジェクトを含む画像である場合（例えば、オブジェクトが領収書である場合において、領収書を斜めの角度から撮像した場合、動画には四辺形ではあるが長方形ではない領収書の画像が含まれる。）、アフィン変換等の処理により長方形のオブジェクトに変形する処理を実行する。これにより、画像処理部２０は、斜めの角度から撮像したオブジェクトが台形状のオブジェクトとしてオブジェクト画像領域に含まれる現象であるキーストーニングの除去を実行する。また、画像処理部２０は、より明確な画像を読取部３２に読み取らせることを目的として、オブジェクト画像領域に二値化処理やシャープネス処理等の画像処理を施すこともできる。画像処理部２０は、画像処理後の画像を方向調整部２２に供給する。 (Image processing unit 20)
The image processing unit 20 performs predetermined image processing (for example, correction processing such as blurring, distortion, rotation, etc.) on the image area extracted by the image area extraction unit 18 to generate an object image area. The image area extraction unit 18 may be used as an object image area without performing image processing by the image processing unit 20 on the extracted image area. The image processing unit 20 performs image processing on the object image area for the purpose of appropriately executing the reading and input of information in the reading unit 32 described later. For example, the image processing unit 20 sets the receipt at an oblique angle when the object image area is an image including the object having a shape deformed from the original shape of the predetermined object (for example, when the object is a receipt). When imaged from, the moving image includes an image of a receipt that is quadrilateral but not rectangular.), Transforms into a rectangular object by processing such as affine transformation. As a result, the image processing unit 20 removes key stoning, which is a phenomenon in which an object imaged from an oblique angle is included in the object image area as a trapezoidal object. Further, the image processing unit 20 can also perform image processing such as binarization processing and sharpness processing on the object image area for the purpose of causing the reading unit 32 to read a clearer image. The image processing unit 20 supplies the image after image processing to the direction adjusting unit 22.

（方向調整部２２）
方向調整部２２は、オブジェクト画像領域に含まれる所定のオブジェクトの方向を調整する。すなわち、画像処理部２０において画像処理が施された画像に含まれる所定のオブジェクトの向きは、所定の方向に揃っているとは限らない。したがって、方向調整部２２は、後述する読取部３２における情報の取り込み／入力を適切に実行することを目的として、画像に含まれる所定のオブジェクトの向きを所定の方向に揃える処理を実行する。例えば、画像処理部２０におけるアフィン変換を経て長方形に変更された画像の向きは、当該長方形の長辺を規準にした場合、一例として、当該基準に対して長辺が、０°、９０°、１８０°、２７０°等の４つの状況をとることが考えられる。方向調整部２２は、画像処理部２０におけるアフィン変換等の画像処理後に得られる画像に含まれる所定のオブジェクト（例えば、領収書等の長方形状を有するオブジェクト）の向きを、一例として、正面視にて縦長の方向（つまり、情報端末２等の表示部を正面から観察した場合に、水平方向に短辺が位置し、垂直方向に長辺が位置する方向）になるように画像を回転する処理を実行する。これにより、方向調整部２２は、オブジェクト画像領域に含まれる所定のオブジェクトの方向を所定の方向に揃えることができる。方向調整部２２は、方向を調整した後のオブジェクト画像領域を情報格納部２４、読取部３２に供給する。 (Direction adjusting unit 22)
The direction adjusting unit 22 adjusts the direction of a predetermined object included in the object image area. That is, the orientations of the predetermined objects included in the image processed by the image processing unit 20 are not always aligned with the predetermined directions. Therefore, the direction adjusting unit 22 executes a process of aligning the orientation of a predetermined object included in the image in a predetermined direction for the purpose of appropriately capturing / inputting information in the reading unit 32 described later. For example, when the orientation of the image changed to a rectangle by the affine transformation in the image processing unit 20 is based on the long side of the rectangle, as an example, the long side is 0 °, 90 ° with respect to the reference. It is conceivable to take four situations such as 180 ° and 270 °. The direction adjusting unit 22 views the direction of a predetermined object (for example, an object having a rectangular shape such as a receipt) included in the image obtained after image processing such as affine transformation in the image processing unit 20 as an example in a front view. The process of rotating the image in the vertically long direction (that is, the direction in which the short side is located in the horizontal direction and the long side is located in the vertical direction when the display unit of the information terminal 2 or the like is observed from the front). To execute. As a result, the direction adjusting unit 22 can align the direction of the predetermined object included in the object image area with the predetermined direction. The direction adjusting unit 22 supplies the object image area after adjusting the direction to the information storage unit 24 and the reading unit 32.

なお、方向調整部２２は、複数の所定のオブジェクトのデータ（例えば、長方形状のオブジェクトの画像データであって、正面視にて長方形の短辺が水平方向に沿った方向であるデータ）を予め定められた規則により所定種類のクラスにランダムに分類して学習することで学習モデルを生成することもできる。この学習モデルは、所定のオブジェクトの上方向を上であるとして認識するように推論できるモデルである。上方向の認識ができれば、所定のオブジェクトの領域を長方形に容易に変形できる。また、当該学習モデルとＴｅｓｓｅｒａｃｔによる認識手法とを組み合わせてもよく、係る組み合わせにより、より高い精度が得られる。 The direction adjusting unit 22 previously collects data of a plurality of predetermined objects (for example, image data of a rectangular object in which the short side of the rectangle is in the direction along the horizontal direction in front view). A learning model can also be generated by randomly classifying and learning a predetermined type of class according to a set rule. This learning model is a model that can be inferred to recognize the upward direction of a predetermined object as being upward. If the area of a predetermined object can be recognized in the upward direction, the area of a predetermined object can be easily transformed into a rectangle. Further, the learning model and the recognition method by Tesseract may be combined, and higher accuracy can be obtained by such a combination.

（情報格納部２４）
情報格納部２４は、方向が調整されたオブジェクト画像領域、すなわち、読取部３２における読取処理に適したオブジェクト画像領域を格納する。情報格納部２４は、例えば、ユーザを識別するユーザＩＤに対応付けて、当該オブジェクト画像領域、当該オブジェクト画像領域を含む動画の撮像年月日、撮像時刻等の情報を格納することができる。なお、情報格納部２４に格納される各種の情報は、情報端末２や、外部のサーバ（例えば、画像処理システム１の外部のサーバであって、会計処理や経理処理等に用いるサーバ等）に供給することができる。また、情報端末２は情報格納部２４を有していなくてもよく、この場合、情報格納部２４は、通信網４を介して双方向通信可能に情報端末２に接続される外部サーバが有していてよい。 (Information storage unit 24)
The information storage unit 24 stores an object image area whose direction has been adjusted, that is, an object image area suitable for reading processing in the reading unit 32. The information storage unit 24 can store information such as the object image area, the imaging date of the moving image including the object image area, and the imaging time in association with the user ID that identifies the user, for example. Various information stored in the information storage unit 24 is stored in the information terminal 2 or an external server (for example, an external server of the image processing system 1 and used for accounting processing, accounting processing, etc.). Can be supplied. Further, the information terminal 2 does not have to have the information storage unit 24. In this case, the information storage unit 24 has an external server connected to the information terminal 2 so that bidirectional communication is possible via the communication network 4. You can do it.

（学習モデル生成部２６）
座標取得部１６は、学習モデルに基づいて、リサイズ処理部１４から受け取ったリサイズ画像に所定のオブジェクトが含まれているか否かを判断し、当該所定のオブジェクトの少なくとも一部の所定箇所の座標を取得する。また、座標取得部１６は、リサイズ処理部１４から受け取ったリサイズ画像領域に含まれる所定のオブジェクトの少なくとも一部の所定箇所の座標を学習モデルに基づいて取得する。座標取得部１６は、既知の画像認識技術や機械学習を用いて当該判断を実行できる。 (Learning model generation unit 26)
Based on the learning model, the coordinate acquisition unit 16 determines whether or not the resized image received from the resizing processing unit 14 includes a predetermined object, and determines the coordinates of at least a part of the predetermined object of the predetermined object. get. Further, the coordinate acquisition unit 16 acquires the coordinates of at least a part of the predetermined objects included in the resized image area received from the resizing processing unit 14 based on the learning model. The coordinate acquisition unit 16 can execute the determination by using a known image recognition technique or machine learning.

すなわち、座標取得部１６は、所定のオブジェクトの特徴について予め学習して準備した学習モデルを用いて動画構成画像、リサイズ画像、及び／又はリサイズ画像領域に所定のオブジェクトが含まれているか否か判断する。座標取得部１６は、所定のオブジェクトの特徴を有さないオブジェクトについては、所定のオブジェクトとは認識しない。座標取得部１６は、ニューラルネットワークを用いて大量の所定のオブジェクト等の画像について学習させることにより構築された学習モデルを用いた推論処理により、所定のオブジェクトが動画構成画像、リサイズ画像、及び／又はリサイズ画像領域に含まれているか否かを判断することができる。例えば、座標取得部１６は、動画構成画像、リサイズ画像、及び／又はリサイズ画像領域から抽出されるオブジェクトの画像中での特徴と当該学習モデルとを用い、動画構成画像、リサイズ画像、及び／又はリサイズ画像領域に所定のオブジェクトが存在しているか否か、並びに動画構成画像、リサイズ画像、及び／又はリサイズ画像領域に含まれる各オブジェクトが所定のオブジェクトであるか否かを判断する。 That is, the coordinate acquisition unit 16 determines whether or not the predetermined object is included in the moving image configuration image, the resized image, and / or the resized image area by using the learning model prepared by learning the features of the predetermined object in advance. To do. The coordinate acquisition unit 16 does not recognize an object that does not have the characteristics of a predetermined object as a predetermined object. The coordinate acquisition unit 16 performs inference processing using a learning model constructed by training a large number of images of a predetermined object or the like using a neural network, so that the predetermined object is a moving image, a resized image, and / or. It can be determined whether or not it is included in the resized image area. For example, the coordinate acquisition unit 16 uses the features in the image of the moving image, the resized image, and / or the object extracted from the resized image area and the learning model, and the moving image, the resized image, and / or It is determined whether or not a predetermined object exists in the resized image area, and whether or not each object included in the moving image constituent image, the resized image, and / or the resized image area is a predetermined object.

具体的に、学習モデル生成部２６は、所定のオブジェクトが含まれる画像、所定のオブジェクトが配され得る背景画像、及び所定のオブジェクトが含まれる画像と所定のオブジェクトが配され得る背景画像との組み合わせを教師データとし、一例として勾配法で学習することで、撮像画像である動画が入力されると、動画を構成する動画構成画像やリサイズ画像、及び／又はリサイズ画像領域に含まれるオブジェクトが所定のオブジェクトであるか否か識別するために、所定のオブジェクトの１以上の隅を中心とする１以上の矩形領域、各中心の座標、及び／又は当該所定のオブジェクトの画像を出力とする学習モデルを生成する。なお、学習モデル生成部２６は、動画撮像部１０が撮像した動画、及び／又は画像処理システム１外や当該画像処理システム１とは別の画像処理システム１において取得された動画を教師データとして用い、学習モデルを生成してもよい。 Specifically, the learning model generation unit 26 includes an image including a predetermined object, a background image in which a predetermined object can be arranged, and a combination of an image including the predetermined object and a background image in which the predetermined object can be arranged. When a moving image, which is an captured image, is input by learning by the gradient method as an example, the moving image constituent image and the resized image constituting the moving image, and / or the object included in the resized image area are predetermined. In order to identify whether or not it is an object, a learning model that outputs one or more rectangular areas centered on one or more corners of a predetermined object, coordinates of each center, and / or an image of the predetermined object is used. Generate. The learning model generation unit 26 uses the moving image captured by the moving image capturing unit 10 and / or the moving image acquired outside the image processing system 1 or in an image processing system 1 different from the image processing system 1 as teacher data. , You may generate a learning model.

より具体的に、学習モデル生成部２６は、所定のオブジェクトが含まれる画像や背景画像等を用い、画像に含まれるオブジェクトが所定のオブジェクトであるか否かを、所定のオブジェクトの１以上の所定箇所を中心とする矩形領域（つまり、バウンディングボックス）を抽出対象にした学習モデルを生成する。学習モデル生成部２６は、従来のように抽出対象である所定のオブジェクトの全体を含むバウンディングボックスを抽出対象にするのではなく、所定のオブジェクトの１以上の所定の個所を中心とするバウンディングボックスの組を抽出対象にした学習モデルを生成する。すなわち、学習モデル生成部２６は、所定のオブジェクトの全体を含む１枚の画像を基に当該所定のオブジェクトの当該画像に占める領域を出力させる学習モデルではなく、所定のオブジェクトの複数の所定箇所のそれぞれを中心とする複数のバウンディングボックスを基に当該所定のオブジェクトの当該画像に占める領域を出力させる学習モデルを生成する。例えば、学習モデル生成部２６は、領収書を含む１以上のオブジェクトが含まれる画像が入力された場合、当該領収書の４隅を中心とする４つのバウンディングボックス、４つのバウンディングボックスにより認識される当該領収書の画像、及び／又は４隅の座標を抽出対象にした学習モデルを生成する。学習モデル生成部２６は、領収書を含む１以上のオブジェクトが含まれる画像が入力された場合、当該領収書の一部の隅を中心とする１以上のバウンディングボックス、１以上のバウンディングボックスにより認識される当該領収書の画像、及び／又は１以上の隅の座標を抽出対象にした学習モデルを生成してもよい。 More specifically, the learning model generation unit 26 uses an image including a predetermined object, a background image, or the like, and determines whether or not the object included in the image is a predetermined object by determining one or more of the predetermined objects. A learning model is generated in which a rectangular area (that is, a bounding box) centered on a location is extracted. The learning model generation unit 26 does not set the bounding box including the entire predetermined object to be extracted as the extraction target as in the conventional case, but the bounding box centered on one or more predetermined points of the predetermined object. Generate a learning model for which pairs are extracted. That is, the learning model generation unit 26 is not a learning model that outputs an area occupied by the image of the predetermined object based on one image including the entire predetermined object, but a plurality of predetermined locations of the predetermined object. Based on a plurality of bounding boxes centered on each, a learning model for outputting the area occupied by the image of the predetermined object is generated. For example, when an image containing one or more objects including a receipt is input, the learning model generation unit 26 is recognized by four bounding boxes centered on the four corners of the receipt and four bounding boxes. A learning model is generated in which the image of the receipt and / or the coordinates of the four corners are extracted. When an image containing one or more objects including a receipt is input, the learning model generation unit 26 recognizes it by one or more bounding boxes centered on a part of a corner of the receipt and one or more bounding boxes. A learning model may be generated in which the image of the receipt and / or the coordinates of one or more corners are extracted.

学習モデル生成部２６は、所定のオブジェクトが含まれる画像をデータ拡張し、人工的に学習データを増加させて学習することで学習モデルを生成してよい。例えば、学習モデル生成部２６は、所定のオブジェクトが含まれる画像として、所定のオブジェクトが含まれるオブジェクト画像だけではなく、当該オブジェクト画像を変形させた変形画像（変形画像は、例えば、所定のオブジェクトの一部を欠けさせた画像、オブジェクト画像を所定角度回転させた画像、オブジェクト画像に歪みを加えた画像等である）、当該オブジェクト画像に所定のノイズを加えたノイズ画像、複数の所定のオブジェクトを含むオブジェクト画像等を用いることができる。また、学習モデル生成部２６は、一の所定のオブジェクトを正面から撮像した状態の画像を用いるだけでなく、当該一の所定のオブジェクトを様々な角度から撮像した状態の画像を用いることや、正面から撮像した状態の画像を、様々な角度から撮像した状態の画像に変形して用いることもできる。なお、複数の所定のオブジェクトを含むオブジェクト画像としては、一の所定のオブジェクトに他の所定のオブジェクトが重なった画像や、一部の所定のオブジェクトが撮像領域外にはみ出すことで撮像領域内には当該所定のオブジェクトの一部分のみが含まれる画像等を用いることができる。ここで、複数の所定のオブジェクトがオブジェクト画像に含まれる場合、いずれか一つの所定のオブジェクトを認識すべき所定のオブジェクトとして学習させることもできる（例えば、オブジェクト画像に複数の所定のオブジェクトが含まれている場合、最も左若しくは右に位置する所定のオブジェクトを当該オブジェクト画像に含まれる所定のオブジェクトとして認識するよう学習させることができる。）。 The learning model generation unit 26 may generate a learning model by expanding the data of an image including a predetermined object and artificially increasing the learning data for learning. For example, the learning model generation unit 26 includes not only an object image including a predetermined object but also a deformed image obtained by deforming the object image as an image including a predetermined object (the deformed image is, for example, a predetermined object. An image in which a part is missing, an image in which an object image is rotated by a predetermined angle, an image in which an object image is distorted, etc.), a noise image in which a predetermined noise is added to the object image, and a plurality of predetermined objects. An object image or the like including the object can be used. Further, the learning model generation unit 26 not only uses an image in which one predetermined object is captured from the front, but also uses an image in which the one predetermined object is captured from various angles, and the front. It is also possible to transform an image in a state taken from the above into an image in a state taken from various angles and use it. As an object image including a plurality of predetermined objects, an image in which one predetermined object is overlapped with another predetermined object, or a part of the predetermined objects protrudes out of the imaging area and is included in the imaging area. An image or the like containing only a part of the predetermined object can be used. Here, when a plurality of predetermined objects are included in the object image, any one predetermined object can be trained as a predetermined object to be recognized (for example, the object image includes a plurality of predetermined objects). If so, it is possible to learn to recognize a predetermined object located on the leftmost or rightmost as a predetermined object included in the object image.)

また、学習モデル生成部２６は、様々な背景画像を所定のオブジェクトの画像に重畳させ、学習モデルを生成することもできる。背景画像としては、様々な色、明度、輝度、コントラスト、及び／又は光の反射の有無等が異なる多種多様な背景画像を用いることができる。すなわち、領収書が置かれる環境は様々な状況が想定される。例えば、机に領収書が置かれる場合、机の色が白色である場合や茶色である場合、また、室内環境によっては蛍光灯の光を机が反射する場合、様々な色や表面形状のカーペットに置かれる場合等、様々な状況が想定される。そこで、学習モデル生成部２６は、様々な背景画像を所定のオブジェクトの画像に重畳させて学習モデルを生成する。 Further, the learning model generation unit 26 can also generate a learning model by superimposing various background images on an image of a predetermined object. As the background image, a wide variety of background images having different colors, brightness, brightness, contrast, and / or presence / absence of light reflection can be used. In other words, various situations are assumed in the environment where receipts are placed. For example, if a receipt is placed on the desk, if the desk is white or brown, or if the desk reflects fluorescent light depending on the indoor environment, carpets of various colors and surface shapes Various situations are assumed, such as when placed in. Therefore, the learning model generation unit 26 generates a learning model by superimposing various background images on the image of a predetermined object.

そして、学習モデル生成部２６は、所定のオブジェクトの所定箇所を中心とするバウンディングボックスを基に当該所定のオブジェクトの当該画像に占める領域を出力させる学習モデルを生成する場合において、１以上の所定箇所を中心とするバウンディングボックスそれぞれをラベリングする。 Then, the learning model generation unit 26 generates one or more predetermined locations when generating a learning model that outputs a region occupied by the image of the predetermined object based on a bounding box centered on the predetermined location of the predetermined object. Label each bounding box centered on.

図３は、本実施形態に係る学習モデル生成部が生成する学習モデルにおける所定のオブジェクトのラベリング方法の概要を示す。 FIG. 3 shows an outline of a method of labeling a predetermined object in the learning model generated by the learning model generation unit according to the present embodiment.

本実施形態に係る学習モデル生成部２６は、画像に含まれるオブジェクトの所定箇所の座標を取得し、取得した座標を中心座標とする矩形領域（つまり、バウンディングボックスであり、形状は例えば、正方形）を形成し、形成した１以上の矩形領域を、当該オブジェクトを識別する学習データとして用いる。この学習モデルを用いることで座標取得部１６は、所定のオブジェクトが占める領域の正しい隅（コーナー）の座標を取得する場合に、各矩形領域（バウンディングボックス）の中心を計算するだけでよいので、隅の位置計算を容易にすることができる。 The learning model generation unit 26 according to the present embodiment acquires the coordinates of a predetermined position of the object included in the image, and has a rectangular region (that is, a bounding box, for example, a square) having the acquired coordinates as the center coordinates. Is formed, and the formed one or more rectangular regions are used as training data for identifying the object. By using this learning model, the coordinate acquisition unit 16 only needs to calculate the center of each rectangular area (bounding box) when acquiring the coordinates of the correct corner of the area occupied by the predetermined object. The corner position calculation can be facilitated.

すなわち、学習モデル生成部２６は、画像１３０に所定のオブジェクト（例えば、領収書のオブジェクト８４）が含まれている場合、所定のオブジェクト８４の隅の座標を中心とする１以上の矩形領域であって、当該中心から所定のオブジェクト８４が含まれる画像１３０の外縁までの長さが最短距離になる直線を垂線とする辺が画像１３０の外縁に接するサイズの矩形領域を形成する。例えば、学習モデル生成部２６は、図３に示すように、画像１３０に所定のオブジェクト８４が含まれている場合、オブジェクト８４の４つの隅それぞれの座標（つまり、座標１６０、座標１６２、座標１６４、及び座標１６６）を中心とする矩形領域（つまり、矩形領域１７０、矩形領域１７２、矩形領域１７４、及び矩形領域１７６）を形成する。この場合において、各矩形領域のサイズは、各矩形領域の中心の座標から画像１３０の外縁までの距離によって規定される。例えば、矩形領域１７０は、オブジェクト８４の隅の座標１６０を中心とし、当該中心から画像１３０の外縁までの距離が最短距離になる直線を垂線とする辺１７０ａが画像１３０の外縁に接するサイズの正方形である。他の矩形領域も同様にして形成される。そして、学習モデル生成部２６は、抽出対象である所定のオブジェクトの大量の画像や、抽出対象である所定のオブジェクトの画像を背景画像に重畳した大量の画像を教師データとして用い、上記のように形成された矩形領域及び矩形領域の中心の座標に基づいて、画像中の所定のオブジェクトを識別し、所定のオブジェクトの１以上の隅を中心とする１以上の矩形領域、各矩形領域の中心座標、及び／又は当該所定のオブジェクトの画像を出力するための学習モデルを生成する。 That is, when the image 130 includes a predetermined object (for example, a receipt object 84), the learning model generation unit 26 is one or more rectangular regions centered on the coordinates of the corners of the predetermined object 84. Therefore, a rectangular region having a size in which the side whose vertical line is a straight line having the shortest length from the center to the outer edge of the image 130 including the predetermined object 84 is in contact with the outer edge of the image 130 is formed. For example, as shown in FIG. 3, when the image 130 includes the predetermined object 84, the learning model generation unit 26 has the coordinates of each of the four corners of the object 84 (that is, the coordinates 160, the coordinates 162, and the coordinates 164). , And a rectangular area centered on the coordinates 166) (that is, a rectangular area 170, a rectangular area 172, a rectangular area 174, and a rectangular area 176). In this case, the size of each rectangular area is defined by the distance from the coordinates of the center of each rectangular area to the outer edge of the image 130. For example, the rectangular area 170 is a square having a side 170a centered on the coordinates 160 at the corner of the object 84 and having a perpendicular line whose distance from the center to the outer edge of the image 130 is the shortest, which is in contact with the outer edge of the image 130. Is. Other rectangular areas are formed in the same manner. Then, the learning model generation unit 26 uses a large number of images of a predetermined object to be extracted and a large number of images obtained by superimposing an image of a predetermined object to be extracted on a background image as teacher data, as described above. A predetermined object in the image is identified based on the formed rectangular area and the coordinates of the center of the rectangular area, and one or more rectangular areas centered on one or more corners of the predetermined object, the center coordinates of each rectangular area. And / or generate a learning model to output an image of the given object.

なお、学習モデル生成部２６が、各矩形領域の幅を画像１３０の外縁に接する距離に規定した理由は、本発明者が様々検討したところ、オブジェクト８４の隅の座標を中心とする１以上の矩形領域であって、当該中心からオブジェクト８４が含まれる画像１３０の外縁までの長さが最短距離になる直線を垂線とする辺が画像１３０の外縁に接するサイズの矩形領域を形成すること（つまり、各矩形領域のサイズを、矩形の形状を正方形に保ちながら画像１３０の範囲内で最大化すること）で、画像に含まれる１以上のオブジェクトが所定のオブジェクトであるか否かを判断する精度が高くなる知見を得た結果である。 The reason why the learning model generation unit 26 defines the width of each rectangular region as the distance in contact with the outer edge of the image 130 is that the present inventor has examined variously and found that one or more of them are centered on the coordinates of the corners of the object 84. A rectangular area having a side whose perpendicular line is a straight line having the shortest length from the center to the outer edge of the image 130 including the object 84 forms a rectangular area having a size in contact with the outer edge of the image 130 (that is,). , Maximizing the size of each rectangular area within the range of image 130 while keeping the rectangular shape square), the accuracy of determining whether one or more objects included in the image are predetermined objects. This is the result of obtaining the finding that

つまり、所定のオブジェクトの全体を含む矩形領域を用いて所定のオブジェクトを識別する従来技術とは異なり、本実施形態に係る学習モデルは、所定のオブジェクトを識別し、当該オブジェクトの画像を出力するために、複数（例えば、４つ）のバウンディングボックスの組と所定のオブジェクトのカテゴリー（例えば、領収書）とを対応付けることができる。そして、画像処理システム１においては、画像（例えば、動画構成画像、リサイズ画像、及び／又はリサイズ画像領域）が入力された場合に当該学習モデルを用いて４つのバウンディングボックスに基づいた推論処理を実行し、当該画像に含まれるオブジェクトが所定のオブジェクトであるか否かを識別することや、当該オブジェクトの所定箇所の座標や当該オブジェクトの画像領域を出力することができる。 That is, unlike the conventional technique of identifying a predetermined object by using a rectangular area including the entire predetermined object, the learning model according to the present embodiment is for identifying the predetermined object and outputting an image of the object. Can be associated with a plurality of (for example, four) bounding box sets and a predetermined object category (for example, a receipt). Then, in the image processing system 1, when an image (for example, a moving image constituent image, a resized image, and / or a resized image area) is input, the inference processing based on the four bounding boxes is executed using the learning model. Then, it is possible to identify whether or not the object included in the image is a predetermined object, and to output the coordinates of a predetermined position of the object and the image area of the object.

なお、バウンディングボックスの検出・形成方法に限定はない。例えば、ＹＯＬＯ、ＦａｓｔＲ−ＣＮＮ、ＳｉｎｇｌｅＳｈｏｔＭｕｌｔｉＤｅｔｅｃｔｉｏｎ（ＳＳＤ）等を利用できる。 There is no limitation on the method of detecting and forming the bounding box. For example, YOLO, Fast R-CNN, Single Shot Multi Detection (SSD) and the like can be used.

そして、座標取得部１６は、学習モデル生成部２６が生成した学習モデルに基づいてリサイズ画像に含まれるオブジェクトが所定のオブジェクトであるか否かを判断し、所定のオブジェクトである場合、当該所定のオブジェクトの所定箇所の座標を取得する。また、座標取得部１６は、当該学習モデルに基づいてリサイズ画像領域に含まれる所定のオブジェクトの所定箇所の座標を取得する。そして、座標取得部１６は、取得した座標に関する情報を画像領域抽出部１８に供給する。 Then, the coordinate acquisition unit 16 determines whether or not the object included in the resized image is a predetermined object based on the learning model generated by the learning model generation unit 26, and if it is a predetermined object, the predetermined object. Gets the coordinates of a given location on an object. Further, the coordinate acquisition unit 16 acquires the coordinates of a predetermined position of a predetermined object included in the resized image area based on the learning model. Then, the coordinate acquisition unit 16 supplies the information regarding the acquired coordinates to the image area extraction unit 18.

（読取部３２）
読取部３２は、方向調整部２２から受け取った方向が調整されたオブジェクト画像領域に含まれるオブジェクト表面に記載された各種の情報を読み取る。読取部３２は、例えば、ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅｃｏｇｎｉｔｉｏｎ／Ｒｅａｄｅｒ（ＯＣＲ）等を利用し、オブジェクト表面に記載された各種の情報を読み取る。一例として、オブジェクトが領収書である場合、読取部３２が読み取る情報は、日付、金額、電話番号等の情報である。読取部３２は、読み取った情報を情報格納部２４に格納させることができる。情報格納部２４は、例えば、ユーザＩＤに対応付けて、情報の読み取りに用いたオブジェクト画像領域の撮像年月日、撮像時刻に関する情報と共に、読み取った情報を格納する。 (Reading unit 32)
The reading unit 32 reads various information written on the surface of the object included in the object image area whose direction is adjusted, which is received from the direction adjusting unit 22. The reading unit 32 uses, for example, Optical Character Recognition / Reader (OCR) or the like to read various types of information written on the surface of the object. As an example, when the object is a receipt, the information read by the reading unit 32 is information such as a date, an amount of money, and a telephone number. The reading unit 32 can store the read information in the information storage unit 24. The information storage unit 24 stores the read information together with the information regarding the imaging date and the imaging time of the object image area used for reading the information in association with the user ID, for example.

（入力部２８）
入力部２８は、ユーザからの各種情報や所定の指示の入力を受け付ける。入力部２８は、例えば、情報端末２のタッチパネル、キーボード、マウス、マイク、ジェスチャーセンサ等である。入力部２８は、画像処理システム１の所定の構成要素に当該所定の指示を供給する。当該所定の指示を受け付けた各構成要素はそれぞれ所定の機能を発揮する。 (Input unit 28)
The input unit 28 receives input of various information and predetermined instructions from the user. The input unit 28 is, for example, a touch panel, a keyboard, a mouse, a microphone, a gesture sensor, or the like of the information terminal 2. The input unit 28 supplies the predetermined instruction to a predetermined component of the image processing system 1. Each component that receives the predetermined instruction performs a predetermined function.

（出力部３０）
出力部３０は、画像処理システム１において実行された各種の処理結果を出力する。出力部３０は、各種の処理結果や格納している情報をユーザが知覚可能に出力する。具体的に出力部３０は、各種処理結果や格納している情報を、静止画像、動画像、音声、テキスト、及び／又は振動や光等の物理現象等として出力する。例えば、出力部３０は、情報端末２の表示部、スピーカー等である。 (Output unit 30)
The output unit 30 outputs various processing results executed in the image processing system 1. The output unit 30 outputs various processing results and stored information in a perceptible manner by the user. Specifically, the output unit 30 outputs various processing results and stored information as still images, moving images, sounds, texts, and / or physical phenomena such as vibration and light. For example, the output unit 30 is a display unit, a speaker, or the like of the information terminal 2.

［画像処理システム１の処理の流れ］
図４は本実施形態に係る画像処理システムの処理の第１の工程の概要を示し、図５はマージン領域を設ける理由の概要を示し、図６は本実施形態に係る画像処理システムの処理の第２の工程の概要を示す。また、図７は、本実施形態に係る画像処理システムの処理全体の流れの概要を示す。 [Processing flow of image processing system 1]
FIG. 4 shows an outline of the first step of the processing of the image processing system according to the present embodiment, FIG. 5 shows an outline of the reason for providing the margin region, and FIG. 6 shows the outline of the processing of the image processing system according to the present embodiment. The outline of the second step is shown. Further, FIG. 7 shows an outline of the overall processing flow of the image processing system according to the present embodiment.

まず、図７に示すように、学習モデル生成部２６は、所定のオブジェクト（例えば、領収書）の特徴量（例えば、オブジェクトの隅の座標に基づく４つのバウンディングボックスの組、若しくはオブジェクトの一部の隅の座標に基づく１以上（好ましくは２つ以上）のバウンディングボックスの組）と所定のオブジェクトのカテゴリー（例えば、領収書）との組み合わせを含む教師データを取得若しくは生成し、取得若しくは生成した教師データに基づき、リサイズ画像又はリサイズ画像領域を入力、リサイズ画像又はリサイズ画像領域に含まれる所定のオブジェクトの１以上の隅を中心とする１以上の矩形領域及び／又は当該所定のオブジェクトの画像を出力とする学習モデルを生成する（ステップ１０。以下、ステップを「Ｓ」と表す。）。 First, as shown in FIG. 7, the learning model generation unit 26 uses a feature amount (for example, a set of four bounding boxes based on the coordinates of the corners of the object) of a predetermined object (for example, a receipt), or a part of the object. Acquired or generated, acquired or generated teacher data containing a combination of one or more (preferably two or more) bounding box sets based on the coordinates of the corners of a given object (eg, receipt). Based on the teacher data, input the resized image or the resized image area, and input one or more rectangular areas centered on one or more corners of the predetermined object included in the resized image or the resized image area and / or the image of the predetermined object. A learning model to be output is generated (step 10. Hereinafter, the step is referred to as "S").

そして、例えば、情報端末２の動画撮像部１０としてのカメラが、複数のオブジェクト（所定のオブジェクト、及び／又は所定のオブジェクトとは異なる他のオブジェクト）の動画１１０を撮像する（Ｓ１２）。一例として、図４（ａ）に示すように、動画撮像部１０は、オブジェクト８６（例えば、領収書）の動画１１０を撮像する。図４（ａ）の例では、動画１１０が複数の動画構成画像（例えば、動画構成画像１２０ａ、動画構成画像１２０ｂ、及び動画構成画像１２０ｃ等）から構成されていることを示している。なお、この場合において動画撮像部１０が撮像する動画は、複数のオブジェクトが平面上に配列された状態の動画であっても、複数のオブジェクトが１枚１枚めくられる状態の動画であってもよい。また複数のオブジェクトが平面上に配列された状態において、各オブジェクトの方向は揃っていなくてもよく、一のオブジェクトの一部に他のオブジェクトが重なっていてもよい。更に、動画撮像部１０は、撮像領域を横方向や縦方向に移動してもよい。また、動画構成画像のサイズに限定はない。 Then, for example, the camera as the moving image capturing unit 10 of the information terminal 2 images the moving image 110 of a plurality of objects (a predetermined object and / or another object different from the predetermined object) (S12). As an example, as shown in FIG. 4A, the moving image capturing unit 10 images the moving image 110 of the object 86 (for example, a receipt). In the example of FIG. 4A, it is shown that the moving image 110 is composed of a plurality of moving image constituent images (for example, the moving image constituent image 120a, the moving image constituent image 120b, the moving image constituent image 120c, etc.). In this case, the moving image captured by the moving image capturing unit 10 may be a moving image in which a plurality of objects are arranged on a plane or a moving image in which a plurality of objects are turned over one by one. Good. Further, in a state where a plurality of objects are arranged on a plane, the directions of the objects may not be the same, and another object may overlap a part of one object. Further, the moving image imaging unit 10 may move the imaging region in the horizontal direction or the vertical direction. In addition, there is no limit to the size of the moving image.

次に、構成画像抽出部１２は、動画撮像部１０が撮像した動画を変換し、複数の動画構成画像を抽出する（Ｓ１４）。そして、リサイズ処理部１４は、抽出された複数の動画構成画像にリサイズ処理を施し、リサイズ画像を生成する（Ｓ１６）。例えば、図４（ｂ）に示すように、リサイズ処理部１４は、動画構成画像１２０ａをリサイズしたリサイズ画像１４０ａ、動画構成画像１２０ｂをリサイズしたリサイズ画像１４０ｂ、及び動画構成画像１２０ｃをリサイズしたリサイズ画像１４０ｃを生成する。 Next, the constituent image extraction unit 12 converts the moving image captured by the moving image capturing unit 10 and extracts a plurality of moving image constituent images (S14). Then, the resizing processing unit 14 performs resizing processing on the extracted plurality of moving image constituent images to generate a resizing image (S16). For example, as shown in FIG. 4B, the resizing processing unit 14 resizes the moving image constituent image 120a to resize the moving image 140a, the moving image constituent image 120b to the resized image 140b, and the moving image constituent image 120c to be resized. Generate 140c.

続いて、座標取得部１６は、学習モデル生成部２６が予め生成した学習モデル２６０を用い、リサイズ画像に所定のオブジェクト（例えば、領収書）が含まれているか否か判断し、所定のオブジェクトが含まれている場合には、リサイズ画像における当該所定のオブジェクトの１以上の隅（典型的には、４隅）の座標を取得する（Ｓ１８）。ここで、座標取得部１６は、所定のオブジェクトの座標を取得する場合に、所定数の座標を取得できるか否かを判断する（Ｓ２０）。例えば、所定のオブジェクトが矩形状の領収書である場合、座標取得部１６は、一の所定のオブジェクトの４つの隅の座標（つまり、４つの座標）若しくは一部の隅（典型的には２つ以上の隅）の座標を取得できるか否かを判断する。座標取得部１６が所定数の座標を取得できないと判断した場合（Ｓ２０のＮｏ）、構成画像抽出部１２は、動画１１０から他の動画構成画像を抽出する（Ｓ１４）。一方、座標取得部１６が所定数の座標を取得できると判断した場合（Ｓ２０のＹｅｓ）、座標取得部１６は、所定数の座標を取得する。ここで、座標取得部１６は、リサイズ画像における当該所定のオブジェクトの一部の隅の座標を取得した場合、学習モデルを用いて残りの隅の座標を推定して取得する。 Subsequently, the coordinate acquisition unit 16 uses the learning model 260 generated in advance by the learning model generation unit 26 to determine whether or not a predetermined object (for example, a receipt) is included in the resized image, and the predetermined object is If it is included, the coordinates of one or more corners (typically four corners) of the predetermined object in the resized image are acquired (S18). Here, the coordinate acquisition unit 16 determines whether or not a predetermined number of coordinates can be acquired when acquiring the coordinates of a predetermined object (S20). For example, if a given object is a rectangular receipt, the coordinate acquisition unit 16 may use the coordinates (ie, four coordinates) of the four corners of one given object or some corners (typically two). Determine if the coordinates of one or more corners) can be obtained. When the coordinate acquisition unit 16 determines that a predetermined number of coordinates cannot be acquired (No in S20), the configuration image extraction unit 12 extracts another video configuration image from the video 110 (S14). On the other hand, when the coordinate acquisition unit 16 determines that a predetermined number of coordinates can be acquired (Yes in S20), the coordinate acquisition unit 16 acquires a predetermined number of coordinates. Here, when the coordinates acquisition unit 16 acquires the coordinates of a part of the corners of the predetermined object in the resized image, the coordinates acquisition unit 16 estimates and acquires the coordinates of the remaining corners using the learning model.

例えば、図４（ｃ）に示すように、リサイズ画像１４０ａ及びリサイズ画像１４０ｂには所定のオブジェクトの一部分のみが含まれており、当該オブジェクトの４隅の一部がリサイズ画像１４０ａ及びリサイズ画像１４０ｂには含まれていない。したがって、座標取得部１６は、リサイズ画像１４０ａ及びリサイズ画像１４０ｂから所定のオブジェクトの所定箇所の座標を取得できないと判断する。一方、リサイズ画像１４０ｃには所定のオブジェクト８６の全体が含まれている。したがって、座標取得部１６は、リサイズ画像１４０ｃからオブジェクト８６の所定箇所（つまり、４隅）の座標（つまり、座標１６０ａ、座標１６２ａ、座標１６４ａ、及び座標１６６ａ）を取得できると判断し、これらの座標を取得する。 For example, as shown in FIG. 4C, the resized image 140a and the resized image 140b include only a part of a predetermined object, and a part of the four corners of the object is included in the resized image 140a and the resized image 140b. Is not included. Therefore, the coordinate acquisition unit 16 determines that the coordinates of the predetermined location of the predetermined object cannot be acquired from the resized image 140a and the resized image 140b. On the other hand, the resized image 140c includes the entire predetermined object 86. Therefore, the coordinate acquisition unit 16 determines that the coordinates (that is, the coordinates 160a, the coordinates 162a, the coordinates 164a, and the coordinates 166a) of the predetermined positions (that is, the four corners) of the object 86 can be acquired from the resized image 140c, and these Get the coordinates.

そして、図４（ｄ）に示すように、画像領域抽出部１８は、座標取得部１６が取得した座標を動画構成画像１２０ｃ（つまり、リサイズ画像１４０ｃがリサイズされる前の動画構成画像）に射影し（Ｓ２２）、動画構成画像１２０ｃのオブジェクト８６の領域を特定する。更に、画像領域抽出部１８は、図４（ｅ）に示すように、オブジェクト８６を含む画像領域１４４を取得する（Ｓ２４）。ここで、画像領域抽出部１８は、オブジェクト８６の周囲に所定のマージン領域１８０を含めた領域を画像領域１４４として取得する。図５を参照しながらこの理由を説明する。 Then, as shown in FIG. 4D, the image area extraction unit 18 projects the coordinates acquired by the coordinate acquisition unit 16 onto the moving image constituent image 120c (that is, the moving image constituent image before the resized image 140c is resized). (S22), the region of the object 86 of the moving image configuration image 120c is specified. Further, as shown in FIG. 4E, the image area extraction unit 18 acquires an image area 144 including the object 86 (S24). Here, the image area extraction unit 18 acquires an area including a predetermined margin area 180 around the object 86 as the image area 144. The reason for this will be described with reference to FIG.

まず、図５（ａ）に示すように、動画構成画像１２０にオブジェクト８８が含まれているとする。この動画構成画像１２０をリサイズ処理部１４がリサイズすることで、図５（ｂ）に示すように、リサイズ画像１４２が生成される。そして、座標取得部１６は、学習モデルを用い、リサイズ画像１４２からオブジェクト８８の４隅の座標（つまり、座標１６０ｂ、座標１６２ｂ、座標１６４ｂ、及び座標１６６ｂ）を取得する。続いて、画像領域抽出部１８は、座標取得部１６が取得した座標を動画構成画像１２０に射影して画像領域を取得する。 First, as shown in FIG. 5A, it is assumed that the moving image configuration image 120 includes the object 88. When the resizing processing unit 14 resizes the moving image configuration image 120, the resizing image 142 is generated as shown in FIG. 5 (b). Then, the coordinate acquisition unit 16 acquires the coordinates of the four corners of the object 88 (that is, the coordinates 160b, the coordinates 162b, the coordinates 164b, and the coordinates 166b) from the resized image 142 using the learning model. Subsequently, the image area extraction unit 18 projects the coordinates acquired by the coordinate acquisition unit 16 onto the moving image configuration image 120 to acquire the image area.

この場合において、リサイズ画像１４２から取得した座標をリサイズ前の画像サイズが大きな動画構成画像１２０に射影するので、各座標の位置が実際の位置からずれる可能性がある。一例として、動画構成画像１２０の縦横画素数が３８４０ｐｘ×２１６０ｐｘであり、これをリサイズしたリサイズ画像１４２の縦横画素数が３００ｐｘ×３００ｐｘであるとする。この場合、リサイズ画像１４２のサイズと動画構成画像１２０のサイズとには、リサイズ画像１４２を基準とすると横方向で７．２倍、及び縦方向で１２．８倍の違いがある。そのため、リサイズ画像１４２から取得した座標を動画構成画像１２０に射影すると、座標の位置は実際の座標の位置からずれる可能性がある。例えば、リサイズ画像１４２の座標１６０ｂを動画構成画像１２０に射影した場合の座標１６０ｃは、図５（ｃ）に黒丸で示したように、所定のピクセル単位でずれが生じ得る。他の座標（座標１６２ｃ、座標１６４ｃ、及び座標１６６ｃ）についても同様である。その結果、座標取得部１６が取得した座標を画像領域抽出部１８が動画構成画像１２０に射影して規定する矩形の画像領域が、図５（ｃ）に示すように画像領域１４４ａ（図５（ｃ）の点線で規定した領域）として規定されることや、画像領域１４４ｂ（図５（ｃ）の一点鎖線で規定した領域）として規定され、実際のオブジェクト８８の画像領域からずれる場合が生じ得る。したがって、画像領域抽出部１８は、座標取得部１６が取得した座標を動画構成画像に射影し、動画構成画像のオブジェクトの領域を特定する場合に、当該オブジェクトの周囲に所定のマージン領域を含めた領域を画像領域として取得する（つまり、粗く、画像領域を抽出する。）。なお、マージン領域のサイズは、例えば、リサイズ処理部１４によるリサイズの縮小倍率や、動画構成画像のサイズとリサイズ画像のサイズとの比等に応じて決定してよい。 In this case, since the coordinates acquired from the resized image 142 are projected onto the moving image configuration image 120 having a large image size before resizing, the position of each coordinate may deviate from the actual position. As an example, it is assumed that the number of vertical and horizontal pixels of the moving image constituent image 120 is 3840px × 2160px, and the number of vertical and horizontal pixels of the resized image 142 which is resized is 300px × 300px. In this case, there is a difference of 7.2 times in the horizontal direction and 12.8 times in the vertical direction with respect to the resized image 142 as a reference between the size of the resized image 142 and the size of the moving image constituent image 120. Therefore, when the coordinates acquired from the resized image 142 are projected onto the moving image configuration image 120, the coordinate positions may deviate from the actual coordinate positions. For example, when the coordinates 160b of the resized image 142 are projected onto the moving image constituent image 120, the coordinates 160c may be deviated in predetermined pixel units as shown by black circles in FIG. 5 (c). The same applies to the other coordinates (coordinates 162c, coordinates 164c, and coordinates 166c). As a result, the rectangular image area defined by the image area extraction unit 18 projecting the coordinates acquired by the coordinate acquisition unit 16 onto the moving image configuration image 120 is an image area 144a (FIG. 5 (FIG. 5)) as shown in FIG. It may be defined as the area defined by the dotted line in c) or the image area 144b (the area defined by the alternate long and short dash line in FIG. 5C), and may deviate from the image area of the actual object 88. .. Therefore, when the image area extraction unit 18 projects the coordinates acquired by the coordinate acquisition unit 16 onto the moving image constituent image and specifies the area of the object of the moving image constituent image, the image area extraction unit 18 includes a predetermined margin area around the object. Acquire the area as an image area (that is, extract the coarse, image area). The size of the margin region may be determined, for example, according to the reduction magnification of the resizing by the resizing processing unit 14, the ratio of the size of the moving image constituent image to the size of the resized image, and the like.

続いて、画像領域抽出部１８が所定回数（例えば、２回）、画像領域を取得していない場合（Ｓ２６のＮｏ）、リサイズ処理部１４は、画像領域抽出部１８が抽出した画像領域１４４をリサイズしてリサイズ画像領域を生成する（Ｓ１６）。つまり、第１の工程で得られた画像領域１４４を用い、リサイズ画像領域が生成される。例えば、図６（ａ）に示すオブジェクト８６の周囲にマージン領域１８０を含む画像領域１４４をリサイズ処理部１４はリサイズし、図６（ｂ）に示すリサイズ画像領域１４６を生成する。リサイズ画像領域のサイズに限定はないが、例えば、縦横画素数が３００ｐｘ×３００ｐｘのサイズであってよい。 Subsequently, when the image area extraction unit 18 has not acquired the image area a predetermined number of times (for example, twice) (No in S26), the resizing processing unit 14 uses the image area 144 extracted by the image area extraction unit 18. Resize to generate a resized image area (S16). That is, the resized image area is generated by using the image area 144 obtained in the first step. For example, the resizing unit 14 resizes the image area 144 including the margin area 180 around the object 86 shown in FIG. 6A to generate the resized image area 146 shown in FIG. 6B. The size of the resized image area is not limited, but may be, for example, a size in which the number of vertical and horizontal pixels is 300 px × 300 px.

続いて、座標取得部１６は、学習モデル２６０を用い、リサイズ画像領域に含まれる所定のオブジェクトの１以上の隅（典型的には、４隅）の座標を取得する（Ｓ１８）。ここで、座標取得部１６は、所定のオブジェクトの座標を取得する場合に、所定数の座標を取得できるか否かを判断する（Ｓ２０）。ただし、既に一度Ｓ２０を経ているので、座標取得部１６は、所定数の座標を取得できるか否かの判断を省略し、所定数の座標を取得してよい。 Subsequently, the coordinate acquisition unit 16 acquires the coordinates of one or more corners (typically four corners) of a predetermined object included in the resized image area by using the learning model 260 (S18). Here, the coordinate acquisition unit 16 determines whether or not a predetermined number of coordinates can be acquired when acquiring the coordinates of a predetermined object (S20). However, since S20 has already passed once, the coordinate acquisition unit 16 may omit the determination of whether or not a predetermined number of coordinates can be acquired and acquire a predetermined number of coordinates.

例えば、図６（ｃ）に示すように、座標取得部１６は、学習モデル２６０を用い、オブジェクト８６の４隅を中心とする４つの矩形領域であって、各中心からリサイズ画像領域１４６の外縁までの長さが最短距離になる直線を垂線とする辺がリサイズ画像領域１４６の外縁に接するサイズの４つ矩形領域を形成した場合における４つの中心の座標（つまり、座標１６０ｄ、座標１６２ｄ、座標１６４ｄ、及び座標１６６ｄ）を取得する。なお、座標取得部１６は、リサイズ画像領域１４６における当該所定のオブジェクトの一部の隅の座標を取得した場合、学習モデルを用いて残りの隅の座標を推定して取得する。 For example, as shown in FIG. 6C, the coordinate acquisition unit 16 uses the learning model 260 and has four rectangular regions centered on the four corners of the object 86, and the outer edge of the resized image region 146 from each center. Coordinates of four centers (that is, coordinates 160d, coordinates 162d, coordinates) when four rectangular areas having a size in which the side whose vertical line is the straight line whose length is the shortest distance is in contact with the outer edge of the resized image area 146 are formed. 164d and coordinates 166d) are acquired. When the coordinate acquisition unit 16 acquires the coordinates of a part of the corners of the predetermined object in the resized image area 146, the coordinate acquisition unit 16 estimates and acquires the coordinates of the remaining corners using the learning model.

そして、図６（ｄ）に示すように、画像領域抽出部１８は、座標取得部１６が取得した座標を動画構成画像１２０ｃ（つまり、リサイズ画像領域１４６のリサイズ元の画像領域１４４を含む動画構成画像１２０ｃ）に射影し（Ｓ２２）、動画構成画像１２０ｃのオブジェクト８６の画像領域を取得する（Ｓ２４）。なお、既に第１の工程で粗く抽出した画像領域を用いて座標取得部１６がオブジェクト８６の４隅の座標を再び取得しているので、元の動画構成画像１２０ｃに座標を射影しても、元の動画構成画像１２０ｃに含まれるオブジェクト８６の実際の４隅の座標からのずれを少なくすることができる。 Then, as shown in FIG. 6D, the image area extraction unit 18 uses the coordinates acquired by the coordinate acquisition unit 16 as the moving image configuration image 120c (that is, the moving image configuration including the image area 144 of the resizing source of the resizing image area 146). It is projected onto the image 120c) (S22), and the image area of the object 86 of the moving image configuration image 120c is acquired (S24). Since the coordinate acquisition unit 16 has already acquired the coordinates of the four corners of the object 86 using the image area roughly extracted in the first step, even if the coordinates are projected on the original moving image configuration image 120c, It is possible to reduce the deviation from the coordinates of the actual four corners of the object 86 included in the original moving image configuration image 120c.

そして、画像領域抽出部１８が所定回数（例えば、２回）、画像領域を取得したので（Ｓ２６のＹｅｓ）、画像処理部２０は取得された画像領域に所定の画像処理を施す（Ｓ２８）。これにより、画像領域抽出部１８は、オブジェクト画像領域１４８を抽出する（Ｓ３０）。画像領域抽出部１８は、抽出したオブジェクト画像領域１４８を、例えば、情報格納部２４に格納する。 Then, since the image area extraction unit 18 has acquired the image area a predetermined number of times (for example, twice) (Yes in S26), the image processing unit 20 performs a predetermined image processing on the acquired image area (S28). As a result, the image area extraction unit 18 extracts the object image area 148 (S30). The image area extraction unit 18 stores the extracted object image area 148 in, for example, the information storage unit 24.

［画像処理プログラム］
図１〜図７に示した本実施形態に係る画像処理システム１が備える各構成要素は、中央演算処理装置（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：ＣＰＵ）等の演算処理装置にプログラム（すなわち、画像処理プログラム）を実行させること、つまり、ソフトウェアによる処理により実現できる。また、集積回路（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ：ＩＣ）等の電子部品としてのハードウェアにプログラムを予め書き込むことで実現することもできる。なお、ソフトウェアとハードウェアとを併用することもできる。 [Image processing program]
Each component of the image processing system 1 according to the present embodiment shown in FIGS. 1 to 7 has a program (that is, an image processing program) in an arithmetic processing unit such as a central processing unit (CPU). It can be achieved by executing it, that is, by processing it by software. It can also be realized by writing a program in advance in hardware as an electronic component such as an integrated circuit (IC). It is also possible to use software and hardware together.

本実施形態に係る画像処理プログラムは、例えば、ＩＣやＲＯＭ等に予め組み込むことができる。また、画像処理プログラムは、インストール可能な形式、又は実行可能な形式のファイルで、磁気記録媒体、光学記録媒体、半導体記録媒体等のコンピュータで読み取り可能な記録媒体に記録し、コンピュータプログラムとして提供することもできる。プログラムを格納している記録媒体は、ＣＤ−ＲＯＭやＤＶＤ等の非一過性の記録媒体であってよい。更に、画像処理プログラムを、インターネット等の通信ネットワークに接続されたコンピュータに予め格納させ、通信ネットワークを介してダウンロードによる提供ができるようにすることもできる。 The image processing program according to this embodiment can be incorporated in advance into, for example, an IC or a ROM. Further, the image processing program is a file in an installable format or an executable format, which is recorded on a computer-readable recording medium such as a magnetic recording medium, an optical recording medium, or a semiconductor recording medium, and provided as a computer program. You can also do it. The recording medium in which the program is stored may be a non-transient recording medium such as a CD-ROM or a DVD. Further, the image processing program can be stored in advance in a computer connected to a communication network such as the Internet so that the image processing program can be provided by download via the communication network.

本実施形態に係る画像処理プログラムは、ＣＰＵ等に働きかけて、画像処理プログラムを、図１〜図７にかけて説明した動画撮像部１０、構成画像抽出部１２、リサイズ処理部１４、座標取得部１６、画像領域抽出部１８、画像処理部２０、方向調整部２２、情報格納部２４、学習モデル生成部２６、入力部２８、出力部３０、及び読取部３２として機能させる。 The image processing program according to the present embodiment works on the CPU and the like to transfer the image processing program to the moving image imaging unit 10, the constituent image extraction unit 12, the resizing processing unit 14, the coordinate acquisition unit 16, and the image processing unit 16 described in FIGS. It functions as an image area extraction unit 18, an image processing unit 20, a direction adjustment unit 22, an information storage unit 24, a learning model generation unit 26, an input unit 28, an output unit 30, and a reading unit 32.

（実施の形態の効果）
本実施の形態に係る画像処理システム１は、抽出対象である所定のオブジェクトの所定箇所の座標を中心とする１以上の矩形領域を当該所定のオブジェクトにラベル付けしたデータセットを用いて学習して構築された学習モデルを用いる。そして、画像処理システム１は、動画から動画構成画像を抽出し、抽出した動画構成画像からリサイズ画像を生成し、学習モデルを用いてリサイズ画像から抽出対象の所定のオブジェクトの所定箇所の座標を取得し、取得した座標を当該動画構成画像に射影して抽出対象のオブジェクトが含まれる画像領域を抽出する第１の工程と、この画像領域をリサイズしてリサイズ画像領域を生成し、学習モデルを用いてリサイズ画像領域から抽出対象の所定のオブジェクトの所定箇所の座標を取得し、取得した座標を当該動画構成画像に射影して抽出対象のオブジェクトのオブジェクト画像領域を抽出する第２の工程とにより抽出対象であるオブジェクト画像領域を抽出する。これにより、画像処理システム１によれば、例えば、机の上等に整頓されずに配置された複数のオブジェクトの動画を撮像するだけで、高精度、高速、かつ、適切に抽出対象であるオブジェクト（例えば、領収書）の画像を抽出し、オブジェクトに記載されている各種の情報の取得用データや画像処理用のデータとして情報格納部２４に格納することができる。 (Effect of embodiment)
The image processing system 1 according to the present embodiment learns by using a data set in which one or more rectangular areas centered on the coordinates of a predetermined location of a predetermined object to be extracted are labeled on the predetermined object. Use the constructed learning model. Then, the image processing system 1 extracts a moving image constituent image from the moving image, generates a resized image from the extracted moving image constituent image, and acquires the coordinates of a predetermined position of a predetermined object to be extracted from the resized image using a learning model. Then, the acquired coordinates are projected onto the moving image constituent image to extract the image area including the object to be extracted, and this image area is resized to generate a resized image area, and the learning model is used. The coordinates of a predetermined part of a predetermined object to be extracted are acquired from the resized image area, and the acquired coordinates are projected onto the moving image constituent image to extract the object image area of the object to be extracted. Extract the target object image area. As a result, according to the image processing system 1, for example, an object that is a high-precision, high-speed, and appropriate extraction target can be extracted simply by capturing a moving image of a plurality of objects arranged unorganized on a desk or the like. An image of (for example, a receipt) can be extracted and stored in the information storage unit 24 as data for acquiring various information described in the object or data for image processing.

また、例えば、抽出対象であるオブジェクトが領収書である場合を例に挙げる。この場合、従来技術で抽出対象にしている名刺と領収書とは、例えば、皺が領収書の方が発生しやすく、折れ曲がりも多い特徴があり、また、領収書の方が名刺より薄く、机の上等に置いた場合にエッジを認識し難い。例えば、背景と領収書との色の関係で領収書のエッジが検出し難い場合（一例として、領収書の色が白色で、背景である机の色が白色の場合）、従来技術ではエッジを適切に検出できず、動画からオブジェクトの領域を抽出できない。更に、領収書は名刺とは異なり、縦横比のバリエーションが様々存在する。この場合において、従来技術のようにエッジ検出を前提とした技術では、領収書が波打ったり、折れ曲がっている場合（例えば、図１（ｂ）に示すオブジェクト８２）、本来１枚の領収書であるところ、折れ目を境に複数の個別の領域として検出してしまう。また、例えば、名刺は縦横比が略一定であることからバウンディングボックスを用いて動画構成画像中の名刺の領域を推定することが容易であるものの、領収書は縦横比に様々なバリエーションがあることから、エッジ検出を前提とする従来技術において、バウンディングボックスを用いた領収書の領域推定は困難である。 Further, for example, a case where the object to be extracted is a receipt will be taken as an example. In this case, the business card and the receipt to be extracted by the conventional technique are characterized in that wrinkles are more likely to occur on the receipt and there are many bends, and the receipt is thinner than the business card and the desk. It is difficult to recognize the edge when placed on the top. For example, if the edge of the receipt is difficult to detect due to the color relationship between the background and the receipt (for example, if the color of the receipt is white and the color of the desk that is the background is white), the edge is used in the prior art. It cannot be detected properly and the area of the object cannot be extracted from the video. Furthermore, unlike business cards, receipts have various aspect ratio variations. In this case, in the technique premised on edge detection as in the conventional technique, when the receipt is wavy or bent (for example, the object 82 shown in FIG. 1B), one receipt is originally used. At some point, it is detected as multiple individual areas with the crease as the boundary. Further, for example, since the aspect ratio of a business card is substantially constant, it is easy to estimate the area of the business card in the video composition image using a bounding box, but the receipt has various variations in the aspect ratio. Therefore, in the conventional technique premised on edge detection, it is difficult to estimate the area of the receipt using the bounding box.

一方、本実施形態に係る画像処理システム１は、エッジ検出を要さず、オブジェクトの所定箇所の座標及び当該座標を中心とするバウンディングボックスに基づいてオブジェクトが領収書であるか否かを認識できるので、波打ったり、折れ曲がった状態の領収書や縦横比が一定でない複数の領収書を動画で撮像した場合であっても、１枚１枚の領収書として適切に認識し、検出できる。 On the other hand, the image processing system 1 according to the present embodiment does not require edge detection, and can recognize whether or not the object is a receipt based on the coordinates of a predetermined position of the object and the bounding box centered on the coordinates. Therefore, even when a receipt in a wavy or bent state or a plurality of receipts having an inconsistent aspect ratio are imaged as a moving image, they can be appropriately recognized and detected as individual receipts.

また、本実施形態に係る学習モデル２６０は、様々な縦横比の領収書の画像を学習させていることから、画像処理システム１によれば、縦横比が一定ではない複数の領収書のそれぞれを領収書として適切に認識できる。更に、学習モデル２６０は、様々な様式の領収書の４隅の座標及び４隅を含むバウンディングボックスを学習させていることから、画像処理システム１によれば、例えば、１枚の領収書に情報が表示されている複数の領域が印字され、かつ、一の領域と他の領域との間に大きな空白がある場合であっても１枚の領収書として適切に認識できる。そして、学習モデル２６０は、領収書の画像だけでなく様々な背景画像に領収書の画像を重畳させて学習させていることから、画像処理システム１によれば、背景と領収書とのコントラスト差が小さい場合であっても、領収書の画像を適切に抽出できる。 Further, since the learning model 260 according to the present embodiment trains the images of the receipts having various aspect ratios, according to the image processing system 1, each of the plurality of receipts whose aspect ratios are not constant is received. Can be properly recognized as a receipt. Further, since the learning model 260 trains the coordinates of the four corners of the receipts of various formats and the bounding box including the four corners, according to the image processing system 1, for example, information is provided on one receipt. Even when a plurality of areas in which is displayed are printed and there is a large space between one area and the other area, it can be appropriately recognized as one receipt. Since the learning model 260 trains the receipt image by superimposing the receipt image on various background images as well as the receipt image, the contrast difference between the background and the receipt is according to the image processing system 1. Even if is small, the image of the receipt can be properly extracted.

具体的に、本実施形態に係る画像処理システム１において、抽出対象である所定のオブジェクトを領収書にした学習モデル２６０を準備した上で、動画撮像部１０に領収書、名刺、及びスマートフォンを含む領域を撮像させてテストした。その結果、本実施形態に係る画像処理システム１は、領収書のオブジェクト画像領域を動画構成画像から適切にリアルタイムで抽出した。一方、画像処理システム１は、名刺、及びスマートフォンについては、領収書とは認識しなかった。 Specifically, in the image processing system 1 according to the present embodiment, after preparing a learning model 260 in which a predetermined object to be extracted is a receipt, the moving image imaging unit 10 includes a receipt, a business card, and a smartphone. The area was imaged and tested. As a result, the image processing system 1 according to the present embodiment appropriately extracts the object image area of the receipt from the moving image constituent image in real time. On the other hand, the image processing system 1 did not recognize the business card and the smartphone as a receipt.

以上、本発明の実施の形態を説明したが、上記に記載した実施の形態は特許請求の範囲に係る発明を限定するものではない。また、実施の形態の中で説明した特徴の組合せのすべてが発明の課題を解決するための手段に必須であるとは限らない点に留意すべきである。更に、上記した実施形態の技術的要素は、単独で適用されてもよく、プログラム部品とハードウェア部品とのような複数の部分に分割されて適用されるようにすることもできる。 Although the embodiments of the present invention have been described above, the embodiments described above do not limit the invention according to the claims. It should also be noted that not all combinations of features described in the embodiments are essential to the means for solving the problems of the invention. Further, the technical elements of the above-described embodiments may be applied alone, or may be divided and applied to a plurality of parts such as a program component and a hardware component.

なお、本実施形態に係る画像処理システムは、特許請求の範囲と混同されるべきでない以下の付記項でも言及できる。
（付記項１）
オブジェクトを動画で撮像する動画撮像部と、
前記動画の動画構成画像を抽出する構成画像抽出部と、
前記動画構成画像をリサイズしてリサイズ画像を生成するリサイズ処理部と、
前記リサイズ画像から、前記オブジェクトの所定箇所の座標を取得する座標取得部と、
前記座標を前記動画構成画像に射影して、前記動画構成画像から前記オブジェクトが含まれる画像領域を抽出する画像領域抽出部と
を備え、
前記リサイズ処理部が、前記画像領域をリサイズしてリサイズ画像領域を生成し、
前記座標取得部が、前記リサイズ画像領域から、前記オブジェクトの前記所定箇所の座標を再取得し、
前記画像領域抽出部が、前記再取得された前記座標を前記動画構成画像に射影して、前記オブジェクトのオブジェクト画像領域を抽出する画像処理システム。 The image processing system according to the present embodiment can also be referred to in the following appendix which should not be confused with the scope of claims.
(Appendix 1)
A video imager that captures an object as a video,
A component image extraction unit that extracts a moving image of the moving image,
A resizing processing unit that resizes the moving image configuration image and generates a resized image,
A coordinate acquisition unit that acquires the coordinates of a predetermined position of the object from the resized image, and
It is provided with an image area extraction unit that projects the coordinates onto the moving image constituent image and extracts an image area including the object from the moving image constituent image.
The resizing unit resizes the image area to generate a resized image area.
The coordinate acquisition unit reacquires the coordinates of the predetermined position of the object from the resized image area.
An image processing system in which the image area extraction unit projects the re-acquired coordinates onto the moving image constituent image to extract the object image area of the object.

１画像処理システム
２情報端末
３サーバ
４通信網
１０動画撮像部
１２構成画像抽出部
１４リサイズ処理部
１６座標取得部
１８画像領域抽出部
２０画像処理部
２２方向調整部
２４情報格納部
２６学習モデル生成部
２８入力部
３０出力部
３２読取部
８０、８２、８４、８６、８８オブジェクト
９０机
１００リサイズ画像
１１０動画
１２０、１２０ａ、１２０ｂ、１２０ｃ動画構成画像
１３０画像
１４０ａ、１４０ｂ、１４０ｃリサイズ画像
１４２リサイズ画像
１４４、１４４ａ、１４４ｂ画像領域
１４６リサイズ画像領域
１４８オブジェクト画像領域
１５０、１５２、１５４、１５６隅
１５０ａ、１５２ａ、１５４ａ、１５６ａ座標
１６０、１６２、１６４、１６６座標
１６０ａ、１６２ａ、１６４ａ、１６６ａ座標
１６０ｂ、１６２ｂ、１６４ｂ、１６６ｂ座標
１６０ｃ、１６２ｃ、１６４ｃ、１６６ｃ座標
１６０ｄ、１６２ｄ、１６４ｄ、１６６ｄ座標
１７０、１７２、１７４、１７６矩形領域
１７０ａ辺
１８０マージン領域
２６０学習モデル 1 Image processing system 2 Information terminal 3 Server 4 Communication network 10 Video imaging unit 12 Configuration image extraction unit 14 Resize processing unit 16 Coordinate acquisition unit 18 Image area extraction unit 20 Image processing unit 22 Direction adjustment unit 24 Information storage unit 26 Learning model generation Part 28 Input part 30 Output part 32 Reading part 80, 82, 84, 86, 88 Object 90 Desk 100 Resized image 110 Video 120, 120a, 120b, 120c Video composition image 130 image 140a, 140b, 140c Resized image 142 Resized image 144 , 144a, 144b image area 146 resized image area 148 object image area 150, 152, 154, 156 corners 150a, 152a, 154a, 156a coordinates 160, 162, 164, 166 coordinates 160a, 162a, 164a, 166a coordinates 160b, 162b, 164b, 166b Coordinates 160c, 162c, 164c, 166c Coordinates 160d, 162d, 164d, 166d Coordinates 170, 172, 174, 176 Rectangular Area 170a Side 180 Margin Area 260 Learning Model

Claims

A resizing processing unit that resizes the video composition image of the video that captured the object and generates a resized image,
A coordinate acquisition unit that acquires the coordinates of the corners of the object from the resized image, and
It is provided with an image area extraction unit that projects the coordinates acquired from the resized image onto the moving image constituent image and extracts an image area including the object from the moving image constituent image.
The resizing unit resizes the image area to generate a resized image area.
The coordinate acquisition unit reacquires the coordinates of the corner of the object from the resized image area.
The image area extraction unit projects the re- acquired coordinates onto the moving image constituent image to extract the object image area of the object .
The resized image has a side whose perpendicular line is a straight line whose coordinates are one or more rectangular regions centered on the corner of the object and whose length from the center to the outer edge of the resized image is the shortest distance. An image processing system that is the coordinates of the center when the rectangular region having a size tangent to the outer edge of the is formed .

The image processing system according to claim 1, wherein the image area extraction unit extracts the image area to which a predetermined margin area is added from the moving image constituent image.

The image processing system according to claim 1 or 2 , wherein the coordinate acquisition unit acquires the coordinates of the corner of the predetermined object by using a learning model prepared in advance.

The image processing system according to any one of claims 1 to 3 , further comprising an image processing unit that performs predetermined image processing on the object image area.

A coordinate acquisition unit that acquires the coordinates of a predetermined location of the object based on the moving image of the moving image of the object.
It includes an image area extraction unit that extracts an image area including the object from the moving image constituent image based on the coordinates.
The coordinate acquisition unit acquires the coordinates of the predetermined position of the object based on the image area.
The image area extraction unit projects the acquired coordinates onto the moving image constituent image to extract the object image area of the object.
The coordinates of the predetermined location are one or more rectangular regions centered on the corners of the object, from the center to the outer edge of the moving image constituent image or the outer edge of the generated image generated based on the moving image constituent image. An image processing system in which a side having a straight line having the shortest length as a perpendicular line is the coordinate of the center when the rectangular region having a size in contact with the outer edge of the moving image constituent image or the outer edge of the generated image is formed.

An image processing method for image processing systems
A resizing process that resizes the video composition image of the video that captured the object and generates a resized image,
A coordinate acquisition process for acquiring the coordinates of the corners of the object from the resized image, and
An image area extraction step of projecting the coordinates acquired from the resized image onto the moving image constituent image and extracting an image area including the object from the moving image constituent image.
A step of resizing the image area to generate a resized image area, and
A step of reacquiring the coordinates of the corner of the object from the resized image area, and
A step of projecting the reacquired coordinates onto the moving image constituent image to extract an object image area of the object is provided .
The resizing image is a side whose perpendicular line is a straight line whose corner coordinates are one or more rectangular regions centered on the corner of the object and whose length from the center to the outer edge of the resizing image is the shortest distance. coordinate der Ru image processing method of the center in the case of forming the rectangular region size in contact with the outer edge of the.

An image processing program for an image processing system
On the computer
Resize processing function that resizes the video composition image of the video that captured the object and generates a resized image,
A coordinate acquisition function that acquires the coordinates of the corners of the object from the resized image, and
An image area extraction function that projects the coordinates acquired from the resized image onto the moving image constituent image and extracts an image area including the object from the moving image constituent image.
A function to resize the image area to generate a resized image area, and
A function to reacquire the coordinates of the corner of the object from the resized image area, and
A function of projecting the re- acquired coordinates onto the moving image constituent image to extract an object image area of the object is realized .
The resizing image is a side whose perpendicular line is a straight line whose corner coordinates are one or more rectangular regions centered on the corner of the object and whose length from the center to the outer edge of the resizing image is the shortest distance. the center coordinates der Ru image processing program in the case of forming the rectangular region size in contact with the outer edge.

A resizing processing unit that resizes the video composition image of the video that captured the object and generates a resized image,
A coordinate acquisition unit that acquires the coordinates of the corners of the object from the resized image, and
It is provided with an image area extraction unit that projects the coordinates acquired from the resized image onto the moving image constituent image and extracts an image area including the object from the moving image constituent image.
The resizing unit resizes the image area to generate a resized image area.
The coordinate acquisition unit reacquires the coordinates of the corner of the object from the resized image area.
The image area extraction unit projects the re- acquired coordinates onto the moving image constituent image to extract the object image area of the object .
The resizing image is a side whose perpendicular line is a straight line whose corner coordinates are one or more rectangular regions centered on the corner of the object and whose length from the center to the outer edge of the resizing image is the shortest distance. An image processing server that is the coordinates of the center when the rectangular region having a size tangent to the outer edge of the is formed .

When the captured image is input, one or more rectangular areas centered on one or more corners of the predetermined object are output in order to identify whether or not the object included in the captured image is a predetermined object. It ’s a learning model that makes the processor work.
In the learning model, an image including the predetermined object, a background image in which the predetermined object can be arranged, and a combination of the image including the predetermined object and the background image are learned as teacher data.
In the learning, a straight line having one or more rectangular regions centered on the corners of the predetermined object and having the shortest length from the center to the outer edge of the image including the predetermined object is defined as a perpendicular line. A learning model for forming the rectangular region whose sides are tangent to the outer edge of the image and identifying the predetermined object in the image using the formed rectangular region and the coordinates of the center of the rectangular region. ..