WO2022202178A1 - 機械学習用の学習データ生成装置、学習データ生成システム及び学習データ生成方法 - Google Patents
機械学習用の学習データ生成装置、学習データ生成システム及び学習データ生成方法 Download PDFInfo
- Publication number
- WO2022202178A1 WO2022202178A1 PCT/JP2022/009062 JP2022009062W WO2022202178A1 WO 2022202178 A1 WO2022202178 A1 WO 2022202178A1 JP 2022009062 W JP2022009062 W JP 2022009062W WO 2022202178 A1 WO2022202178 A1 WO 2022202178A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- learning data
- unit
- frame
- data generation
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
Definitions
- the present invention relates to a device that generates learning data for machine learning in image analysis, and more particularly to a learning data generation device, a learning data generation system, and a learning data generation method that automate collection and annotation of learning data.
- Non-Patent Documents 1 and 2 For the task of detecting an object (object detection task) with temporally continuous image data such as moving images and live images (hereinafter simply referred to as "video") as input, a machine learning-based The application of AI (Artificial Intelligence, hereinafter simply referred to as "AI”) is progressing (see Non-Patent Documents 1 and 2).
- AI Artificial Intelligence
- Supervised learning is often used as a method for learning AI.
- training data is a pair of input (for example, image data) for a certain task and expected output value (teaching data) for that input. learn.
- adding teacher data is called annotation, and estimating an output value for an input using a learned AI (learned model) is called inference.
- object detection AI object detection tasks
- inference accuracy such as objects that do not exist in the learning data exist in the actual operating environment. Therefore, in order to put AI into practical use, robustness against environmental differences and the ability to make correct inferences even for unknown inputs (generalization performance) are required.
- Patent Document 1 discloses determining the category of an object from an image of the object.
- the expected inference accuracy may not be obtained due to differences in the environment.
- inference is performed for each frame, so there are cases where the target can be detected in one frame but the same target cannot be detected in another frame (missing the detection target). ) may occur.
- the conventional technology has the problem that it takes time and effort to collect overlooked image data and annotate the collected image data in order to improve the accuracy of inference.
- Patent Document 1 does not describe automating the work of collecting and annotating overlooked image data.
- the present invention for solving the problems of the conventional example is a learning data generation device for generating learning data for machine learning in image analysis, comprising: an inference unit for inferring object detection in a video frame using a trained model; Detect missed video frames that have been missed due to failure of object detection by the trained model from multiple video frames inferred over time, and interpolate the detection results of the missed frames from the multiple video frames. It is characterized by having an interpolation unit and a generation unit that generates learning data using the interpolated detection result.
- the interpolation unit acquires the identification number of the overlooked video frame, estimates the position and type of the object from a plurality of video frames over time, and the generation unit converts the identification number into
- the method is characterized in that learning data is generated by associating the corresponding missed frame with the estimated position and type of the object.
- the present invention provides, in the learning data generation device, a video data storage unit that stores a plurality of video frames over time, a learning data storage unit that stores the generated learning data, and an overlooked video detected by the interpolation unit. and a frame extraction unit for extracting the missed video frame based on the identification number of the frame and outputting it to the generation unit.
- a learning data generation system is characterized by comprising the learning data generation device described above and a photographing device that captures a video and provides video frames to the learning data generation device.
- the present invention is a learning data generation method for generating learning data for machine learning in image analysis, inferring object detection in video frames with a trained model, and learning from a plurality of inferred video frames over time. Detecting overlooked video frames that have been overlooked due to an object detection by a model, interpolating the detection results of the detected overlooked frames from a plurality of video frames, and generating learning data using the interpolated detection results. characterized by
- a learning data generation device for generating learning data for machine learning in image analysis, comprising an inference unit for inferring object detection in a video frame using a trained model; An interpolating unit that detects missed video frames that have been missed due to object detection by a trained model from video frames, and interpolates detection results of the detected missed frames from a plurality of video frames; and interpolated detection results. and a generation unit that generates learning data using .
- FIG. 3 is a flow chart showing a schematic process in this device;
- FIG. 4 is an explanatory diagram showing an example of an image;
- FIG. 11 is an explanatory diagram showing an example of video of an inference result by object detection AI;
- FIG. 10 is an explanatory diagram showing an example of numerical data of an inference result by object detection AI;
- FIG. 10 is an explanatory diagram showing an example of numerical data of detection results after interpolation;
- FIG. 11 is an explanatory diagram showing an example of an image of an interpolated detection result;
- It is a flow chart of interpolation processing.
- FIG. 10 is an explanatory diagram showing an example of specifying a position based on one frame immediately before and after one frame;
- FIG. 4 is an explanatory diagram showing an example of specifying a position based on several frames immediately before and after;
- a learning data generation device (this device) according to an embodiment of the present invention generates learning data for machine learning in image analysis, and an inference unit detects an object in a video frame. is inferred by the learned model, and the interpolating unit detects missed video frames that have been missed due to object detection by the learned model from a plurality of inferred video frames over time, and detects the missed video frames. Frame detection results are interpolated from multiple video frames over time, and the generator generates learning data using the interpolated detection results. It improves the work efficiency of creation.
- FIG. 1 is a schematic diagram of the configuration of this device.
- the apparatus 1 is implemented by an information processing apparatus such as a computer, and basically comprises a control section 10 and a storage section 20. As shown in FIG. 1, the apparatus 1 is implemented by an information processing apparatus such as a computer, and basically comprises a control section 10 and a storage section 20. As shown in FIG.
- the control unit 10 includes a video acquisition unit 11, an object detection unit 12, a missed frame extraction unit 13, and a learning data generation unit .
- the storage unit 20 also includes a video data storage unit 21 , a detection result storage unit 22 , and a learning data storage unit 23 . Each part will be specifically described below.
- Video acquisition unit 11 The video acquisition unit 11 reads video frames (video data/image data) over time from the video data storage unit 21 of the storage unit 2 and outputs them to the object detection unit 12 .
- the object detection unit 12 performs object detection inference processing on input video data using an AI trained model, detects missing frames, and performs interpolation processing.
- the object detection unit 12 includes an inference unit 121 and an interpolation unit 122 .
- the processing by the inference unit 121 and the processing by the interpolation unit 122 may be performed in parallel, or the processing by the interpolation unit 122 may be performed after the processing by the inference unit 121 .
- the inference unit 121 performs inference by object detection AI, and infers the position and type of an object for each frame of video using a learned object detection AI (learned model). .
- object detection AI learned model
- a method/algorithm for detecting an object it is possible to arbitrarily select from publicly known ones as long as they are based on machine learning. Examples include YOLO (You Only Look Once/see Non-Patent Document 1) and SSD (Single Shot MultiBox Detector/see Non-Patent Document 2).
- the interpolation unit 122 performs interpolation processing of the inference result of the inference unit 121, and uses a method/algorithm that can be interpolated when the object detection AI misses an inference by the object detection AI. Interpolate the result. That is, it estimates the position and type of an object missed by the object detection AI.
- one or more of known ones can be arbitrarily selected as long as it utilizes the temporal context (passage of time) in the inference results of video or object detection AI.
- Examples include tracking using the Kanade-Lucas-Tomasi algorithm (Carlo Tomasi, Takeo Kanade, “Detection and Tracking of Point Features,” Technical Report CMU-CS-91-132, April 1991), Kalman filter (Rudolf Emil Kalman, “On the general theory of control systems,” Proc. the 1st IFAC World Congress, August 1960.).
- the object detection unit 12 records the following information for each object in each video frame only for those that have been overlooked by the object detection AI and interpolated by interpolation processing.
- the first is the frame number, which is the identification information of the frame.
- frame time information may be used instead of the frame number as the frame identification information.
- the second is the information of the bounding box, which is the positional information of the detected object.
- This information represents the position of an object within a video frame, and is a set of coordinate values (for example, left edge, top edge, right edge, bottom edge, etc.) indicating the object area.
- the missed frame extraction unit 13 extracts from the video data storage unit 21 the video frame corresponding to the frame number (or time) that is the identification information of the first frame recorded by the object detection unit 12. and outputs the extracted video frames to the learning data generation unit 14 .
- the learning data generation unit 14 annotates overlooked objects in missed video frames, and stores missed video frames input from the missed frame extraction unit 13 in the detection result storage unit 22.
- the detection result is used as the teacher data as it is, and the set of the overlooked video frame and the teacher data is stored in the learning data storage unit 23 as learning data for re-learning.
- the video data storage unit 21 stores video frames over time, and each video frame is given a frame number, a shooting time, or both.
- the video frames in the video data storage unit 21 are read by the video acquisition unit 11, and the video frames to be interpolated by access from the missed frame extraction unit 13 are read based on the corresponding frame number or time.
- the detection result storage unit 22 stores the detection result of the inference unit 121 of the object detection unit 12 and the interpolation result (interpolation content) of the interpolation unit 122 . Further, the detection result storage unit 22 outputs an interpolation result that becomes teacher data upon access from the learning data generation unit 14 . Acquisition processing of teacher data will be described later.
- the learning data storage unit 23 stores missed video frames and teacher data output from the learning data generation unit 14 as learning data.
- the learning data stored in the learning data storage unit 23 becomes learning data for re-learning.
- a learning data generation system (this system) is constructed by connecting a photographing device such as a camera for photographing to the device 1 and storing video data photographed by the photographing device in the video data storage unit 21 of the storage unit 2. may be configured.
- the first condition is that there is no occluder between the object and the camera.
- the second condition is that objects do not overlap or contact each other.
- the third condition is that the speed of the object is constant or the change in speed of the object is sufficiently small relative to the number of frames per second of the image.
- FIG. 2 is a flowchart showing a schematic processing in this device.
- the inference unit 121 of the object detection unit 12 performs inference processing for object detection (S1). (S2).
- the overlooked frame extracting unit 13 extracts the video frame in which the overlooked object has been interpolated from the video data storage unit 21 by frame number or the like (S3). Furthermore, the overlooked video frame extracted by the learning data generation unit 14 and the corresponding detection result are acquired from the detection result storage unit 22, the detection result is used as teacher data, and learning data is generated by combining the missed video frame and the teacher data. (S4), the process ends.
- the apparatus 1 can automate the detection of missed video frames in which the detection object is interpolated by the interpolator 122, and can also automate annotation using the inference results of the temporally preceding and succeeding frames of the missed video frames. It is.
- FIG. 3 is an explanatory diagram showing an example of an image
- FIG. 4 is an explanatory diagram showing an example of an inference result by the object detection AI
- FIG. 5 is an explanation showing an example of numerical data of the inference result by the object detection AI
- FIG. 6 is an explanatory diagram showing an example of numerical data of the detection result after interpolation
- FIG. 7 is an explanatory diagram showing an example of video of the interpolated detection result.
- FIG. 3 shows an example of an image for object detection, which is an image consisting of 5 frames, and is an image of a car traveling at a constant speed in front of a building in the background. Inference is performed on the image in FIG. 3 by the learned object detection AI that has learned cars in advance, and the bounding box (1) obtained as the inference result is superimposed on the image and displayed in FIG. .
- FIG. 5 shows the inference results shown in FIG. 4 as numerical data in tabular form. Each row of the table in FIG. 5 represents information about each detected object.
- FIG. 6 shows the results of interpolation processing performed on the inference results of the object detection AI.
- the row surrounded by a thick frame is the interpolated detection result (2).
- FIG. 7 shows the interpolated detection result (2) superimposed on the image.
- the bounding box (1) properly overlaps the car area in the frame and the interpolation process works ideally in this example. Based on the above, the image of frame number 3 was obtained as the missed frame, and the interpolated detection result was obtained. This is used as learning data for learning.
- FIG. 8 is a flowchart of interpolation processing. Although various methods and algorithms can be applied to the contents of the interpolation processing, FIG. 8 shows a simple example.
- the interpolation processing algorithm compares the frame numbers with the inference results in ascending order, and if there is no detection result for a certain frame number (that is, there is an oversight), interpolation is performed based on the detection results of the preceding and succeeding frames.
- the i-th frame is collated, and it is determined whether or not there is a detection result in that frame (S14). If there is no detection result in determination processing S14 (No), it is regarded as a missed frame candidate. If there is a detection result in determination processing S14 (Yes), the frame number is incremented (S15), and the process returns to determination processing S13.
- a process is performed to verify whether a frame that is a candidate for a missed frame is actually a missed frame (that is, is it not detected in that frame even though it was detected in the previous frame and the next frame) (S16).
- the bounding box of the object is set to the average value of the previous frame and the next frame, and the class of the object is set to the value of the previous frame.
- the frame number is incremented (S18), and the process returns to the determination process S13.
- FIG. 9 is an explanatory diagram showing an example of specifying a position based on one frame immediately before and after
- FIG. 10 is an explanatory diagram showing an example of specifying a position based on several frames immediately before and after.
- the position (coordinates of the bounding box) is estimated by referring only to the immediately preceding frame (i-1) and the immediately succeeding frame (i+1). This is calculated by Equation (1) shown in FIG.
- the inference unit 121 infers object detection in video frames using a trained model, and the interpolation unit 122 learns from a plurality of inferred video frames over time. Missed video frames that have been missed due to the failure of object detection by the existing model are detected, and the detection results of the detected missed frames are interpolated from a plurality of video frames over time, and the learning data generation unit 14 performs interpolation. Since learning data is generated using the detected detection results, the collection and annotation of overlooked frames can be automated, easily improving inference accuracy, reducing the burden on workers in AI creation, reducing human error, and creating AI. It has the effect of realizing efficiency improvement and man-hour reduction.
- the present invention is suitable for a learning data generation device, a learning data generation system, and a learning data generation method that automate the work of collecting and annotating overlooked image data.
- 1... learning data generation device 10... control unit, 11... video acquisition unit, 12... object detection unit, 13... missing frame extraction unit, 14... learning data generation unit, 20... storage unit, 21... video data storage unit, 22... Learning data storage unit, 121... Inference unit, 122... Interpolation unit
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023508880A JPWO2022202178A1 (https=) | 2021-03-23 | 2022-03-03 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021048089 | 2021-03-23 | ||
| JP2021-048089 | 2021-03-23 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022202178A1 true WO2022202178A1 (ja) | 2022-09-29 |
Family
ID=83396925
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2022/009062 Ceased WO2022202178A1 (ja) | 2021-03-23 | 2022-03-03 | 機械学習用の学習データ生成装置、学習データ生成システム及び学習データ生成方法 |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JPWO2022202178A1 (https=) |
| WO (1) | WO2022202178A1 (https=) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016001397A (ja) * | 2014-06-11 | 2016-01-07 | キヤノン株式会社 | 画像処理装置、画像処理方法、コンピュータプログラム |
| JP2017168029A (ja) * | 2016-03-18 | 2017-09-21 | Kddi株式会社 | 行動価値によって調査対象の位置を予測する装置、プログラム及び方法 |
| JP2018112996A (ja) * | 2017-01-13 | 2018-07-19 | キヤノン株式会社 | 映像認識装置、映像認識方法及びプログラム |
| JP2021012446A (ja) * | 2019-07-04 | 2021-02-04 | Kddi株式会社 | 学習装置及びプログラム |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP4206369B2 (ja) * | 2004-07-15 | 2009-01-07 | 日本放送協会 | 時系列データ補完装置、その方法及びそのプログラム |
| JP7065557B2 (ja) * | 2018-12-05 | 2022-05-12 | Kddi株式会社 | 人物を追跡する映像解析装置、プログラム及び方法 |
-
2022
- 2022-03-03 JP JP2023508880A patent/JPWO2022202178A1/ja active Pending
- 2022-03-03 WO PCT/JP2022/009062 patent/WO2022202178A1/ja not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016001397A (ja) * | 2014-06-11 | 2016-01-07 | キヤノン株式会社 | 画像処理装置、画像処理方法、コンピュータプログラム |
| JP2017168029A (ja) * | 2016-03-18 | 2017-09-21 | Kddi株式会社 | 行動価値によって調査対象の位置を予測する装置、プログラム及び方法 |
| JP2018112996A (ja) * | 2017-01-13 | 2018-07-19 | キヤノン株式会社 | 映像認識装置、映像認識方法及びプログラム |
| JP2021012446A (ja) * | 2019-07-04 | 2021-02-04 | Kddi株式会社 | 学習装置及びプログラム |
Non-Patent Citations (1)
| Title |
|---|
| HIRONORI HATTORI, IKUHISA MITSUGAMI, MASAYUKI MUKUNOKI, MICHIHIKO MINOH: "Scene Adaptation Method of HOG Human Detector for Fixed Camera Video", IEICE TECHNICAL REPORT, PRMU, vol. 109, no. 471 (PRMU2009-261), 8 March 2010 (2010-03-08), JP, pages 163 - 168, XP009539961 * |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2022202178A1 (https=) | 2022-09-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Doersch et al. | Tap-vid: A benchmark for tracking any point in a video | |
| Wang et al. | Monocular 3d object detection with depth from motion | |
| NL2024682B1 (en) | Assembly monitoring method and device based on deep learning, and readable storage medium | |
| US11967088B2 (en) | Method and apparatus for tracking target | |
| Biresaw et al. | ViTBAT: Video tracking and behavior annotation tool | |
| Nawaz et al. | A protocol for evaluating video trackers under real-world conditions | |
| Usmani et al. | A reinforcement learning based adaptive ROI generation for video object segmentation | |
| Wang et al. | A semi-automatic video labeling tool for autonomous driving based on multi-object detector and tracker | |
| CN118314123B (zh) | 焊接射线照片缺陷检测方法、装置、计算机设备及介质 | |
| Odobez et al. | Embedding motion in model-based stochastic tracking | |
| CN117078927A (zh) | 一种联合目标标注方法、装置、设备及存储介质 | |
| CN116420163A (zh) | 识别系统、识别方法、程序、学习方法、学习完毕模型、蒸馏模型及学习用数据集生成方法 | |
| CN112037267A (zh) | 基于视频目标跟踪的商品摆放位置全景图形生成方法 | |
| JP2022027439A (ja) | アクティビティ署名を利用する画像/ビデオの解析方法及びその解析システム | |
| CN107833240B (zh) | 多跟踪线索引导的目标运动轨迹提取和分析方法 | |
| CN114529587A (zh) | 一种视频目标跟踪方法、装置、电子设备及存储介质 | |
| WO2022202178A1 (ja) | 機械学習用の学習データ生成装置、学習データ生成システム及び学習データ生成方法 | |
| CN110866939A (zh) | 基于相机位姿估计和深度学习的机器人运动状态识别方法 | |
| Amini-Naieni et al. | Open-world object counting in videos | |
| CN120088711A (zh) | 一种基于大模型的多摄像头视频人群分析系统与方法 | |
| CN114037950A (zh) | 一种基于行人和头部检测的多行人跟踪方法及装置 | |
| CN119672766A (zh) | 眼镜设备中手关节检测的方法、存储介质、电子设备及产品 | |
| CN119131524A (zh) | 图像处理模型训练、目标提取方法以及相关装置 | |
| US20240428588A1 (en) | System and Method for Multi-spot Beam Tracking | |
| JP5144706B2 (ja) | 画像処理装置 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22774974 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2023508880 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22774974 Country of ref document: EP Kind code of ref document: A1 |