WO2024247193A1 - 画像評価装置、方法およびプログラム - Google Patents
画像評価装置、方法およびプログラム Download PDFInfo
- Publication number
- WO2024247193A1 WO2024247193A1 PCT/JP2023/020354 JP2023020354W WO2024247193A1 WO 2024247193 A1 WO2024247193 A1 WO 2024247193A1 JP 2023020354 W JP2023020354 W JP 2023020354W WO 2024247193 A1 WO2024247193 A1 WO 2024247193A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- person
- movement
- model
- video
- relating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
Definitions
- Embodiments of the present invention relate to an image evaluation device, method, and program.
- AQA action quality assessment
- Non-Patent Document 1 there is a technology that inputs two movements with different scores, i.e., the same technique, and learns to predict the difference in score, focusing in particular on the subtle differences in quality between similar movements.
- input data is input into a model to obtain an estimated score, a loss based on this estimated score and the true score (here, the MSE (Mean Squared Error) between the estimated score and the true score) is calculated, and the model parameters are updated using backpropagation.
- MSE Mel Squared Error
- first original video of a scene that includes human movement and a second original video of a different scene are input into individual models that share parameter weighting and the difference in scores for each original video is predicted, if the backgrounds of the first original video and the second original video are different, the predicted score difference may focus on the difference in the background, making it impossible to appropriately evaluate the quality of human movement.
- This invention was made in light of the above-mentioned circumstances, and its purpose is to provide an image evaluation device, method, and program that can appropriately evaluate the quality of human movements in moving images.
- An image evaluation device includes a first update unit that inputs a video of a first person's movement and a video of a second person's movement into a model that determines the features of each of the videos, calculates an estimate of the difference between the evaluation score of the first person's movement in the video of the first person's movement and the evaluation score of the second person's movement in the video of the second person's movement based on the features output from the model, and updates parameters of the model so that the calculated estimate approaches a true value; and a second update unit that inputs into the model each of the videos when a mask process is applied to a person-related area for at least one of the video of the first person's movement and the video of the second person's movement, calculates an estimate of the difference between the evaluation score of the first person's movement and the evaluation score of the second person's movement in the video of the output model based on the features output from the model in response to the input, and updates parameters of the model so that the calculated estimate approaches a
- An image evaluation method is a method performed by an image evaluation device, comprising: a first update unit of the image evaluation device inputs a video relating to the movement of a first person and a video relating to the movement of a second person into a model that determines features of the respective videos, calculates an estimate of the difference between the evaluation score of the first person's movement in the video relating to the movement of the first person and the evaluation score of the second person's movement in the video relating to the movement of the second person based on the features output from the model, and updates parameters of the model so that the calculated estimate approaches a true value; and a second update unit of the image evaluation device inputs into the model each of the videos when a mask process is applied to a person-related area for at least one of the video relating to the movement of the first person and the video relating to the movement of the second person, calculates an estimate of the difference between the evaluation score of the first person's movement in the video relating to the output model based on the features output from
- the present invention makes it possible to appropriately evaluate the quality of human movements in moving images.
- FIG. 1 is a diagram showing an application example of an image evaluation device according to an embodiment of the present invention.
- FIG. 2 is a diagram showing an example of an original image.
- FIG. 3 is a diagram showing an example of an athlete mask image.
- FIG. 4 is a diagram illustrating an example of a learning process of the first pattern performed by the learning unit.
- FIG. 5 is a diagram illustrating an example of a learning process of the second pattern performed by the learning unit.
- FIG. 6 is a diagram illustrating an example of a learning process of the third pattern performed by the learning unit.
- Figure 7 is a block diagram showing an example of the hardware configuration of an image evaluation device according to one embodiment of the present invention.
- FIG. 1 is a diagram showing an application example of an image evaluation device according to an embodiment of the present invention.
- an image evaluation device (video evaluation device) 100 includes a learning processing unit 10 and an estimation processing unit 20.
- the learning processing unit 10 includes an input unit 11, a learning unit (update unit) 12, and a learning model DB (database) 13.
- the estimation processing unit 20 includes an input unit 21, an estimation unit 22, and a learning model DB 23.
- the learning model DB 13 of the learning processing unit 10 and the learning model DB 23 of the estimation processing unit 20 store models used for extracting features of a plurality of moving images including human actions. In addition, these may be a common model DB, i.e., a single model DB.
- the input unit 21 of the estimation processing unit 20 inputs a plurality of video images including human movements.
- the estimation unit 22 inputs the input video images into a model stored in the learning model DB 23, and estimates the output from the model, which is an evaluation score of the human movements in the video images.
- the input unit 21 of the estimation processing unit 20 inputs a plurality of videos including human movements for the purpose of learning the parameters of the model stored in the learning model DB 13, and also inputs true values of the scores of the evaluation of the movements of the humans in the videos.
- the learning unit 12 inputs the input videos to the model stored in the learning model DB 13, and estimates the scores of the evaluation of the movements of the humans in the videos, which are the output from the model. Then, based on the difference between the estimated score and the true value of the score, the learning unit 12 updates the parameters of the model so that the difference approaches zero or becomes zero, and stores this updated model in the learning model DB 23 of the estimation processing unit 20.
- the model parameters are learned using two types of video: an original video and an athlete masked video in which the area showing the human movements in the original video, in this case the athlete's movements, is masked.
- FIG. 2 is a diagram showing an example of an original image.
- original images A and B shown in FIG. 2 can be used as original images to learn the parameters of the model.
- Original video A is an original video showing the action of high diving athlete A in a swimming competition, that is, the action of the athlete diving from a high diving board into the water.
- Original video B is an original video showing the movements of high diving athlete B, and the athlete and the timing of the competition are different from those in original video A.
- the movements of the athlete in original video A and original video B are assumed to be movements related to the execution of the same type of high diving technique.
- FIG. 3 is a diagram showing an example of an athlete mask image.
- athlete mask image A and athlete mask image B shown in FIG. 3 may be used as the athlete mask images for learning the model parameters.
- Athlete mask image A is an image generated externally or within the learning processing unit 10 by masking the area of athlete A in the original image A by filling in a rectangular area.
- Athlete mask image B is an image generated externally or within the learning processing unit 10 by masking the area of athlete B in the original image B.
- the input unit 11 of the learning processing unit 10 inputs the original video A, athlete mask video A, original video B, athlete mask video B, the true value of the competition score A, and the true value of the competition score B, and passes them to the learning unit 12.
- the true value of the competition score A is the true value of the score of the evaluation of the athlete's movement in the original video A.
- the true value of the competition score B is the true value of the score of the evaluation of the athlete's movement in the original video B.
- the number of athletes in the original images A and B may be one or more.
- the original images may be images showing multiple athletes, such as synchronized swimming.
- the method of acquiring the athlete's area in generating the athlete mask images A and B there is no particular limitation on the method of acquiring the athlete's area in generating the athlete mask images A and B.
- the mask images may be generated automatically using the technology disclosed in "K. He et al. Mask R-CNN. In ICCV, 2017.", or may be generated manually while visually checking all frames of the original video.
- the method of giving the scores for the above evaluation is not particularly limited. For example, the scores given by a judge to the actions in the video may be used.
- the learning unit 12 learns the model parameters multiple times by performing the following first pattern learning process, second pattern learning process, and third pattern learning process.
- FIG. 4 is a diagram illustrating an example of a learning process of the first pattern performed by the learning unit.
- the learning unit 12 inputs the original videos A and B, i.e., a plurality of original videos, into the model as in the conventional method, and learns the parameters of the model so that the difference between the competition scores output from the model can be appropriately predicted.
- This difference is the difference between the competition score A, which is the predicted value of the score of the evaluation of the athlete's movement in the original video A, and the competition score B, which is the predicted value of the score of the evaluation of the athlete's movement in the original video B.
- the model stored in the learning model DB13 includes a first DNN and a second DNN.
- the first DNN can input a first image
- the second DNN can input a second image.
- the parameters of the model include weighting between nodes of the first DNN and weighting between nodes of the second DNN, and these weightings are shared with each other.
- the first DNN outputs feature f1, which is a feature of original video A
- the second DNN outputs feature f2 , which is a feature of original video B.
- These features are input to layers in the learning unit 12, and a predicted value of the difference between competitive scores A and B is output from these layers.
- the model parameters are learned so that this predicted value approaches or becomes the same as the difference between the true value of the competition score A and the true value of the competition score B.
- FIG. 5 is a diagram illustrating an example of a learning process of the second pattern performed by the learning unit.
- the learning unit 12 inputs athlete mask image A and athlete mask image B, i.e., multiple athlete mask images, into the model, and assumes that if the athlete is not visible, the score is 0 points, and learns the model parameters so that the difference between the two competition scores is predicted to be 0 points.
- the first DNN outputs feature f1, which is a feature of athlete mask image A
- the second DNN outputs feature f2 , which is a feature of athlete mask image B.
- FIG. 6 is a diagram illustrating an example of a learning process of the third pattern performed by the learning unit.
- the learning unit 12 inputs a combination of original video A and athlete mask video B, or a combination of original video B and athlete mask video A, and assumes that if the athlete is not visible, the competition score is 0 points, and learns the model parameters so that the competition score of the input original video is appropriately predicted as the difference between the two competition scores.
- a combination of original video A and athlete mask video B, or a combination of original video B and athlete mask video A is input, and the first DNN outputs feature f1, which is a feature of original video A, and the second DNN outputs feature f2 , which is a feature of athlete mask video B.
- These features are input to layers in the learning unit 12, and a predicted value of the difference in competition scores is output from this layers.
- the model parameters are learned so that this predicted value approaches or becomes the same as the true value of the competition score for the athlete's movement in the original video A.
- the structure of the DNN and layers is not limited.
- the first and second DNNs may be, for example, I3D (Two-Stream Inflated 3D Convnet) disclosed in “Carreira et al. "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.”, In CVPR 2017”.
- the layers may be multiple fully connected layers, or the model structure disclosed in the above-mentioned non-patent document 1 may be applied.
- a loss function is calculated from the difference between the predicted competition score and the true competition score, and the network model is learned by error backpropagation.
- the weighting of the parameters of this learned model is stored in the learning model DB 13 and is also stored in the learning model DB 23 of the estimation processing unit 20.
- the type of loss function is not particularly limited. For example, L1 distance, L2 distance, or the sum of these may be used as the loss function.
- the order of execution of the learning processes using the above-mentioned first, second and third patterns is not particularly limited, and the timing of updating the weighting of the parameters in the learning processes using the various patterns is not particularly limited.
- the learning process in order to stabilize the learning, the learning process may be performed using the first pattern, and in the following 50 epochs, the learning process may be performed using all of the first, second and third patterns. In this way, the combination may be changed according to the number of learning epochs, i.e., the number of times the learning is performed.
- the type of learning process pattern may always be adopted randomly throughout the epoch.
- all of the learning processes of the first, second, and third patterns may be performed within the first epoch, or the learning process may be performed using one of the patterns for each epoch, such as the first pattern in the first epoch and the second pattern in the second epoch.
- the estimation process by the estimation processing unit 20 after the above learning process is performed will be described.
- a combination of original video A and original video B is input to the input unit 21, and the estimation unit 22 inputs this input video into a model stored in the learning model DB 23.
- the first DNN outputs feature f1, which is a feature of original video A
- the second DNN outputs feature f2, which is a feature of the athlete's original video B.
- the learning process using athlete mask video promotes the extraction of features in video related to human movements, thereby achieving the effect of generalizing to human movements, making it possible to accurately score sports competitions, for example.
- model parameters are learned by executing a learning process using all of the first, second and third patterns.
- the present invention is not limited to this, and the model parameters can be learned by executing a learning process using some of the first, second and third patterns, which necessarily includes a learning process for the first pattern, i.e., by executing a learning process for the first pattern and then executing a learning process for either the second or third pattern, thereby facilitating the extraction of features in the video related to the human movements and achieving the effect of generalizing to human movements.
- the reason why the learning process of the first pattern is always included as described above is that in the estimation process, a combination of original images is input and a predicted value is output, as described above, so that a decrease in estimation accuracy due to a discrepancy in the input content between the learning process and the estimation process is not caused.
- the accuracy of learning the model parameters when a learning process is executed using all of the first, second and third patterns as described above is higher than the accuracy of learning the model parameters when a learning process is executed using some of the first, second and third patterns as described above.
- FIG. 7 is a block diagram showing an example of a hardware configuration of an image evaluation device according to an embodiment of the present invention.
- the image evaluation device 100 is configured, for example, by a server computer or a personal computer, and has a hardware processor 111A such as a CPU (Central Processing Unit).
- a program memory 111B, a data memory 112, an input/output interface 113, and a communication interface 114 are connected to the hardware processor 111A via a bus 115.
- a hardware processor 111A such as a CPU (Central Processing Unit).
- a program memory 111B, a data memory 112, an input/output interface 113, and a communication interface 114 are connected to the hardware processor 111A via a bus 115.
- the communication interface 114 includes, for example, one or more wireless communication interface units, and enables the transmission and reception of information to and from the communication network NW.
- a wireless interface for example, an interface that adopts a low-power wireless data communication standard such as a wireless LAN (Local Area Network) is used.
- An input device 200 and an output device 300 that are attached to the image evaluation apparatus 100 and used by a user or the like are connected to the input/output interface 113 .
- the input/output interface 113 can take in operation data input by a user or the like through an input device 200 such as a keyboard, a touch panel, a touchpad, a mouse, or the like, and can output and display output data to an output device 300 including a display device using liquid crystal or organic EL (Electro Luminescence), etc.
- the input device 200 and the output device 300 may be devices built into the image evaluation device 100, or may be input devices and output devices of other information terminals that can communicate with the image evaluation device 100 via the network NW.
- the program memory 111B is a non-transient tangible storage medium that is a combination of a non-volatile memory, such as a hard disk drive (HDD) or solid state drive (SSD), which can be written to and read from at any time, and a non-volatile memory such as a ROM, and can store programs necessary to execute various control processes, etc., according to one embodiment.
- a non-volatile memory such as a hard disk drive (HDD) or solid state drive (SSD)
- SSD solid state drive
- the data memory 112 is a tangible storage medium that is, for example, a combination of the above-mentioned non-volatile memory and a volatile memory such as RAM, and can be used to store various data or information acquired and created during various processes.
- the image evaluation device 100 can be configured as a data processing device having the various units shown in FIG. 1 as software-based processing function units.
- the information storage unit used as a work memory by each part of the image evaluation device 100 can be configured by using the data memory 112 shown in FIG. 7.
- these configured storage areas are not essential components within the image evaluation device 100, and may be areas provided in a storage device such as an external storage medium such as a USB (Universal Serial Bus) memory, or a database server located in the cloud.
- a storage device such as an external storage medium such as a USB (Universal Serial Bus) memory, or a database server located in the cloud.
- processing function units in the above-mentioned components can be realized by having the hardware processor 111A read and execute a program stored in the program memory 111B. Note that some or all of these processing function units may be realized in a variety of other forms, including integrated circuits such as application specific integrated circuits (ASICs (Application Specific Integrated Circuits)) or field-programmable gate arrays (FPGAs).
- ASICs Application Specific Integrated Circuits
- FPGAs field-programmable gate arrays
- the methods described in each embodiment can be stored as a program (software means) that can be executed by a computer on a recording medium such as a magnetic disk (floppy disk, hard disk, etc.), optical disk (CD-ROM, DVD, MO, etc.), semiconductor memory (ROM, RAM, Flash memory, etc.), and can be distributed by transmitting it via a communication medium.
- the programs stored on the medium also include a setting program that configures the software means (including not only execution programs but also tables and data structures) that the computer executes.
- the computer that realizes this device reads the program recorded on the recording medium, and in some cases configures the software means using the setting program, and executes the above-mentioned processing by having the operation controlled by this software means.
- the recording medium referred to in this specification is not limited to one for distribution, but also includes storage media such as magnetic disks and semiconductor memories installed inside the computer or in devices connected via a network.
- the present invention is not limited to the above-described embodiments, and can be modified in various ways during implementation without departing from the gist of the invention.
- the embodiments may also be implemented in appropriate combination, in which case the combined effects can be obtained.
- the above-described embodiments include various inventions, and various inventions can be extracted by combinations selected from the multiple constituent elements disclosed. For example, if the problem can be solved and an effect can be obtained even if some constituent elements are deleted from all the constituent elements shown in the embodiments, the configuration from which these constituent elements are deleted can be extracted as an invention.
- Image evaluation device 10 Learning processing unit 11
- 23 Learning model DB (database) 20
- Estimation processing unit 22 ... Estimation unit
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2023/020354 WO2024247193A1 (ja) | 2023-05-31 | 2023-05-31 | 画像評価装置、方法およびプログラム |
| JP2025523143A JPWO2024247193A1 (https=) | 2023-05-31 | 2023-05-31 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2023/020354 WO2024247193A1 (ja) | 2023-05-31 | 2023-05-31 | 画像評価装置、方法およびプログラム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024247193A1 true WO2024247193A1 (ja) | 2024-12-05 |
Family
ID=93657299
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2023/020354 Ceased WO2024247193A1 (ja) | 2023-05-31 | 2023-05-31 | 画像評価装置、方法およびプログラム |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JPWO2024247193A1 (https=) |
| WO (1) | WO2024247193A1 (https=) |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022244135A1 (ja) * | 2021-05-19 | 2022-11-24 | 日本電信電話株式会社 | 学習装置、推定装置、学習モデルデータ生成方法、推定方法及びプログラム |
-
2023
- 2023-05-31 JP JP2025523143A patent/JPWO2024247193A1/ja active Pending
- 2023-05-31 WO PCT/JP2023/020354 patent/WO2024247193A1/ja not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022244135A1 (ja) * | 2021-05-19 | 2022-11-24 | 日本電信電話株式会社 | 学習装置、推定装置、学習モデルデータ生成方法、推定方法及びプログラム |
Non-Patent Citations (2)
| Title |
|---|
| CHEN TING, KORNBLITH SIMON, NOROUZI MOHAMMAD, HINTON GEOFFREY: "A Simple Framework for Contrastive Learning of Visual Representations", 1 July 2020 (2020-07-01), pages 1 - 20, XP093037179, Retrieved from the Internet <URL:https://arxiv.org/pdf/2002.05709.pdf> [retrieved on 20230404], DOI: 10.48550/arXiv.2002.05709 * |
| CHEN ZEKAI; AGARWAL DEVANSH; AGGARWAL KSHITIJ; SAFTA WIEM; BALAN MARIANN MICSINAI; BROWN KEVIN: "Masked Image Modeling Advances 3D Medical Image Analysis", 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), IEEE, 2 January 2023 (2023-01-02), pages 1969 - 1979, XP034291047, DOI: 10.1109/WACV56688.2023.00201 * |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2024247193A1 (https=) | 2024-12-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Zhou et al. | Classroom learning status assessment based on deep learning | |
| US11487999B2 (en) | Spatial-temporal reasoning through pretrained language models for video-grounded dialogues | |
| CN113705297B (zh) | 检测模型的训练方法、装置、计算机设备和存储介质 | |
| CN114925748B (zh) | 模型训练及模态信息的预测方法、相关装置、设备、介质 | |
| CN109214337B (zh) | 一种人群统计方法、装置、设备及计算机可读存储介质 | |
| CN110880019A (zh) | 通过无监督域适应训练目标域分类模型的方法 | |
| CN113112577B (zh) | 过渡帧预测模型的训练方法以及过渡帧预测方法 | |
| Nimmagadda et al. | Cricket score and winning prediction using data mining | |
| JP7547652B2 (ja) | 動作認識の方法および装置 | |
| Urtans et al. | Survey of deep q-network variants in pygame learning environment | |
| CN113408226B (zh) | 一种基于深度学习的芯片供电网络凸快电流估算方法及系统 | |
| CN108921023A (zh) | 一种确定低质量人像数据的方法及装置 | |
| US12136356B2 (en) | Using collected user data and cognitive information to generate a scene set in a training environment | |
| US20220215228A1 (en) | Detection method, computer-readable recording medium storing detection program, and detection device | |
| WO2024247193A1 (ja) | 画像評価装置、方法およびプログラム | |
| US20250201357A1 (en) | Information processing apparatus, information processing system, and method | |
| KR102127449B1 (ko) | 생존율 예측 모델 생성 방법, 장치 및 컴퓨터 프로그램 | |
| CN120873844A (zh) | 肿瘤分类模型训练和使用方法、装置、设备、介质和产品 | |
| CN116246201B (zh) | 一种基于像素擦除技术的弱监督视频实例分割方法 | |
| CN119026658A (zh) | 一种模型训练方法、系统及芯片 | |
| KR102792485B1 (ko) | 다중 출구를 가지는 신경망 기반 출구 앙상블 디스틸레이션 방법 및 장치 | |
| TWI757965B (zh) | 擴增實境體感遊戲機之深度學習方法 | |
| US20230019194A1 (en) | Deep Learning in a Virtual Reality Environment | |
| US20230316731A1 (en) | Information processing apparatus, information processing method, and non-transitory computer-readable storage medium | |
| CN115830716B (zh) | 一种基于衍生数据蒸馏的骨架人体行为识别方法、装置及设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23939668 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2025523143 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025523143 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |