WO2024057469A1 - 映像処理システム、映像処理装置および映像処理方法 - Google Patents
映像処理システム、映像処理装置および映像処理方法 Download PDFInfo
- Publication number
- WO2024057469A1 WO2024057469A1 PCT/JP2022/034510 JP2022034510W WO2024057469A1 WO 2024057469 A1 WO2024057469 A1 WO 2024057469A1 JP 2022034510 W JP2022034510 W JP 2022034510W WO 2024057469 A1 WO2024057469 A1 WO 2024057469A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- frames
- input
- difference information
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
Definitions
- the present disclosure relates to a video processing system, a video processing device, and a video processing method.
- a technology has been developed in which an edge-side terminal sends a captured video to a center-side server, and the center-side server uses an AI engine to recognize objects in the video. For example, the server recognizes the type of work performed by the worker.
- the edge-side terminal changes the frame rate of the video by frame filtering or the like in order to efficiently use calculation resources and network bandwidth.
- Patent Document 1 discloses a technology that performs video scene recognition from time-series frames extracted from video using a deep learning algorithm such as RNN (Recurrent Neural Network).
- RNN Recurrent Neural Network
- the server on the center side does not support changing the frame rate of the video, so objects cannot be recognized in response to changing the frame rate of the video, and the accuracy of recognizing objects in the video cannot be improved. There was room for.
- the present disclosure aims to provide a video processing system, a video processing device, and a video processing method that can be expected to improve the recognition accuracy of objects in videos.
- the video processing system of the present disclosure includes: a video acquisition means for acquiring an input video; time difference information acquisition means for acquiring first time difference information between frames of the input video; Inputting the input video and the first time difference information between the frames of the input video into a trained recognition model trained using the training video and the second time difference information between the frames of the training video, and recognition means for recognizing an object within the input video.
- the video processing device of the present disclosure includes: a video acquisition means for acquiring an input video; time difference information acquisition means for acquiring first time difference information between frames of the input video; Inputting the input video and the first time difference information between the frames of the input video into a trained recognition model trained using the training video and the second time difference information between the frames of the training video, and recognition means for recognizing an object within the input video.
- the video processing method of the present disclosure includes: The computer is Get the input video, obtaining first time difference information between frames of the input video; Inputting the input video and the first time difference information between the frames of the input video into a trained recognition model trained using the training video and the second time difference information between the frames of the training video, and Recognize objects in input video.
- FIG. 1 is a block diagram showing the configuration of a video processing system according to an overview of an embodiment.
- FIG. 1 is a block diagram showing the configuration of a video processing device according to an overview of an embodiment.
- 1 is a flowchart illustrating a video processing method according to an overview of an embodiment.
- FIG. 1 is a block diagram showing the configuration of a video processing system according to a first embodiment.
- FIG. 2 is a block diagram showing the configuration of a terminal according to the first embodiment.
- FIG. 2 is a block diagram showing the configuration of a center server according to the first embodiment.
- 3 is a flowchart showing the operation of the video processing system according to the first embodiment.
- FIG. 3 is a diagram illustrating an example of input information of a trained recognition model according to the first embodiment.
- FIG. 3 is a diagram illustrating an example of the configuration and recognition operation of a trained recognition model according to the first embodiment.
- FIG. 2 is a block diagram showing the configuration of a center server according to a second embodiment.
- 7 is a flowchart illustrating an example of the operation of the video processing system according to the second embodiment.
- FIG. 7 is a diagram illustrating an example of the configuration and recognition operation of a trained recognition model according to the second embodiment.
- FIG. 7 is a diagram illustrating an example of a first learning operation of a recognition model according to a second embodiment.
- FIG. 7 is a diagram illustrating an example of a second learning operation of the recognition model according to the second embodiment.
- FIG. 7 is a diagram showing another example of the configuration of a trained recognition model and recognition operation according to the second embodiment.
- FIG. 1 is a block diagram showing the configuration of a computer according to the present embodiment.
- FIG. 1 is a block diagram showing the configuration of a video processing system 10 according to an overview of the embodiment.
- the video processing system 10 is applicable to, for example, a remote monitoring system that collects video via a network and recognizes the video.
- the video processing system 10 includes a video acquisition section 11, a time difference information acquisition section 12, and a recognition section 13.
- the video acquisition unit 11 acquires input video.
- the time difference information acquisition unit 12 acquires first time difference information between frames of the input video.
- the recognition unit 13 inputs the input video and the first time difference information between the frames of the input video to the trained recognition model trained using the training video and the second time difference information between the frames of the training video. , Recognize objects in the input video.
- FIG. 2 is a block diagram showing the configuration of a video processing device 20 according to an overview of the embodiment.
- the video processing device 20 includes the video acquisition section 11, time difference information acquisition section 12, and recognition section 13 shown in FIG.
- the video processing device 20 is realized by edge computing, part or all of the video processing device 20 may be placed on the edge or in the cloud.
- the video acquisition unit 11 and the time difference information acquisition unit 12 may be placed in the edge terminal, and the recognition unit 13 may be placed in the cloud server.
- each function may be distributed and arranged in the cloud.
- the video processing device 20 may be realized using virtualization technology such as a virtualization server.
- part or all of the video processing device 20 may be placed on the site or on the server side.
- the site where the terminal is installed, the device located near the site, or the device close to the terminal as a network layer is considered to be the device placed on the site side.
- devices located far from the site are placed on the center side. Since devices placed on the center side may be placed on the cloud, the center side is sometimes referred to as the cloud side.
- FIG. 3 is a flowchart illustrating a video processing method according to an overview of the embodiment.
- the video processing method according to the embodiment is executed by the video processing system 10 in FIG. 1 or the video processing device 20 in FIG. 2.
- an input video is acquired (step S11).
- first time difference information between frames of the input video is acquired (step S12).
- the input video and the first time difference information between the frames of the input video are input to the trained recognition model that has been trained using the training video and the second time difference information between the frames of the training video. Objects in the video are recognized (step S13).
- the video processing system 10 can recognize objects in response to changes in the frame rate of the video by considering time difference information between frames of the input video.
- the image processing system 10 can be expected to improve the recognition accuracy of objects in images.
- FIG. 4 is a block diagram showing the configuration of the video processing system 1 according to the first embodiment.
- the video processing system 1 is a system that monitors an area where the video is taken using a video taken by a camera.
- the system will be described as a system for remotely monitoring the work of workers at the site.
- the site may be an area where people and machines operate, such as a work site such as a construction site, a public square where people gather, or a school.
- the work will be described as construction work, civil engineering work, etc., but is not limited thereto.
- the video processing system can be said to be a video processing system that processes videos, and also an image processing system that processes images.
- the video processing system 1 includes a plurality of terminals 100, a center server 200, a base station 300, and an MEC 400.
- the terminal 100, base station 300, and MEC 400 are placed on the field side, and the center server 200 is placed on the center side.
- the center server 200 is located in a data center or the like that is located away from the site.
- the field side is the edge side of the system, and the center side is also the cloud side.
- Terminal 100 and base station 300 are communicably connected via network NW1.
- the network NW1 is, for example, a wireless network such as 4G, local 5G/5G, LTE (Long Term Evolution), or wireless LAN.
- Base station 300 and center server 200 are communicably connected via network NW2.
- the network NW2 includes, for example, core networks such as 5GC (5th Generation Core network) and EPC (Evolved Packet Core), the Internet, and the like. It can also be said that the terminal 100 and the center server 200 are communicably connected via the base station 300.
- 5GC Fifth Generation Core network
- EPC Evolved Packet Core
- the base station 300 and MEC 400 are communicably connected by any communication method, the base station 300 and MEC 400 may be one device.
- the terminal 100 is a terminal device connected to the network NW1, and is also a video generation device that generates on-site video.
- the terminal 100 acquires an image captured by a camera 101 installed at the site, and transmits the acquired image to the center server 200 via the base station 300.
- the camera 101 may be placed outside the terminal 100 or inside the terminal 100.
- the terminal 100 compresses the video from the camera 101 to a predetermined bit rate and transmits the compressed video.
- the terminal 100 has a compression efficiency optimization function 102 that optimizes compression efficiency and a video distribution function 103.
- the compression efficiency optimization function 102 performs ROI control to control the image quality of a ROI (Region of Interest).
- the compression efficiency optimization function 102 reduces the bit rate by lowering the image quality of the region around the ROI while maintaining the image quality of the ROI including the person or object.
- the video distribution function 103 distributes the quality-controlled video to the center server 200.
- the base station 300 is a base station device of the network NW1, and is also a relay device that relays communication between the terminal 100 and the center server 200.
- the base station 300 is a local 5G base station, a 5G gNB (next Generation Node B), an LTE eNB (evolved Node B), a wireless LAN access point, or the like, but may also be another relay device.
- MEC 400 is an edge processing device placed on the edge side of the system.
- the MEC 400 is an edge server that controls the terminal 100, and has a compression bit rate control function 401 and a terminal control function 402 that control the bit rate of the terminal.
- the compression bit rate control function 401 controls the bit rate of the terminal 100 through adaptive video distribution control and QoE (quality of experience) control.
- QoE quality of experience
- the compression bit rate control function 401 predicts the recognition accuracy that will be obtained while suppressing the bit rate according to the communication environment of the networks NW1 and NW2, and sets the bit rate to the camera 101 of each terminal 100 so as to improve the recognition accuracy.
- the terminal control function 402 controls the terminal 100 to distribute video at the assigned bit rate.
- the terminal 100 encodes the video at the assigned bit rate and distributes the encoded video.
- the center server 200 is a server installed on the center side of the system.
- the center server 200 may be one or more physical servers, or may be a cloud server built on the cloud or other virtualized servers.
- the center server 200 is a monitoring device that monitors on-site work by recognizing people's work from on-site camera images.
- the center server 200 is also a video recognition device that recognizes the actions of people in the video transmitted from the terminal 100.
- the center server 200 has a video recognition function 201, an alert generation function 202, a GUI drawing function 203, and a screen display function 204.
- the video recognition function 201 inputs the video transmitted from the terminal 100 into an AI engine (for example, a trained recognition model) to recognize the type of work performed by the worker, that is, the type of behavior of the person.
- the alert generation function 202 generates an alert in response to the recognized work.
- the GUI drawing function 203 displays a GUI (Graphical User Interface) on the screen of a display device.
- the screen display function 204 displays images of the terminal 100, recognition results, alerts, etc. on the GUI.
- the video processing system 1 includes a plurality of terminals 100, a center server 200, a base station 300, and an MEC 400.
- the configuration of each device is an example, and other configurations may be used as long as the operation according to the present embodiment described later is possible.
- some functions of the terminal 100 may be placed in the center server 200 or other devices, or some functions of the center server 200 may be placed in the terminal 100 or other devices.
- the video processing system 1 is an embodiment of the video processing system 10 according to the outline of the embodiment.
- the center server 200 embodies the video processing device 20 according to the outline of the embodiment.
- FIG. 5 is a block diagram showing the configuration of the terminal 100 of the video processing system 1 according to the first embodiment.
- the terminal 100 includes a video acquisition section 110, a frame filter section 120, an encoding section 130, and a terminal communication section 140.
- the video acquisition unit 110 acquires the video captured by the camera 101 (also referred to as input video).
- the input video is, for example, data obtained by photographing a person who is a worker working at a site, a work object used by the person, and the like.
- the input video includes time-series frames.
- the frame filter unit 120 filters (selects) time-series frames included in the input video.
- the frame filter unit 120 performs filtering, for example, to adjust the bit rate of video transmitted to the center server 200.
- frames included in the input video that have not been filtered are skipped.
- the encoding unit 130 encodes the filtered input video.
- the encoder 130 changes the frame rate of the input video by filtering frames.
- the encoding unit 130 may encode the input video so that the focused region of the frame has higher image quality than other regions.
- the encoding unit 130 detects an object in the input video using a trained neural network model (for example, a convolutional neural network model), and surrounds the detected object with a box.
- a trained neural network model for example, a convolutional neural network model
- the encoder 130 is not limited to a box, and may surround the detected object with a circle, an ellipse, an irregular shape matching a silhouette, or the like. The encoding unit 130 then recognizes the object within the box.
- the encoding unit 130 extracts objects whose class is person or work object from among the recognized objects, and determines the inside of the box of the extracted object as a gaze region.
- the encoding unit 130 encodes the input video so that the region of interest has higher image quality than other regions.
- the terminal communication unit 140 transmits the encoded data to the center server 200.
- FIG. 6 is a block diagram showing an example of the configuration of the center server 200 of the video processing system 1 according to the first embodiment.
- the center server 200 includes a center communication section 210, a decoding section 220, a time difference information acquisition section 230, a recognition section 240, a storage section 250, and a learning section 260.
- the center communication unit 210 receives encoded data transmitted from the terminal 100 via the base station 300.
- the center communication unit 210 is an interface capable of communicating with the Internet or a core network, and is, for example, a wired interface for IP communication, but may be a wired or wireless interface of any other communication method.
- the decoding unit 220 decodes the encoded data received from the terminal 100.
- the decoding unit 220 corresponds to the encoding method of the terminal 100, for example, H. 264 and H.
- the video is decoded using a video encoding method such as H.265.
- the decoding unit 220 performs decoding according to the compression rate of each region in the frame, and generates a decoded input video.
- the time difference information acquisition unit 230 acquires time difference information ⁇ T ( ⁇ T is a natural number) based on time stamp information acquired from a video compression codec or the like.
- the time difference information ⁇ T corresponds to a predetermined frame included in the input video and is information representing the time difference between the predetermined frame and the previous frame. That is, the time difference information ⁇ T is 1 if no frame is skipped between the predetermined frame and the previous frame. On the other hand, when n frames are skipped between the predetermined frame and the previous frame, the time difference information ⁇ T becomes 1+n.
- the time stamp information is information indicating the timing at which each frame included in the input video was photographed by the camera 101. Note that the timestamp information may be information indicating the timing at which each frame was encoded by the encoding unit 130 of the terminal 100.
- the recognition unit 240 inputs time-series frames included in the input video and time difference information ⁇ T between frames of the input video as input information to the trained recognition model M1, and recognizes objects in the input video.
- the recognition unit 240 recognizes, for example, the type of work performed by a worker in the input video, that is, the type of behavior of a person.
- the trained recognition model M1 is a recurrent neural network (RNN) model that receives time-series frames included in the input video, and includes a plurality of cells of the RNN.
- a plurality of cells of the RNN input parameters corresponding to time difference information between frames of input video. More specifically, a plurality of cells of the RNN input parameters obtained by decoding time difference information between frames of an input video by a decoder.
- the storage unit 250 stores the trained recognition model M1.
- the learning unit 260 generates a trained recognition model M1 by learning using the time difference information ⁇ T between the frames of the learning video and the correct answer data.
- FIG. 7 is a flowchart showing the operation of the video processing system 1 according to the first embodiment.
- the video acquisition unit 110 of the terminal 100 of the video processing system 1 acquires an input video captured by the camera 101 (step S101).
- the input video includes time-series frames.
- the frame filter unit 120 filters time-series frames included in the input video (step S102). Here, frames included in the input video that have not been filtered are skipped.
- the encoding unit 130 encodes the filtered input video (step S103).
- the terminal communication unit 140 transmits the encoded data to the center server 200 via the base station 300 (step S104).
- the center communication unit 210 of the center server 200 receives the encoded data from the terminal 100 (step S105).
- the decoding unit 220 obtains the input video by decoding the encoded data (step S106).
- the time difference information acquisition unit 230 acquires time difference information ⁇ T between frames corresponding to the frames of the input video (step S107). Specifically, the time difference information acquisition unit 230 acquires time difference information ⁇ T based on time stamp information acquired from a video compression codec or the like.
- the time stamp information is, for example, information about the timing at which each frame included in the input video was photographed by the camera 101.
- the recognition unit 240 inputs the time-series frames included in the input video and the time difference information ⁇ T corresponding to the frames of the input video as input information to the trained recognition model M1 (step S108).
- FIG. 8 is a diagram showing an example of input information input to the trained recognition model M1.
- the input information includes time-series frames included in the input video and time difference information ⁇ T corresponding to the frames.
- the time difference information ⁇ T represents the time difference between the corresponding predetermined frame and the previous frame.
- the time difference information ⁇ T is 1 if no frame is skipped between the corresponding predetermined frame and the previous frame.
- the time difference information ⁇ T becomes 1+n when n frames are skipped between the corresponding predetermined frame and the previous frame.
- the recognition unit 240 recognizes the object in the input video using the learned recognition model M1 (step S109).
- the recognition unit 240 recognizes, for example, the type of work performed by a worker in the input video, that is, the type of behavior of a person.
- FIG. 9 is a diagram showing an example of the configuration and recognition operation of the trained recognition model M1.
- the trained recognition model M1 is a recurrent neural network (RNN) model, and includes a plurality of time-series cells M11 of the RNN.
- Cell M11 corresponds to the middle layer of the RNN when the structure of the RNN is classified into an input layer, a middle layer, and an output layer.
- the trained recognition model M1 includes a decoder M12 corresponding to each cell M11.
- the decoder M12 receives input of the time difference information ⁇ T, and outputs parameters obtained by decoding the input time difference information ⁇ T to the cell M11.
- cell M11 receives input of the frame, the state vector output by cell M11 at a previous time, and the parameter information output by decoder M12, and outputs the state vector to cell M11 at a later time.
- the initial state of the state vector input to the cell M11 may be such that all elements are 0, for example.
- the decoder M12 receives an input of time difference information ⁇ T of 1, and outputs a parameter obtained by decoding the time difference information ⁇ T of 1 to the cell M11.
- the time difference information ⁇ T input to decoder M12 at time t is It becomes 1.
- cell M11 receives input of the frame, the state vector and parameters output by cell M11 at time t-1, and outputs the state vector to cell M11 at time t+1.
- decoder M12 receives input of time difference information ⁇ T of 2, and outputs a parameter obtained by decoding time difference information ⁇ T of 2 to cell M11.
- time difference information ⁇ T is input to decoder M12 at time t+1. becomes 2.
- cell M11 receives input of the frame and the state vector and parameters output by cell M11 at time t, and outputs the state vector to cell M11 at time t+2.
- the learning unit 260 inputs time-series frames included in the learning video and time difference information ⁇ T corresponding to the frames to the recognition model M1.
- the training video includes, for example, a time series of frames in which frame skipping occurs according to a predetermined pattern.
- the configuration of the recognition model M1 has been described above (see FIG. 9).
- the learning unit 260 learns the recognition model M1 by comparing the output result of the recognition model M1 with correct data, and generates a learned recognition model M1.
- the trained recognition model M1 of the video processing system 1 decodes the time difference information ⁇ T and dynamically determines the parameters to be input to the cell M11.
- the trained recognition model M1 improves object recognition accuracy by reflecting frame time difference information in object recognition, taking into account cases where frames are skipped due to changes in video frame rate, etc. be able to.
- the video processing system 2 includes a plurality of terminals 100, a center server 200, a base station 300, and an MEC 400, similarly to the video processing system 1 according to the first embodiment.
- the center server 200 of the video processing system 2 differs from the center server 200 of the video processing system 1 in the following configuration.
- FIG. 10 is a diagram showing the configuration of the center server 200 of the video processing system 2.
- the center server 200 of the video processing system 2 includes a center communication section 210, a decoding section 220, a time difference information acquisition section 230, a recognition section 270, a storage section 280, and a learning section 290.
- the recognition unit 270 inputs time-series frames included in the input video and time difference information ⁇ T between frames of the input video as input information to the trained recognition model M2, and recognizes objects in the input video.
- the trained recognition model M2 includes a plurality of cells of a recurrent neural network (RNN) that inputs time-series frames included in the input video and inputs and outputs state vectors in a time-series manner.
- the trained recognition model M2 inserts a state predictor that predicts a state vector based on interframe time difference information ⁇ T between predetermined cells such as between cells where frame skipping has occurred.
- the storage unit 280 stores the recognition model M2.
- the learning unit 290 uses the learned data in which a state predictor is inserted using the time difference information between the time-series frames in which frame skipping occurred in a predetermined pattern included in the training video and the frames of the training video and the correct answer data.
- the recognition model M2 is learned.
- the learning unit 290 learns a plurality of cells of the trained recognition model M2 using the correct data and time-series frames that are not skipped and included in the training video. Then, the learning unit 290 inputs time-series frames that are not skipped and included in the training video to a plurality of cells of the learned recognition model M2. In that case, the learning unit 290 learns the state predictor using the state vector output by the plurality of cells at time t (t is a natural number) and the state vector output at time t+N (N is a natural number).
- the recognition unit 270 adds time difference information between the input video and the frames of the input video to the trained recognition model trained using the time difference information between the frames of the learning video and the movement between the frames of the input video. and the movement between frames of the input video, and recognize objects in the input video.
- the learning unit 290 uses time difference information between time-series frames in which frame skipping occurred in a predetermined pattern included in the learning video and frames of the learning video, movement between frames of the input video, and correct answer data. , learns the trained recognition model M2 into which the state predictor has been inserted.
- FIG. 11 is a flowchart illustrating an example of the operation of the video processing system 2 according to the second embodiment. As shown in FIG. 11, first, the video processing system 2 executes the processing from step S101 to step S107 (see FIG. 7) described above. Description of the processing from step S101 to step S107 will be omitted.
- the recognition unit 270 of the center server 200 inputs the time-series frames included in the input video and the time difference information ⁇ T corresponding to the frames as input information to the trained recognition model M2 (step S201).
- An example of the input information is described above (see FIG. 8).
- the recognition unit 240 uses time difference information ⁇ T where ⁇ T ⁇ 1 as input information to the learned recognition model M2.
- the recognition unit 270 recognizes the object in the input video using the learned recognition model M2 (step S202).
- the recognition unit 240 recognizes, for example, the type of work performed by a worker in the input video, that is, the type of behavior of a person.
- FIG. 12 is a diagram illustrating an example of the configuration and recognition operation of the trained recognition model M2 according to the second embodiment.
- the learned recognition model M2 is a recurrent neural network (RNN) and includes a plurality of time-series RNN cells M21.
- RNN recurrent neural network
- cell M21 receives input of a frame and a state vector output by cell M21 at a previous time, and outputs the state vector to cell M21 at a later time.
- the initial state of the state vector may be such that all elements are 0, for example.
- a state predictor M22 is inserted between the cell M21 at a predetermined time and the cell M21 at the previous time.
- the occurrence of frame skipping can be determined from time difference information ⁇ T corresponding to frames input to cell M21 at a predetermined time.
- the inserted state predictor M22 receives input of the state vector output by the cell M21 at the previous time and the time difference information ⁇ T corresponding to the frame input to the cell M21 at a predetermined time. Then, the state predictor M22 predicts the state vector and outputs the predicted state vector to the cell M21 at a predetermined time.
- a frame skip occurs between the frame input to the cell M21 at time t+1 and the frame input to the cell M21 at time t.
- the learned recognition model M2 inserts the state predictor M22 between the cell M21 at time t+1 and the cell M21 at time t.
- State predictor M22 receives input of the state vector output by cell M21 at time t and time difference information ⁇ T of 2 corresponding to the frame input to cell M21 at time t+1.
- the input time difference information ⁇ T is 2 because one frame is skipped between the frame input to cell M21 at time t+1 and the frame input to cell M21 at time t. be.
- State predictor M22 predicts a state vector and outputs the predicted state vector to cell M21 at time t+1.
- FIG. 13 is a diagram illustrating an example of the first learning operation of the recognition model M2.
- the learning unit 290 inputs time-series frames included in the learning video and time difference information ⁇ T ( ⁇ T ⁇ 1) corresponding to the frames to the recognition model M2.
- the learning unit 290 inputs time-series frames included in the learning video to a plurality of cells M21 of the recognition model M2.
- Frame skipping occurs in the input time-series frames according to a predetermined pattern.
- the learning unit 290 A state predictor M22 is inserted between the cell M21 at the previous time and the cell M21 at the previous time.
- the learning unit 290 inputs time difference information ⁇ T corresponding to the frame input to the cell M21 at a predetermined time to the state predictor M22. For example, one frame skip occurs between a frame input to a predetermined cell M21 at time t+1 and a frame input to cell M21 at time t.
- the learning unit 290 inserts the state predictor M22 between the cell M21 at time t+1 and the cell M21 at time t, and calculates the time difference information ⁇ T corresponding to the frame input to the predetermined cell M21 at time t+1. Enter 2.
- the learning unit 290 learns the recognition model M2 with the state predictor M22 inserted by comparing the output result of the recognition model M2 with the state predictor M22 inserted with the correct data. Note that the learning unit 290 may perform learning of the state predictor M22 separately from learning of the recognition model M2.
- FIG. 14 is a diagram illustrating an example of the second learning operation of the recognition model M2.
- the learning unit 290 inputs time-series frames included in the learning video to a plurality of cells M21 of the recognition model M2. Frame skipping has not occurred in the input time series frames.
- the learning unit 290 then learns the recognition model M2 by comparing the output result of the recognition model M2 with the correct data.
- the learning unit 290 inputs the time-series frames included in the learning video to the plurality of cells M21 of the learned recognition model M2. Frame skipping has not occurred in the input time series frames.
- the learning unit 290 acquires a data set consisting of the state vector output from the cell M21 at time t and the state vector output from the cell M21 at time t+N (N is a natural number), and the acquired data set
- the state predictor M22 is trained using the data as learning data. Specifically, the learning unit 290 performs a regression analysis on the output result when inputting the state vector at time t and N to the state predictor M22 so as to bring it closer to the state vector at time t+N. Let them learn M22.
- FIG. 15 is a diagram showing another example of the recognition operation of the trained recognition model M2 according to the second embodiment.
- the trained recognition model M2 is a recurrent neural network (RNN) and includes a plurality of time-series RNN cells M21.
- RNN recurrent neural network
- cell M21 receives input of a frame and a state vector output by cell M21 at a previous time, and outputs the state vector to cell M21 at a later time.
- the initial state of the state vector may be such that all elements are 0, for example.
- a state predictor M23 is inserted between a predetermined cell M21 and a cell M21 at a previous time.
- the state predictor M23 receives input of the state vector output by the cell M21 at the previous time, time difference information ⁇ T corresponding to the frame input to the cell M21 at a predetermined time, and a motion vector.
- a motion vector is information obtained by vectorizing the difference between a frame at a predetermined time and a frame at a previous time, that is, motion.
- State predictor M23 predicts a state vector and outputs the predicted state vector to cell M21 at a predetermined time.
- a frame skip occurs between the frame input to the cell M21 at time t+1 and the frame input to the cell M21 at time t.
- the learned recognition model M2 inserts a state predictor M23 between the cell M21 at time t+1 and the cell M21 at time t.
- the state predictor M23 receives input of a motion vector as well as the state vector outputted by the cell M21 at time t and the time difference information ⁇ T corresponding to the frame input to the cell M21 at time t+1.
- the input motion vector represents the difference between the frame input to the cell M21 at time t and the frame input to the cell M21 at time t+1, that is, the motion.
- State predictor M23 predicts a state vector and outputs the predicted state vector to cell M21 at time t+1.
- the learning unit 290 inputs time-series frames included in the learning video, time difference information ⁇ T ( ⁇ T ⁇ 1) corresponding to the frames, and a motion vector to the recognition model M2.
- the learning unit 290 learns the recognition model M2 with the state predictor M22 inserted by comparing the output result of the recognition model M2 with the state predictor M22 inserted with the correct data. Note that the learning unit 290 may perform learning of the state predictor M22 separately from learning of the recognition model M2.
- the trained recognition model M2 of the video processing system 2 inserts the state predictor M22 or M23 between the cells M21 when a frame skip occurs, and Predict the state vector.
- the trained recognition model M1 can improve object recognition accuracy by reflecting frame time difference information in object recognition, taking into account cases where frames are skipped due to changes in video frame rate, etc. can.
- Each configuration in the embodiments described above is configured by hardware, software, or both, and may be configured from one piece of hardware or software, or from multiple pieces of hardware or software.
- Each device and each function (processing) may be realized by a computer 1000 having a processor 1001 such as a CPU (Central Processing Unit) and a memory 1002 as a storage device, as shown in FIG.
- a program for performing the method (video processing method) in the embodiment may be stored in the memory 1002, and each function may be realized by having the processor 1001 execute the program stored in the memory 1002.
- These programs include instructions (or software code) that, when loaded into a computer, cause the computer to perform one or more of the functions described in the embodiments.
- the program may be stored on a non-transitory computer readable medium or a tangible storage medium.
- computer readable or tangible storage media may include random-access memory (RAM), read-only memory (ROM), flash memory, solid-state drive (SSD) or other memory technology, CD - Including ROM, digital versatile disc (DVD), Blu-ray disc or other optical disc storage, magnetic cassette, magnetic tape, magnetic disc storage or other magnetic storage device.
- the program may be transmitted on a transitory computer-readable medium or a communication medium.
- transitory computer-readable or communication media includes electrical, optical, acoustic, or other forms of propagating signals.
- a video acquisition means for acquiring an input video
- time difference information acquisition means for acquiring first time difference information between frames of the input video
- a video processing system comprising: recognition means for recognizing an object in an input video.
- the trained recognition model is A model including a plurality of cells of a recurrent neural network (RNN) that inputs time-series frames included in the input video, The video processing system according to appendix 1, wherein the plurality of cells input parameters corresponding to first time difference information between frames of the input video.
- RNN recurrent neural network
- the plurality of cells are The video processing system according to appendix 2, wherein a parameter obtained by decoding first time difference information between frames of the input video is input.
- the trained recognition model is It includes a plurality of cells of a recurrent neural network that inputs time-series frames included in the input video and inputs and outputs state vectors in a time-series manner,
- the video processing system according to appendix 1, wherein the model includes a state predictor inserted between predetermined cells to predict a state vector based on first time difference information between frames of the input video.
- the trained recognition model into which the state predictor is inserted is Learning is performed using second time difference information and correct answer data between time-series frames in which frame skipping has occurred in a predetermined pattern included in the training video and frames of the training video, according to appendix 4.
- Video processing system is
- the plurality of cells of the trained recognition model are Learning is performed using correct data and time-series frames that do not cause frame skipping that are included in the training video
- the state predictor inserted into the learned recognition model is State vectors and times that are output by the plurality of cells at time t (t is a natural number) when time-series frames that are included in the training video and do not have frame skips are input to the plurality of learned cells.
- the video processing system according to appendix 4, wherein the video processing system is trained using the state vector outputted to t+N (N is a natural number).
- the recognition means is The trained recognition model trained using the second time difference information between the frames of the training video and the movement between the frames of the training video is added to the trained recognition model using the second time difference information between the frames of the training video.
- the trained recognition model is A model including a plurality of cells of a recurrent neural network (RNN) that inputs time-series frames included in the input video, The video processing device according to appendix 8, wherein the plurality of cells input parameters corresponding to first time difference information between frames of the input video.
- RNN recurrent neural network
- the plurality of cells are The video processing device according to appendix 9, wherein a parameter obtained by decoding first time difference information between frames of the input video is input.
- the trained recognition model is It includes a plurality of cells of a recurrent neural network that inputs time-series frames included in the input video and inputs and outputs state vectors in a time-series manner, The video processing device according to appendix 8, which is a model in which a state predictor that predicts a state vector based on first time difference information between frames of the input video is inserted between predetermined cells.
- the trained recognition model into which the state predictor is inserted is Learning is performed using second time difference information and correct answer data between time-series frames in which frame skipping has occurred in a predetermined pattern included in the training video and frames of the training video, according to appendix 11.
- Video processing device The plurality of cells of the trained recognition model are Learning is performed using correct data and time-series frames that do not cause frame skipping that are included in the training video
- the state predictor inserted into the learned recognition model is State vectors and times that are output by the plurality of cells at time t (t is a natural number) when time-series frames that are included in the training video and do not have frame skips are input to the plurality of learned cells.
- the recognition means is The trained recognition model trained using the second time difference information between the frames of the training video and the movement between the frames of the training video is added to the trained recognition model using the second time difference information between the frames of the training video
- the video processing device according to appendix 8, wherein the time difference information of No. 1 and the movement between frames of the input video are input, and an object in the input video is recognized.
- the computer is Get the input video, obtaining first time difference information between frames of the input video; Inputting an input video and first time difference information between frames of the input video into a trained recognition model trained using a training video and second time difference information between frames of the training video, An image processing method that recognizes objects in images.
- the trained recognition model is A model including a plurality of cells of a recurrent neural network (RNN) that inputs time-series frames included in the input video, The video processing method according to appendix 15, wherein the plurality of cells input parameters corresponding to first time difference information between frames of the input video.
- the plurality of cells are The video processing method according to appendix 16, wherein a parameter obtained by decoding first time difference information between frames of the input video is input.
- the trained recognition model is It includes a plurality of cells of a recurrent neural network that inputs time-series frames included in the input video and inputs and outputs state vectors in a time-series manner, The video processing method according to appendix 15, wherein the model includes a state predictor that predicts a state vector based on first time difference information between frames of the input video image between predetermined cells.
- the trained recognition model into which the state predictor is inserted is Learning is performed using second time difference information and correct data between time-series frames in which frame skipping occurs in a predetermined pattern included in the training video and frames of the training video, according to appendix 18.
- Video processing method is performed using second time difference information and correct data between time-series frames in which frame skipping occurs in a predetermined pattern included in the training video and frames of the training video, according to appendix 18.
- the plurality of cells of the trained recognition model are Learning is performed using correct data and time-series frames that do not cause frame skipping that are included in the training video,
- the state predictor inserted into the learned recognition model is State vectors and times that are output by the plurality of cells at time t (t is a natural number) when time-series frames that are included in the learning video and do not have frame skips are input to the plurality of learned cells.
- t+N N is a natural number).
- the computer is The trained recognition model trained using the second time difference information between the frames of the training video and the movement between the frames of the training video is added to 15.
- Video processing system 11 Video acquisition section 12 Time difference information acquisition section 13 Recognition section 20
- Video processing device 100 Terminal 101 Camera 102 Compression efficiency optimization function 110
- Video acquisition section 120 Frame filter section 130 Encoding section 140
- Center server 201 Video recognition function 202
- Alert generation function 203 GUI drawing function 204
- Screen display function 210 Center communication unit 220
- Decoding unit 230 Time difference information acquisition unit 240, 270 Recognition unit 250, 280 Storage unit 260, 290 Learning unit 300
- Base station 401 Compression Bit rate control function 1000
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2024546614A JPWO2024057469A1 (https=) | 2022-09-15 | 2022-09-15 | |
| US18/857,236 US20250259432A1 (en) | 2022-09-15 | 2022-09-15 | Video processing system, video processing apparatus, and video processing method |
| PCT/JP2022/034510 WO2024057469A1 (ja) | 2022-09-15 | 2022-09-15 | 映像処理システム、映像処理装置および映像処理方法 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2022/034510 WO2024057469A1 (ja) | 2022-09-15 | 2022-09-15 | 映像処理システム、映像処理装置および映像処理方法 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024057469A1 true WO2024057469A1 (ja) | 2024-03-21 |
Family
ID=90274577
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2022/034510 Ceased WO2024057469A1 (ja) | 2022-09-15 | 2022-09-15 | 映像処理システム、映像処理装置および映像処理方法 |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20250259432A1 (https=) |
| JP (1) | JPWO2024057469A1 (https=) |
| WO (1) | WO2024057469A1 (https=) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2012073971A (ja) * | 2010-09-30 | 2012-04-12 | Fujifilm Corp | 動画オブジェクト検出装置、方法、及びプログラム |
| JP2018005638A (ja) * | 2016-07-04 | 2018-01-11 | 日本電信電話株式会社 | 映像認識モデル学習装置、映像認識装置、方法、及びプログラム |
| US20180308253A1 (en) * | 2017-04-25 | 2018-10-25 | Samsung Electronics Co., Ltd. | Method and system for time alignment calibration, event annotation and/or database generation |
-
2022
- 2022-09-15 JP JP2024546614A patent/JPWO2024057469A1/ja active Pending
- 2022-09-15 US US18/857,236 patent/US20250259432A1/en active Pending
- 2022-09-15 WO PCT/JP2022/034510 patent/WO2024057469A1/ja not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2012073971A (ja) * | 2010-09-30 | 2012-04-12 | Fujifilm Corp | 動画オブジェクト検出装置、方法、及びプログラム |
| JP2018005638A (ja) * | 2016-07-04 | 2018-01-11 | 日本電信電話株式会社 | 映像認識モデル学習装置、映像認識装置、方法、及びプログラム |
| US20180308253A1 (en) * | 2017-04-25 | 2018-10-25 | Samsung Electronics Co., Ltd. | Method and system for time alignment calibration, event annotation and/or database generation |
Also Published As
| Publication number | Publication date |
|---|---|
| US20250259432A1 (en) | 2025-08-14 |
| JPWO2024057469A1 (https=) | 2024-03-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111669589B (zh) | 图像编码方法、装置、计算机设备以及存储介质 | |
| JP7103530B2 (ja) | 映像分析方法、映像分析システム及び情報処理装置 | |
| CN114513655A (zh) | 直播视频质量评价方法、视频质量的调整方法及相关装置 | |
| US12155840B2 (en) | Video processing method, video processing apparatus, smart device, and storage medium | |
| US9723264B2 (en) | Method and apparatus for motion based participant switching in multipoint video conferences | |
| US11979660B2 (en) | Camera analyzing images on basis of artificial intelligence, and operating method therefor | |
| JP5779116B2 (ja) | 画像符号化方法、画像符号化装置及び画像符号化プログラム | |
| JP2018201117A (ja) | 映像符号化装置、映像符号化方法およびプログラム | |
| CN106713901A (zh) | 一种视频质量评价方法及装置 | |
| CN117014659A (zh) | 一种视频转码方法、装置、电子设备和存储介质 | |
| US20260065441A1 (en) | Video processing system, video processing method, and image quality control apparatus | |
| US11350134B2 (en) | Encoding apparatus, image interpolating apparatus and encoding program | |
| WO2024057469A1 (ja) | 映像処理システム、映像処理装置および映像処理方法 | |
| WO2022221205A1 (en) | Video super-resolution using deep neural networks | |
| CN119211476A (zh) | 一种基于算力服务器集群的视频处理方法和装置 | |
| WO2024047790A1 (ja) | 映像処理システム、映像処理装置及び映像処理方法 | |
| WO2024013933A1 (ja) | 映像処理システム、映像処理装置及び映像処理方法 | |
| WO2018235697A1 (ja) | 送信装置、通信システム、通信方法、及び記録媒体 | |
| US20210319358A1 (en) | Learning apparatus, communication system, and learning method | |
| CN114531594B (zh) | 一种数据处理方法、装置、设备及存储介质 | |
| WO2024057446A1 (ja) | 映像処理システム、映像処理装置および映像処理方法 | |
| US20260017768A1 (en) | Video processing apparatus, video processing system, and video processing method | |
| JP6720743B2 (ja) | メディア品質判定装置、メディア品質判定方法及びメディア品質判定用コンピュータプログラム | |
| CN114449284B (zh) | 一种数据编码方法、装置、存储介质和计算机设备 | |
| CN119094776A (zh) | 一种帧内预测方法、装置、电子设备及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22958793 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18857236 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024546614 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 18857236 Country of ref document: US |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22958793 Country of ref document: EP Kind code of ref document: A1 |