WO2023249034A1 - 画像処理方法、コンピュータプログラム及び画像処理装置 - Google Patents
画像処理方法、コンピュータプログラム及び画像処理装置 Download PDFInfo
- Publication number
- WO2023249034A1 WO2023249034A1 PCT/JP2023/022855 JP2023022855W WO2023249034A1 WO 2023249034 A1 WO2023249034 A1 WO 2023249034A1 JP 2023022855 W JP2023022855 W JP 2023022855W WO 2023249034 A1 WO2023249034 A1 WO 2023249034A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- scene change
- change position
- frames
- candidate
- moving image
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/254—Analysis of motion involving subtraction of images
Definitions
- the present invention relates to an image processing method, a computer program, and an image processing device that perform image processing on moving images.
- Image processing techniques that detect changes in scenes (scenes, shots, etc.) in moving images are widely used.
- pixel values between consecutive frames in time series are compared for a plurality of frames constituting a moving image, and based on the comparison result, it is determined whether or not there is a change in the scene.
- methods have also been developed that use learning models that have been machine learned in advance to detect scene changes.
- Patent Document 1 proposes a method of detecting scene changes within a video, detecting fast-moving scenes by counting the number of consecutive scene changes, and selecting key frames from the scenes.
- the present invention has been made in view of the above circumstances, and its purpose is to provide an image processing method, a computer program, and an image processing device that can be expected to accurately detect changes in scenes from moving images. It is about providing.
- an image processing device detects candidates for scene change positions from a moving image, and detects candidate frames that can be main frames of the scene from among frames constituting the moving image. , a main frame is determined from among the candidate frames, and a scene change position is determined from among the scene change position candidates based on the chronological order of the scene change position candidates and the main frames.
- the image processing method according to the second aspect is the image processing method according to the first aspect, in which when a main frame does not exist between two scene change position candidates arranged in chronological order, The scene change position is determined by excluding one of the change positions from the candidates.
- the image processing method according to the third aspect is the image processing method according to the first aspect or the second aspect, in which a scene change position candidate is located between two similar main frames arranged in chronological order. If the scene change position exists, the scene change position is determined by excluding the scene change position from the candidates.
- the image processing method according to the fourth aspect is the image processing method according to any one of the first to third aspects, and calculates the statistical value of each frame, and calculates the statistical value of the two frames.
- a scene change position candidate is detected based on the difference in the scene change position.
- An image processing method is an image processing method according to any one of the first to fourth aspects, which calculates a hash value of each frame, and calculates a hash value of two frames.
- a scene change position candidate is detected based on the difference in the scene change position.
- An image processing method is an image processing method according to any one of the first to fifth aspects, which extracts an edge from each frame, and extracts an edge between two frames. Detect candidate frames based on changes in .
- An image processing method is an image processing method according to any one of the first to sixth aspects, which extracts feature points from candidate frames, and extracts feature points between a plurality of candidate frames.
- the main frame is determined from among the candidate frames by excluding the candidate frames based on the comparison result of the feature points in .
- An image processing method is an image processing method according to any one of the first to seventh aspects, in which information regarding the determined scene change position and main frames is added to the moving image.
- the moving image is stored in correspondence, the selection of the scene change position or main frame is accepted, and the moving image is reproduced based on the selected scene change position or main frame.
- the image processing method according to the ninth aspect is the image processing method according to the eighth aspect, in which information regarding the determined scene change position and main frame and text information regarding the moving image are stored in association with each other.
- the image processing method according to the tenth aspect is the image processing method according to the ninth aspect, in which a moving image of construction or repair work of air conditioning-related equipment is acquired, and a scene change position is determined for the acquired moving image. and a main frame, and store information about the determined scene change position and main frame in association with text information about air conditioning related equipment.
- the image processing method according to an eleventh aspect is the image processing method according to any one of the first to tenth aspects, and is based on the determined scene change position and main frame. Partial video images are extracted and the extracted partial video images are combined to generate a summary video image.
- the image processing method is the image processing method according to any one of the first to eleventh aspects, in which the determined main frame is used for construction or repair work of air conditioning related equipment. input into a learning model that classifies the type of construction or repair based on the input of the main frames of the video image captured, obtain the classification results output by the learning model, and based on the obtained classification results, Alternatively, the title of the scene included in the moving image is determined.
- a computer program causes a computer to detect a scene change position candidate from a moving image, detect a candidate frame that can be a main frame of a scene from among frames constituting the moving image, and detect a candidate frame that can be a main frame of the scene, and A main frame is determined from among the frames, and a process for determining a scene change position from among the scene change position candidates is executed based on the scene change position candidates and the chronological order of the main frames.
- An image processing device includes a scene change position candidate detection unit that detects a scene change position candidate from a moving image, and a scene change position candidate detection unit that detects a scene change position candidate from a moving image; a candidate frame detection unit that detects a candidate frame; a main frame determination unit that determines a main frame from among the candidate frames; and a scene change position determination unit that determines a scene change position from the scene change position.
- scene changes can be detected accurately from moving images.
- FIG. 1 is a schematic diagram for explaining an overview of an information processing system according to the present embodiment.
- FIG. 2 is a block diagram showing the configuration of a server device according to the present embodiment.
- FIG. 2 is a block diagram showing the configuration of a terminal device according to the present embodiment.
- FIG. 3 is a schematic diagram for explaining scene change position and key frame detection processing performed by the information processing system according to the present embodiment. It is a schematic diagram which shows an example of an HSL histogram.
- FIG. 2 is a schematic diagram showing an example of a calculation result of a degree of difference between frames in a moving image.
- FIG. 2 is a schematic diagram showing an example of edge extraction.
- FIG. 2 is a schematic diagram showing an example of a calculation result of an edge change rate between frames in a moving image.
- FIG. 1 is a schematic diagram for explaining an overview of an information processing system according to the present embodiment.
- FIG. 2 is a block diagram showing the configuration of a server device according to the present embodiment.
- FIG. 3 is a schematic diagram showing an example of key points extracted from candidate frames.
- FIG. 2 is a schematic diagram showing an example of key point matching results.
- FIG. 3 is a schematic diagram for explaining a method of determining a scene change position by the server device. It is a flowchart which shows the procedure of the process which a server apparatus performs in this Embodiment. It is a flowchart which shows the procedure of the process which a server apparatus performs in this Embodiment.
- FIG. 3 is a schematic diagram showing an example of a playback screen displayed by a terminal device.
- FIG. 2 is a schematic diagram for explaining a learning model used by the server device according to the present embodiment.
- FIG. 1 is a schematic diagram for explaining an overview of an information processing system according to this embodiment.
- a worker 102 who performs work such as installing or repairing an air conditioner 101 monitors the progress of the work using a camera 103 equipped with a headset or the like worn on his or her head. Take a photo.
- photography is performed using the camera 103 installed in a wearable device such as a headset worn by the worker 102, but the invention is not limited to this, and the air conditioning equipment 101 and the worker A camera 103 may be installed around the person 102 to photograph the work.
- the air conditioning equipment 101 is not limited to this, and examples of the air conditioning equipment 101 include an outdoor unit of an air conditioner, a ventilation device, a circulator, an air purifier, a heating It may be a variety of air conditioning related equipment, such as an appliance or a dehumidifying dryer. Further, the camera 103 may take pictures of work such as construction or repair of various equipment other than air conditioning related equipment, and may take pictures of various works other than construction or repair of these equipment.
- the moving image taken by the camera 103 is provided to the server device 1.
- the server device 1 acquires moving images photographed by one or more workers, and stores the acquired moving images in a database.
- the method for providing moving images from the camera 103 to the server device 1 is, for example, if the camera 103 is equipped with a communication function, the moving images are directly transmitted from the camera 103 to the server device 1 through wired or wireless communication. A method may be adopted.
- the camera 103 may record a moving image on a recording medium such as a memory card or an optical disk, and provide the moving image from the camera 103 to the server device 1 via the recording medium. Can be done.
- a terminal device such as a PC (personal computer) or a smartphone may be interposed between the camera 103 and the server device 1, and the terminal device may acquire a moving image from the camera 103 and transmit it to the server device 1. Any method may be used to provide moving images from the camera 103 to the server device 1.
- the server device 1 can communicate with one or more terminal devices 3 via a network such as a LAN (Local Area Network) or the Internet.
- the terminal device 3 is a general-purpose information processing device such as a PC or a smartphone, and in this embodiment, an unskilled user who is learning work such as construction or repair of the air conditioning equipment 101 can be used by a skilled worker. It is used to view videos of the work being performed.
- the server device 1 Based on a request from the terminal device 3, the server device 1 acquires a desired moving image from among the plurality of moving images stored in the database and transmits it to the terminal device 3.
- the terminal device 3 displays (plays) the moving image received from the server device 1.
- the server device 1 detects scenes (scenes, shots, etc.) and key frames (main frames) from the video image acquired from the camera 103, and stores information regarding these detection results together with the video image.
- the moving image obtained by shooting with the camera 103 is composed of about several dozen frames (still images) connected in one second, and key frames hold important information about the scene among these frames. It is a frame.
- the server device 1 transmits information regarding scenes and key frames to the terminal device 3 along with the moving image.
- the terminal device 3 receives information about scenes and key frames along with moving images from the server device 1, and, for example, when playing back moving images, receives a selection of a scene or key frame from a user, and selects a scene or key frame from the selected scene or key frame. Playback of the moving image can be started.
- the server device 1 first performs a process of detecting scene change position candidates from the moving image captured by the camera 103.
- the server device 1 also performs a process of detecting candidate frames that can become key frames from the moving images captured by the camera 103.
- the server device 1 performs a process of determining a key frame by, for example, excluding similar candidate frames from among the plurality of candidate frames detected from the moving image.
- the server device 1 determines whether or not a key frame exists between two consecutive scene change position candidates from among the plurality of scene change position candidates detected from the video image. Performs processing to determine the scene change position in the image.
- the server device 1 can be expected to accurately detect scene change positions and key frames suitable for moving images.
- FIG. 2 is a block diagram showing the configuration of the server device 1 according to this embodiment.
- the server device 1 according to the present embodiment includes a processing section 11, a storage section 12, a communication section (transceiver) 13, and the like. Note that although the present embodiment will be described assuming that the processing is performed by one server device, the processing may be performed in a distributed manner by a plurality of server devices.
- the processing unit 11 includes an arithmetic processing unit such as a CPU (Central Processing Unit), an MPU (Micro-Processing Unit), a GPU (Graphics Processing Unit), or a quantum processor, a ROM (Read Only Memory), a RAM (Random Access Memory), etc. It is configured using The processing unit 11 reads out and executes the server program 12a stored in the storage unit 12, thereby detecting scene change positions and key frames from the moving image taken by the camera 103, and detecting the moving image stored in the database. performs various processes such as providing the information to the terminal device 3.
- arithmetic processing unit such as a CPU (Central Processing Unit), an MPU (Micro-Processing Unit), a GPU (Graphics Processing Unit), or a quantum processor, a ROM (Read Only Memory), a RAM (Random Access Memory), etc. It is configured using The processing unit 11 reads out and executes the server program 12a stored in the storage unit 12, thereby detecting scene change positions and key frames from the moving image
- the storage unit 12 is configured using a large-capacity storage device such as a hard disk, for example.
- the storage unit 12 stores various programs executed by the processing unit 11 and various data necessary for processing by the processing unit 11.
- the storage unit 12 stores a server program 12a executed by the processing unit 11.
- the storage unit 12 is also provided with a moving image DB (database) 12b that stores moving images captured by the camera 103.
- the server program (program product) 12a is provided in a form recorded on a recording medium 99 such as a memory card or an optical disk, and the server device 1 reads the server program 12a from the recording medium 99 and stores it in the storage unit 12.
- the server program 12a may be written into the storage unit 12, for example, during the manufacturing stage of the server device 1.
- the server program 12a may be distributed by another remote server device, and the server device 1 may obtain it through communication.
- the server program 12 a may be recorded on the recording medium 99 by a writing device and written into the storage unit 12 of the server device 1 .
- the server program 12a may be provided in the form of distribution via a network, or may be provided in the form of being recorded on the recording medium 99.
- the moving image DB 12b is a database that stores and accumulates moving images captured by the camera 103.
- the moving image DB 12b also stores information regarding scene change positions and key frames detected from the moving images in association with these moving images.
- the communication unit 13 communicates with various devices via a network N including a mobile phone communication network, a wireless LAN (Local Area Network), the Internet, and the like.
- the communication unit 13 communicates with one or more terminal devices 3 and the camera 103 via the network N.
- the communication unit 13 transmits data provided from the processing unit 11 to other devices, and also provides data received from other devices to the processing unit 11.
- the storage unit 12 may be an external storage device connected to the server device 1.
- the server device 1 may be a multicomputer including a plurality of computers, or may be a virtual machine virtually constructed by software.
- the server device 1 is not limited to the above configuration, and may include, for example, a reading unit that reads information stored in a portable storage medium, an input unit that accepts operation input, a display unit that displays an image, etc. .
- the server device 1 has a scene change position candidate detection section 11a, a candidate frame detection section 11b, a key
- the frame determining unit 11c, the scene change position determining unit 11d, the DB processing unit 11e, and the like are implemented in the processing unit 11 as software-like functional units.
- functional units related to moving images are illustrated as functional units of the processing unit 11, and functional units related to other processes are omitted.
- the scene change position candidate detection unit 11a performs processing to detect candidates for positions where the scene (scene, shot, etc.) changes from the moving image captured by the camera 103.
- the scene change position candidate detection unit 11a compares two consecutive frames in time series for each frame constituting a moving image, and calculates a value indicating a difference between the two frames. When this calculated value exceeds a predetermined threshold, the scene change position candidate detection unit 11a determines that a scene change has occurred in the moving image, and sets the previous frame of these two frames as the last frame of the scene, The next frame is the beginning of the next scene.
- the scene change position candidate detection unit 11a sets the position of the first (or last) frame of the scene as a candidate for the scene change position of the moving image, and information indicating this position (time or frame from the beginning of the moving image to this position) memorize numbers, etc.)
- the scene change position candidate detection unit 11a calculates two types of values as values indicating the difference between two frames.
- the first value indicating the difference between the two frames is the "Bhattacharyya Distance."
- the Bhattacharyya distance is one of the measures for determining the distance between two probability distributions.
- the scene change position candidate detection unit 11a creates HSL (hue, saturation, brightness) histograms for a plurality of pixel values included in the frames for each of the two frames to be compared.
- the scene change position candidate detection unit 11a calculates the Bhattacharyya distance indicating the difference between the two HSL histograms created for the two frames.
- the scene change position candidate detection unit 11a can select the positions of these two frames as scene change position candidates when the calculated Bhattacharyya distance exceeds a predetermined threshold.
- the second value indicating the difference between the two frames is the "pHash (Perseptual Hash) distance."
- a hash value is a value of a predetermined length (for example, 64 bits, 256 bits, etc.) calculated by performing predetermined arithmetic processing on input information
- pHash is a hash value that has characteristics of the input image.
- the scene change position candidate detection unit 11a calculates pHash from each of the two frames, and calculates the distance (for example, Hamming distance) between the two calculated pHash.
- the scene change position candidate detection unit 11a can determine the positions of these two frames as scene change position candidates when the calculated pHash distance exceeds a predetermined threshold.
- the scene change position candidate detection unit 11a detects scene change position candidates based on the Bhattacharyya distance and scene change position candidates based on the pHash distance, and detects scene change position candidates using at least one of the methods.
- the selected scene change position candidates can be used as the final scene change position candidates.
- the scene change position candidate detection unit 11a may use the scene change position candidates detected as scene change position candidates by both methods as the final scene change position candidates.
- the scene change position candidate detection unit 11a may detect scene change position candidates based on the Bhattacharyya distance or detect scene change position candidates based on the pHash distance; Scene change position candidates may be detected using other methods.
- the scene change position candidate detection unit 11a calculates the average value or total value of the calculated Bhattacharyya distance and pHash distance as the degree of dissimilarity, and when the calculated degree of dissimilarity exceeds a threshold value, the scene change position candidate detection unit 11a changes the position of the two frames to the scene. It may also be a candidate for a change position.
- the candidate frame detection unit 11b performs a process of detecting candidate frames that can become key frames from the moving image captured by the camera 103.
- the candidate frame detection unit 11b performs processing to extract edges for each frame constituting the moving image, compares the edges of two chronologically consecutive frames, and calculates the rate of change of the edges of the two frames. do. If the calculated edge change rate is smaller than a predetermined threshold (that is, the edge change is small), the candidate frame detection unit 11b sets the chronologically earlier frame (or the later frame) as a key among these two frames. This is a candidate frame that can become a frame.
- the key frame determination unit 11c performs a process of determining key frames of a moving image from among the candidate frame candidates detected by the candidate frame detection unit 11b.
- the key frame determining unit 11c searches for similar candidate frames from among the plurality of candidate frames by extracting feature amounts for a plurality of candidate frames detected from a moving image and comparing the features.
- the key frame determination unit 11c leaves any one candidate frame as the final key frame from among the plurality of similar candidate frames, and excludes the other candidate frames. For example, when there are two similar candidate frames, the key frame determining unit 11c can leave the earlier candidate frame in chronological order and exclude the later candidate frame. For example, when there are three similar candidate frames, the key frame determining unit 11c can leave the chronologically middle candidate frame and exclude the preceding and succeeding candidate frames.
- the key frame determining unit 11c extracts key points using ORB (Oriented FAST and Rotated BRIEF) to extract feature amounts from each candidate frame. For example, the key frame determining unit 11c performs matching of key points extracted from two candidate frames, calculates a value such as the number or proportion of matching key points between the two candidate frames, and this value is determined as a threshold value. It is possible to determine whether two candidate frames are similar based on whether the two candidate frames are similar. Note that the key frame determining unit 11c may extract feature amounts other than key frames of the ORB to determine whether candidate frames are similar.
- ORB Oriented FAST and Rotated BRIEF
- the scene change position determination unit 11d determines the final scene change position from among the scene change position candidates detected by the scene change position candidate detection unit 11a, based on the key frame determined by the key frame determination unit 11c. Perform processing.
- the scene change position determination unit 11d examines the chronological order relationship between scene change position candidates and key frames, and determines whether a key frame exists between two scene change position candidates that are sequentially adjacent in time. Determine whether or not.
- the scene determined by the scene change position (that is, the scene captured in the moving image from the previous scene change position to the later scene change position) includes at least one key frame. The condition is that there is.
- the scene change position determination unit 11d determines that at least one of the scene change position candidates is inappropriate and excludes it.
- the scene change position determination unit 11d repeatedly performs the above process on all scene change position candidates detected from the video image, excludes inappropriate candidates, and finally selects the remaining scene change position candidates as the final scene change position candidates. Determine the scene change position.
- the scene change position determination unit 11d may determine the scene change position candidate detected by the scene change position candidate detection unit 11a as the final scene change position. In this case, if a key frame does not exist between the two scene change positions, the scene change position determining unit 11d excludes the moving image (that is, the scene) between the two scene change positions from the entire moving image. Note that the exclusion of a scene may be performed, for example, by removing the data of this scene from the video data to generate video data with a shortened playback time, or, for example, by changing the video data itself. Instead, information regarding the relevant scene may be removed from the scene configuration information held regarding the moving image.
- the DB processing unit 11e associates and stores the moving image captured by the camera 103 and information regarding the scene change position determined by the scene change position determining unit 11d and the key frame determined by the key frame determining unit 11c for this video.
- the information is stored in the moving image DB 12b of the section 12.
- the DB processing unit 11e receives a request to play back a moving image from the terminal device 3, reads data of the moving image requested to be played from the moving image DB 12b, and stores the read moving image and the scene associated with this moving image.
- the changed position and information regarding the key frame are transmitted to the requesting terminal device 3.
- FIG. 3 is a block diagram showing the configuration of the terminal device 3 according to the present embodiment.
- the terminal device 3 according to the present embodiment includes a processing section 31, a storage section 32, a communication section (transceiver) 33, a display section 34, an operation section 35, and the like.
- the terminal device 3 is a device used by, for example, an unskilled user who learns techniques such as construction or repair of the air conditioning equipment 101, and is configured using an information processing device such as a smartphone, a tablet terminal device, or a personal computer. can be done.
- the processing unit 31 is configured using an arithmetic processing device such as a CPU or MPU, a ROM, a RAM, and the like.
- the processing unit 31 reads and executes the program 32a stored in the storage unit 32 to search for moving images stored in the moving image DB 12b of the server device 1 and display (playback) these moving images. Perform processing, etc.
- the storage unit 32 is configured using, for example, a nonvolatile memory element such as a flash memory or a storage device such as a hard disk.
- the storage unit 32 stores various programs executed by the processing unit 31 and various data necessary for processing by the processing unit 31.
- the storage unit 32 stores a program 32a executed by the processing unit 31.
- the program 32a is distributed by a remote server device or the like, and the terminal device 3 acquires it through communication and stores it in the storage unit 32.
- the program 32a may be written into the storage unit 32, for example, during the manufacturing stage of the terminal device 3.
- the program 32a may be stored in the storage unit 32 by reading the program 32a recorded on a recording medium 98 such as a memory card or an optical disk by the terminal device 3.
- the program 32a may be recorded on the recording medium 98 and read by a writing device and written into the storage unit 32 of the terminal device 3.
- the program 32a may be provided in the form of distribution via a network, or may be provided in the form of being recorded on the recording medium 98.
- the communication unit 33 communicates with various devices via a network N including a mobile phone communication network, wireless LAN, the Internet, and the like.
- the communication unit 33 communicates with the server device 1 via the network N.
- the communication unit 33 transmits data provided from the processing unit 31 to other devices, and also provides data received from other devices to the processing unit 31.
- the display unit 34 is configured using a liquid crystal display or the like, and displays various images, characters, etc. based on the processing of the processing unit 31.
- the operation unit 35 accepts user operations and notifies the processing unit 31 of the accepted operations.
- the operation unit 35 receives user operations using input devices such as mechanical buttons or a touch panel provided on the surface of the display unit 34 .
- the operation unit 35 may be an input device such as a mouse and a keyboard, and these input devices may be configured to be detachable from the terminal device 3.
- the search processing section 31a, the display processing section 31b, etc. function as software-like functional sections by the processing section 31 reading out and executing the program 32a stored in the storage section 32.
- the program 32a may be a program dedicated to the information processing system according to this embodiment, or may be a general-purpose program such as an Internet browser or a web browser.
- the search processing unit 31a performs search processing on a large number of moving images stored in the moving image DB 12b of the server device 1.
- the search processing unit 31a receives input of various search conditions from the user, and transmits the received search conditions to the server device 1.
- the server device 1 which has received the search conditions from the terminal device 3, extracts moving images corresponding to the search conditions from the moving image DB 12b, and transmits list information of the extracted moving images and the like to the terminal device 3 as a search result.
- the terminal device 3 receives and displays the search results from the server device 1, and the search processing unit 31a receives a selection of a video to be played back from the user based on the search results, and sends the selected video to the server device 1. request.
- the display processing unit 31b performs display processing such as displaying a screen that accepts input of search conditions, displaying information transmitted as search results from the server device 1, and playing back (displaying) moving images.
- the server device 1 reads the requested moving image from the moving image DB 12b, and various information related to this moving image (for example, scene change position and key frame information).
- the moving image is transmitted to the requesting terminal device 3.
- the display processing unit 31b of the terminal device 3 that has received the moving image from the server device 1 plays back the received moving image and displays it on the display unit 34.
- the display processing unit 31b receives information transmitted together with the moving image, and based on the scene change position and key frame information included in the received information, the display processing unit 31b determines, for example, the scene change position specified by the user as the playback position of the moving image. Alternatively, processing such as skipping to a key frame may be performed.
- FIG. 4 is a schematic diagram for explaining scene change position and key frame detection processing performed by the information processing system according to the present embodiment.
- the moving image handled by the information processing system according to the present embodiment is, for example, a series of about several dozen frames (still images) per second.
- a moving image can also be divided into multiple scenes.
- a scene can be called a scene or a shot in video production, for example, and is a unit that defines one section of the action of a person, object, etc. captured in a moving image.
- one scene is a series of multiple frames including at least one key frame, as shown in the lower part of FIG. 4, and the first frame and the last frame are treated as scene change positions. be exposed.
- the server device 1 of the information processing system acquires a moving image taken by a camera 103 of work such as construction or repair of an air conditioner 101 via communication or a recording medium, and stores the obtained moving image. It is stored in the moving image DB 12b. At this time, the server device 1 performs processing to detect the scene change position and key frame of the acquired moving image, and stores information regarding the detected scene change position and key frame in association with the moving image in the moving image DB 12b.
- the server device 1 For the moving image acquired from the camera 103, the server device 1 first performs a process of detecting scene change position candidates included in the moving image.
- the server device 1 calculates a value indicating a difference between two consecutive frames in time series for a plurality of frames constituting a moving image, and determines whether the calculated value exceeds a predetermined threshold. Determine whether If the value indicating the difference between frames exceeds the threshold, the server device 1 selects the positions of two consecutive frames as candidates for scene change positions.
- the server device 1 calculates two values, the Bhattacharyya distance of the HSL histogram and the distance of the hash value, as values indicating the difference between frames.
- FIG. 5 is a schematic diagram showing an example of an HSL histogram, and is a graph showing the distribution of the number of pixels with respect to the values of H (hue), S (saturation), and L (luminance) for one sample frame. .
- the horizontal axis represents the HSL value
- the vertical axis represents the number of pixels.
- the distribution of H is shown by a solid line
- the distribution of S is shown by a broken line
- the distribution of L is shown by a chain line.
- the server device 1 calculates an HSL histogram for each frame making up a moving image.
- the server device 1 can calculate the HSL histogram by converting pixel values of a frame given, for example, RGB values into HSL values, and counting the number of pixels included in the frame for each HSL value.
- the server device 1 calculates HSL histograms for all frames included in the moving image, and calculates a value indicating the difference in the HSL histogram between each frame and the previous chronologically continuous frame.
- the server device 1 calculates the Bhattacharyya distance as a value indicating the difference in HSL histograms. Note that the method for calculating the Bhattacharyya distance is an existing technique, so a detailed explanation will be omitted.
- the server device 1 calculates the HSL histogram of each frame and calculates the Bhattacharyya distance between frames, but the server device 1 is not limited to this, and the server device 1 calculates statistical values other than the HSL histogram.
- a value other than the Bhattacharyya distance may be calculated as a value indicating the difference between frames.
- the server device 1 calculates hash values, such as pHash, for all frames included in the moving image.
- pHash is a hash value obtained by frequency-converting an image using discrete cosine transform or the like, extracting low-frequency components, and calculating a hash value.
- the server device 1 calculates the Hamming distance of pHash between each frame and the previous frame that is consecutive in time series as a value indicating the difference between the frames. pHash has a characteristic that the more similar two images are, the smaller the Hamming distance of pHash between the two images becomes. Note that the pHash and Hamming distance calculation methods are existing techniques, so detailed explanations will be omitted.
- the server device 1 calculates the pHash of each frame and calculates the Hamming distance of pHash between frames, but the server device 1 is not limited to this, and the server device 1 calculates the pHash of each frame ( For example, aHash (Average Hash), etc.) may be calculated, or a value other than the Hamming distance may be calculated as a value indicating the difference between frames.
- aHash Average Hash
- a value other than the Hamming distance may be calculated as a value indicating the difference between frames.
- the server device 1 calculates two values, the Bhattacharyya distance of the HSL histogram and the Hamming distance of pHash, between each frame and the previous frame that is consecutive in time. , is calculated as a value indicating the difference between the two frames.
- the average value (weighted average value) of the two values of the Bhattacharya distance and Hamming distance calculated by the server device 1 is calculated, and the calculated average value is used as the degree of difference between frames.
- FIG. 6 is a schematic diagram showing an example of a calculation result of the degree of dissimilarity between frames in a moving image, and is a graph showing a change in the degree of dissimilarity over the playback time of a moving image.
- the horizontal axis of the graph in FIG. 6 is time (playback time of a moving image), and the vertical axis is the degree of difference between frames.
- the server device 1 calculates the degree of difference from the previous frame for all frames included in the moving image, and determines whether this degree of difference exceeds a predetermined threshold.
- the waveform indicated by a solid line indicates a change in the degree of dissimilarity
- the dashed horizontal line indicates a threshold value.
- the server device 1 sets the point in time at which the calculated degree of difference exceeds a predetermined threshold as a candidate for the scene change position.
- the dashed-dotted vertical line is the point in time when the server device 1 selects the moving image change position as a candidate.
- the server device 1 is restricted from detecting scene change position candidates for a predetermined period of time after detecting the scene change position candidates. By setting such a restriction, it can be expected to suppress detection of a large number of candidates for similar scene change positions.
- the server device 1 performs a process of detecting candidate frames that are key frame candidates from all frames included in the moving image captured by the camera 103. Note that the candidate frame detection process may be performed before, after, or simultaneously with the scene change position detection process described above.
- the server device 1 according to the present embodiment performs image processing to extract edges for a plurality of frames constituting a moving image.
- FIG. 7 is a schematic diagram showing an example of edge extraction.
- an example of an image (frame image) corresponding to one frame included in a moving image is shown.
- a binary image (edge image) of edges extracted from the image of this frame is shown in which pixels corresponding to the edges are white and pixels other than the edges are black.
- the server device 1 can extract edges from each frame image by performing edge detection processing (Canny Edge Detection) using the Canny method, for example, on each frame included in a moving image.
- edge detection processing Canny Edge Detection
- Image processing for extracting edges from an image is an existing technology, so detailed explanation will be omitted. Note that the edge extraction from the frame image by the server device 1 may be performed using any image processing.
- the server device 1 that has extracted the edges of each frame included in the video image calculates the edge change rate by comparing the edges of the two chronologically consecutive frames.
- the server device 1 compares two edge images extracted from two frame images, and calculates, for example, the number of edge pixels changed to non-edge pixels and the number of non-edge pixels changed to edge pixels.
- the total number can be calculated, the ratio of this total number to the total number of pixels in one frame can be calculated, and the calculated ratio can be taken as the edge change rate.
- the edge change rate may be any value as long as it is an index that indicates how much the edge changes between frames, and the above method for calculating the edge change rate is an example and is limited to this. Instead, the server device 1 may calculate the edge change rate using any method.
- FIG. 8 is a schematic diagram showing an example of calculation results of the edge change rate between frames in a moving image, and is a graph showing changes in the edge change rate over the playback time of the moving image.
- the horizontal axis of the graph in FIG. 8 is time (playback time of a moving image), and the vertical axis is the edge change rate between frames.
- the server device 1 calculates the edge change rate by comparing all frames included in the video with the previous frame, and determines whether this edge change rate exceeds a predetermined threshold. In FIG. 8, the horizontal straight line indicates the threshold value.
- the server device 1 sets a frame at a time when the calculated edge change rate is less than a predetermined threshold (a time when it is less than the threshold) as a candidate frame that becomes a key frame candidate.
- a vertical straight line indicates a time point of a frame that is determined as a candidate frame by the server device 1.
- FIG. 8 shows that three candidate frames are detected in the first half of the video, and one candidate frame is detected in the second half of the video. Note that in addition to these four candidate frames, there are other times in FIG. 8 when the edge change rate is less than the threshold, and the server device 1 may detect candidate frames at these times as well.
- the server device 1 selects similar candidate frames from among the multiple candidate frames detected from all frames included in the moving image. Processing is performed to determine the final key frame by removing candidate frames. Note that although the server device 1 performs the process of determining key frames after the process of detecting candidate frames, the process of determining key frames may be performed before or after the process of detecting scene change position candidates. or in parallel.
- the server device 1 performs a process of extracting the feature amount of each candidate frame, compares the feature amounts of the two candidate frames to calculate the degree of similarity, and the calculated degree of similarity exceeds a threshold value. In this case, it is determined that these two candidate frames are similar.
- the server device 1 extracts key points based on the ORB as feature amounts of each candidate frame.
- ORB is a method that combines key point detection using FAST (Features from Accelerated Segment Test) and feature descriptors using BRIEF (Binary Robust Independent Elementary Features). Since these technologies such as ORB, FAST, and BRIEF are already existing, detailed explanations thereof will be omitted.
- FIG. 9 is a schematic diagram showing an example of key points extracted from candidate frames.
- the two images shown in FIG. 9 are obtained by extracting key points from two similar candidate frames (candidate frame 1 and candidate frame 2), and the extracted key points are shown as circular dots on the images. There is.
- FIG. 10 is a schematic diagram showing an example of key point matching results.
- the example shown in FIG. 10 shows a key point matching result by connecting corresponding (matching) key points of the two candidate frames shown in FIG. 9 with straight lines.
- the server device 1 counts the total number of key points extracted from two candidate frames and the number of matches of key points between the two candidate frames, and calculates the ratio of the number of matches to the total number of key points as a degree of similarity. Calculated as The server device 1 can determine whether or not the calculated degree of similarity exceeds a predetermined threshold, and can determine that the two candidate frames are similar if the degree of similarity exceeds the threshold.
- the server device 1 determines a candidate frame for which no other similar candidate frames exist as a key frame included in the video image. In addition, when a plurality of similar candidate frames are included in the video image, the server device 1 appropriately selects one candidate frame from among these plurality of candidate frames, determines it as a key frame, and selects one candidate frame from among the plurality of candidate frames, and selects one candidate frame from among the plurality of candidate frames and determines it as a key frame. one or more candidate frames are excluded from key frame candidates. At this time, if there are two similar candidate frames, the server device 1 sets the earlier candidate frame in chronological order as a key frame, and excludes the later candidate frame from the key frame candidates.
- the server device 1 sets the second candidate frame in chronological order as a key frame, and excludes the first and third candidate frames from the key frame candidates.
- the method of selecting one candidate frame as a key frame from a plurality of similar candidate frames is not limited to the above method; the server device 1 may select one key frame from a plurality of similar candidate frames. You may.
- the server device 1 After detecting scene change position candidates from the video image and detecting key frames, the server device 1 performs a process of determining a scene change position from among the scene change position candidates. I do. In the present embodiment, the server device 1 determines the final scene change position from the scene change position candidates based on the chronological order relationship between the scene change position candidates and key frames. Determine the scenes included in the scene. In this embodiment, the scene of the moving image is conditioned to include at least one key frame, as shown in the lower part of FIG.
- FIG. 11 is a schematic diagram for explaining a method of determining a scene change position by the server device 1.
- the illustrated example shows a state in which the server device 1 has detected three scene change position candidates 1 to 3 and two key frames 1 and 2 by performing the above-described processing on the moving image. There is. In chronological order, scene change position candidate 1, scene change position candidate 2, scene change position candidate 3 are arranged in this order, and key frame 1 and key frame 2 are placed between scene change position candidates 2 and 3. They are lined up in order. No key frame exists between scene change position candidates 1 and 2.
- Scene candidate 1 is a moving image consisting of multiple frames existing between scene change position candidates 1 and 2, and multiple frames (two key frames 1 , 2) is set as scene candidate 2.
- a moving image includes one or more scenes, and one scene includes one or more key frames.
- Scene candidate 1 which has scene change position candidates 1 and 2 shown in FIG. 11 as previous and subsequent scene change positions, does not include a key frame and does not correspond to the scene in this embodiment. If a key frame is not included between two chronologically consecutive scene change position candidates, the server device 1 removes either one of the two scene change position candidates to determine whether the key frame is Connect the scene candidates that are not included to the scene candidates that include the key frame. In the case of the example shown in FIG. 11, the server device 1 connects the scene candidates 1 and 2 that exist before and after the scene change position candidate 2 and combines them into one scene by excluding the scene change position candidate 2, for example. , scene change position candidates 1 and 3 are set as the final scene change positions.
- the server device 1 excludes the candidate 2 at the scene change position that is later in the chronological order for the scene candidate 1 that does not include a key frame, and sets the scene candidate 1 to the scene candidate 2 that is later in the chronological order.
- the server device 1 may, for example, exclude the candidate 1 at the previous scene change position in chronological order and connect the scene candidate 1 to the previous scene candidate.
- the server device 1 may exclude candidates for scene change positions that are earlier in chronological order, or may exclude candidates for scene change positions that are later.
- Which scene change position candidates the server device 1 excludes may be determined in advance, for example, or may select one of them depending on the length of the preceding and succeeding scene candidates, the number of key frames, etc. You can.
- the server device 1 may exclude this scene change position candidate.
- the server device 1 can determine whether two key frames are similar using the degree of similarity calculated when determining a key frame from candidate frames.
- the server device 1 calculates the similarity based on the key point matching results for two chronologically consecutive key frames, and sets the calculated similarity to a predetermined threshold (however, it is determined whether the candidate frames are similar or not).
- the two key frames are determined to be similar if the value exceeds the threshold (a value smaller than the threshold value at the time).
- the server device 1 can exclude scene change position candidates that exist between two similar key frames, and connect the scene candidates before and after this scene change position candidate to form one scene.
- the server device 1 may exclude the scene between the two scene change positions from the entire moving image.
- the server device 1 estimates that the moving image during this period does not contain important information, and Scene candidate 1 between change position candidates 1 and 2 may be excluded from the moving image. Further, at this time, the server device 1 may exclude either scene change position candidate 1 or 2 together with scene candidate 1.
- the server device 1 After determining all the scene change positions and key frames included in the moving image, the server device 1 stores information regarding the determined scene change positions and key frames in association with the moving image in the moving image DB 12b. Further, when the server device 1 receives a request from the terminal device 3 to send a moving image stored in the moving image DB 12b, the server device 1 transmits the requested moving image and information regarding the scene change position and key frame associated with the requested moving image. It is read from the moving image DB 12b and transmitted to the requesting terminal device 3. The terminal device 3 can receive a scene selection from the user, for example, using the information regarding the scene change position and key frame received from the server device 1 along with the video, and can reproduce and display the video from the accepted scene.
- the processing unit 11 of the server device 1 acquires a moving image of the construction or repair of the air conditioning equipment 101 captured by the camera 103 by communicating with the camera 103 through the communication unit 13, for example.
- the scene change position candidate detection unit 11a of the processing unit 11 calculates an HSL histogram of each frame included in the video image acquired in step S1 (step S2).
- the scene change position candidate detection unit 11a calculates the Bhattacharyya distance between the HSL histograms of two consecutive frames in time series based on the HSL histogram of each frame calculated in step S2 (step S3).
- the scene change position candidate detection unit 11a calculates pHash of each frame included in the video image acquired in step S1 (step S4).
- the scene change position candidate detection unit 11a calculates the Hamming distance between the pHash of two chronologically consecutive frames based on the pHash of each frame calculated in step S4 (step S5).
- the scene change position candidate detection unit 11a calculates the degree of difference between two chronologically consecutive frames based on the Bhattacharyya distance calculated in step S3 and the Hamming distance calculated in step S5 (step S6). .
- the sum or average value of the Bhattacharyya distance and Hamming distance may be used as the degree of dissimilarity.
- the scene change position candidate detection unit 11a compares the degree of difference calculated in step S6 with a predetermined threshold, and selects two frames (or between these two frames) for which the degree of difference exceeds the threshold as candidates for scene change positions. (Step S7).
- the candidate frame detection unit 11b of the processing unit 11 extracts edges of each frame included in the video image acquired in step S1 (step S8).
- the candidate frame detection unit 11b calculates the edge change rate between each frame and the frame immediately before this frame in time series based on the edges extracted in step S8 (step S9).
- the candidate frame detection unit 11b compares the edge change rate calculated in step S9 with a predetermined threshold, and detects frames whose edge change rate does not exceed the threshold as candidate frames that are key frame candidates ( Step S10).
- the key frame determining unit 11c of the processing unit 11 extracts ORB key points for each candidate frame detected in step S10 (step S11).
- the key frame determining unit 11c performs key point matching between candidate frames based on the key points of each candidate frame extracted in step S11 (step S12).
- the key frame determining unit 11c calculates the degree of similarity between candidate frames based on the key point matching result in step S12 (step S13).
- the key frame determination unit 11c determines a key frame by selecting one frame from the plurality of candidate frames as a key frame and excluding other candidate frames for the plurality of candidate frames whose degree of similarity exceeds a threshold value. (Step S14).
- the scene change position determination unit 11d of the processing unit 11 determines two chronologically consecutive scene change position candidates based on the scene change position candidates detected in step S7 and the key frames determined in step S14. A scene candidate that does not include a key frame is searched from among the scene candidates that exist between them (step S15). The scene change position determining unit 11d removes one of the two scene change position candidates before and after defining scene candidates that do not include the key frame searched in step S15 (step S16). . The scene change position determination unit 11d determines the scene change position candidates that were not removed in step S16 as the final scene change positions (step S17).
- the DB processing unit 11e of the processing unit 11 stores the moving image acquired in step S1, the scene change position determined in step S17, and the key frame information determined in step S14 in association with each other in the moving image DB 12b ( Step S19), the process ends.
- the moving image DB 12b corresponds to various information such as character strings such as the title or description of the moving image given by the photographer of the moving image, the shooting date and time of the moving image, and the shooting location of the moving image. It is attached and memorized.
- the video DB 12b stores video images taken of work such as construction or repair of the air conditioning equipment 101, and the video includes images of the air conditioning equipment 101 that is the target of the work such as construction or repair. Character information such as a name or product number is stored in association with each other.
- the terminal device 3 accepts input of a character string serving as a keyword from the user, transmits the accepted character string to the server device 1, and requests a search for a moving image.
- the server device 1 searches the video image DB 12b for video images that include the given keyword character string in the title, description, name of the air conditioning equipment 101, product number, etc.
- Information regarding the moving image is transmitted as a search result to the requesting terminal device 3.
- the terminal device 3 that has received the search results from the server device 1 displays a list of information such as the title and shooting date and time of the moving images that correspond to the input keyword.
- the terminal device 3 accepts from the user a selection of a moving image to be played from among the moving images displayed as a list as a search result, and requests the server device 1 to transmit the selected moving image.
- the server device 1 reads out the selected moving image and information regarding the scene change position and key frame associated with this moving image from the moving image DB 12b, and sends the information to the requesting terminal device.
- FIG. 14 is a schematic diagram showing an example of a playback screen by the terminal device 3.
- the illustrated playback screen is provided with a moving image display area for displaying moving images at the upper center of the screen, and four operation buttons are provided horizontally below this area.
- the four operation buttons are, from the left, a button to return to the beginning of the scene (back button), a button to play the video (play button), and a button to skip to the next scene ( (skip button) and a button (stop button) for stopping the playback of the moving image.
- the terminal device 3 receives the user's operations on these operation buttons and performs processing such as playing and stopping the moving image.
- the terminal device 3 performs a scene change of the moving image using the back button and the skip button based on the scene change position received from the server device 1. For example, when the return button is operated, the terminal device 3 starts playback from the closest scene change position before the current playback time of the moving image. Further, for example, when the skip button is operated, the terminal device 3 starts playback from the closest scene change position after the current playback time of the moving image.
- the terminal device 3 displays a progress bar indicating the playback time below the four operation buttons, and indicates a break indicating the scene change position of the moving image on the progress bar.
- the illustrated example is a case where the moving image includes three scenes, and the progress bar shows two divisions of thick vertical lines indicating moving image change positions.
- the terminal device 3 displays reduced images of one or more key frames included in this moving image in an appropriate arrangement below the progress bar.
- the terminal device 3 indicates the temporal timing at which these key frames appear in the moving image by an arrow connecting the key frame image and the progress bar. This arrow indicates the temporal position at which the corresponding key frame appears in the video playback time indicated by the progress bar.
- the method of displaying a moving image shown in FIG. 14 is an example and is not limited to this, and the terminal device 3 may display a moving image using any method.
- the server device 1 performs a process of generating a digest video (summary video) of a video based on the scene change position and key frame of the video determined by the above-described process. conduct.
- the server device 1 generates a digest video whose playback time is shorter than that of the original video by extracting (cutting out) and concatenating one or more partial video images from the video.
- the server device 1 extracts a partial video of a predetermined time (for example, several seconds to several tens of seconds) from the scene change position and a partial video of a predetermined time before and after a key frame from the entire video. do.
- the server device 1 generates a digest video by connecting the plurality of partial video images extracted from the video in chronological order.
- the server device 1 stores the generated digest video in the video DB 12b in association with the original video.
- the server device 1 may generate a digest video when acquiring a video from the camera 103, or may generate a digest video in response to a request from the terminal device 3, or may generate a digest video in response to a request from the terminal device 3.
- a digest video image may be generated based on the timing.
- the server device 1 when transmitting search results in response to a video search request from the terminal device 3, the server device 1 reads a digest video of the corresponding video as the search result from the video image DB 12b and sends it to the terminal device 3. You can also send it.
- the terminal device 3 may display a list of digest video images as a result of the video search, along with information such as titles of multiple video images that match the search conditions.
- the server device 1 stores a learning model generated by machine learning in the storage unit 12. After detecting key frames included in a moving image, the server device 1 inputs each of the detected one or more key frames to a learning model, and obtains a classification result output from the learning model.
- the type of construction or repair of the air conditioning equipment 101 output by the learning model is, for example, construction of an outdoor unit of an air conditioner or repair of an indoor unit of an air conditioner.
- the server device 1 classifies each key frame included in a video image into the type of construction or repair using a learning model, obtains classification results for all key frames, and performs classification based on the obtained classification results. Generate a video title and a scene title.
- the server device 1 accepts input of the key frame and generates a character string for the title of this key frame.
- a learning model may also be used.
- Such learning models include, for example, learning models such as CNN that convert images into features, RNN (Recurrent Neural Network), LSTM (Long Short Term Memory), and BERT that generate title strings based on features. (Bidirectional Encoder Representations from Transformers) or GPT-3 (Generative Pre-trained Transformer-3).
- the server device 1 stores the title of the moving image generated based on the key frame and the title of the scene included in the moving image in association with the moving image in the moving image DB 12b. Note that, for example, when a videographer or the like inputs the title of the video or scene, the server device 1 stores the input title in the video image DB 12b and generates the title using the learning model. You don't have to do it. Further, the terminal device 3 can display the title of the moving image at the top or the like on the playback screen shown in FIG. 14, for example, and can display the title of the scene in association with one or more key frame images.
- the server device 1 detects a scene change position candidate from a moving image taken by the camera 103, and selects a scene change position candidate from a plurality of frames constituting the moving image. Detect candidate frames that can become frames (main frames), determine key frames from among the candidate frames, and select from among scene change position candidates based on the scene change position candidates and the chronological order of the key frames. Determine the scene change position. As a result, the information processing system according to this embodiment can be expected to accurately detect changes in scenes from moving images.
- the server device 1 selects one of these two scene change position candidates. By excluding one, the scene change position is determined. Furthermore, in the information processing system according to the present embodiment, if a scene change position candidate exists between two chronologically consecutive and similar key frames, the server device 1 selects this scene change position candidate. By excluding the scene change position, the scene change position may be determined. As a result, the information processing system according to this embodiment can be expected to accurately determine a scene change position from among scene change position candidates detected from a moving image.
- the server device 1 calculates the statistical value (HSL histogram) of each frame included in a moving image, and calculates the difference (difference) between the statistical values of two consecutive frames in time series. Detect scene change position candidates based on the Further, in the information processing system according to the present embodiment, the server device 1 calculates a hash value (pHash) of each frame included in a moving image, and calculates the hash value (pHash) of each frame included in a moving image based on the difference between the hash values of two chronologically consecutive frames. , detect scene change position candidates. As a result, the information processing system according to this embodiment can be expected to accurately detect candidates for scene change positions from moving images.
- the server device 1 stores information regarding the determined scene change position and key frame in the video image DB 12b in association with the video image.
- the terminal device 3 receives a selection of a scene change position or a key frame from a user, and reproduces a moving image based on the received scene change position or key frame.
- the information processing system according to the present embodiment can be expected to reproduce moving images from scenes and the like required by the user.
- information regarding the determined scene change position and key frame, the title, description, shooting date and time, shooting location of the moving image, air conditioning equipment to be constructed or repaired (air conditioning related
- the server device 1 stores the information in the moving image DB 12b in association with character information such as the name or product number of the device) 101. This allows the user to search for a moving image by inputting a keyword or the like based on the text information associated with the moving image.
- partial video images are extracted from the video image based on the determined scene change position and key frame, and the extracted partial video images are combined to create a digest video image. generate.
- the information processing according to the present embodiment can provide a summary video to the user, and the user can use the summary video to easily understand the outline of the video even if the video has a long playback time. can be expected.
- the server device 1 inputs the determined key frames into a learning model that has been subjected to machine learning in advance, and the key frames included in the moving image or the moving image are input based on the information outputted by the learning model. Decide on the title of the scene.
- the information processing system according to the present embodiment can automatically add a title to a moving image even if the photographer of the moving image does not input a title.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
Description
図1は、本実施の形態に係る情報処理システムの概要を説明するための模式図である。本実施の形態に係る情報処理システムでは、空調設備101の施行又は修理等の作業を行う作業者102は、自身の頭部に装着したヘッドセット等に備えられたカメラ103にて作業の様子を撮影する。なお本実施の形態においては、作業者102が装着したヘッドセット等のウェアラブルデバイスに搭載されたカメラ103を用いて撮影が行われるものとするが、これに限るものではなく、空調設備101及び作業者102の周辺にカメラ103を設置して作業を撮影してもよい。また、図1に示した空調設備101はエアコン(エアーコンディショナー)の室内機であるが、これに限るものではなく、空調設備101は例えばエアコンの室外機、換気装置、サーキュレータ、空気清浄機、暖房器具又は除湿乾燥機等の種々の空調関連機器であってよい。またカメラ103により、空調関連機器以外の様々な機器の施工又は修理等の作業を撮影してよく、これらの機器について施工又は修理等以外の様々な作業を撮影してよい。
図2は、本実施の形態に係るサーバ装置1の構成を示すブロック図である。本実施の形態に係るサーバ装置1は、処理部11、記憶部(ストレージ)12及び通信部(トランシーバ)13等を備えて構成されている。なお本実施の形態においては、1つのサーバ装置にて処理が行われるものとして説明を行うが、複数のサーバ装置が分散して処理を行ってもよい。
図4は、本実施の形態に係る情報処理システムが行う場面変化位置及びキーフレームの検出処理を説明するための模式図である。本実施の形態に係る情報処理システムが扱う動画像は、図4の上段に示すように、例えば1秒間に数十枚程度のフレーム(静止画像)を連ねたものである。また動画像は、複数の場面に分割され得る。本実施の形態において場面は、例えば映像制作においてシーン又はショット等と呼ばれ得るものであり、動画像に写された人又は物等の動作が一区切りする単位である。また本実施の形態において1つの場面は、図4の下段に示すように、少なくとも1つのキーフレームを含む複数のフレームを連ねたものであり、最初のフレーム及び最後のフレームが場面変化位置として扱われる。
カメラ103から取得した動画像に対して、サーバ装置1は、まず動画像に含まれる場面変化位置の候補を検出する処理を行う。本実施の形態に係るサーバ装置1は、動画像を構成する複数のフレームについて、時系列的に連続する2つのフレームの差異を示す値を算出し、算出した値が所定の閾値を超えるか否かを判定する。サーバ装置1は、フレームの差異を示す値が閾値を超える場合に、連続する2つのフレームの位置を場面変化位置の候補とする。本実施の形態においてサーバ装置1は、フレームの差異を示す値として、HSLヒストグラムのバタチャリヤ距離と、ハッシュ値の距離との2つの値を算出する。
サーバ装置1は、カメラ103が撮影した動画像に含まれる全てのフレームから、キーフレームの候補となる候補フレームを検出する処理を行う。なお候補フレームの検出処理は、上記の場面変化位置の検出処理より先に行われてもよく、後に行われてもよく、同時に行われてもよい。本実施の形態に係るサーバ装置1は、動画像を構成する複数のフレームについて、エッジを抽出する画像処理を行う。
サーバ装置1は、動画像に含まれる全フレームから検出した候補フレームについて、例えば所定時間内に複数の候補フレームが存在する場合、これら複数の候補フレームの中から類似する候補フレームを除去することによって、最終的なキーフレームを決定する処理を行う。なおサーバ装置1は、キーフレームを決定する処理を候補フレームを検出する処理の後に行うが、キーフレームを決定する処理を場面変化位置の候補を検出する処理の前に行ってもよく、後に行ってもよく、並列的に行ってもよい。
動画像から場面変化位置の候補を検出し、且つ、キーフレームを検出した後、サーバ装置1は、場面変化位置の候補の中から、場面変化位置を決定する処理を行う。本実施の形態においてサーバ装置1は、場面変化位置の候補とキーフレームとの時系列的な順序関係に基づいて、場面変化位置の候補から最終的な場面変化位置を決定することにより、動画像に含まれる場面を決定する。本実施の形態において動画像の場面は、図4の下段に示したように、少なくとも1つのキーフレームを含むことを条件としている。
本実施の形態に係る情報処理システムでは、サーバ装置1が上述の処理により決定した動画像の場面変化位置及びキーフレームに関する情報を、この動画像に対応付けて動画像DB12bに記憶している。また動画像DB12bには、例えば動画像の撮影者が付与した動画像のタイトル(表題)又は説明等の文字列、動画像の撮影日時、並びに、動画像の撮影場所等の様々な情報が対応付けて記憶される。また本実施の形態において動画像DB12bには、空調設備101の施工又は修理等の作業を撮影した動画像が記憶され、動画像には施工又は修理等の作業の対象となった空調設備101の名称又は商品番号等の文字情報が対応付けて記憶される。
本実施の形態に係る情報処理システムでは、上述の処理により決定した動画像の場面変化位置及びキーフレームに基づいて、動画像のダイジェスト動画像(要約動画像)を生成する処理をサーバ装置1が行う。サーバ装置1は、動画像から一又は複数の部分動画像を抽出して(切り出して)連結することによって、元の動画像よりも再生時間が短いダイジェスト動画像を生成する。
本実施の形態に係る情報処理システムでは、動画像のタイトル、動画像に含まれる場面のタイトル、又は、動画像に含まれるキーフレームのタイトル等を、機械学習がなされた学習モデル、いわゆるAI(Artificial Intelligence)を用いてサーバ装置1が自動生成することができる。図15は、本実施の形態に係るサーバ装置1が用いる学習モデルを説明するための模式図である。本実施の形態に係るサーバ装置1が用いる学習モデルは、動画像に含まれるキーフレームを入力として受け付け、このキーフレームに写されている空調設備101の施工又は修理の種別を分類結果として出力するよう予め機械学習がなされた学習モデルである。
以上の構成の本実施の形態に係る情報処理システムでは、カメラ103が撮影した動画像からサーバ装置1が場面変化位置の候補を検出し、動画像を構成する複数のフレームの中から場面のキーフレーム(主要フレーム)となり得る候補フレームを検出し、候補フレームの中からキーフレームを決定し、場面変化位置の候補及びキーフレームの時系列的な順序に基づいて、場面変化位置の候補の中から場面変化位置を決定する。これにより本実施の形態に係る情報処理システムは、動画像から場面の変化を精度よく検出することが期待できる。
3 端末装置
11 処理部
11a 場面変化位置候補検出部
11b 候補フレーム検出部
11c キーフレーム決定部(主要フレーム決定部)
11d 場面変化位置決定部
11e DB処理部
12 記憶部
12a サーバプログラム(コンピュータプログラム)
12b 動画像DB
13 通信部
31 処理部
31a 検索処理部
31b 表示処理部
32 記憶部
32a プログラム
33 通信部
34 表示部
35 操作部
101 空調設備(空調関連機器)
102 作業者
103 カメラ
N ネットワーク
Claims (14)
- 画像処理装置が、
動画像から場面変化位置の候補を検出し、
前記動画像を構成するフレームの中から、場面の主要フレームとなり得る候補フレームを検出し、
前記候補フレームの中から主要フレームを決定し、
場面変化位置の候補及び主要フレームの時系列的な順序に基づいて、場面変化位置の候補の中から場面変化位置を決定する、
画像処理方法。 - 時系列的に並ぶ2つの場面変化位置の候補の間に主要フレームが存在しない場合、前記2つの場面変化位置を候補からいずれか一方を除外することで、場面変化位置を決定する、
請求項1に記載の画像処理方法。 - 時系列的に並び且つ類似する2つの主要フレームの間に場面変化位置の候補が存在する場合、当該場面変化位置を候補から除外することで、場面変化位置を決定する、
請求項1又は請求項2に記載の画像処理方法。 - 各フレームの統計値を算出し、
2つのフレームの統計値の差異に基づいて、場面変化位置の候補を検出する、
請求項1から請求項3までのいずれか1つに記載の画像処理方法。 - 各フレームのハッシュ値を算出し、
2つのフレームのハッシュ値の差異に基づいて、場面変化位置の候補を検出する、
請求項1から請求項4までのいずれか1つに記載の画像処理方法。 - 各フレームからエッジを抽出し、
2つのフレームの間のエッジの変化に基づいて、候補フレームを検出する、
請求項1から請求項5までのいずれか1つに記載の画像処理方法。 - 候補フレームから特徴点を抽出し、
複数の候補フレームの間での特徴点の比較結果に基づいて、候補フレームを除外することで、候補フレームの中から主要フレームを決定する、
請求項1から請求項6までのいずれか1つに記載の画像処理方法。 - 決定した場面変化位置及び主要フレームに関する情報を前記動画像に対応付けて記憶し、
場面変化位置又は主要フレームの選択を受け付け、
選択された場面変化位置又は主要フレームに基づく前記動画像の再生を行う、
請求項1から請求項7までのいずれか1つに記載の画像処理方法。 - 決定した場面変化位置及び主要フレームに関する情報と、前記動画像に関する文字情報とを対応付けて記憶する、
請求項8に記載の画像処理方法。 - 空調関連機器の施工又は修理の作業を撮影した動画像を取得し、
取得した動画像について場面変化位置及び主要フレームを決定し、
決定した場面変化位置及び主要フレームに関する情報と、空調関連機器に関する文字情報とを対応付けて記憶する、
請求項9に記載の画像処理方法。 - 決定した場面変化位置及び主要フレームに基づいて前記動画像から部分動画像を抽出し、
抽出した部分動画像を結合して要約動画像を生成する、
請求項1から請求項10までのいずれか1つに記載の画像処理方法。 - 決定した主要フレームを、空調関連機器の施工又は修理の作業を撮影した動画像の主要フレームの入力に対して施工又は修理の種別を分類する学習モデルへ入力し、
当該学習モデルが出力する分類結果を取得し、
取得した分類結果に基づいて、前記動画像又は前記動画像に含まれる場面の表題を決定する、
請求項1から請求項11までのいずれか1つに記載の画像処理方法。 - コンピュータに、
動画像から場面変化位置の候補を検出し、
前記動画像を構成するフレームの中から、場面の主要フレームとなり得る候補フレームを検出し、
前記候補フレームの中から主要フレームを決定し、
場面変化位置の候補及び主要フレームの時系列的な順序に基づいて、場面変化位置の候補の中から場面変化位置を決定する
処理を実行させる、コンピュータプログラム。 - 動画像から場面変化位置の候補を検出する場面変化位置候補検出部と、
前記動画像を構成するフレームの中から、場面の主要フレームとなり得る候補フレームを検出する候補フレーム検出部と、
前記候補フレームの中から主要フレームを決定する主要フレーム決定部と、
場面変化位置の候補及び主要フレームの時系列的な順序に基づいて、場面変化位置の候補の中から場面変化位置を決定する場面変化位置決定部と
を備える、画像処理装置。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP23827217.3A EP4546265A1 (en) | 2022-06-23 | 2023-06-21 | Image processing method, computer program, and image processing device |
CN202380047958.3A CN119404224A (zh) | 2022-06-23 | 2023-06-21 | 图像处理方法、计算机程序以及图像处理装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022-101248 | 2022-06-23 | ||
JP2022101248A JP7429016B2 (ja) | 2022-06-23 | 2022-06-23 | 画像処理方法、コンピュータプログラム及び画像処理装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023249034A1 true WO2023249034A1 (ja) | 2023-12-28 |
Family
ID=89379977
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2023/022855 WO2023249034A1 (ja) | 2022-06-23 | 2023-06-21 | 画像処理方法、コンピュータプログラム及び画像処理装置 |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP4546265A1 (ja) |
JP (1) | JP7429016B2 (ja) |
CN (1) | CN119404224A (ja) |
WO (1) | WO2023249034A1 (ja) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10257436A (ja) * | 1997-03-10 | 1998-09-25 | Atsushi Matsushita | 動画像の自動階層構造化方法及びこれを用いたブラウジング方法 |
JP2001527304A (ja) * | 1997-12-19 | 2001-12-25 | シャープ株式会社 | ディジタル動画の階層的要約及び閲覧方法 |
JP2003519946A (ja) | 1999-12-30 | 2003-06-24 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 速い動きのシーンを検出する方法及び装置 |
JP2005151069A (ja) * | 2003-11-14 | 2005-06-09 | Funai Electric Co Ltd | 記録再生装置 |
WO2007039995A1 (ja) * | 2005-09-30 | 2007-04-12 | Pioneer Corporation | ダイジェスト作成装置およびそのプログラム |
JP2008109290A (ja) * | 2006-10-24 | 2008-05-08 | Sony Corp | 内容文字情報取得方法、内容文字情報取得プログラム、内容文字情報取得装置及び映像コンテンツ記録装置 |
US20200349357A1 (en) * | 2018-01-17 | 2020-11-05 | Group Ib, Ltd | Method of creating a template of original video content |
JP2021131738A (ja) * | 2020-02-20 | 2021-09-09 | 株式会社安藤・間 | 工程判別システム、及び工程判別方法 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004032551A (ja) * | 2002-06-27 | 2004-01-29 | Seiko Epson Corp | 画像処理方法、画像処理装置及びプロジェクタ |
CN106503693B (zh) * | 2016-11-28 | 2019-03-15 | 北京字节跳动科技有限公司 | 视频封面的提供方法及装置 |
CN112651336B (zh) * | 2020-12-25 | 2023-09-29 | 深圳万兴软件有限公司 | 关键帧的确定方法、设备、计算机可读存储介质 |
CN114187558A (zh) * | 2021-12-20 | 2022-03-15 | 深圳万兴软件有限公司 | 一种视频场景识别方法、装置、计算机设备及存储介质 |
-
2022
- 2022-06-23 JP JP2022101248A patent/JP7429016B2/ja active Active
-
2023
- 2023-06-21 CN CN202380047958.3A patent/CN119404224A/zh active Pending
- 2023-06-21 EP EP23827217.3A patent/EP4546265A1/en active Pending
- 2023-06-21 WO PCT/JP2023/022855 patent/WO2023249034A1/ja active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10257436A (ja) * | 1997-03-10 | 1998-09-25 | Atsushi Matsushita | 動画像の自動階層構造化方法及びこれを用いたブラウジング方法 |
JP2001527304A (ja) * | 1997-12-19 | 2001-12-25 | シャープ株式会社 | ディジタル動画の階層的要約及び閲覧方法 |
JP2003519946A (ja) | 1999-12-30 | 2003-06-24 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 速い動きのシーンを検出する方法及び装置 |
JP2005151069A (ja) * | 2003-11-14 | 2005-06-09 | Funai Electric Co Ltd | 記録再生装置 |
WO2007039995A1 (ja) * | 2005-09-30 | 2007-04-12 | Pioneer Corporation | ダイジェスト作成装置およびそのプログラム |
JP2008109290A (ja) * | 2006-10-24 | 2008-05-08 | Sony Corp | 内容文字情報取得方法、内容文字情報取得プログラム、内容文字情報取得装置及び映像コンテンツ記録装置 |
US20200349357A1 (en) * | 2018-01-17 | 2020-11-05 | Group Ib, Ltd | Method of creating a template of original video content |
JP2021131738A (ja) * | 2020-02-20 | 2021-09-09 | 株式会社安藤・間 | 工程判別システム、及び工程判別方法 |
Also Published As
Publication number | Publication date |
---|---|
JP7429016B2 (ja) | 2024-02-07 |
CN119404224A (zh) | 2025-02-07 |
JP2024002193A (ja) | 2024-01-11 |
EP4546265A1 (en) | 2025-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yao et al. | Oscar: On-site composition and aesthetics feedback through exemplars for photographers | |
CN101523412B (zh) | 基于人脸的图像聚类 | |
US8995725B2 (en) | On-site composition and aesthetics feedback through exemplars for photographers | |
CN106462744B (zh) | 基于规则的视频重要性分析 | |
US10410677B2 (en) | Content management system, management content generating method, management content play back method, and recording medium | |
CN109063611B (zh) | 一种基于视频语义的人脸识别结果处理方法和装置 | |
US20070074244A1 (en) | Method and apparatus for presenting content of images | |
US8345742B2 (en) | Method of processing moving picture and apparatus thereof | |
CN103365936A (zh) | 视频推荐系统及其方法 | |
US9721613B2 (en) | Content management system, management content generation method, management content reproduction method, program and recording medium | |
TWI764240B (zh) | 智慧型影片編輯方法及系統 | |
CN111986180B (zh) | 基于多相关帧注意力机制的人脸伪造视频检测方法 | |
JP2006172437A (ja) | データのストリームにおけるセグメント境界の位置の決定方法、データサブセットを近隣のデータサブセットと比較してセグメント境界を決定する方法、コンピュータによって実行可能な命令のプログラム、ならびにデータのストリームにおける境界及び非境界を識別するシステム又は装置 | |
CN110866563B (zh) | 相似视频检测、推荐方法、电子设备和存储介质 | |
CN110569773A (zh) | 基于时空显著性行为注意力的双流网络行为识别方法 | |
CN105095853A (zh) | 图像处理装置及图像处理方法 | |
CN108921023A (zh) | 一种确定低质量人像数据的方法及装置 | |
Ma et al. | An universal image attractiveness ranking framework | |
Cheng et al. | Re-compose the image by evaluating the crop on more than just a score | |
JP7429016B2 (ja) | 画像処理方法、コンピュータプログラム及び画像処理装置 | |
JP6410427B2 (ja) | 情報処理装置、情報処理方法及びプログラム | |
WO2020217425A1 (ja) | 教師データ生成装置 | |
JP2006217046A (ja) | 映像インデックス画像生成装置及び映像のインデックス画像を生成するプログラム | |
CN111626409B (zh) | 一种图像质量检测的数据生成方法 | |
JP7667724B2 (ja) | 計算機システム及び運動を行う人の身体動作の分析方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23827217 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202417103248 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023827217 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2023827217 Country of ref document: EP Effective date: 20250123 |
|
WWP | Wipo information: published in national office |
Ref document number: 2023827217 Country of ref document: EP |