WO2024052964A1 - Video synchronization device, video synchronization method, and video synchronization program - Google Patents

Video synchronization device, video synchronization method, and video synchronization program Download PDF

Info

Publication number
WO2024052964A1
WO2024052964A1 PCT/JP2022/033307 JP2022033307W WO2024052964A1 WO 2024052964 A1 WO2024052964 A1 WO 2024052964A1 JP 2022033307 W JP2022033307 W JP 2022033307W WO 2024052964 A1 WO2024052964 A1 WO 2024052964A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
delay
feature extraction
input
unit
Prior art date
Application number
PCT/JP2022/033307
Other languages
French (fr)
Japanese (ja)
Inventor
隆行 黒住
優花 芹澤
馨亮 長谷川
真二 深津
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/033307 priority Critical patent/WO2024052964A1/en
Publication of WO2024052964A1 publication Critical patent/WO2024052964A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/242Synchronization processes, e.g. processing of PCR [Program Clock References]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • One aspect of the present invention relates to a video synchronization device, a video synchronization method, and a video synchronization program.
  • video and audio playback has become popular, which involves digitizing video and audio recorded at a certain point, transmitting it in real time to a remote location via communication lines such as IP (Internet Protocol) networks, and playing back the video and audio at the remote location.
  • IP Internet Protocol
  • equipment has come into use.
  • online live performances and public viewing which transmit real-time video and audio of live music events held at music venues and video and audio of sports matches held at competition venues, to remote locations are becoming more popular. It is being done.
  • Such video/audio transmission is not limited to one-to-one one-way transmission.
  • Video and audio are transmitted from the venue where the music live performance is being held (hereinafter referred to as the event venue) to multiple remote locations, and even at each of these multiple remote locations, the video and audio such as cheers of the audience enjoying the live performance are transmitted.
  • Two-way transmission is also being carried out, in which video and audio are photographed and recorded, transmitted to event venues and other remote locations, and output from large video display devices and speakers at each site.
  • a customer who is enjoying the video of a live music event in a remote location can connect to the event venue and listen to the music together with other audience members at the event venue or in a remote location.
  • a person is waving a flashlight, clapping, or dancing in time with other audience members, it is difficult to broadcast the video in sync with the performers and audience at the event venue, as well as with the audience in other remote locations.
  • the communication time between a remote location and an event venue includes delay time caused by various factors such as communication time, video processing time, reaction time of the audience at the remote location, etc. Therefore, it is difficult to synchronize images that include the movements of spectators in remote locations in real time.
  • Non-Patent Document 1 describes a method of synchronizing videos based on a synchronization signal embedded in a video signal.
  • Non-Patent Document 1 is a method of synchronizing the viewed videos, and it is difficult to synchronize the videos based on the actions in the videos.
  • This invention has been made in view of the above circumstances, and its purpose is to provide a technology that can synchronize videos based on the motion contained in the videos.
  • the video synchronization device includes a video feature extraction unit that extracts video features from a plurality of input videos, and a delay estimation unit that estimates a relative delay time from the video features extracted from the multiple input videos. and a delay correction unit that synchronizes the plurality of input videos by correcting the delay time of the plurality of input videos using the delay time.
  • videos can be synchronized based on motion included in the videos.
  • FIG. 1 is a block diagram showing an example of the hardware configuration of each electronic device included in the video synchronization system according to the first embodiment.
  • FIG. 2 is a block diagram illustrating an example of the software configuration of a server that constitutes the video synchronization system according to the first embodiment.
  • FIG. 3 is a diagram illustrating an example of an image of an audience at a remote location according to the first embodiment.
  • FIG. 4 is a diagram showing an example of a video at an event venue according to the first embodiment.
  • FIG. 5 is a conceptual diagram showing video feature extraction by the server according to the first embodiment.
  • FIG. 6 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the first embodiment.
  • FIG. 1 is a block diagram showing an example of the hardware configuration of each electronic device included in the video synchronization system according to the first embodiment.
  • FIG. 2 is a block diagram illustrating an example of the software configuration of a server that constitutes the video synchronization system according to the first
  • FIG. 7 is a diagram illustrating a server delay estimation method according to the first embodiment.
  • FIG. 8 is a diagram illustrating a server delay estimation method according to the first embodiment.
  • FIG. 9 is a diagram illustrating a server delay estimation method according to the first embodiment.
  • FIG. 10 is a diagram illustrating a server delay estimation method according to the first embodiment.
  • FIG. 11 is a conceptual diagram showing video synchronization processing of the server according to the first embodiment.
  • FIG. 12 is a block diagram illustrating an example of the software configuration of a server configuring the video synchronization system according to the second embodiment.
  • FIG. 13 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the second embodiment.
  • FIG. 14 is a diagram illustrating an example of a method of photographing a video at an event venue according to the first and second embodiments.
  • FIG. 15 is a diagram illustrating an application example of the video synchronization system according to the first and second embodiments.
  • FIG. 16 is a diagram illustrating an example of server processing according to the first and second embodiments.
  • FIG. 17 is a diagram illustrating an example of a DNN structure implemented in the video feature extraction unit according to the first and second embodiments.
  • FIG. 18 is a diagram illustrating an example of phase-based learning performed by the learning unit according to the first and second embodiments.
  • FIG. 19 is a diagram illustrating an example of learning based on a phase difference performed by the learning unit according to the first and second embodiments.
  • FIG. 20 is a diagram illustrating an example of a time series search by the delay estimator according to the first and second embodiments.
  • a music live concert (hereinafter also referred to as an event) such as a music live venue
  • the input video of multiple audiences watching the live performance from remote locations (hereinafter referred to as remote audience) is calculated based on the characteristics of the movements in the video. Assuming that you synchronize.
  • FIG. 3 shows images of multiple remote spectators.
  • FIG. 3 shows a situation in which multiple remote spectators are excited using penlights.
  • videos of a plurality of remote audience members are aggregated in a 5 ⁇ 5 matrix, but each video is cut out from such an aggregated video and used.
  • an image of a crowd at an event venue as shown in FIG. 4 may be used as the input image.
  • FIG. 4 shows a crowd at an event venue being excited using penlights.
  • a part of the video of the crowd at the event venue may be cut out and used as the input video, or the entire video may be used as the input video.
  • the input video is assumed to be a video of the audience holding a distinctive item such as a penlight whose movements are highly visible, but it may also be a video of the audience clapping or dancing without holding anything. .
  • the first embodiment is an embodiment in which a plurality of videos are synchronized by using characteristics of videos of a remote audience.
  • FIG. 1 is a block diagram showing an example of the hardware configuration of each electronic device included in the video synchronization system according to the first embodiment.
  • the video synchronization system S includes a server 1, an audio output device 101, a video output device 102, and a plurality of audience terminals 2 to 2n.
  • the server 1, the audio output device 101, the video output device 102, and the plurality of audience terminals 2 to 2n can communicate with each other via an IP network.
  • the server 1 is an electronic device that collects data and processes the collected data.
  • Electronic devices include computers.
  • the audio output device 101 is a device that includes a speaker that reproduces and outputs audio.
  • the audio output device 101 is, for example, a device that outputs audio at an event venue.
  • the video output device 102 is a device that includes a display that plays and displays video.
  • the display is a liquid crystal display.
  • the video output device 102 is, for example, a device that plays and displays video at an event venue.
  • Each of the spectator terminals 2 to 2n is a terminal used by each of a plurality of remote spectators.
  • Each of the spectator terminals 2 to 2n is an electronic device having an input function, a display function, and a communication function.
  • each of the audience terminals 2 to 2n is a tablet terminal, a smartphone, a PC (Personal Computer), or the like, but is not limited to these.
  • the spectator terminal 2 is an example of a terminal.
  • the server 1 includes a control section 11, a program storage section 12, a data storage section 13, a communication interface 14, and an input/output interface 15. Each element included in the server 1 is connected to each other via a bus.
  • the control unit 11 corresponds to the central part of the server 1.
  • the control unit 11 includes a processor such as a central processing unit (CPU).
  • the control unit 11 includes a ROM (Read Only Memory) as a nonvolatile memory area.
  • the control unit 11 includes a RAM (Random Access Memory) as a volatile memory area.
  • the processor expands the program stored in the ROM or the program storage unit 12 into the RAM.
  • the control unit 11 realizes each functional unit described below by the processor executing the program loaded in the RAM.
  • the control unit 11 constitutes a computer.
  • the program storage unit 12 is configured of a non-volatile memory that can be written to and read from at any time, such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), as a storage medium.
  • the program storage unit 12 stores programs necessary to execute various control processes.
  • the program storage unit 12 stores a program that causes the server 1 to execute processing by each functional unit implemented in the control unit 11, which will be described later.
  • the program storage unit 12 is an example of storage.
  • the data storage unit 13 is composed of a nonvolatile memory that can be written to and read from at any time, such as an HDD or an SSD, as a storage medium.
  • the data storage unit 13 is an example of a storage or a storage unit.
  • the communication interface 14 includes various interfaces that communicatively connect the server 1 to other electronic devices using communication protocols defined by IP networks.
  • the input/output interface 15 is an interface that enables communication between the server 1 and each of the audio output device 101 and the video output device 102.
  • the input/output interface 15 may include a wired communication interface or a wireless communication interface.
  • the hardware configuration of the server 1 is not limited to the above-mentioned configuration.
  • the server 1 allows the above-mentioned components to be omitted and changed, and new components to be added as appropriate.
  • FIG. 2 is a block diagram showing an example of the software configuration of the server 1 that constitutes the video synchronization system according to the first embodiment.
  • the server 1 includes a video feature extraction section 110, a delay estimation section 111, a delay correction section 112, and a learning section 114.
  • Each functional unit is realized by execution of a program by the control unit 11. It can also be said that each functional unit is included in the control unit 11 or the processor. Each functional unit can be read as the control unit 11 or a processor.
  • three video feature extraction units 110 are illustrated in FIG. 2, the number of video feature extraction units 110 is not limited to this. In the following description, it is assumed that each of a plurality of input videos is processed by a different video feature extraction unit 110, but it may be processed by a single video feature extraction unit 110.
  • the video feature extraction unit 110 extracts video features from the input video.
  • the input video includes, for example, multiple remote audience videos.
  • the input video includes, for example, a video obtained by cutting out individual videos from a 5 ⁇ 5 matrix video as shown in FIG.
  • the input video may include a video of a crowd at an event venue, as shown in FIG.
  • Video features are features seen in the input video.
  • the video features include, for example, human movements, objects, human facial expressions, etc. included in the input video.
  • the image characteristics include human movements such as waving a penlight, lifting a towel, raising a hand, and waving a hand from side to side.
  • Video features may include objects such as penlights, towels, etc.
  • the video features may include human facial expressions such as smiling faces and crying faces.
  • Video features include features that indicate action or movement in the video.
  • the video feature extraction unit 110 performs feature extraction while shifting the input video, for example, as shown in FIG. FIG. 5 is a conceptual diagram showing video feature extraction by the server 1 according to the first embodiment. As shown in FIG. 5, the video feature extraction unit 110 cuts out the input video based on the video clipping window width. The video feature extraction unit 110 determines the starting point of the video clipping window width based on the clipping interval. The video feature extraction unit 110 extracts features from an input video with a certain video clipping window width, shifts the video by the clipping interval, and then extracts features from an input video with the next video clipping window width.
  • the video feature extraction unit 110 may use machine learning, for example, to extract video features.
  • the video feature extraction unit 110 may perform feature extraction using a known method described in Non-Patent Document 2 or Non-Patent Document 3.
  • the video feature extraction unit 110 may use a video feature extraction method learned in advance by associating rhythmic sounds with video. In this case, the video feature extraction unit 110 can extract video features that are more related to rhythm.
  • Non-patent document 2 Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi and Noboru Harada. Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. Proc. Interspeech 2020, pp. 1446-1450, 2020.
  • Non-patent document 3 Masahiro Yasuda, Yasutoshi Oishi, Yuma Koizumi, Noboru Harada "Cross-modal sound search based on specific co-occurrence relationships indicated by weak labels" Proceedings of the Acoustical Society of Japan Research Conference, Autumn 2020 ROMBUNNO. 2-1-2
  • the video feature extraction unit 110 may extract video features based on feature extraction by learning performed by the learning unit 114, which will be described later.
  • the feature extraction by learning performed by the learning unit 114 is a video feature extraction method obtained by learning.
  • the video feature extraction method can also be called a trained model for extracting video features.
  • the video feature extraction unit 110 can extract a feature vector indicating an action or movement in the video as the video feature.
  • the video feature extraction unit 110 does not have to always perform feature extraction at the same feature extraction interval or density.
  • the video feature extraction unit 110 may provide at least two types of feature extraction intervals or densities. For example, the video feature extraction unit 110 may narrow the interval or increase the density of feature extraction for one of the two videos, and widen the interval or lower the density of feature extraction for the other video.
  • the delay estimation unit 111 estimates a relative delay time from video features extracted from a plurality of input videos.
  • the delay estimation unit 111 compares a plurality of video features in time series, and determines which video feature at each time is closest to the video feature at which time based on the distance between the video features or the similarity between the video features. Ask for something.
  • a search method that compares a plurality of video features in chronological order and determines which video feature at each time is closest to the video feature at which time is referred to as a time-series search.
  • the delay estimation unit 111 may estimate at least one of the relative delay time and speed using voting.
  • the delay estimating unit 111 may estimate the timing deviation of the movements of a plurality of remote spectators, that is, the delay time, by determining the characteristics of each time that are closest to the characteristics of which time, and voting on that time. good.
  • the delay estimation unit 111 may perform estimation by voting using Hough transformation.
  • Hough transformation involves drawing a straight line on the a-b plane shown in Figure 8 that corresponds to the point (x i , y i ) on the x-y plane shown in Figure 7, and determining the slope and intercept of the straight line by voting. It is a method.
  • the delay estimation unit 111 determines the intersection points (a 0 , b 0 ) of straight lines drawn on the ab plane by voting on squares divided into grid shapes.
  • the video feature extraction unit 110 cuts out the input video at regular intervals and converts the input video into time-series feature vectors as time-series video features.
  • the delay estimation unit 111 measures the distance between the feature vectors of each person at each time, and plots the pair of closest times.
  • the delay estimation unit 111 obtains the slope a 0 and intercept b 0 of this straight line by Hough transformation.
  • the delay estimation unit 111 determines the intercept b 0 that received the largest number of votes as the estimated delay time.
  • the delay estimator 111 estimates the delay time between two videos, but when determining the delay time of multiple videos, multiple pairs are extracted from the set and each pair is estimated.
  • the delay time may also be determined.
  • the delay estimation unit 111 may perform matching between feature vectors using the method described in Non-Patent Document 2 or Non-Patent Document 3. In that case, the delay estimation unit 111 may estimate at least one of the relative delay time and speed using the distance.
  • the delay estimation unit 111 may use a distance measure such as Euclidean distance.
  • Euclidean distance we refer to a search method that compares multiple video features in chronological order and determines which video feature at each time is closest to the video feature at which time based on the distance between the video features as distance-based search. That's it.
  • the delay estimation unit 111 pairs the two types of intervals or two types of densities to estimate the relative delay time.
  • the delay estimation unit 111 narrows or increases the density of feature extraction for one video, widens the feature extraction interval for the other video, or The delay time may be estimated by lowering the density.
  • the delay correction unit 112 uses the delay time to correct the delay times of the plurality of videos, and synchronizes the plurality of videos.
  • the delay correction unit 112 corrects the video playback time based on the estimated delay times of the plurality of videos. For example, the delay correction unit 112 inserts a difference in delay time into the playback time of the video of the remote audience with the shortest delay time in accordance with the video of the remote audience with the longest delay time, thereby creating a synchronized video.
  • the delay correction unit 112 may perform delay correction so that all the images match the images of one of the spectators shown in FIG. 3, for example, the remote audience in the upper left.
  • the delay correction unit 112 extracts a small number of representative samples from a grouped set of remote audiences, and treats the average delay time of the samples as the delay time of all videos in the group. , the playback times of the videos of all remote spectators in the group may be corrected.
  • the delay correction unit 112 corrects the playback time of the video based on the delay time, thereby adjusting the movements of the multiple audience members as shown in the right diagram from the videos of the multiple spectators as shown in the left diagram of FIG. 11. It is possible to create an image with all the necessary information.
  • "reproduction” may be read as "output” or "transmission”.
  • the learning unit 114 performs learning based on the phase of the learning video or learning based on the phase difference. Learning based on phase and learning based on phase difference will be described later.
  • the learning unit 114 executes learning of learning data including a plurality of learning videos and phases associated with each of the plurality of learning videos.
  • the phase is a phase corresponding to a part of the learning video.
  • part of the learning video is a penlight included in the learning video. Since the penlight is swung by a person, the phase corresponds to the position of the penlight. When the penlight is swung to the left side of the person in the video, the phase may be set to 0 [rad].
  • phase-related learning video When the penlight is swung to the right toward a person in the video, the phase may be set to ⁇ [rad]. When a penlight is swung in front of a person in the video, the phase may be set to ⁇ /2 [rad]. Note that a part of the phase-related learning video is not limited to a penlight, but may be various human movements, objects, human facial expressions, etc. in the video.
  • the learning data may be stored in the data storage unit 13 or may be stored in an electronic device different from the server 1.
  • the learning unit 114 can obtain a video feature extraction method by performing learning.
  • processing procedure described below is only an example, and each process may be changed as much as possible. Further, regarding the processing procedure described below, steps can be omitted, replaced, or added as appropriate depending on the embodiment.
  • FIG. 6 is a flowchart showing an example of the video synchronization procedure and processing contents of the server 1 according to the first embodiment.
  • images from cameras of multiple remote spectators and images used for video synchronization are input, and a video in which the images of multiple remote spectators are synchronized is output.
  • Images from remote audience cameras and images used for video synchronization are examples of input images. It is assumed that the input video is a remote audience video obtained from the spectator terminals 2 to 2n.
  • the synchronized video is output via the video output device 102 at the event venue. The synchronized video may be output to the audience terminals 2 to 2n.
  • the control unit 11 obtains input images obtained from the audience terminals 2 to 2n (step S1).
  • the video feature extraction unit 110 waits for an input video (step S2).
  • the video feature extraction unit 110 extracts video features from the input video (step S3).
  • step S3 for example, the video feature extraction unit 110 obtains an input video.
  • the video feature extraction unit 110 extracts video features while shifting the input video.
  • the video feature extraction unit 110 may extract video features using machine learning, for example.
  • the video feature extraction unit 110 may extract video features using a known method described in Non-Patent Document 2 or Non-Patent Document 3.
  • the video feature extraction unit 110 may use a video feature extraction method learned in advance by associating rhythmic sounds with video.
  • the video feature extraction unit 110 may perform feature extraction on a plurality of input videos at different intervals or densities.
  • the video feature extraction unit 110 may extract video features by providing at least two types of feature extraction intervals or densities.
  • the video feature extraction unit 110 may narrow the feature extraction interval for one video and widen the feature extraction interval for the other video.
  • the video feature extraction unit 110 may increase the density of feature extraction for one video and lower the density of feature extraction for the other video.
  • the video feature extraction unit 110 determines whether feature extraction has been performed for all input videos (step S4). If the video feature extraction unit 110 determines that feature extraction has been performed for all input videos (step S4: YES), the process transitions from step S4 to step S5. If the video feature extraction unit 110 determines that feature extraction has not been performed for all input videos (step S4: NO), the process transitions from step S4 to step S2.
  • the delay estimation unit 111 estimates a relative delay time from video features extracted from a plurality of input videos (step S5).
  • step S5 for example, the delay estimating unit 111 collates a plurality of video features and determines which video feature at each time corresponds to which video feature at each time based on the distance between the video features or the similarity between the video features. Find the closest one.
  • the delay estimation unit 111 estimates the delay time between two videos based on the video feature of a certain video and the distance, or the time of the video feature with the closest similarity.
  • the delay estimation unit 111 may extract a plurality of pairs from a set of a plurality of input videos and estimate the delay time for each pair.
  • the delay estimating unit 111 may estimate at least one of the relative delay time and speed using voting, for example.
  • the delay estimating unit 111 determines the time difference between the timings of the movements of the plurality of remote spectators, that is, the delay time, by determining which time the feature of each time is closest to, and voting for that time.
  • the delay estimation unit 111 estimates the delay time between two videos based on the voted time.
  • the delay estimation unit 111 may perform estimation by voting using Hough transform.
  • the delay estimation unit 111 may perform delay estimation by setting the relative speed as 1.
  • the delay estimation unit 111 may pair the two types of intervals or two types of densities to estimate the relative delay time. .
  • the delay estimation unit 111 may estimate a relative delay time by pairing intervals between two types of feature extraction or densities of two types of feature extraction.
  • the delay correction unit 112 waits for input video (step S6).
  • the delay correction unit 112 uses the delay time estimated by the delay estimation unit 111 to correct the delay times of the plurality of input videos, and synchronizes the plurality of input videos (step S7).
  • step S7 for example, the delay correction unit 112 obtains an input video.
  • the delay correction unit 112 performs delay correction based on a plurality of input videos.
  • the delay correction unit 112 performs delay correction based on the time determined from a plurality of input videos.
  • the delay correction unit 112 corrects the playback times of the plurality of input videos based on the delay times estimated for the plurality of input videos.
  • the delay correction unit 112 may create synchronized video by inserting a delay time difference into the playback time of the video of the remote audience with the shortest delay time in accordance with the video of the remote audience with the longest delay time. .
  • the delay correction unit 112 may perform delay correction to match all other videos to a predetermined video.
  • the delay correction unit 112 may perform delay correction based on delay times calculated by grouping a plurality of videos. In this case, the delay correction unit 112 extracts a small number of samples from the grouped set, sets the average delay time of the samples as the delay time of all videos in the group, and sets the average delay time of the samples as the delay time of all videos in the group. Correction may be made.
  • the delay correction unit 112 determines whether delay correction has been performed for all input videos (step S8). If the delay correction unit 112 determines that delay correction has been performed on all input videos (step S8: YES), the process ends. If the delay correction unit 112 determines that delay correction has not been performed on all input videos (step S8: NO), the process transitions from step S8 to step S6.
  • the control unit 11 outputs the delay-corrected video to the video output device 102 via the input/output interface 15.
  • the video output device 102 outputs video that has undergone delay correction.
  • the control unit may output the delay-corrected video to the audience terminals 2 to 2n via the IP network.
  • the audience terminals 2 to 2n output video images on which delay correction has been performed.
  • the server 1 extracts video features from a plurality of input videos, estimates a relative delay time from the video features extracted from the multiple input videos, uses the delay time, and extracts video features from a plurality of input videos. It is possible to synchronize multiple input videos by correcting the delay time. Therefore, the server 1 estimates the delay time between multiple videos based on the characteristics such as motion included in the input video, and corrects the playback time of the multiple videos based on the delay time. It is possible to play back a video that includes all the actions that will be performed. Thereby, the server 1 can synchronize the videos based on the motion included in the videos.
  • the second embodiment is an embodiment in which a sound feature is extracted from an input sound and a plurality of videos are synchronized based on the input sound.
  • FIG. 12 is a block diagram showing an example of the software configuration of the server 1 configuring the video synchronization system according to the second embodiment.
  • the server 1 includes a video feature extraction section 110, a delay estimation section 111, a delay correction section 112, a sound feature extraction section 113, and a learning section 114.
  • Each functional unit is realized by execution of a program by the control unit 11. It can also be said that each functional unit is included in the control unit 11 or the processor. Each functional unit can be read as the control unit 11 or a processor.
  • three video feature extraction units 110 are illustrated in FIG. 12, the number of video feature extraction units 110 is not limited to this. In the following description, it is assumed that each of a plurality of input videos is processed by a different video feature extraction unit 110, but it may be processed by a single video feature extraction unit 110.
  • the sound feature extraction unit 113 extracts sound features from the input sound.
  • the input sound is the sound played at the venue.
  • the input sound is a sound that serves as a reference for sound characteristics.
  • the sound feature extraction unit 113 may perform sound feature extraction using the method described in Non-Patent Document 2 or 3.
  • the sound feature extraction unit 113 may perform matching of feature vectors of different modals on a common feature space.
  • the sound feature extraction unit 113 may provide at least two types of sound feature extraction intervals or densities. For example, the sound feature extraction unit 113 may narrow the interval or increase the density of feature extraction for one of the two videos, and widen the interval or lower the density of feature extraction for the other video. Note that the sound feature extraction unit 113 may extract sound features from the input video.
  • the delay estimation unit 111 estimates the delay time from the sound of the video by comparing the video features and sound features of a plurality of videos.
  • the delay estimating unit 111 estimates relative delay times between a plurality of input videos and sounds from sound features.
  • the delay estimation unit 111 adjusts the playback times of the plurality of videos according to the input sound.
  • the delay estimation unit 111 pairs the two types of intervals or two types of densities to estimate the relative delay time. Good too. For example, when estimating the delay between two videos, the delay estimating unit 111 narrows or increases the density of feature extraction for one sound and widens the interval for feature extraction of the other sound, or The delay time may be estimated by lowering the density.
  • the delay correction unit 112 uses the delay time to correct the delay times of the plurality of videos, and synchronizes the plurality of videos.
  • the delay correction unit corrects the delay times of the plurality of videos based on the sound.
  • the delay correction unit 112 adjusts the playback times of the plurality of videos in accordance with the sound.
  • FIG. 13 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the second embodiment.
  • images from cameras of multiple remote audience members and input sounds are input, and a video in which the images of multiple remote audience members are synchronized is output.
  • the image from the remote spectator's camera is an example of the input image.
  • the input video is a remote audience video obtained from the spectator terminals 2 to 2n.
  • the synchronized video is output via the video output device 102 at the event venue.
  • the synchronized video may be output to the audience terminals 2 to 2n.
  • the input sound is, for example, sound obtained from the audio output device 101.
  • the control unit 11 acquires input sound (step S101).
  • the input sound is, for example, reproduced sound played at an event venue.
  • the sound feature extraction unit 113 waits for input sound (step S102).
  • the sound feature extraction unit 113 extracts sound features from the input sound (step S103).
  • the sound feature extraction unit 113 extracts sound features using a known method.
  • the sound feature extraction unit 113 may perform feature extraction on the input sound at different intervals or densities.
  • the sound feature extraction unit 113 may extract sound features by providing at least two types of feature extraction intervals or densities. Types of spacing or density include wide spacing, narrow spacing, high density, low density, and the like.
  • the control unit 11 obtains input images obtained from the audience terminals 2 to 2n (step S104).
  • the video feature extraction unit 110 waits for an input video (step S105).
  • the video feature extraction unit 110 extracts video features from the input video similarly to step S3 (step S106).
  • the video feature extraction unit 110 determines whether feature extraction has been performed for all input videos (step S107). If the video feature extraction unit 110 determines that feature extraction has been performed for all input videos (step S107: YES), the process transitions from step S107 to step S108. If the video feature extraction unit 110 determines that feature extraction has not been performed for all input videos (step S107: NO), the process transitions from step S107 to step S105.
  • the delay estimation unit 111 estimates the relative delay times of the plurality of input videos and sounds from the sound features (step S108).
  • step S108 for example, the delay estimation unit 111 compares the video features and sound features of the input video.
  • the delay estimation unit 111 estimates the delay time from the sound of the plurality of videos based on the result of the comparison. For example, the delay estimating unit 111 collates a plurality of video features and sound features, and determines which video feature at each time corresponds to the video feature at which time based on the distance between the video feature and the sound feature or the similarity between the video feature and the sound feature. Find the closest match to the sound feature.
  • the delay estimating unit 111 estimates the delay time from the sound of the video based on the distance to the video feature of a certain video or the time of the sound feature with the closest similarity.
  • the delay estimating unit 111 may estimate at least one of the relative delay time and speed using voting, for example.
  • the delay estimating unit 111 determines the timing deviation of the movements of the plurality of remote spectators, that is, the delay time, by determining which time the video feature at each time is closest to the sound feature at which time, and voting for that time.
  • the delay estimation unit 111 estimates the delay time from the sound of the video based on the voted time.
  • the delay estimation unit 111 may perform estimation by voting using Hough transform.
  • the delay estimation unit 111 may perform delay estimation by setting the relative speed as 1.
  • the delay estimation unit 111 may pair the two types of intervals or two types of densities to estimate the relative delay time. .
  • the delay estimation unit 111 may estimate a relative delay time by pairing intervals between two types of feature extraction or densities of two types of feature extraction.
  • the delay correction unit 112 waits for input video (step S109).
  • the delay correction unit 112 uses the delay time estimated by the delay estimation unit 111 to correct the delay time of the plurality of input videos, and synchronizes the plurality of input videos (step S110).
  • step S110 for example, the delay correction unit 112 obtains an input video.
  • the delay correction unit 112 performs delay correction based on the input sound.
  • the delay correction unit 112 performs delay correction based on the delay time from the sound of a plurality of input videos.
  • the delay correction unit 112 corrects the playback times of a plurality of input videos based on the delay time.
  • the delay correction unit 112 determines whether delay correction has been performed for all input videos (step S111). If the delay correction unit 112 determines that delay correction has been performed on all input videos (step S111: YES), the process ends. If the delay correction unit 112 determines that delay correction has not been performed on all input videos (step S111: NO), the process transitions from step S111 to step S109.
  • the control unit 11 outputs the delay-corrected video to the video output device 102 via the input/output interface 15.
  • the video output device 102 outputs video that has undergone delay correction.
  • the control unit may output the delay-corrected video to the audience terminals 2 to 2n via the IP network.
  • the audience terminals 2 to 2n output video images on which delay correction has been performed.
  • the server 1 extracts video features from a plurality of input videos, estimates a relative delay time from the video features extracted from the multiple input videos, uses the delay time, and extracts video features from a plurality of input videos. It is possible to synchronize multiple input videos by correcting the delay time. Furthermore, the server 1 can extract sound features from the input sound and estimate relative delay times between a plurality of videos and sounds from the sound features. Therefore, the server 1 estimates the delay time from the input sound of multiple videos based on the characteristics such as motion included in the input video and the sound characteristics of the input sound, and plays the multiple videos based on the delay time. The time can be corrected. Thereby, the server 1 can adjust the playback time of the video in accordance with the input sound, and can synchronize the video in accordance with the sound.
  • FIG. 14 is a diagram illustrating an example of a method of photographing a video at an event venue according to the first and second embodiments.
  • a camera installed inside the event venue photographs the crowd inside the venue.
  • an image of the crowd as shown in FIG. 4 is captured by a camera in the venue.
  • cameras in the venue are installed on the stage side of the venue, and are installed to take pictures of the audience seats.
  • the number of cameras in the venue is not limited to one, and a plurality of cameras may be installed.
  • the crowd image may be an image selected from images captured by at least one camera.
  • FIG. 15 is a diagram illustrating an application example of the video synchronization system S according to the first and second embodiments.
  • FIG. 15 shows an example of a system where remote audience members participate in a live streaming event. Remote spectators participate in the event using a PC that allows them to watch the live broadcast while filming themselves with a camera.
  • the roles of the devices and functions in (1) to (5) in FIG. 15 are as follows.
  • (1) Remote spectators will continue to transmit camera images of their faces and upper bodies during the event.
  • (1) is a function performed by a plurality of audience terminals 2 to 2n.
  • Images from remote audiences are arranged horizontally, vertically, or vertically and horizontally in a grid, aggregated, and processed into an easy-to-use format.
  • (2) is a function provided by the server 1.
  • (3) Display the video created in (2) at the event venue.
  • (3) is a function provided by the server 1.
  • (4) Deliver the captured video of the event venue in which the remote audience's video is reflected to the remote audience's PC as a live distribution video.
  • (4) is a function provided by the server 1.
  • Live streaming video is displayed on the remote audience's PC screen.
  • (5) is a function performed by a plurality of audience terminals 2 to 2n.
  • Using video of the remote audience in this way not only allows the audience and performers at the venue to see how excited the remote audience is, but it can also be used for interaction between the performers and audience at the venue and the remote audience. This is thought to increase the utility value of interactive video distribution. Furthermore, at live music events and sports, it is common for spectators to enjoy the event by using penlights and cheering goods to synchronize their movements with each other. It is difficult to match the remote audience's video with the on-site audience's experience due to the delay that occurs between the live venue and the remote environment.
  • the first and second embodiments are designed to create harmonious synchronized images from the camera images of multiple remote audience members viewing a live streaming event without sacrificing the sense of rhythm and use them in venue production. Adjust the delay time due to various factors such as communication time, video processing time, reaction time of remote audience, etc.
  • the movements of a plurality of audience members such as clapping hands or shaking a penlight, are synchronized in time to produce a harmonious video.
  • FIG. 16 is a diagram illustrating a processing example of the server 1 according to the first and second embodiments.
  • Server 1 In order to synchronize the video so that the movements of the audience are aligned, Server 1 needs to focus on and align the objects that serve as the reference for synchronization of people, that is, the partial movements in the video of people clapping their hands or waving their penlights. There is.
  • the server 1 uses a 3-Dimensional Convolutional Neural Network (3D-CNN) to extract spatial and temporal features, and performs a search based on the obtained features.
  • 3D-CNN 3-Dimensional Convolutional Neural Network
  • the video feature extraction unit 110 extracts video features X from the input video X based on feature extraction by phase-based learning or phase difference-based learning.
  • the video feature extraction unit 110 extracts the video feature Y from the input video Y based on feature extraction by phase-based learning or phase difference-based learning.
  • the delay estimation unit 111 estimates a relative delay time from the video feature X and the video feature Y.
  • the delay correction unit 112 synchronizes the reproduced video X based on the input video X and the reproduced video Y based on the input video Y using the estimated delay time.
  • the delay correction unit 112 corrects the playback time of the playback video X by inserting into the playback video X the estimated delay time of the playback video Y with respect to the playback video X, and synchronizes the playback video X and the playback video Y. Create an image that looks great.
  • FIG. 17 is a diagram illustrating an example of a Deep Neural Network (DNN) structure implemented in the video feature extraction unit 110 according to the first and second embodiments.
  • DNN Deep Neural Network
  • the video feature extraction unit 110 uses the DNN structure of the 3D-CNN in order to extract video features based on phase-based learning or phase difference-based learning.
  • the DNN structure of the 3D-CNN is Resnet18-3D (R3D-18) described in Non-Patent Document 5, which is trained using Kinetics-400 described in Non-Patent Document 4, which is used for human behavior classification tasks.
  • the top three layers are used.
  • the image features extracted from R3D-18 are encoded into a G-dimensional latent space through a fully connected layer and a pooling layer.
  • Video Encoder f is the input video of used to encode into latent space as in
  • H, W, and P are the height, width, and number of frames of the input video.
  • the loss function used for DNN training is triplet loss use. here, is a positive constant called the margin parameter. In learning based on triplet loss, learning is performed using a set of Anchor x a as a reference, Positive x p in the same category as Anchor, and Negative x n in a different category from Anchor.
  • Each input is output as a vector in the embedding space by CNN (transform f).
  • the distance d p between Anchor and Positive and the distance d n between Anchor and Negative are measured using a distance function d. Note that the Euclidean distance is used as the distance function d.
  • a triplet is constructed by selecting positive samples and negative samples only in the mini-batch using the method described in Non-Patent Document 6.
  • negative samples are selected according to the following semi-hard negative condition (formula (1)).
  • Non-patent document 4 W. Kay et al. “The Kinetics Human Action Video Dataset”. Computing Research Repository, abs/1705.06950, 2017.
  • Non-patent document 5 D. Tran et al. “A Closer Look at Spatiotemporal Convolutions for Action Recognition”. Proc. of IEEE International Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Non-patent document 6 F. Faghri et al. “VSE++: Improving visual-semantic embeddings with hard negatives”. Proc. of the British Machine Vision Conf. (BMVC), 2018
  • FIG. 18 is a diagram illustrating an example of phase-based learning performed by the learning unit 114 according to the first and second embodiments.
  • the upper half of FIG. 18 shows the left and right swing of the penlight viewed from the front, and the lower half shows the corresponding phase expressed using the arc method.
  • moving the penlight back and forth from side to side is considered to be one cycle, and the phases are associated with each other.
  • one period can be defined by periodic movements such as dancing, hand movements, and object movements, each position can be similarly associated with a phase.
  • phase-based learning It is necessary to determine whether each pair is Positive or Negative, but the phase position is divided into four divisions (corresponding to quadrants) in advance, and if the pairs are in the same division, it is determined to be Positive, and if they are in different divisions, it is determined to be Negative. This is the standard. Here, this will be referred to as phase-based learning.
  • FIG. 19 is a diagram illustrating an example of learning based on phase differences performed by the learning unit 114 according to the first and second embodiments. This is a standard in which the phase difference between the two pairs is positive when it is less than ⁇ /2, and negative when it is ⁇ /2 or more. Here, this will be referred to as learning based on phase differences.
  • FIG. 20 is a diagram illustrating an example of a time-series search by the delay estimation unit 111 according to the first and second embodiments.
  • the delay estimating unit 111 compares the feature vectors of the latent space obtained as video features through the processing of the video feature extracting unit 110 in time series.
  • the delay estimation unit 111 extracts feature vectors for person X and person Y in time series.
  • the delay estimating unit 111 extracts feature vectors in time series from the time of delay time t0 .
  • the delay estimating unit 111 compares the feature vector F x (t+t 0 ) of person X with the feature vector F Y (t) of person Y while shifting the delay time t 0 .
  • the delay estimation unit 111 determines the delay time t 0 at which the distance D (F x , F Y , t 0 ) is the smallest.
  • the delay correction unit 112 adjusts the playback time of the playback video X of the person X by inserting the time of the delay time t 0 obtained by the processing of the delay estimation unit 111 into the playback time.
  • the delay correction unit 112 can create an image in which the reproduced image X of the person X and the reproduced image Y of the person Y appear to be synchronized. Note that in the distance-based search, the delay estimation unit 111 can use Euclidean distance as the distance function D.
  • the video synchronization device may be realized by one device as explained in the above example, or may be realized by a plurality of devices with distributed functions.
  • the program may be transferred while being stored in the electronic device, or may be transferred without being stored in the electronic device. In the latter case, the program may be transferred via a network or may be transferred while being recorded on a recording medium.
  • the recording medium is a non-transitory tangible medium.
  • the recording medium is a computer readable medium.
  • the recording medium may be any medium capable of storing a program and readable by a computer, such as a CD-ROM or a memory card, and its form is not limited.
  • the present invention is not limited to the above-described embodiments as they are, but can be embodied by modifying the constituent elements at the implementation stage without departing from the spirit of the invention.
  • various inventions can be formed by appropriately combining the plurality of components disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiments. Furthermore, components from different embodiments may be combined as appropriate.
  • the embodiments described above may be applied not only to electronic devices but also to methods performed by electronic devices.
  • the above-described embodiments may be applied to a program that allows a computer to execute the processing of each part of an electronic device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A video synchronization device according to one embodiment is equipped with: a video feature extraction unit for extracting a video feature from a plurality of input videos; a delay estimation unit for estimating the relative delay interval from the video features extracted from the plurality of input videos; and a delay correction unit for synchronizing the plurality of input images by correcting the delay interval of the plurality of input videos by using said delay interval.

Description

映像同期装置、映像同期方法及び映像同期プログラムVideo synchronization device, video synchronization method, and video synchronization program
 この発明の一態様は、映像同期装置、映像同期方法及び映像同期プログラムに関する。 One aspect of the present invention relates to a video synchronization device, a video synchronization method, and a video synchronization program.
 近年、ある地点で撮影・収録された映像・音声をデジタル化してIP(Internet Protocol)ネットワーク等の通信回線を介して遠隔地にリアルタイム伝送し、遠隔地で映像・音声を再生する映像・音声再生装置が用いられるようになってきた。例えば、音楽ライブの会場で行われている音楽ライブの映像・音声や、競技会場で行われているスポーツ競技試合の映像・音声を遠隔地にリアルタイム伝送するオンラインライブやパブリックビューイング等が盛んに行われている。このような映像・音声の伝送は1対1の一方向伝送にとどまらない。音楽ライブが行われている会場(以下、イベント会場とする)から映像・音声を複数の遠隔地に伝送し、それら複数の遠隔地でもそれぞれ観客がライブを楽しんでいる映像や歓声等の音声を撮影・収録し、それらの映像・音声をイベント会場や他の遠隔地に伝送し、各拠点において大型映像表示装置やスピーカから出力する、というような双方向伝送も行なわれている。 In recent years, video and audio playback has become popular, which involves digitizing video and audio recorded at a certain point, transmitting it in real time to a remote location via communication lines such as IP (Internet Protocol) networks, and playing back the video and audio at the remote location. equipment has come into use. For example, online live performances and public viewing, which transmit real-time video and audio of live music events held at music venues and video and audio of sports matches held at competition venues, to remote locations are becoming more popular. It is being done. Such video/audio transmission is not limited to one-to-one one-way transmission. Video and audio are transmitted from the venue where the music live performance is being held (hereinafter referred to as the event venue) to multiple remote locations, and even at each of these multiple remote locations, the video and audio such as cheers of the audience enjoying the live performance are transmitted. Two-way transmission is also being carried out, in which video and audio are photographed and recorded, transmitted to event venues and other remote locations, and output from large video display devices and speakers at each site.
 このような双方向での映像・音声の伝送においては、音楽ライブ等の映像を遠隔地で楽しんでいる顧客が、イベント会場へ接続し、イベント会場や遠隔地の他の観客と一緒に、音楽や他の観客に合わせてペンライトを振ったり、手拍子を行ったり、ダンスをするような場合、イベント会場の演者や観客、他の遠隔地の観客と揃えて、映像を流すことは難しい。遠隔地とイベント会場との通信時間には、通信時間、映像処理時間、遠隔地の観客の反応時間等、様々な要因によって発生する遅延時間が含まれる。そのため、遠隔地の観客の動作を含む映像をリアルタイムに同期させることは困難である。 In this kind of two-way video and audio transmission, a customer who is enjoying the video of a live music event in a remote location can connect to the event venue and listen to the music together with other audience members at the event venue or in a remote location. When a person is waving a flashlight, clapping, or dancing in time with other audience members, it is difficult to broadcast the video in sync with the performers and audience at the event venue, as well as with the audience in other remote locations. The communication time between a remote location and an event venue includes delay time caused by various factors such as communication time, video processing time, reaction time of the audience at the remote location, etc. Therefore, it is difficult to synchronize images that include the movements of spectators in remote locations in real time.
 そこで、複数の遠隔地の観客の映像の遅延時間を推定して、その推定された遅延時間に基づいて映像の再生時刻を補正し、観客の動作が揃っているかのように見える映像を作り出すことが考えられる。非特許文献1には、映像信号に埋め込まれた同期信号に基づいて、映像を同期する方法が記載されている。 Therefore, by estimating the delay time of the video of the audience in multiple remote locations and correcting the video playback time based on the estimated delay time, we can create a video that looks as if the audience's movements are aligned. is possible. Non-Patent Document 1 describes a method of synchronizing videos based on a synchronization signal embedded in a video signal.
 しかしながら,非特許文献1の方法は、視聴映像を同期する方法であり、映像の中の動作に基づいて映像を同期することは困難である。 However, the method of Non-Patent Document 1 is a method of synchronizing the viewed videos, and it is difficult to synchronize the videos based on the actions in the videos.
 この発明は、上記事情に着目してなされたもので、その目的とするところは、映像に含まれる動きに基づいて映像を同期できる技術を提供することにある。 This invention has been made in view of the above circumstances, and its purpose is to provide a technology that can synchronize videos based on the motion contained in the videos.
 この発明の一実施形態では、映像同期装置は、複数の入力映像から映像特徴を抽出する映像特徴抽出部と、複数の入力映像から抽出した映像特徴から相対的な遅延時間を推定する遅延推定部と、遅延時間を使用し、複数の入力映像の遅延時間を補正して複数の入力映像を同期する遅延補正部と、を備える。 In one embodiment of the present invention, the video synchronization device includes a video feature extraction unit that extracts video features from a plurality of input videos, and a delay estimation unit that estimates a relative delay time from the video features extracted from the multiple input videos. and a delay correction unit that synchronizes the plurality of input videos by correcting the delay time of the plurality of input videos using the delay time.
 この発明の一態様によれば、映像に含まれる動きに基づいて映像を同期することができる。 According to one aspect of the present invention, videos can be synchronized based on motion included in the videos.
図1は、第1の実施形態に係る映像同期システムに含まれる各電子機器のハードウェア構成の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of the hardware configuration of each electronic device included in the video synchronization system according to the first embodiment. 図2は、第1の実施形態に係る映像同期システムを構成するサーバのソフトウェア構成の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of the software configuration of a server that constitutes the video synchronization system according to the first embodiment. 図3は、第1の実施形態に係る遠隔地での観客の映像の一例を示す図である。FIG. 3 is a diagram illustrating an example of an image of an audience at a remote location according to the first embodiment. 図4は、第1の実施形態に係るイベント会場での映像の一例を示す図である。FIG. 4 is a diagram showing an example of a video at an event venue according to the first embodiment. 図5は、第1の実施形態に係るサーバの映像特徴抽出を示す概念図である。FIG. 5 is a conceptual diagram showing video feature extraction by the server according to the first embodiment. 図6は、第1の実施形態に係るサーバの映像同期手順と処理内容の一例を示すフローチャートである。FIG. 6 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the first embodiment. 図7は、第1の実施形態に係るサーバの遅延推定方法を説明する図である。FIG. 7 is a diagram illustrating a server delay estimation method according to the first embodiment. 図8は、第1の実施形態に係るサーバの遅延推定方法を説明する図である。FIG. 8 is a diagram illustrating a server delay estimation method according to the first embodiment. 図9は、第1の実施形態に係るサーバの遅延推定方法を説明する図である。FIG. 9 is a diagram illustrating a server delay estimation method according to the first embodiment. 図10は、第1の実施形態に係るサーバの遅延推定方法を説明する図である。FIG. 10 is a diagram illustrating a server delay estimation method according to the first embodiment. 図11は、第1の実施形態に係るサーバの映像同期処理を示す概念図である。FIG. 11 is a conceptual diagram showing video synchronization processing of the server according to the first embodiment. 図12は、第2の実施形態に係る映像同期システムを構成するサーバのソフトウェア構成の一例を示すブロック図である。FIG. 12 is a block diagram illustrating an example of the software configuration of a server configuring the video synchronization system according to the second embodiment. 図13は、第2の実施形態に係るサーバの映像同期手順と処理内容の一例を示すフローチャートである。FIG. 13 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the second embodiment. 図14は、第1及び第2の実施形態に係るイベント会場での映像の撮影方法の一例を示す図である。FIG. 14 is a diagram illustrating an example of a method of photographing a video at an event venue according to the first and second embodiments. 図15は、第1及び第2の実施形態に係る映像同期システムの適用例を説明する図である。FIG. 15 is a diagram illustrating an application example of the video synchronization system according to the first and second embodiments. 図16は、第1及び第2の実施形態に係るサーバの処理例を説明する図である。FIG. 16 is a diagram illustrating an example of server processing according to the first and second embodiments. 図17は、第1及び第2の実施形態に係る映像特徴抽出部に実装されるDNN構造の例を説明する図である。FIG. 17 is a diagram illustrating an example of a DNN structure implemented in the video feature extraction unit according to the first and second embodiments. 図18は、第1及び第2の実施形態に係る学習部により実行される位相に基づく学習例を説明する図である。FIG. 18 is a diagram illustrating an example of phase-based learning performed by the learning unit according to the first and second embodiments. 図19は、第1及び第2の実施形態に係る学習部により実行される位相差に基づく学習例を説明する図である。FIG. 19 is a diagram illustrating an example of learning based on a phase difference performed by the learning unit according to the first and second embodiments. 図20は、第1及び第2の実施形態に係る遅延推定部による時系列探索例について説明する図である。FIG. 20 is a diagram illustrating an example of a time series search by the delay estimator according to the first and second embodiments.
 以下、図面を参照してこの発明に係るいくつかの実施形態を説明する。 
 音楽ライブ会場等の音楽ライブ(以下、イベントともいう)において、遠隔地でライブを視聴する複数の観客(以下、リモート観客という)の入力映像を、映像の中の動作の特徴に基づいて再生時間を同期することを想定する。
Hereinafter, some embodiments of the present invention will be described with reference to the drawings.
At a music live concert (hereinafter also referred to as an event) such as a music live venue, the input video of multiple audiences watching the live performance from remote locations (hereinafter referred to as remote audience) is calculated based on the characteristics of the movements in the video. Assuming that you synchronize.
 入力映像は、図3に示すようなリモート観客の複数の映像を利用するものとする。図3は、複数のリモート観客の映像を示す。図3は、複数のリモート観客がペンライトを使用して盛り上がっている状態を示す。図3では、5×5のマトリックス状に複数のリモート観客の映像が集約されているが、複数の映像は、このような集約映像から個々の映像を切り出して使用するものとする。なお、図4に示すようなイベント会場の群衆の映像を入力映像として使用してもよい。図4は、イベント会場の群衆がペンライトを使用して盛り上がっている状態を示す。この場合、イベント会場の群衆の映像の一部を切り出して入力映像として使用してもよいし、全体を入力映像として使用してもよい。入力映像は、動作の視認性が高いペンライトのような特徴的なアイテムを所持した観客の映像を想定するが、観客が何も持たず、手拍子を行う場合やダンスを行う場合の映像でもよい。 As the input video, multiple videos of remote audience members as shown in FIG. 3 are used. FIG. 3 shows images of multiple remote spectators. FIG. 3 shows a situation in which multiple remote spectators are excited using penlights. In FIG. 3, videos of a plurality of remote audience members are aggregated in a 5×5 matrix, but each video is cut out from such an aggregated video and used. Note that an image of a crowd at an event venue as shown in FIG. 4 may be used as the input image. FIG. 4 shows a crowd at an event venue being excited using penlights. In this case, a part of the video of the crowd at the event venue may be cut out and used as the input video, or the entire video may be used as the input video. The input video is assumed to be a video of the audience holding a distinctive item such as a penlight whose movements are highly visible, but it may also be a video of the audience clapping or dancing without holding anything. .
 [第1の実施形態] 
 第1の実施形態は、リモート観客の映像の特徴を利用して、複数の映像を同期する実施形態である。
[First embodiment]
The first embodiment is an embodiment in which a plurality of videos are synchronized by using characteristics of videos of a remote audience.
 (構成例) 
 図1は、第1の実施形態に係る映像同期システムに含まれる各電子機器のハードウェア構成の一例を示すブロック図である。
 映像同期システムSは、サーバ1、音声出力装置101、映像出力装置102、複数の観客用端末2~2nを含む。サーバ1、音声出力装置101、映像出力装置102、複数の観客用端末2~2nは、IPネットワークを介して互いに通信可能である。
(Configuration example)
FIG. 1 is a block diagram showing an example of the hardware configuration of each electronic device included in the video synchronization system according to the first embodiment.
The video synchronization system S includes a server 1, an audio output device 101, a video output device 102, and a plurality of audience terminals 2 to 2n. The server 1, the audio output device 101, the video output device 102, and the plurality of audience terminals 2 to 2n can communicate with each other via an IP network.
 サーバ1は、データを収集し、収集したデータを処理する電子機器である。電子機器は、コンピュータを含む。 The server 1 is an electronic device that collects data and processes the collected data. Electronic devices include computers.
 音声出力装置101は、音声を再生して出力するスピーカを含む装置である。音声出力装置101は、例えば、イベント会場において音声を出力する装置である。 The audio output device 101 is a device that includes a speaker that reproduces and outputs audio. The audio output device 101 is, for example, a device that outputs audio at an event venue.
 映像出力装置102は、映像を再生して表示するディスプレイを含む装置である。例えば、ディスプレイは、液晶ディスプレイである。映像出力装置102は、例えば、イベント会場において映像を再生して表示する装置である。 The video output device 102 is a device that includes a display that plays and displays video. For example, the display is a liquid crystal display. The video output device 102 is, for example, a device that plays and displays video at an event venue.
 観客用端末2~2nのそれぞれは、複数のリモート観客のそれぞれが使用する端末である。観客用端末2~2nのそれぞれは、入力機能、表示機能及び通信機能を備える電子機器である。例えば、観客用端末2~2nのそれぞれは、タブレット端末、スマートフォン、又はPC(Personal Computer)等であるが、これらに限定されない。観客用端末2は、端末の一例である。 Each of the spectator terminals 2 to 2n is a terminal used by each of a plurality of remote spectators. Each of the spectator terminals 2 to 2n is an electronic device having an input function, a display function, and a communication function. For example, each of the audience terminals 2 to 2n is a tablet terminal, a smartphone, a PC (Personal Computer), or the like, but is not limited to these. The spectator terminal 2 is an example of a terminal.
 サーバ1の構成例について説明する。
 サーバ1は、制御部11、プログラム記憶部12、データ記憶部13、通信インタフェース14及び入出力インタフェース15を備える。サーバ1が備える各要素は、バスを介して、互いに接続されている。
An example of the configuration of the server 1 will be explained.
The server 1 includes a control section 11, a program storage section 12, a data storage section 13, a communication interface 14, and an input/output interface 15. Each element included in the server 1 is connected to each other via a bus.
 制御部11は、サーバ1の中枢部分に相当する。制御部11は、中央処理ユニット(Central Processing Unit:CPU)等のプロセッサを備える。制御部11は、不揮発性のメモリ領域としてROM(Read Only Memory)を備える。制御部11は、揮発性のメモリ領域としてRAM(Random Access Memory)を備える。プロセッサは、ROM、又はプログラム記憶部12に記憶されているプログラムをRAMに展開する。プロセッサがRAMに展開されるプログラムを実行することで、制御部11は、後述する各機能部を実現する。制御部11は、コンピュータを構成する。 The control unit 11 corresponds to the central part of the server 1. The control unit 11 includes a processor such as a central processing unit (CPU). The control unit 11 includes a ROM (Read Only Memory) as a nonvolatile memory area. The control unit 11 includes a RAM (Random Access Memory) as a volatile memory area. The processor expands the program stored in the ROM or the program storage unit 12 into the RAM. The control unit 11 realizes each functional unit described below by the processor executing the program loaded in the RAM. The control unit 11 constitutes a computer.
 プログラム記憶部12は、記憶媒体としてHDD(Hard Disk Drive)、又はSSD(Solid State Drive)等の随時書込み及び読出しが可能な不揮発性メモリで構成される。プログラム記憶部12は、各種制御処理を実行するために必要なプログラムを記憶する。例えば、プログラム記憶部12は、制御部11に実現される後述する各機能部による処理をサーバ1に実行させるプログラムを記憶する。プログラム記憶部12は、ストレージの一例である。 The program storage unit 12 is configured of a non-volatile memory that can be written to and read from at any time, such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), as a storage medium. The program storage unit 12 stores programs necessary to execute various control processes. For example, the program storage unit 12 stores a program that causes the server 1 to execute processing by each functional unit implemented in the control unit 11, which will be described later. The program storage unit 12 is an example of storage.
 データ記憶部13は、記憶媒体としてHDD、又はSSD等の随時書込み及び読出しが可能な不揮発性メモリで構成される。データ記憶部13は、ストレージ、又は記憶部の一例である。 The data storage unit 13 is composed of a nonvolatile memory that can be written to and read from at any time, such as an HDD or an SSD, as a storage medium. The data storage unit 13 is an example of a storage or a storage unit.
 通信インタフェース14は、IPネットワークにより定義される通信プロトコルを使用して、サーバ1を他の電子機器と通信可能に接続する種々のインタフェースを含む。 The communication interface 14 includes various interfaces that communicatively connect the server 1 to other electronic devices using communication protocols defined by IP networks.
 入出力インタフェース15は、サーバ1と音声出力装置101、映像出力装置102のそれぞれとの通信を可能にするインタフェースである。入出力インタフェース15は、有線通信のインタフェースを備えていてもいいし、無線通信のインタフェースを備えていてもよい。 The input/output interface 15 is an interface that enables communication between the server 1 and each of the audio output device 101 and the video output device 102. The input/output interface 15 may include a wired communication interface or a wireless communication interface.
 なお、サーバ1のハードウェア構成は、上述の構成に限定されるものではない。サーバ1は、適宜、上述の構成要素の省略、及び変更並びに新たな構成要素の追加を可能とする。 Note that the hardware configuration of the server 1 is not limited to the above-mentioned configuration. The server 1 allows the above-mentioned components to be omitted and changed, and new components to be added as appropriate.
 図2は、第1の実施形態に係る映像同期システムを構成するサーバ1のソフトウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram showing an example of the software configuration of the server 1 that constitutes the video synchronization system according to the first embodiment.
 サーバ1は、映像特徴抽出部110、遅延推定部111、遅延補正部112、及び学習部114を備える。各機能部は、制御部11によるプログラムの実行によって実現される。各機能部は、制御部11又はプロセッサが備えるということもできる。各機能部は、制御部11又はプロセッサと読み替え可能である。図2において、3つの映像特徴抽出部110を例示するが、映像特徴抽出部110の数はこれに限られない。以下の説明において、複数の入力映像のそれぞれは、異なる映像特徴抽出部110により処理されるものとするが、一つの映像特徴抽出部110により処理されてもよい。 The server 1 includes a video feature extraction section 110, a delay estimation section 111, a delay correction section 112, and a learning section 114. Each functional unit is realized by execution of a program by the control unit 11. It can also be said that each functional unit is included in the control unit 11 or the processor. Each functional unit can be read as the control unit 11 or a processor. Although three video feature extraction units 110 are illustrated in FIG. 2, the number of video feature extraction units 110 is not limited to this. In the following description, it is assumed that each of a plurality of input videos is processed by a different video feature extraction unit 110, but it may be processed by a single video feature extraction unit 110.
 映像特徴抽出部110は、入力映像から映像特徴を抽出する。入力映像は、例えば、複数のリモート観客の映像を含む。入力映像は、例えば、図3に示すような、5×5のマトリックス状の映像から個々の映像を切り出した映像を含む。入力映像は、図4に示すようなイベント会場の群衆の映像を含んでもよい。映像特徴は、入力映像に見られる特徴である。映像特徴は、例えば、入力映像に含まれる人の動き、物、人の表情等を含む。入力映像が観客の映像である場合、映像特徴は、ペンライトを振る動き、タオルを持ち上げる動き、手を上にあげる動き、手を左右に振る動き等の人の動きを含む。映像特徴は、ペンライト、タオル等の物を含んでもよい。映像特徴は、笑顔、泣き顔等の人の表情を含んでもよい。映像特徴は、映像の中の動作又は動きを示す特徴を含む。 The video feature extraction unit 110 extracts video features from the input video. The input video includes, for example, multiple remote audience videos. The input video includes, for example, a video obtained by cutting out individual videos from a 5×5 matrix video as shown in FIG. The input video may include a video of a crowd at an event venue, as shown in FIG. Video features are features seen in the input video. The video features include, for example, human movements, objects, human facial expressions, etc. included in the input video. When the input image is an image of a spectator, the image characteristics include human movements such as waving a penlight, lifting a towel, raising a hand, and waving a hand from side to side. Video features may include objects such as penlights, towels, etc. The video features may include human facial expressions such as smiling faces and crying faces. Video features include features that indicate action or movement in the video.
 映像特徴抽出部110は、例えば、図5に示すように、入力映像をずらしながら特徴抽出を行う。図5は、第1の実施形態に係るサーバ1の映像特徴抽出を示す概念図である。図5に示すように、映像特徴抽出部110は、入力映像を映像切り出し窓幅に基づいて切り出す。映像特徴抽出部110は、切り出し間隔に基づいて映像切り出し窓幅の始点を決定する。映像特徴抽出部110は、ある映像切り出し窓幅の入力映像から特徴を抽出した後、切り出し間隔の分だけずらして、次の映像切り出し窓幅の入力映像から特徴を抽出する。 The video feature extraction unit 110 performs feature extraction while shifting the input video, for example, as shown in FIG. FIG. 5 is a conceptual diagram showing video feature extraction by the server 1 according to the first embodiment. As shown in FIG. 5, the video feature extraction unit 110 cuts out the input video based on the video clipping window width. The video feature extraction unit 110 determines the starting point of the video clipping window width based on the clipping interval. The video feature extraction unit 110 extracts features from an input video with a certain video clipping window width, shifts the video by the clipping interval, and then extracts features from an input video with the next video clipping window width.
 映像特徴抽出部110は、例えば、映像の特徴抽出に、機械学習を用いてもよい。映像特徴抽出部110は、非特許文献2又は非特許文献3に記載された公知の方法により特徴抽出を行ってもよい。映像特徴抽出部110は、例えば、予めリズムのある音と映像を対応づけて学習した映像特徴抽出手法を使用してもよい。この場合、映像特徴抽出部110は、よりリズムに関係する映像特徴抽出を行うことができる。
非特許文献2:Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi and Noboru Harada. Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. Proc. Interspeech 2020, pp. 1446-1450, 2020.
非特許文献3:安田昌弘、大石康智、小泉悠馬、原田登「弱ラベルで示される特定の共起関係に基づいたクロスモーダル音検索」 日本音響学会研究発表会講演論文集、秋季2020 ROMBUNNO.2-1-2
The video feature extraction unit 110 may use machine learning, for example, to extract video features. The video feature extraction unit 110 may perform feature extraction using a known method described in Non-Patent Document 2 or Non-Patent Document 3. For example, the video feature extraction unit 110 may use a video feature extraction method learned in advance by associating rhythmic sounds with video. In this case, the video feature extraction unit 110 can extract video features that are more related to rhythm.
Non-patent document 2: Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi and Noboru Harada. Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. Proc. Interspeech 2020, pp. 1446-1450, 2020.
Non-patent document 3: Masahiro Yasuda, Yasutoshi Oishi, Yuma Koizumi, Noboru Harada "Cross-modal sound search based on specific co-occurrence relationships indicated by weak labels" Proceedings of the Acoustical Society of Japan Research Conference, Autumn 2020 ROMBUNNO. 2-1-2
 映像特徴抽出部110は、後述する学習部114により実行される学習による特徴抽出に基づいて映像特徴を抽出してもよい。学習部114により実行される学習による特徴抽出は、学習により得られた映像特徴抽出手法である。映像特徴抽出手法は、映像特徴を抽出するための学習済モデルということもできる。映像特徴抽出部110は、映像特徴として、映像の中の動作又は動きを示す特徴ベクトルを抽出することができる。 The video feature extraction unit 110 may extract video features based on feature extraction by learning performed by the learning unit 114, which will be described later. The feature extraction by learning performed by the learning unit 114 is a video feature extraction method obtained by learning. The video feature extraction method can also be called a trained model for extracting video features. The video feature extraction unit 110 can extract a feature vector indicating an action or movement in the video as the video feature.
 なお、映像特徴抽出部110は、特徴抽出の間隔又は密度を常に同じにして特徴抽出を行わなくてもよい。映像特徴抽出部110は、特徴抽出の間隔又は密度を少なくとも2種類設けてもよい。例えば、映像特徴抽出部110は、2つの映像のうち一方の特徴抽出の間隔を狭く、又は、密度を高くして、他方の特徴抽出の間隔を広く、又は、密度を低くしてもよい。 Note that the video feature extraction unit 110 does not have to always perform feature extraction at the same feature extraction interval or density. The video feature extraction unit 110 may provide at least two types of feature extraction intervals or densities. For example, the video feature extraction unit 110 may narrow the interval or increase the density of feature extraction for one of the two videos, and widen the interval or lower the density of feature extraction for the other video.
 遅延推定部111は、複数の入力映像から抽出した映像特徴から相対的な遅延時間を推定する。遅延推定部111は、複数の映像特徴同士を時系列に照合し、映像特徴同士の距離、又は、映像特徴同士の類似度に基づいて、各時刻の映像特徴がどの時刻の映像特徴と最も近いかを求める。ここでは、複数の映像特徴同士を時系列に照合し、各時刻の映像特徴がどの時刻の映像特徴と最も近いかを求める探索手法を時系列探索と呼ぶこととする。 The delay estimation unit 111 estimates a relative delay time from video features extracted from a plurality of input videos. The delay estimation unit 111 compares a plurality of video features in time series, and determines which video feature at each time is closest to the video feature at which time based on the distance between the video features or the similarity between the video features. Ask for something. Here, a search method that compares a plurality of video features in chronological order and determines which video feature at each time is closest to the video feature at which time is referred to as a time-series search.
 遅延推定部111は、投票を用いて相対的な遅延時間又は速度の少なくとも一方を推定してもよい。遅延推定部111は、複数のリモート観客の動作のタイミングのずれ、すなわち、遅延時間を、各時刻の特徴がどの時刻の特徴と最も近いかを求め、その時刻を投票することで推定してもよい。遅延推定部111は、Hough変換により投票による推定を行ってもよい。 The delay estimation unit 111 may estimate at least one of the relative delay time and speed using voting. The delay estimating unit 111 may estimate the timing deviation of the movements of a plurality of remote spectators, that is, the delay time, by determining the characteristics of each time that are closest to the characteristics of which time, and voting on that time. good. The delay estimation unit 111 may perform estimation by voting using Hough transformation.
 Hough変換は、図7に示すx-y平面上の点(x, y)に対応する、図8に示すa-b平面上の直線を描いて、直線の傾きと切片を投票により求める手法である。遅延推定部111は、a-b平面上に描いた直線の交点(a, b)をグリッド状に区切った升目への投票により求める。映像特徴抽出部110は、図5に示すように、入力映像を等間隔に切り出し、時系列の映像特徴として、時系列の特徴ベクトルにする。遅延推定部111は、図9に示すように、各人物の時刻ごとの特徴ベクトルを相互に距離を測定し、最近傍となった時刻のペアをプロットする。このとき、ペアの映像が同じ動きをしていれば、理想的には遅延時間分の時刻がずれた直線にプロットされる。遅延推定部111は、この直線の傾きaと切片bをHough変換により求める。 Hough transformation involves drawing a straight line on the a-b plane shown in Figure 8 that corresponds to the point (x i , y i ) on the x-y plane shown in Figure 7, and determining the slope and intercept of the straight line by voting. It is a method. The delay estimation unit 111 determines the intersection points (a 0 , b 0 ) of straight lines drawn on the ab plane by voting on squares divided into grid shapes. As shown in FIG. 5, the video feature extraction unit 110 cuts out the input video at regular intervals and converts the input video into time-series feature vectors as time-series video features. As shown in FIG. 9, the delay estimation unit 111 measures the distance between the feature vectors of each person at each time, and plots the pair of closest times. At this time, if the paired images have the same movement, ideally they will be plotted on a straight line with time offset by the delay time. The delay estimation unit 111 obtains the slope a 0 and intercept b 0 of this straight line by Hough transformation.
 例えば、リモート観客の映像について遅延推定をする場合について説明する。リモート観客の各人物間で異なる速度の映像を見て動作をしているとは考えにくいため、遅延推定部111は、図10に示すように、傾きをa=1に固定し投票を簡略化してもよい。例えば、遠隔地で複数の観客がライブを視聴している場合は、各観客は同じ速度で再生されている映像を見ていることが想定されるため、遅延推定部111は、a=1に固定してもよい。この場合、1点につき1票投じるだけでよいので、効率的である。遅延推定部111は、最も投票数の大きかった切片bを推定遅延時間として決定する。この例では、遅延推定部111は、2つの映像間の遅延時間の推定を行うことを想定したが、複数の映像の遅延時間を求める場合は、集合から複数のペアを抽出してペア毎に遅延時間を求めてもよい。ここでは、複数の映像特徴同士を時系列に照合し、映像特徴同士の類似度に基づいて各時刻の映像特徴がどの時刻の映像特徴と最も近いかを求める探索手法を、投票に基づく探索と呼ぶこととする。 For example, a case will be described in which delay estimation is performed for a video of a remote audience. Since it is unlikely that each person in the remote audience is watching and moving at different speeds, the delay estimation unit 111 fixes the slope to a 0 = 1 to simplify voting, as shown in FIG. may be converted into For example, if multiple audience members are watching a live performance in a remote location, it is assumed that each audience member is watching the video being played at the same speed, so the delay estimation unit 111 calculates a 0 = 1. It may be fixed to In this case, it is efficient because it is only necessary to cast one vote for each point. The delay estimation unit 111 determines the intercept b 0 that received the largest number of votes as the estimated delay time. In this example, it is assumed that the delay estimator 111 estimates the delay time between two videos, but when determining the delay time of multiple videos, multiple pairs are extracted from the set and each pair is estimated. The delay time may also be determined. Here, we use a search method that compares multiple video features in chronological order and determines which video feature at each time is closest to the video feature at which time based on the degree of similarity between the video features. I will call you.
 遅延推定部111は、特徴ベクトル同士の照合を、非特許文献2又は非特許文献3の方法により行ってもよい。その場合、遅延推定部111は、距離を用いて相対的な遅延時間又は速度の少なくとも一方を推定してもよい。遅延推定部111は、ユークリッド距離等の距離尺度を使用してもよい。ここでは、複数の映像特徴同士を時系列に照合し、映像特徴同士の距離に基づいて各時刻の映像特徴がどの時刻の映像特徴と最も近いかを求める探索手法を、距離に基づく探索と呼ぶこととする。 The delay estimation unit 111 may perform matching between feature vectors using the method described in Non-Patent Document 2 or Non-Patent Document 3. In that case, the delay estimation unit 111 may estimate at least one of the relative delay time and speed using the distance. The delay estimation unit 111 may use a distance measure such as Euclidean distance. Here, we refer to a search method that compares multiple video features in chronological order and determines which video feature at each time is closest to the video feature at which time based on the distance between the video features as distance-based search. That's it.
 遅延推定部111は、例えば、映像特徴抽出部110により特徴抽出の間隔又は密度が2種類設けられた場合、2種類の間隔又は2種類の密度をペアにして相対的な遅延時間を推定してもよい。例えば、遅延推定部111は、2つの映像の遅延を推定する場合、一方の映像の特徴抽出の間隔を狭く、又は、密度を高くして、他方の映像の特徴抽出の間隔を広く、又は、密度を低くして、遅延時間を推定してもよい。 For example, when the video feature extraction unit 110 provides two types of feature extraction intervals or densities, the delay estimation unit 111 pairs the two types of intervals or two types of densities to estimate the relative delay time. Good too. For example, when estimating the delay between two videos, the delay estimation unit 111 narrows or increases the density of feature extraction for one video, widens the feature extraction interval for the other video, or The delay time may be estimated by lowering the density.
 遅延補正部112は、遅延時間を使用し、複数の映像の遅延時間を補正して、複数の映像を同期する。遅延補正部112は、複数の映像の推定された遅延時間に基づいて、映像の再生時刻を補正する。遅延補正部112は、例えば、遅延時間が最も大きいリモート観客の映像に合わせて、遅延時間が小さいリモート観客の映像の再生時刻に遅延時間の差分を挿入して、同期する映像を作り出す。 The delay correction unit 112 uses the delay time to correct the delay times of the plurality of videos, and synchronizes the plurality of videos. The delay correction unit 112 corrects the video playback time based on the estimated delay times of the plurality of videos. For example, the delay correction unit 112 inserts a difference in delay time into the playback time of the video of the remote audience with the shortest delay time in accordance with the video of the remote audience with the longest delay time, thereby creating a synchronized video.
 遅延補正部112は、例えば、図3に示すいずれかの観客、例えば、左上のリモート観客の映像に全ての映像を合わせるように遅延補正をしてもよい。遅延補正部112は、低コストに遅延時間を求めるために、グループ化されたリモート観客の集合から代表して少数のサンプル抽出し、サンプルの平均の遅延時間をグループ全ての映像の遅延時間として扱い、グループ内の全てのリモート観客の映像の再生時刻の補正を行ってもよい。遅延補正部112は、遅延時間に基づいて、映像の再生時刻を補正することで、図11の左図に示すような複数の観客の映像から、右図に示すような、複数の観客の動作が揃った映像を作り出すことができる。以下の説明において、「再生」は、「出力」、又は「送信」と読み替えてもよい。 For example, the delay correction unit 112 may perform delay correction so that all the images match the images of one of the spectators shown in FIG. 3, for example, the remote audience in the upper left. In order to find the delay time at low cost, the delay correction unit 112 extracts a small number of representative samples from a grouped set of remote audiences, and treats the average delay time of the samples as the delay time of all videos in the group. , the playback times of the videos of all remote spectators in the group may be corrected. The delay correction unit 112 corrects the playback time of the video based on the delay time, thereby adjusting the movements of the multiple audience members as shown in the right diagram from the videos of the multiple spectators as shown in the left diagram of FIG. 11. It is possible to create an image with all the necessary information. In the following description, "reproduction" may be read as "output" or "transmission".
 学習部114は、学習用映像の位相に基づく学習又は位相差分に基づく学習を実行する。位相に基づく学習及び位相差分に基づく学習については後述する。学習部114は、複数の学習用映像及び複数の学習用映像のそれぞれに対応付けられた位相を含む学習データの学習を実行する。位相は、学習用映像の一部に対応する位相である。例えば、学習用映像の一部は、学習用映像に含まれるペンライトである。ペンライトは、人により振られるので、位相は、ペンライトの位置に対応する。映像内の人に向かって左側の位置にペンライトが振られている場合、位相は、0[rad]としてもよい。映像内の人に向かってペンライトが右側の位置に振られている場合、位相は、π[rad]としてもよい。映像内の人に向かって正面の位置にペンライトが振られている場合、位相は、π/2[rad]としてもよい。なお、位相に関する学習用映像の一部は、ペンライトに限定されるものではなく、映像内の種々の人の動き、物、人の表情等でもよい。学習データは、データ記憶部13に記憶されていてもよいし、サーバ1とは異なる電子機器に記憶されていてもよい。学習部114は、学習の実行により映像特徴抽出手法を得ることができる。 The learning unit 114 performs learning based on the phase of the learning video or learning based on the phase difference. Learning based on phase and learning based on phase difference will be described later. The learning unit 114 executes learning of learning data including a plurality of learning videos and phases associated with each of the plurality of learning videos. The phase is a phase corresponding to a part of the learning video. For example, part of the learning video is a penlight included in the learning video. Since the penlight is swung by a person, the phase corresponds to the position of the penlight. When the penlight is swung to the left side of the person in the video, the phase may be set to 0 [rad]. When the penlight is swung to the right toward a person in the video, the phase may be set to π [rad]. When a penlight is swung in front of a person in the video, the phase may be set to π/2 [rad]. Note that a part of the phase-related learning video is not limited to a penlight, but may be various human movements, objects, human facial expressions, etc. in the video. The learning data may be stored in the data storage unit 13 or may be stored in an electronic device different from the server 1. The learning unit 114 can obtain a video feature extraction method by performing learning.
 (動作例) 
 サーバ1による処理の手順について説明する。
 なお、以下のサーバ1を主体とする説明では、サーバ1を制御部11と読み替えてもよい。
(Operation example)
The procedure of processing by the server 1 will be explained.
In addition, in the following description mainly based on the server 1, the server 1 may be read as the control unit 11.
 なお、以下で説明する処理手順は一例に過ぎず、各処理は可能な限り変更されてよい。また、以下で説明する処理手順について、実施形態に応じて、適宜、ステップの省略、置換、及び追加が可能である。 Note that the processing procedure described below is only an example, and each process may be changed as much as possible. Further, regarding the processing procedure described below, steps can be omitted, replaced, or added as appropriate depending on the embodiment.
 図6は、第1の実施形態に係るサーバ1の映像同期手順と処理内容の一例を示すフローチャートである。 FIG. 6 is a flowchart showing an example of the video synchronization procedure and processing contents of the server 1 according to the first embodiment.
 以下の処理では、複数のリモート観客のカメラの映像、映像の同期に使用する映像を入力とし、複数のリモート観客の映像が同期された映像を出力とする。リモート観客のカメラの映像、映像の同期に使用する映像は入力映像の一例である。入力映像は、観客用端末2~2nから取得されるリモート観客の映像であるとする。同期された映像は、イベント会場において映像出力装置102を介して出力される。同期された映像は、観客用端末2~2nに出力されてもよい。 In the following processing, images from cameras of multiple remote spectators and images used for video synchronization are input, and a video in which the images of multiple remote spectators are synchronized is output. Images from remote audience cameras and images used for video synchronization are examples of input images. It is assumed that the input video is a remote audience video obtained from the spectator terminals 2 to 2n. The synchronized video is output via the video output device 102 at the event venue. The synchronized video may be output to the audience terminals 2 to 2n.
 制御部11は、観客用端末2~2nから取得される入力映像を取得する(ステップS1)。 The control unit 11 obtains input images obtained from the audience terminals 2 to 2n (step S1).
 映像特徴抽出部110は、入力映像を待ち受ける(ステップS2)。 The video feature extraction unit 110 waits for an input video (step S2).
 映像特徴抽出部110は、入力映像から映像特徴を抽出する(ステップS3)。ステップS3では、例えば、映像特徴抽出部110は、入力映像を取得する。映像特徴抽出部110は、図5に示すように、入力映像をずらしながら映像特徴を抽出する。映像特徴抽出部110は、例えば、機械学習を用いて映像特徴を抽出してもよい。映像特徴抽出部110は、非特許文献2又は非特許文献3に記載された公知の方法により映像特徴を抽出してもよい。映像特徴抽出部110は、例えば、予めリズムのある音と映像を対応づけて学習した映像特徴抽出手法を使用してもよい。 The video feature extraction unit 110 extracts video features from the input video (step S3). In step S3, for example, the video feature extraction unit 110 obtains an input video. As shown in FIG. 5, the video feature extraction unit 110 extracts video features while shifting the input video. The video feature extraction unit 110 may extract video features using machine learning, for example. The video feature extraction unit 110 may extract video features using a known method described in Non-Patent Document 2 or Non-Patent Document 3. For example, the video feature extraction unit 110 may use a video feature extraction method learned in advance by associating rhythmic sounds with video.
 映像特徴抽出部110は、複数の入力映像に対して、異なる間隔又は密度で特徴抽出を行ってもよい。例えば、映像特徴抽出部110は、特徴抽出の間隔又は密度を少なくとも2種類設けて、映像特徴を抽出してもよい。映像特徴抽出部110は、2つの映像の特徴を抽出する場合、一方の映像についての特徴抽出の間隔を狭くし、他方の映像についての特徴抽出の間隔を広くしてもよい。映像特徴抽出部110は、一方の映像についての特徴抽出の密度を高くし、他方の映像についての特徴抽出の密度を低くしてもよい。 The video feature extraction unit 110 may perform feature extraction on a plurality of input videos at different intervals or densities. For example, the video feature extraction unit 110 may extract video features by providing at least two types of feature extraction intervals or densities. When extracting features of two videos, the video feature extraction unit 110 may narrow the feature extraction interval for one video and widen the feature extraction interval for the other video. The video feature extraction unit 110 may increase the density of feature extraction for one video and lower the density of feature extraction for the other video.
 映像特徴抽出部110は、全ての入力映像について特徴抽出が行われたか否かを判定する(ステップS4)。映像特徴抽出部110により、全ての入力映像について特徴抽出が行われたと判定された場合(ステップS4:YES)、処理は、ステップS4からステップS5へ遷移する。映像特徴抽出部110により、全ての入力映像について特徴抽出が行われていないと判定された場合(ステップS4:NO)、処理は、ステップS4からステップS2へ遷移する。 The video feature extraction unit 110 determines whether feature extraction has been performed for all input videos (step S4). If the video feature extraction unit 110 determines that feature extraction has been performed for all input videos (step S4: YES), the process transitions from step S4 to step S5. If the video feature extraction unit 110 determines that feature extraction has not been performed for all input videos (step S4: NO), the process transitions from step S4 to step S2.
 遅延推定部111は、複数の入力映像から抽出した映像特徴から相対的な遅延時間を推定する(ステップS5)。ステップS5では、例えば、遅延推定部111は、複数の映像特徴同士を照合し、映像特徴同士の距離、又は、映像特徴同士の類似度に基づいて、各時刻の映像特徴がどの時刻の映像特徴と最も近いかを求める。遅延推定部111は、ある映像の映像特徴と距離、又は、類似度が最も近い映像特徴の時刻に基づいて、2つの映像間の遅延時間を推定する。遅延推定部111は、複数の入力映像の集合から複数のペアを抽出してペア毎に遅延時間を推定してもよい。 The delay estimation unit 111 estimates a relative delay time from video features extracted from a plurality of input videos (step S5). In step S5, for example, the delay estimating unit 111 collates a plurality of video features and determines which video feature at each time corresponds to which video feature at each time based on the distance between the video features or the similarity between the video features. Find the closest one. The delay estimation unit 111 estimates the delay time between two videos based on the video feature of a certain video and the distance, or the time of the video feature with the closest similarity. The delay estimation unit 111 may extract a plurality of pairs from a set of a plurality of input videos and estimate the delay time for each pair.
 遅延推定部111は、例えば、投票を用いて相対的な遅延時間又は速度の少なくとも一方を推定してもよい。遅延推定部111は、複数のリモート観客の動作のタイミングのずれ、すなわち、遅延時間を、各時刻の特徴がどの時刻の特徴と最も近いかを求め、その時刻を投票する。遅延推定部111は、投票された時刻に基づいて、2つの映像間の遅延時間を推定する。遅延推定部111は、Hough変換を用いて、投票による推定を行ってもよい。遅延推定部111は、相対的な速度を1として遅延推定を行ってもよい。 The delay estimating unit 111 may estimate at least one of the relative delay time and speed using voting, for example. The delay estimating unit 111 determines the time difference between the timings of the movements of the plurality of remote spectators, that is, the delay time, by determining which time the feature of each time is closest to, and voting for that time. The delay estimation unit 111 estimates the delay time between two videos based on the voted time. The delay estimation unit 111 may perform estimation by voting using Hough transform. The delay estimation unit 111 may perform delay estimation by setting the relative speed as 1.
 遅延推定部111は、映像特徴抽出部110により特徴抽出の間隔又は密度が2種類設けられた場合、2種類の間隔又は2種類の密度をペアにして相対的な遅延時間を推定してもよい。例えば、遅延推定部111は、2種類の特徴抽出の間隔、又は2種類の特徴抽出の密度をペアにして相対的な遅延時間を推定してもよい。 When the video feature extraction unit 110 provides two types of feature extraction intervals or densities, the delay estimation unit 111 may pair the two types of intervals or two types of densities to estimate the relative delay time. . For example, the delay estimation unit 111 may estimate a relative delay time by pairing intervals between two types of feature extraction or densities of two types of feature extraction.
 遅延補正部112は、入力映像を待ち受ける(ステップS6)。 The delay correction unit 112 waits for input video (step S6).
 遅延補正部112は、遅延推定部111により推定された遅延時間を使用し、複数の入力映像の遅延時間を補正して、複数の入力映像を同期する(ステップS7)。ステップS7では、例えば、遅延補正部112は、入力映像を取得する。遅延補正部112は、複数の入力映像を基準として遅延補正を行う。遅延補正部112は、複数の入力映像から求められた時間を基準にして遅延補正を行う。遅延補正部112は、複数の入力映像について推定された遅延時間に基づいて、複数の入力映像の再生時刻を補正する。遅延補正部112は、例えば、遅延時間が最も大きいリモート観客の映像に合わせて、遅延時間が小さいリモート観客の映像の再生時刻に遅延時間の差分を挿入して、同期する映像を作り出してもよい。遅延補正部112は、例えば、所定の映像に他のすべての映像を合わせるように遅延補正をしてもよい。遅延補正部112は、複数の映像をグループ化して算出された遅延時間に基づいて遅延補正を行ってもよい。この場合、遅延補正部112は、グループ化された集合から少数のサンプルを抽出し、サンプルの平均の遅延時間をグループ内の全ての映像の遅延時間とし、グループ内の全ての映像の再生時刻の補正を行ってもよい。 The delay correction unit 112 uses the delay time estimated by the delay estimation unit 111 to correct the delay times of the plurality of input videos, and synchronizes the plurality of input videos (step S7). In step S7, for example, the delay correction unit 112 obtains an input video. The delay correction unit 112 performs delay correction based on a plurality of input videos. The delay correction unit 112 performs delay correction based on the time determined from a plurality of input videos. The delay correction unit 112 corrects the playback times of the plurality of input videos based on the delay times estimated for the plurality of input videos. For example, the delay correction unit 112 may create synchronized video by inserting a delay time difference into the playback time of the video of the remote audience with the shortest delay time in accordance with the video of the remote audience with the longest delay time. . For example, the delay correction unit 112 may perform delay correction to match all other videos to a predetermined video. The delay correction unit 112 may perform delay correction based on delay times calculated by grouping a plurality of videos. In this case, the delay correction unit 112 extracts a small number of samples from the grouped set, sets the average delay time of the samples as the delay time of all videos in the group, and sets the average delay time of the samples as the delay time of all videos in the group. Correction may be made.
 遅延補正部112は、全ての入力映像について遅延補正が行われたか否かを判定する(ステップS8)。遅延補正部112により、全ての入力映像について遅延補正が行われたと判定された場合(ステップS8:YES)、処理は、終了する。遅延補正部112により、全ての入力映像について遅延補正が行われていないと判定された場合(ステップS8:NO)、処理は、ステップS8からステップS6へ遷移する。 The delay correction unit 112 determines whether delay correction has been performed for all input videos (step S8). If the delay correction unit 112 determines that delay correction has been performed on all input videos (step S8: YES), the process ends. If the delay correction unit 112 determines that delay correction has not been performed on all input videos (step S8: NO), the process transitions from step S8 to step S6.
 制御部11は、入出力インタフェース15を介して、遅延補正が行われた映像を映像出力装置102に出力する。映像出力装置102は、遅延補正が行われた映像を出力する。制御部は、IPネットワークを介して、遅延補正が行われた映像を観客用端末2~2nに出力してもよい。観客用端末2~2nは、遅延補正が行われた映像を出力する。 The control unit 11 outputs the delay-corrected video to the video output device 102 via the input/output interface 15. The video output device 102 outputs video that has undergone delay correction. The control unit may output the delay-corrected video to the audience terminals 2 to 2n via the IP network. The audience terminals 2 to 2n output video images on which delay correction has been performed.
 (効果) 
 上述の実施形態では、サーバ1は、複数の入力映像から映像特徴を抽出し、複数の入力映像から抽出した映像特徴から相対的な遅延時間を推定し、遅延時間を使用し、複数の入力映像の遅延時間を補正して複数の入力映像を同期することができる。そのため、サーバ1は、入力映像に含まれる動作等の特徴に基づいて、複数の映像間の遅延時間を推定し、遅延時間に基づいて複数の映像の再生時刻を補正することで、映像に含まれる動作等が揃った映像を再生することができる。これにより、サーバ1は、映像に含まれる動きに基づいて映像を同期することできる。
(effect)
In the embodiment described above, the server 1 extracts video features from a plurality of input videos, estimates a relative delay time from the video features extracted from the multiple input videos, uses the delay time, and extracts video features from a plurality of input videos. It is possible to synchronize multiple input videos by correcting the delay time. Therefore, the server 1 estimates the delay time between multiple videos based on the characteristics such as motion included in the input video, and corrects the playback time of the multiple videos based on the delay time. It is possible to play back a video that includes all the actions that will be performed. Thereby, the server 1 can synchronize the videos based on the motion included in the videos.
 [第2の実施形態]
 第2の実施形態は、入力音から音特徴を抽出し、入力音を基準にして複数の映像を同期する実施形態である。
[Second embodiment]
The second embodiment is an embodiment in which a sound feature is extracted from an input sound and a plurality of videos are synchronized based on the input sound.
 (構成例)
 第2の実施形態では、第1の実施形態と同様の構成については同一の符号を付し、その説明を省略する。第2の実施形態では、主として、第1の実施形態と異なる部分について説明する。
(Configuration example)
In the second embodiment, the same components as in the first embodiment are denoted by the same reference numerals, and the description thereof will be omitted. In the second embodiment, parts that are different from the first embodiment will be mainly described.
 図12は、第2の実施形態に係る映像同期システムを構成するサーバ1のソフトウェア構成の一例を示すブロック図である。
 サーバ1は、映像特徴抽出部110、遅延推定部111、遅延補正部112、音特徴抽出部113、及び学習部114を備える。各機能部は、制御部11によるプログラムの実行によって実現される。各機能部は、制御部11又はプロセッサが備えるということもできる。各機能部は、制御部11又はプロセッサと読み替え可能である。図12において、3つの映像特徴抽出部110を例示するが、映像特徴抽出部110の数はこれに限られない。以下の説明において、複数の入力映像のそれぞれは、異なる映像特徴抽出部110により処理されるものとするが、一つの映像特徴抽出部110により処理されてもよい。
FIG. 12 is a block diagram showing an example of the software configuration of the server 1 configuring the video synchronization system according to the second embodiment.
The server 1 includes a video feature extraction section 110, a delay estimation section 111, a delay correction section 112, a sound feature extraction section 113, and a learning section 114. Each functional unit is realized by execution of a program by the control unit 11. It can also be said that each functional unit is included in the control unit 11 or the processor. Each functional unit can be read as the control unit 11 or a processor. Although three video feature extraction units 110 are illustrated in FIG. 12, the number of video feature extraction units 110 is not limited to this. In the following description, it is assumed that each of a plurality of input videos is processed by a different video feature extraction unit 110, but it may be processed by a single video feature extraction unit 110.
 音特徴抽出部113は、入力音から音特徴を抽出する。入力音は、例えば、音楽ライブであれば、会場での再生音である。入力音は、音特徴の基準となる音である。音特徴抽出部113は、非特許文献2、又は3の方法により音特徴抽出を実施してもよい。音特徴抽出部113は、異なるモーダルの特徴ベクトルの照合を共通の特徴空間上で実施してもよい。音特徴抽出部113は、音特徴抽出の間隔又は密度を少なくとも2種類設けてもよい。例えば、音特徴抽出部113は、2つの映像のうち一方の特徴抽出の間隔を狭く、又は、密度を高くして、他方の特徴抽出の間隔を広く、又は、密度を低くしてもよい。なお、音特徴抽出部113は、入力映像から音特徴を抽出してもよい。 The sound feature extraction unit 113 extracts sound features from the input sound. For example, in the case of a live music event, the input sound is the sound played at the venue. The input sound is a sound that serves as a reference for sound characteristics. The sound feature extraction unit 113 may perform sound feature extraction using the method described in Non-Patent Document 2 or 3. The sound feature extraction unit 113 may perform matching of feature vectors of different modals on a common feature space. The sound feature extraction unit 113 may provide at least two types of sound feature extraction intervals or densities. For example, the sound feature extraction unit 113 may narrow the interval or increase the density of feature extraction for one of the two videos, and widen the interval or lower the density of feature extraction for the other video. Note that the sound feature extraction unit 113 may extract sound features from the input video.
 遅延推定部111は、複数の映像の映像特徴と音特徴の照合を行うことで、映像の音からの遅延時間を推定する。遅延推定部111は、音特徴から複数の入力映像と音の相対的な遅延時間を推定する。遅延推定部111は、入力音に合わせて、複数の映像の再生時刻を調整する。 The delay estimation unit 111 estimates the delay time from the sound of the video by comparing the video features and sound features of a plurality of videos. The delay estimating unit 111 estimates relative delay times between a plurality of input videos and sounds from sound features. The delay estimation unit 111 adjusts the playback times of the plurality of videos according to the input sound.
 遅延推定部111は、例えば、音特徴抽出部113により特徴抽出の間隔又は密度が2種類設けられた場合、2種類の間隔又は2種類の密度をペアにして相対的な遅延時間を推定してもよい。例えば、遅延推定部111は、2つの映像の遅延を推定する場合、一方の音の特徴抽出の間隔を狭く、又は、密度を高くして、他方の音の特徴抽出の間隔を広く、又は、密度を低くして、遅延時間を推定してもよい。 For example, when the sound feature extraction unit 113 provides two types of feature extraction intervals or densities, the delay estimation unit 111 pairs the two types of intervals or two types of densities to estimate the relative delay time. Good too. For example, when estimating the delay between two videos, the delay estimating unit 111 narrows or increases the density of feature extraction for one sound and widens the interval for feature extraction of the other sound, or The delay time may be estimated by lowering the density.
 遅延補正部112は、遅延時間を使用し、複数の映像の遅延時間を補正して、複数の映像を同期する。遅延補正部は、音を基準にして複数の映像の遅延時間を補正する。遅延補正部112は、音に合わせて複数の映像の再生時刻を調整する。 The delay correction unit 112 uses the delay time to correct the delay times of the plurality of videos, and synchronizes the plurality of videos. The delay correction unit corrects the delay times of the plurality of videos based on the sound. The delay correction unit 112 adjusts the playback times of the plurality of videos in accordance with the sound.
 (動作例)
 図13は、第2の実施形態に係るサーバの映像同期手順と処理内容の一例を示すフローチャートである。
 以下の処理では、複数のリモート観客のカメラの映像、及び入力音を入力とし,複数のリモート観客の映像が同期された映像を出力とする。リモート観客のカメラの映像は入力映像の一例である。入力映像は、観客用端末2~2nから取得されるリモート観客の映像であるとする。同期された映像は、イベント会場において映像出力装置102を介して出力される。同期された映像は、観客用端末2~2nに出力されてもよい。入力音は、例えば、音声出力装置101から取得される音である。
(Operation example)
FIG. 13 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the second embodiment.
In the following process, images from cameras of multiple remote audience members and input sounds are input, and a video in which the images of multiple remote audience members are synchronized is output. The image from the remote spectator's camera is an example of the input image. It is assumed that the input video is a remote audience video obtained from the spectator terminals 2 to 2n. The synchronized video is output via the video output device 102 at the event venue. The synchronized video may be output to the audience terminals 2 to 2n. The input sound is, for example, sound obtained from the audio output device 101.
 制御部11は、入力音を取得する(ステップS101)。入力音は、例えば、イベント会場において再生される再生音である。 The control unit 11 acquires input sound (step S101). The input sound is, for example, reproduced sound played at an event venue.
 音特徴抽出部113は、入力音を待ち受ける(ステップS102)。 The sound feature extraction unit 113 waits for input sound (step S102).
 音特徴抽出部113は、入力音から音特徴を抽出する(ステップS103)。ステップS103では、例えば、音特徴抽出部113は、公知の方法により音特徴抽出を実施する。音特徴抽出部113は、入力音に対して、異なる間隔又は密度で特徴抽出を行ってもよい。例えば、音特徴抽出部113は、特徴抽出の間隔又は密度を少なくとも2種類設けて、音特徴を抽出してもよい。間隔又は密度の種類は、広い間隔、狭い間隔、高い密度、低い密度等を含む。 The sound feature extraction unit 113 extracts sound features from the input sound (step S103). In step S103, for example, the sound feature extraction unit 113 extracts sound features using a known method. The sound feature extraction unit 113 may perform feature extraction on the input sound at different intervals or densities. For example, the sound feature extraction unit 113 may extract sound features by providing at least two types of feature extraction intervals or densities. Types of spacing or density include wide spacing, narrow spacing, high density, low density, and the like.
 制御部11は、観客用端末2~2nから取得される入力映像を取得する(ステップS104)。 The control unit 11 obtains input images obtained from the audience terminals 2 to 2n (step S104).
 映像特徴抽出部110は、入力映像を待ち受ける(ステップS105)。 The video feature extraction unit 110 waits for an input video (step S105).
 映像特徴抽出部110は、ステップS3と同様に、入力映像から映像特徴を抽出する(ステップS106)。 The video feature extraction unit 110 extracts video features from the input video similarly to step S3 (step S106).
 映像特徴抽出部110は、全ての入力映像について特徴抽出が行われたか否かを判定する(ステップS107)。映像特徴抽出部110により、全ての入力映像について特徴抽出が行われたと判定された場合(ステップS107:YES)、処理は、ステップS107からステップS108へ遷移する。映像特徴抽出部110により、全ての入力映像について特徴抽出が行われていないと判定された場合(ステップS107:NO)、処理は、ステップS107からステップS105へ遷移する。 The video feature extraction unit 110 determines whether feature extraction has been performed for all input videos (step S107). If the video feature extraction unit 110 determines that feature extraction has been performed for all input videos (step S107: YES), the process transitions from step S107 to step S108. If the video feature extraction unit 110 determines that feature extraction has not been performed for all input videos (step S107: NO), the process transitions from step S107 to step S105.
 遅延推定部111は、音特徴から複数の入力映像と音の相対的な遅延時間を推定する(ステップS108)。ステップS108では、例えば、遅延推定部111は、入力映像の映像特徴と音特徴との照合を行う。遅延推定部111は、照合の結果に基づいて、複数の映像の音からの遅延時間を推定する。遅延推定部111は、例えば、複数の映像特徴と音特徴を照合し、映像特徴と音特徴の距離、又は、映像特徴と音特徴の類似度に基づいて、各時刻の映像特徴がどの時刻の音特徴と最も近いかを求める。遅延推定部111は、ある映像の映像特徴と距離、又は、類似度が最も近い音特徴の時刻に基づいて、映像の音からの遅延時間を推定する。 The delay estimation unit 111 estimates the relative delay times of the plurality of input videos and sounds from the sound features (step S108). In step S108, for example, the delay estimation unit 111 compares the video features and sound features of the input video. The delay estimation unit 111 estimates the delay time from the sound of the plurality of videos based on the result of the comparison. For example, the delay estimating unit 111 collates a plurality of video features and sound features, and determines which video feature at each time corresponds to the video feature at which time based on the distance between the video feature and the sound feature or the similarity between the video feature and the sound feature. Find the closest match to the sound feature. The delay estimating unit 111 estimates the delay time from the sound of the video based on the distance to the video feature of a certain video or the time of the sound feature with the closest similarity.
 遅延推定部111は、例えば、投票を用いて相対的な遅延時間又は速度の少なくとも一方を推定してもよい。遅延推定部111は、複数のリモート観客の動作のタイミングのずれ、すなわち、遅延時間を、各時刻の映像特徴がどの時刻の音特徴と最も近いかを求め、その時刻を投票する。遅延推定部111は、投票された時刻に基づいて、映像の音からの遅延時間を推定する。遅延推定部111は、Hough変換を用いて、投票による推定を行ってもよい。遅延推定部111は、相対的な速度を1として遅延推定を行ってもよい。 The delay estimating unit 111 may estimate at least one of the relative delay time and speed using voting, for example. The delay estimating unit 111 determines the timing deviation of the movements of the plurality of remote spectators, that is, the delay time, by determining which time the video feature at each time is closest to the sound feature at which time, and voting for that time. The delay estimation unit 111 estimates the delay time from the sound of the video based on the voted time. The delay estimation unit 111 may perform estimation by voting using Hough transform. The delay estimation unit 111 may perform delay estimation by setting the relative speed as 1.
 遅延推定部111は、音特徴抽出部113により特徴抽出の間隔又は密度が2種類設けられた場合、2種類の間隔又は2種類の密度をペアにして相対的な遅延時間を推定してもよい。例えば、遅延推定部111は、2種類の特徴抽出の間隔、又は2種類の特徴抽出の密度をペアにして相対的な遅延時間を推定してもよい。 When the sound feature extraction unit 113 provides two types of feature extraction intervals or densities, the delay estimation unit 111 may pair the two types of intervals or two types of densities to estimate the relative delay time. . For example, the delay estimation unit 111 may estimate a relative delay time by pairing intervals between two types of feature extraction or densities of two types of feature extraction.
 遅延補正部112は、入力映像を待ち受ける(ステップS109)。 The delay correction unit 112 waits for input video (step S109).
 遅延補正部112は、遅延推定部111により推定された遅延時間を使用し、複数の入力映像の遅延時間を補正して、複数の入力映像を同期する(ステップS110)。ステップS110では、例えば、遅延補正部112は、入力映像を取得する。遅延補正部112は、入力音を基準として遅延補正を行う。遅延補正部112は、複数の入力映像の音からのから遅延時間を基準にして遅延補正を行う。遅延補正部112は、遅延時間に基づいて、複数の入力映像の再生時刻を補正する。 The delay correction unit 112 uses the delay time estimated by the delay estimation unit 111 to correct the delay time of the plurality of input videos, and synchronizes the plurality of input videos (step S110). In step S110, for example, the delay correction unit 112 obtains an input video. The delay correction unit 112 performs delay correction based on the input sound. The delay correction unit 112 performs delay correction based on the delay time from the sound of a plurality of input videos. The delay correction unit 112 corrects the playback times of a plurality of input videos based on the delay time.
 遅延補正部112は、全ての入力映像について遅延補正が行われたか否かを判定する(ステップS111)。遅延補正部112により、全ての入力映像について遅延補正が行われたと判定された場合(ステップS111:YES)、処理は、終了する。遅延補正部112により、全ての入力映像について遅延補正が行われていないと判定された場合(ステップS111:NO)、処理は、ステップS111からステップS109へ遷移する。 The delay correction unit 112 determines whether delay correction has been performed for all input videos (step S111). If the delay correction unit 112 determines that delay correction has been performed on all input videos (step S111: YES), the process ends. If the delay correction unit 112 determines that delay correction has not been performed on all input videos (step S111: NO), the process transitions from step S111 to step S109.
 制御部11は、入出力インタフェース15を介して、遅延補正が行われた映像を映像出力装置102に出力する。映像出力装置102は、遅延補正が行われた映像を出力する。制御部は、IPネットワークを介して、遅延補正が行われた映像を観客用端末2~2nに出力してもよい。観客用端末2~2nは、遅延補正が行われた映像を出力する。 The control unit 11 outputs the delay-corrected video to the video output device 102 via the input/output interface 15. The video output device 102 outputs video that has undergone delay correction. The control unit may output the delay-corrected video to the audience terminals 2 to 2n via the IP network. The audience terminals 2 to 2n output video images on which delay correction has been performed.
 (効果) 
 上述の実施形態では、サーバ1は、複数の入力映像から映像特徴を抽出し、複数の入力映像から抽出した映像特徴から相対的な遅延時間を推定し、遅延時間を使用し、複数の入力映像の遅延時間を補正して複数の入力映像を同期することができる。さらに、サーバ1は、入力音から音特徴を抽出し、音特徴から複数の映像と音の相対的な遅延時間を推定することができる。そのため、サーバ1は、入力映像に含まれる動作等の特徴と、入力音の音特徴に基づいて、複数の映像の入力音からの遅延時間を推定し、遅延時間に基づいて複数の映像の再生時刻を補正することができる。これにより、サーバ1は、入力音に合わせて映像の再生時刻を修正することができ、映像を音に合わせて同期することできる。
(effect)
In the embodiment described above, the server 1 extracts video features from a plurality of input videos, estimates a relative delay time from the video features extracted from the multiple input videos, uses the delay time, and extracts video features from a plurality of input videos. It is possible to synchronize multiple input videos by correcting the delay time. Furthermore, the server 1 can extract sound features from the input sound and estimate relative delay times between a plurality of videos and sounds from the sound features. Therefore, the server 1 estimates the delay time from the input sound of multiple videos based on the characteristics such as motion included in the input video and the sound characteristics of the input sound, and plays the multiple videos based on the delay time. The time can be corrected. Thereby, the server 1 can adjust the playback time of the video in accordance with the input sound, and can synchronize the video in accordance with the sound.
 以下、第1の実施形態及び第2の実施形態に共通する事項を補足として説明する。
 イベント会場において群衆の映像を撮影する方法について説明する。
 図14は、第1及び第2の実施形態に係るイベント会場での映像の撮影方法の一例を示す図である。
 図14に示すように、イベント会場内に設置されたカメラにより会場内の群衆を撮影する。会場内のカメラにより、例えば、図4に示すような群衆の映像が撮影される。例えば、会場内のカメラは、会場のステージ側に設置され、客席側を撮影するように設置される。会場内のカメラは、1つに限られず、複数個設置されてもよい。群衆の映像は、少なくとも1つのカメラにより撮影された映像から選択される映像であってもよい。
Hereinafter, matters common to the first embodiment and the second embodiment will be explained as supplementary information.
A method for photographing crowd images at an event venue will be explained.
FIG. 14 is a diagram illustrating an example of a method of photographing a video at an event venue according to the first and second embodiments.
As shown in FIG. 14, a camera installed inside the event venue photographs the crowd inside the venue. For example, an image of the crowd as shown in FIG. 4 is captured by a camera in the venue. For example, cameras in the venue are installed on the stage side of the venue, and are installed to take pictures of the audience seats. The number of cameras in the venue is not limited to one, and a plurality of cameras may be installed. The crowd image may be an image selected from images captured by at least one camera.
 映像同期システムSの適用例について説明する。
 図15は、第1及び第2の実施形態に係る映像同期システムSの適用例を説明する図である。
 図15では、リモート観客がライブ配信イベントに参加する場合のシステムの例が示されている。リモート観客は、自身をカメラで撮影しながらライブ配信を視聴することができるPCを使用してイベントに参加する。図15の(1)から(5)のデバイスや機能の役割は次のとおりである。
An application example of the video synchronization system S will be explained.
FIG. 15 is a diagram illustrating an application example of the video synchronization system S according to the first and second embodiments.
FIG. 15 shows an example of a system where remote audience members participate in a live streaming event. Remote spectators participate in the event using a PC that allows them to watch the live broadcast while filming themselves with a camera. The roles of the devices and functions in (1) to (5) in FIG. 15 are as follows.
 (1)リモート観客はイベント中に自身の顔や上半身を撮影しているカメラ映像を送信し続ける。(1)は複数の観客用端末2~2nによる機能である。
 (2)リモート観客の映像を横や縦または縦横にグリッド状に並べて集約し利用しやすい状態に加工する。(2)はサーバ1による機能である。
 (3)イベント会場で(2)で作成した映像を表示する。(3)はサーバ1による機能である。
 (4)リモート観客の映像が反映されたイベント会場の様子を撮影した映像をライブ配信映像としてリモート観客のPCへ配信する。(4)はサーバ1による機能である。
 (5)ライブ配信映像がリモート観客のPCの画面上に表示される。(5)は複数の観客用端末2~2nによる機能である。
(1) Remote spectators will continue to transmit camera images of their faces and upper bodies during the event. (1) is a function performed by a plurality of audience terminals 2 to 2n.
(2) Images from remote audiences are arranged horizontally, vertically, or vertically and horizontally in a grid, aggregated, and processed into an easy-to-use format. (2) is a function provided by the server 1.
(3) Display the video created in (2) at the event venue. (3) is a function provided by the server 1.
(4) Deliver the captured video of the event venue in which the remote audience's video is reflected to the remote audience's PC as a live distribution video. (4) is a function provided by the server 1.
(5) Live streaming video is displayed on the remote audience's PC screen. (5) is a function performed by a plurality of audience terminals 2 to 2n.
 このようにリモート観客の様子を映した映像を利用すると、会場の観客や演者からリモート観客の盛り上がっている様子を知ることができるだけでなく、会場の演者や観客とリモート観客のインタラクションにも活用できると考えられ、双方向映像配信の利用価値が高まる。また、音楽ライブやスポーツにおいて、ペンライトや応援グッズを用いて、観客同士で動作を揃えてイベントを楽しむようなことがよくある。このような楽しみ方をリモート観客の映像と会場観客とで揃えるようなことを行うことは、ライブ会場とリモート環境との間に遅延が発生するために難しい。 Using video of the remote audience in this way not only allows the audience and performers at the venue to see how excited the remote audience is, but it can also be used for interaction between the performers and audience at the venue and the remote audience. This is thought to increase the utility value of interactive video distribution. Furthermore, at live music events and sports, it is common for spectators to enjoy the event by using penlights and cheering goods to synchronize their movements with each other. It is difficult to match the remote audience's video with the on-site audience's experience due to the delay that occurs between the live venue and the remote environment.
 第1及び第2の実施形態は、ライブ配信イベントを視聴する複数のリモート観客のカメラ映像からリズム感を損なうことなく調和のとれた同期映像を作り会場演出で利用するために、リモート観客と会場との通信時間、映像処理時間、リモート観客の反応時間などの様々な要因による遅延時間を調整する。第1及び第2の実施形態は、図11に示すように、特に手拍子やペンライトの振りといったような、複数の観客の動作を時間方向が同期され調和のとれた映像を生成する。 The first and second embodiments are designed to create harmonious synchronized images from the camera images of multiple remote audience members viewing a live streaming event without sacrificing the sense of rhythm and use them in venue production. Adjust the delay time due to various factors such as communication time, video processing time, reaction time of remote audience, etc. In the first and second embodiments, as shown in FIG. 11, the movements of a plurality of audience members, such as clapping hands or shaking a penlight, are synchronized in time to produce a harmonious video.
 サーバ1の処理例について説明する。
 図16は、第1及び第2の実施形態に係るサーバ1の処理例を説明する図である。
 サーバ1は、観客の動作が揃っているように映像を同期させるために、人物の同期の基準となる対象、すなわち、手拍子やペンライトを振る映像中の部分的な動きに着目して揃える必要がある。サーバ1は、この動きを捉えるために3-Dimensional Convolutional Neural Network(3D-CNN)を用いて、空間的・時間的な特徴を抽出し、得られた特徴に基づいて探索を行う。
An example of processing by the server 1 will be explained.
FIG. 16 is a diagram illustrating a processing example of the server 1 according to the first and second embodiments.
In order to synchronize the video so that the movements of the audience are aligned, Server 1 needs to focus on and align the objects that serve as the reference for synchronization of people, that is, the partial movements in the video of people clapping their hands or waving their penlights. There is. In order to capture this movement, the server 1 uses a 3-Dimensional Convolutional Neural Network (3D-CNN) to extract spatial and temporal features, and performs a search based on the obtained features.
 映像特徴抽出部110は、位相に基づく学習又は位相差分に基づく学習による特徴抽出に基づいて入力映像Xから映像特徴Xを抽出する。映像特徴抽出部110は、位相に基づく学習又は位相差分に基づく学習による特徴抽出に基づいて入力映像Yから映像特徴Yを抽出する。遅延推定部111は、映像特徴X及び映像特徴Yから相対的な遅延時間を推定する。遅延補正部112は、推定された遅延時間を使用し、入力映像Xに基づく再生映像X及び入力映像Yに基づく再生映像Yを同期させる。例えば、遅延補正部112は、推定された再生映像Xに対する再生映像Yの遅延時間を再生映像Xに挿入することで再生映像Xの再生時刻を補正し、再生映像X及び再生映像Yを同期させた映像を作り出す。 The video feature extraction unit 110 extracts video features X from the input video X based on feature extraction by phase-based learning or phase difference-based learning. The video feature extraction unit 110 extracts the video feature Y from the input video Y based on feature extraction by phase-based learning or phase difference-based learning. The delay estimation unit 111 estimates a relative delay time from the video feature X and the video feature Y. The delay correction unit 112 synchronizes the reproduced video X based on the input video X and the reproduced video Y based on the input video Y using the estimated delay time. For example, the delay correction unit 112 corrects the playback time of the playback video X by inserting into the playback video X the estimated delay time of the playback video Y with respect to the playback video X, and synchronizes the playback video X and the playback video Y. Create an image that looks great.
 映像特徴抽出部110の実装例について説明する。
 図17は、第1及び第2の実施形態に係る映像特徴抽出部110に実装されるDeep Neural Neteork(DNN)構造の例を説明する図である。
 S-avgpoolは、空間次元への平均プーリングを表す。T-avgpoolは、時間次元への平均プーリングを表す。
An implementation example of the video feature extraction unit 110 will be described.
FIG. 17 is a diagram illustrating an example of a Deep Neural Network (DNN) structure implemented in the video feature extraction unit 110 according to the first and second embodiments.
S-avgpool represents average pooling into the spatial dimension. T-avgpool represents average pooling into the time dimension.
 映像特徴抽出部110は、位相に基づく学習又は位相差分に基づく学習による特徴抽出に基づいて映像特徴を抽出するために、3D-CNNのDNN構造を使用する。3D-CNNのDNN構造としては、人の行動分類タスクに用いられる非特許文献4に記載されたKinetics-400 を用いて学習した非特許文献5に記載されたResnet18-3D(R3D-18)の上位3層を利用している。R3D-18から抽出された映像特徴は、全結合層及びプーリング層を介してG次元の潜在空間にエンコードされる。 The video feature extraction unit 110 uses the DNN structure of the 3D-CNN in order to extract video features based on phase-based learning or phase difference-based learning. The DNN structure of the 3D-CNN is Resnet18-3D (R3D-18) described in Non-Patent Document 5, which is trained using Kinetics-400 described in Non-Patent Document 4, which is used for human behavior classification tasks. The top three layers are used. The image features extracted from R3D-18 are encoded into a G-dimensional latent space through a fully connected layer and a pooling layer.
 Video Encoder fは、入力映像
Figure JPOXMLDOC01-appb-M000001

Figure JPOXMLDOC01-appb-M000002
のように潜在空間にエンコードするために使用する。ここで、H、W、Pは、入力映像の高さ、幅、フレーム数である。DNNの学習に使用する損失関数は、トリプレット損失
Figure JPOXMLDOC01-appb-M000003
を使用する。ここで、
Figure JPOXMLDOC01-appb-M000004
は、マージンパラメータと呼ばれる正定数である。トリプレット損失に基づく学習では、基準となるAnchor x、Anchorと同じカテゴリーのPositive x、Anchorと異なるカテゴリーのNegative xの3つを1組として学習が行われる。
Video Encoder f is the input video
Figure JPOXMLDOC01-appb-M000001
of
Figure JPOXMLDOC01-appb-M000002
used to encode into latent space as in Here, H, W, and P are the height, width, and number of frames of the input video. The loss function used for DNN training is triplet loss
Figure JPOXMLDOC01-appb-M000003
use. here,
Figure JPOXMLDOC01-appb-M000004
is a positive constant called the margin parameter. In learning based on triplet loss, learning is performed using a set of Anchor x a as a reference, Positive x p in the same category as Anchor, and Negative x n in a different category from Anchor.
 各入力は、CNN(変換f)によって、Embedding空間のベクトルとして出力される。Embedding空間ではAnchor-Positive間の距離d、Anchor-Negative 間の距離dを距離関数dで計測する。なお、距離関数dとして、ユークリッド距離を使用している。 Each input is output as a vector in the embedding space by CNN (transform f). In the embedding space, the distance d p between Anchor and Positive and the distance d n between Anchor and Negative are measured using a distance function d. Note that the Euclidean distance is used as the distance function d.
 ここでは、学習を効率化するために非特許文献6に記載の方法で、mini-batchの中だけでPositiveサンプル、Negativeサンプルを選択してTripletを構築している。特にNegativeサンプルについては、次のSemi-hard negative条件(式(1))に従って選択している。
  d < d <d+α    (1)
Here, in order to make learning more efficient, a triplet is constructed by selecting positive samples and negative samples only in the mini-batch using the method described in Non-Patent Document 6. In particular, negative samples are selected according to the following semi-hard negative condition (formula (1)).
d p < d n < d p +α (1)
非特許文献4:W. Kay et al. “The Kinetics Human Action Video Dataset”. Computing Research Repository, abs/1705.06950, 2017.
非特許文献5:D. Tran et al. “A Closer Look at Spatiotemporal Convolutions for Action Recognition”. Proc. of IEEE International Conf. on Computer Vision andPatternRecognition (CVPR), 2018.
非特許文献6:F. Faghri et al. “VSE++: Improving visual-semanticembeddings with hard negatives”. Proc. of the British Machine Vision Conf. (BMVC), 2018
Non-patent document 4: W. Kay et al. “The Kinetics Human Action Video Dataset”. Computing Research Repository, abs/1705.06950, 2017.
Non-patent document 5: D. Tran et al. “A Closer Look at Spatiotemporal Convolutions for Action Recognition”. Proc. of IEEE International Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
Non-patent document 6: F. Faghri et al. “VSE++: Improving visual-semantic embeddings with hard negatives”. Proc. of the British Machine Vision Conf. (BMVC), 2018
 ここでPositiveとNegativeの判定方法について説明する。
 図18は、第1及び第2の実施形態に係る学習部114により実行される位相に基づく学習例を説明する図である。図18中の上半分は、ペンライトの左右の振りを正面から見た様子を表し、下半分は、対応する位相を弧度法で表現した図である。なお、この例では、ペンライトを左右に往復することを1周期と考えて位相を対応付けたが、ペンライトを前後に往復することを1周期として対応付けてもよいし、手拍子を1回叩くことを1周期として対応付けてもよい。また、ダンス、手の動き、物の動き等、周期性のある動きで、1周期を定義することができれば、同様に各位置を位相に対応付けることができる。
 2つのペアごとにPositiveとNegativeを判定する必要があるが、位相の位置をあらかじめ4つの区分(象限に対応)しておいて、ペアが同じ区分の場合にPositive、別の区分の場合にNegativeとする基準である。ここでは、これを位相に基づく学習と呼ぶこととする。
Here, a method for determining Positive and Negative will be explained.
FIG. 18 is a diagram illustrating an example of phase-based learning performed by the learning unit 114 according to the first and second embodiments. The upper half of FIG. 18 shows the left and right swing of the penlight viewed from the front, and the lower half shows the corresponding phase expressed using the arc method. In this example, moving the penlight back and forth from side to side is considered to be one cycle, and the phases are associated with each other. However, it is also possible to associate the phases with moving the penlight back and forth as one cycle, or clapping hands once. Hitting may be associated with one cycle. Furthermore, if one period can be defined by periodic movements such as dancing, hand movements, and object movements, each position can be similarly associated with a phase.
It is necessary to determine whether each pair is Positive or Negative, but the phase position is divided into four divisions (corresponding to quadrants) in advance, and if the pairs are in the same division, it is determined to be Positive, and if they are in different divisions, it is determined to be Negative. This is the standard. Here, this will be referred to as phase-based learning.
 図19は、第1及び第2の実施形態に係る学習部114により実行される位相差分に基づく学習例を説明する図である。
 2つのペアの位相差分がπ/2未満の場合にPositive、π/2以上の場合にNegativeとする基準である。ここでは、これを位相差分に基づく学習と呼ぶこととする。
FIG. 19 is a diagram illustrating an example of learning based on phase differences performed by the learning unit 114 according to the first and second embodiments.
This is a standard in which the phase difference between the two pairs is positive when it is less than π/2, and negative when it is π/2 or more. Here, this will be referred to as learning based on phase differences.
 遅延推定部111による時系列探索例について説明する。
 図20は、第1及び第2の実施形態に係る遅延推定部111による時系列探索例について説明する図である。
 遅延推定部111は、映像特徴抽出部110の処理により映像特徴として得られた潜在空間の特徴ベクトルについて時系列に照合する。ここでは、遅延推定部111は、人物X及び人物Yについて時系列に特徴ベクトルを切り出す。遅延推定部111は、人物Xについては、遅延時間tの時刻から時系列に特徴ベクトルを切り出す。遅延推定部111は、人物Xの特徴ベクトルF(t+t)と人物Yの特徴ベクトルF(t)について、遅延時間tをずらしながら照合する。遅延推定部111は、距離D(F, FY,t)が最も小さい遅延時間tを求める。遅延補正部112は、遅延推定部111の処理により得られた遅延時間tの時刻を再生時刻に挿入する方法で人物Xの再生映像Xの再生時刻を調整する。遅延補正部112は、人物Xの再生映像X及び人物Yの再生映像Yが同期したように見える映像を作り出すことができる。なお、距離に基づく探索では、遅延推定部111は、距離関数Dとして、ユークリッド距離を使用することができる。
An example of time series search by the delay estimation unit 111 will be described.
FIG. 20 is a diagram illustrating an example of a time-series search by the delay estimation unit 111 according to the first and second embodiments.
The delay estimating unit 111 compares the feature vectors of the latent space obtained as video features through the processing of the video feature extracting unit 110 in time series. Here, the delay estimation unit 111 extracts feature vectors for person X and person Y in time series. For person X, the delay estimating unit 111 extracts feature vectors in time series from the time of delay time t0 . The delay estimating unit 111 compares the feature vector F x (t+t 0 ) of person X with the feature vector F Y (t) of person Y while shifting the delay time t 0 . The delay estimation unit 111 determines the delay time t 0 at which the distance D (F x , F Y , t 0 ) is the smallest. The delay correction unit 112 adjusts the playback time of the playback video X of the person X by inserting the time of the delay time t 0 obtained by the processing of the delay estimation unit 111 into the playback time. The delay correction unit 112 can create an image in which the reproduced image X of the person X and the reproduced image Y of the person Y appear to be synchronized. Note that in the distance-based search, the delay estimation unit 111 can use Euclidean distance as the distance function D.
 [その他の実施形態]
 映像同期装置は、上記の例で説明したように1つの装置で実現されてもよいし、機能を分散させた複数の装置で実現されてもよい。
[Other embodiments]
The video synchronization device may be realized by one device as explained in the above example, or may be realized by a plurality of devices with distributed functions.
 プログラムは、電子機器に記憶された状態で譲渡されてよいし、電子機器に記憶されていない状態で譲渡されてもよい。後者の場合は、プログラムは、ネットワークを介して譲渡されてよいし、記録媒体に記録された状態で譲渡されてもよい。記録媒体は、非一時的な有形の媒体である。記録媒体は、コンピュータ可読媒体である。記録媒体は、CD-ROM、メモリカード等のプログラムを記憶可能かつコンピュータで読取可能な媒体であればよく、その形態は問わない。 The program may be transferred while being stored in the electronic device, or may be transferred without being stored in the electronic device. In the latter case, the program may be transferred via a network or may be transferred while being recorded on a recording medium. The recording medium is a non-transitory tangible medium. The recording medium is a computer readable medium. The recording medium may be any medium capable of storing a program and readable by a computer, such as a CD-ROM or a memory card, and its form is not limited.
 以上、本発明の実施形態を詳細に説明してきたが、前述までの説明はあらゆる点において本発明の例示に過ぎない。本発明の範囲を逸脱することなく種々の改良や変形を行うことができることは言うまでもない。つまり、本発明の実施にあたって、実施形態に応じた具体的構成が適宜採用されてもよい。 Although the embodiments of the present invention have been described in detail above, the above description is merely an illustration of the present invention in all respects. It goes without saying that various improvements and modifications can be made without departing from the scope of the invention. That is, in implementing the present invention, specific configurations depending on the embodiments may be adopted as appropriate.
 要するにこの発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 In short, the present invention is not limited to the above-described embodiments as they are, but can be embodied by modifying the constituent elements at the implementation stage without departing from the spirit of the invention. Moreover, various inventions can be formed by appropriately combining the plurality of components disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiments. Furthermore, components from different embodiments may be combined as appropriate.
 上述の実施形態は、電子機器だけでなく、電子機器が実行する方法に適用されてもよい。上述の実施形態は、電子機器が備える各部の処理をコンピュータに実行させることが可能なプログラムに適用されてもよい。 The embodiments described above may be applied not only to electronic devices but also to methods performed by electronic devices. The above-described embodiments may be applied to a program that allows a computer to execute the processing of each part of an electronic device.
 1 サーバ
 2~2n 観客用端末
 11 制御部
 12 プログラム記憶部
 13 データ記憶部
 14 通信インタフェース
 15 入出力インタフェース
 101 音声出力装置
 102 映像出力装置
 110 映像特徴抽出部
 111 遅延推定部
 112 遅延補正部
 113 音特徴抽出部
 114 学習部
 S 映像同期システム
1 Server 2 to 2n Spectator terminal 11 Control unit 12 Program storage unit 13 Data storage unit 14 Communication interface 15 Input/output interface 101 Audio output device 102 Video output device 110 Video feature extraction unit 111 Delay estimation unit 112 Delay correction unit 113 Sound feature Extraction unit 114 Learning unit S Video synchronization system

Claims (8)

  1.  複数の入力映像から映像特徴を抽出する映像特徴抽出部と、
     前記複数の入力映像から抽出した映像特徴から相対的な遅延時間を推定する遅延推定部と、
     前記遅延時間を使用し、前記複数の入力映像の遅延時間を補正して前記複数の入力映像を同期する遅延補正部と、
     を備える、映像同期装置。
    a video feature extraction unit that extracts video features from a plurality of input videos;
    a delay estimation unit that estimates a relative delay time from video features extracted from the plurality of input videos;
    a delay correction unit that uses the delay time to correct the delay times of the plurality of input videos to synchronize the plurality of input videos;
    A video synchronization device equipped with.
  2.  入力音から音特徴を抽出する音特徴抽出部をさらに備え、
     前記遅延推定部は、前記音特徴から前記複数の入力映像と入力音の相対的な遅延時間を推定する、
     請求項1に記載の映像同期装置。
    It further includes a sound feature extraction unit that extracts sound features from the input sound,
    The delay estimation unit estimates a relative delay time between the plurality of input videos and the input sound from the sound feature.
    The video synchronization device according to claim 1.
  3.  前記映像特徴抽出部は、2種類の間隔又は2種類の密度で特徴抽出を行い、
     前記遅延推定部は、前記2種類の間隔又は2種類密度をペアにして相対的な遅延時間を推定する、
     請求項1に記載の映像同期装置。
    The video feature extraction unit performs feature extraction at two types of intervals or two types of density,
    The delay estimation unit estimates a relative delay time by pairing the two types of intervals or the two types of densities.
    The video synchronization device according to claim 1.
  4.  前記音特徴抽出部は、2種類の間隔又は2種類の密度で特徴抽出を行い、
     前記遅延推定部は、前記2種類の間隔又は2種類密度をペアにして相対的な遅延時間を推定する、
     請求項2に記載の映像同期装置。
    The sound feature extraction unit performs feature extraction at two types of intervals or two types of density,
    The delay estimation unit estimates a relative delay time by pairing the two types of intervals or the two types of densities.
    The video synchronization device according to claim 2.
  5.  前記遅延推定部は、距離又は投票を用いて相対的な遅延時間又は速度の少なくとも一方を推定する、
     請求項1に記載の映像同期装置。
    The delay estimation unit estimates at least one of relative delay time or speed using distance or voting.
    The video synchronization device according to claim 1.
  6.  学習用映像の位相に基づく学習又は位相差分に基づく学習を実行する学習部を備え、
     前記映像特徴抽出部は、前記学習による特徴抽出に基づいて前記映像特徴を抽出する、
     請求項1に記載の映像同期装置。
    comprising a learning section that executes learning based on the phase of the learning video or learning based on the phase difference,
    The video feature extraction unit extracts the video features based on the feature extraction by learning.
    The video synchronization device according to claim 1.
  7.  複数の入力映像から映像特徴を抽出する映像特徴抽出過程と、
     前記複数の入力映像から抽出した映像特徴から相対的な遅延時間を推定する遅延推定過程と、
     前記遅延時間を使用し、前記複数の入力映像の遅延時間を補正して前記複数の入力映像を同期する遅延補正過程と、
     を備える、映像同期方法。
    a video feature extraction process for extracting video features from a plurality of input videos;
    a delay estimation step of estimating a relative delay time from video features extracted from the plurality of input videos;
    a delay correction step of synchronizing the plurality of input images by correcting the delay time of the plurality of input images using the delay time;
    A video synchronization method comprising:
  8.  請求項1乃至6の何れかの映像同期装置が備える各部による処理をコンピュータに実行させる映像同期プログラム。 A video synchronization program that causes a computer to execute processing by each unit included in the video synchronization device according to any one of claims 1 to 6.
PCT/JP2022/033307 2022-09-05 2022-09-05 Video synchronization device, video synchronization method, and video synchronization program WO2024052964A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/033307 WO2024052964A1 (en) 2022-09-05 2022-09-05 Video synchronization device, video synchronization method, and video synchronization program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/033307 WO2024052964A1 (en) 2022-09-05 2022-09-05 Video synchronization device, video synchronization method, and video synchronization program

Publications (1)

Publication Number Publication Date
WO2024052964A1 true WO2024052964A1 (en) 2024-03-14

Family

ID=90192382

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/033307 WO2024052964A1 (en) 2022-09-05 2022-09-05 Video synchronization device, video synchronization method, and video synchronization program

Country Status (1)

Country Link
WO (1) WO2024052964A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008199557A (en) * 2007-02-16 2008-08-28 Nec Corp Stream synchronization reproducing system, stream synchronization reproducing apparatus, synchronous reproduction method, and program for synchronous reproduction
JP2015097319A (en) * 2013-11-15 2015-05-21 キヤノン株式会社 Synchronization system
WO2019022256A1 (en) * 2017-07-28 2019-01-31 国立研究開発法人産業技術総合研究所 Music linking control platform and method for controlling same
JP2019192178A (en) * 2018-04-27 2019-10-31 株式会社コロプラ Program, information processing device, and method
JP2020004388A (en) * 2019-04-11 2020-01-09 株式会社コロプラ System, program, method, and information processing device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008199557A (en) * 2007-02-16 2008-08-28 Nec Corp Stream synchronization reproducing system, stream synchronization reproducing apparatus, synchronous reproduction method, and program for synchronous reproduction
JP2015097319A (en) * 2013-11-15 2015-05-21 キヤノン株式会社 Synchronization system
WO2019022256A1 (en) * 2017-07-28 2019-01-31 国立研究開発法人産業技術総合研究所 Music linking control platform and method for controlling same
JP2019192178A (en) * 2018-04-27 2019-10-31 株式会社コロプラ Program, information processing device, and method
JP2020004388A (en) * 2019-04-11 2020-01-09 株式会社コロプラ System, program, method, and information processing device

Similar Documents

Publication Publication Date Title
US10721439B1 (en) Systems and methods for directing content generation using a first-person point-of-view device
US10474717B2 (en) Live video streaming services with machine-learning based highlight replays
US10938725B2 (en) Load balancing multimedia conferencing system, device, and methods
WO2022121558A1 (en) Livestreaming singing method and apparatus, device, and medium
TW201401104A (en) Controlling a media program based on a media reaction
KR20220031894A (en) Systems and methods for synchronizing data streams
JP2013126233A (en) Video processing device, method and program
WO2022078167A1 (en) Interactive video creation method and apparatus, device, and readable storage medium
US10224073B2 (en) Auto-directing media construction
US20230039530A1 (en) Automated generation of haptic effects based on haptics data
US20150281586A1 (en) Method and apparatus for forming a video sequence
US20230156245A1 (en) Systems and methods for processing and presenting media data to allow virtual engagement in events
WO2024052964A1 (en) Video synchronization device, video synchronization method, and video synchronization program
WO2012166072A1 (en) Apparatus, systems and methods for enhanced viewing experience using an avatar
CN105430449B (en) Media file playing method, apparatus and system
CN107135407B (en) Synchronous method and system in a kind of piano video teaching
WO2021243044A1 (en) Methods and systems for synchronizing multimedia
JP6411274B2 (en) Timing correction system, method and program thereof
WO2024057399A1 (en) Media playback control device, media playback control method, and media playback control program
WO2014064325A1 (en) Media remixing system
US11521390B1 (en) Systems and methods for autodirecting a real-time transmission
WO2024047815A1 (en) Likelihood-of-excitement control method, likelihood-of-excitement control device, and likelihood-of-excitement control method
US20240221261A1 (en) Data processing method and electronic device
US11490148B2 (en) Systems and methods to determine when to rejoin a live media broadcast after an interruption
JP2013150096A (en) Information processor, information processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22958045

Country of ref document: EP

Kind code of ref document: A1