WO2024052964A1

WO2024052964A1 - Video synchronization device, video synchronization method, and video synchronization program

Info

Publication number: WO2024052964A1
Application number: PCT/JP2022/033307
Authority: WO
Inventors: 隆行黒住; 優花芹澤; 馨亮長谷川; 真二深津
Original assignee: 日本電信電話株式会社
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2024-03-14

Abstract

A video synchronization device according to one embodiment is equipped with: a video feature extraction unit for extracting a video feature from a plurality of input videos; a delay estimation unit for estimating the relative delay interval from the video features extracted from the plurality of input videos; and a delay correction unit for synchronizing the plurality of input images by correcting the delay interval of the plurality of input videos by using said delay interval.

Description

Video synchronization device, video synchronization method, and video synchronization program

One aspect of the present invention relates to a video synchronization device, a video synchronization method, and a video synchronization program.

In recent years, video and audio playback has become popular, which involves digitizing video and audio recorded at a certain point, transmitting it in real time to a remote location via communication lines such as IP (Internet Protocol) networks, and playing back the video and audio at the remote location. equipment has come into use. For example, online live performances and public viewing, which transmit real-time video and audio of live music events held at music venues and video and audio of sports matches held at competition venues, to remote locations are becoming more popular. It is being done. Such video/audio transmission is not limited to one-to-one one-way transmission. Video and audio are transmitted from the venue where the music live performance is being held (hereinafter referred to as the event venue) to multiple remote locations, and even at each of these multiple remote locations, the video and audio such as cheers of the audience enjoying the live performance are transmitted. Two-way transmission is also being carried out, in which video and audio are photographed and recorded, transmitted to event venues and other remote locations, and output from large video display devices and speakers at each site.

In this kind of two-way video and audio transmission, a customer who is enjoying the video of a live music event in a remote location can connect to the event venue and listen to the music together with other audience members at the event venue or in a remote location. When a person is waving a flashlight, clapping, or dancing in time with other audience members, it is difficult to broadcast the video in sync with the performers and audience at the event venue, as well as with the audience in other remote locations. The communication time between a remote location and an event venue includes delay time caused by various factors such as communication time, video processing time, reaction time of the audience at the remote location, etc. Therefore, it is difficult to synchronize images that include the movements of spectators in remote locations in real time.

Therefore, by estimating the delay time of the video of the audience in multiple remote locations and correcting the video playback time based on the estimated delay time, we can create a video that looks as if the audience's movements are aligned. is possible. Non-Patent Document 1 describes a method of synchronizing videos based on a synchronization signal embedded in a video signal.

However, the method of Non-Patent Document 1 is a method of synchronizing the viewed videos, and it is difficult to synchronize the videos based on the actions in the videos.

This invention has been made in view of the above circumstances, and its purpose is to provide a technology that can synchronize videos based on the motion contained in the videos.

In one embodiment of the present invention, the video synchronization device includes a video feature extraction unit that extracts video features from a plurality of input videos, and a delay estimation unit that estimates a relative delay time from the video features extracted from the multiple input videos. and a delay correction unit that synchronizes the plurality of input videos by correcting the delay time of the plurality of input videos using the delay time.

According to one aspect of the present invention, videos can be synchronized based on motion included in the videos.

FIG. 1 is a block diagram showing an example of the hardware configuration of each electronic device included in the video synchronization system according to the first embodiment. FIG. 2 is a block diagram illustrating an example of the software configuration of a server that constitutes the video synchronization system according to the first embodiment. FIG. 3 is a diagram illustrating an example of an image of an audience at a remote location according to the first embodiment. FIG. 4 is a diagram showing an example of a video at an event venue according to the first embodiment. FIG. 5 is a conceptual diagram showing video feature extraction by the server according to the first embodiment. FIG. 6 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the first embodiment. FIG. 7 is a diagram illustrating a server delay estimation method according to the first embodiment. FIG. 8 is a diagram illustrating a server delay estimation method according to the first embodiment. FIG. 9 is a diagram illustrating a server delay estimation method according to the first embodiment. FIG. 10 is a diagram illustrating a server delay estimation method according to the first embodiment. FIG. 11 is a conceptual diagram showing video synchronization processing of the server according to the first embodiment. FIG. 12 is a block diagram illustrating an example of the software configuration of a server configuring the video synchronization system according to the second embodiment. FIG. 13 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the second embodiment. FIG. 14 is a diagram illustrating an example of a method of photographing a video at an event venue according to the first and second embodiments. FIG. 15 is a diagram illustrating an application example of the video synchronization system according to the first and second embodiments. FIG. 16 is a diagram illustrating an example of server processing according to the first and second embodiments. FIG. 17 is a diagram illustrating an example of a DNN structure implemented in the video feature extraction unit according to the first and second embodiments. FIG. 18 is a diagram illustrating an example of phase-based learning performed by the learning unit according to the first and second embodiments. FIG. 19 is a diagram illustrating an example of learning based on a phase difference performed by the learning unit according to the first and second embodiments. FIG. 20 is a diagram illustrating an example of a time series search by the delay estimator according to the first and second embodiments.

Hereinafter, some embodiments of the present invention will be described with reference to the drawings.
At a music live concert (hereinafter also referred to as an event) such as a music live venue, the input video of multiple audiences watching the live performance from remote locations (hereinafter referred to as remote audience) is calculated based on the characteristics of the movements in the video. Assuming that you synchronize.

As the input video, multiple videos of remote audience members as shown in FIG. 3 are used. FIG. 3 shows images of multiple remote spectators. FIG. 3 shows a situation in which multiple remote spectators are excited using penlights. In FIG. 3, videos of a plurality of remote audience members are aggregated in a 5×5 matrix, but each video is cut out from such an aggregated video and used. Note that an image of a crowd at an event venue as shown in FIG. 4 may be used as the input image. FIG. 4 shows a crowd at an event venue being excited using penlights. In this case, a part of the video of the crowd at the event venue may be cut out and used as the input video, or the entire video may be used as the input video. The input video is assumed to be a video of the audience holding a distinctive item such as a penlight whose movements are highly visible, but it may also be a video of the audience clapping or dancing without holding anything. .

[First embodiment]
The first embodiment is an embodiment in which a plurality of videos are synchronized by using characteristics of videos of a remote audience.

(Configuration example)
FIG. 1 is a block diagram showing an example of the hardware configuration of each electronic device included in the video synchronization system according to the first embodiment.
The video synchronization system S includes a server 1, an audio output device 101, a video output device 102, and a plurality of audience terminals 2 to 2n. The server 1, the audio output device 101, the video output device 102, and the plurality of audience terminals 2 to 2n can communicate with each other via an IP network.

The server 1 is an electronic device that collects data and processes the collected data. Electronic devices include computers.

The audio output device 101 is a device that includes a speaker that reproduces and outputs audio. The audio output device 101 is, for example, a device that outputs audio at an event venue.

The video output device 102 is a device that includes a display that plays and displays video. For example, the display is a liquid crystal display. The video output device 102 is, for example, a device that plays and displays video at an event venue.

Each of the spectator terminals 2 to 2n is a terminal used by each of a plurality of remote spectators. Each of the spectator terminals 2 to 2n is an electronic device having an input function, a display function, and a communication function. For example, each of the audience terminals 2 to 2n is a tablet terminal, a smartphone, a PC (Personal Computer), or the like, but is not limited to these. The spectator terminal 2 is an example of a terminal.

An example of the configuration of the server 1 will be explained.
The server 1 includes a control section 11, a program storage section 12, a data storage section 13, a communication interface 14, and an input/output interface 15. Each element included in the server 1 is connected to each other via a bus.

The control unit 11 corresponds to the central part of the server 1. The control unit 11 includes a processor such as a central processing unit (CPU). The control unit 11 includes a ROM (Read Only Memory) as a nonvolatile memory area. The control unit 11 includes a RAM (Random Access Memory) as a volatile memory area. The processor expands the program stored in the ROM or the program storage unit 12 into the RAM. The control unit 11 realizes each functional unit described below by the processor executing the program loaded in the RAM. The control unit 11 constitutes a computer.

The program storage unit 12 is configured of a non-volatile memory that can be written to and read from at any time, such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), as a storage medium. The program storage unit 12 stores programs necessary to execute various control processes. For example, the program storage unit 12 stores a program that causes the server 1 to execute processing by each functional unit implemented in the control unit 11, which will be described later. The program storage unit 12 is an example of storage.

The data storage unit 13 is composed of a nonvolatile memory that can be written to and read from at any time, such as an HDD or an SSD, as a storage medium. The data storage unit 13 is an example of a storage or a storage unit.

The communication interface 14 includes various interfaces that communicatively connect the server 1 to other electronic devices using communication protocols defined by IP networks.

The input/output interface 15 is an interface that enables communication between the server 1 and each of the audio output device 101 and the video output device 102. The input/output interface 15 may include a wired communication interface or a wireless communication interface.

Note that the hardware configuration of the server 1 is not limited to the above-mentioned configuration. The server 1 allows the above-mentioned components to be omitted and changed, and new components to be added as appropriate.

FIG. 2 is a block diagram showing an example of the software configuration of the server 1 that constitutes the video synchronization system according to the first embodiment.

The server 1 includes a video feature extraction section 110, a delay estimation section 111, a delay correction section 112, and a learning section 114. Each functional unit is realized by execution of a program by the control unit 11. It can also be said that each functional unit is included in the control unit 11 or the processor. Each functional unit can be read as the control unit 11 or a processor. Although three video feature extraction units 110 are illustrated in FIG. 2, the number of video feature extraction units 110 is not limited to this. In the following description, it is assumed that each of a plurality of input videos is processed by a different video feature extraction unit 110, but it may be processed by a single video feature extraction unit 110.

The video feature extraction unit 110 extracts video features from the input video. The input video includes, for example, multiple remote audience videos. The input video includes, for example, a video obtained by cutting out individual videos from a 5×5 matrix video as shown in FIG. The input video may include a video of a crowd at an event venue, as shown in FIG. Video features are features seen in the input video. The video features include, for example, human movements, objects, human facial expressions, etc. included in the input video. When the input image is an image of a spectator, the image characteristics include human movements such as waving a penlight, lifting a towel, raising a hand, and waving a hand from side to side. Video features may include objects such as penlights, towels, etc. The video features may include human facial expressions such as smiling faces and crying faces. Video features include features that indicate action or movement in the video.

The video feature extraction unit 110 performs feature extraction while shifting the input video, for example, as shown in FIG. FIG. 5 is a conceptual diagram showing video feature extraction by the server 1 according to the first embodiment. As shown in FIG. 5, the video feature extraction unit 110 cuts out the input video based on the video clipping window width. The video feature extraction unit 110 determines the starting point of the video clipping window width based on the clipping interval. The video feature extraction unit 110 extracts features from an input video with a certain video clipping window width, shifts the video by the clipping interval, and then extracts features from an input video with the next video clipping window width.

The video feature extraction unit 110 may use machine learning, for example, to extract video features. The video feature extraction unit 110 may perform feature extraction using a known method described in Non-Patent Document 2 or Non-Patent Document 3. For example, the video feature extraction unit 110 may use a video feature extraction method learned in advance by associating rhythmic sounds with video. In this case, the video feature extraction unit 110 can extract video features that are more related to rhythm.
Non-patent document 2: Masahiro Yasuda, Yasunori Ohishi, Yuma Koizumi and Noboru Harada. Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels. Proc. Interspeech 2020, pp. 1446-1450, 2020.
Non-patent document 3: Masahiro Yasuda, Yasutoshi Oishi, Yuma Koizumi, Noboru Harada "Cross-modal sound search based on specific co-occurrence relationships indicated by weak labels" Proceedings of the Acoustical Society of Japan Research Conference, Autumn 2020 ROMBUNNO. 2-1-2

The video feature extraction unit 110 may extract video features based on feature extraction by learning performed by the learning unit 114, which will be described later. The feature extraction by learning performed by the learning unit 114 is a video feature extraction method obtained by learning. The video feature extraction method can also be called a trained model for extracting video features. The video feature extraction unit 110 can extract a feature vector indicating an action or movement in the video as the video feature.

Note that the video feature extraction unit 110 does not have to always perform feature extraction at the same feature extraction interval or density. The video feature extraction unit 110 may provide at least two types of feature extraction intervals or densities. For example, the video feature extraction unit 110 may narrow the interval or increase the density of feature extraction for one of the two videos, and widen the interval or lower the density of feature extraction for the other video.

The delay estimation unit 111 estimates a relative delay time from video features extracted from a plurality of input videos. The delay estimation unit 111 compares a plurality of video features in time series, and determines which video feature at each time is closest to the video feature at which time based on the distance between the video features or the similarity between the video features. Ask for something. Here, a search method that compares a plurality of video features in chronological order and determines which video feature at each time is closest to the video feature at which time is referred to as a time-series search.

The delay estimation unit 111 may estimate at least one of the relative delay time and speed using voting. The delay estimating unit 111 may estimate the timing deviation of the movements of a plurality of remote spectators, that is, the delay time, by determining the characteristics of each time that are closest to the characteristics of which time, and voting on that time. good. The delay estimation unit 111 may perform estimation by voting using Hough transformation.

Hough transformation involves drawing a straight line on the a-b plane shown in Figure 8 that corresponds to the point (x _i , y _i ) on the x-y plane shown in Figure 7, and determining the slope and intercept of the straight line by voting. It is a method. The delay estimation unit 111 determines the intersection points (a ₀ , b ₀ ) of straight lines drawn on the ab plane by voting on squares divided into grid shapes. As shown in FIG. 5, the video feature extraction unit 110 cuts out the input video at regular intervals and converts the input video into time-series feature vectors as time-series video features. As shown in FIG. 9, the delay estimation unit 111 measures the distance between the feature vectors of each person at each time, and plots the pair of closest times. At this time, if the paired images have the same movement, ideally they will be plotted on a straight line with time offset by the delay time. The delay estimation unit 111 obtains the slope a ₀ and intercept b ₀ of this straight line by Hough transformation.

For example, a case will be described in which delay estimation is performed for a video of a remote audience. Since it is unlikely that each person in the remote audience is watching and moving at different speeds, the delay estimation unit 111 fixes the slope to a ₀ = 1 to simplify voting, as shown in FIG. may be converted into For example, if multiple audience members are watching a live performance in a remote location, it is assumed that each audience member is watching the video being played at the same speed, so the delay estimation unit 111 calculates a ₀ = 1. It may be fixed to In this case, it is efficient because it is only necessary to cast one vote for each point. The delay estimation unit 111 determines the intercept b ₀ that received the largest number of votes as the estimated delay time. In this example, it is assumed that the delay estimator 111 estimates the delay time between two videos, but when determining the delay time of multiple videos, multiple pairs are extracted from the set and each pair is estimated. The delay time may also be determined. Here, we use a search method that compares multiple video features in chronological order and determines which video feature at each time is closest to the video feature at which time based on the degree of similarity between the video features. I will call you.

The delay estimation unit 111 may perform matching between feature vectors using the method described in Non-Patent Document 2 or Non-Patent Document 3. In that case, the delay estimation unit 111 may estimate at least one of the relative delay time and speed using the distance. The delay estimation unit 111 may use a distance measure such as Euclidean distance. Here, we refer to a search method that compares multiple video features in chronological order and determines which video feature at each time is closest to the video feature at which time based on the distance between the video features as distance-based search. That's it.

For example, when the video feature extraction unit 110 provides two types of feature extraction intervals or densities, the delay estimation unit 111 pairs the two types of intervals or two types of densities to estimate the relative delay time. Good too. For example, when estimating the delay between two videos, the delay estimation unit 111 narrows or increases the density of feature extraction for one video, widens the feature extraction interval for the other video, or The delay time may be estimated by lowering the density.

The delay correction unit 112 uses the delay time to correct the delay times of the plurality of videos, and synchronizes the plurality of videos. The delay correction unit 112 corrects the video playback time based on the estimated delay times of the plurality of videos. For example, the delay correction unit 112 inserts a difference in delay time into the playback time of the video of the remote audience with the shortest delay time in accordance with the video of the remote audience with the longest delay time, thereby creating a synchronized video.

For example, the delay correction unit 112 may perform delay correction so that all the images match the images of one of the spectators shown in FIG. 3, for example, the remote audience in the upper left. In order to find the delay time at low cost, the delay correction unit 112 extracts a small number of representative samples from a grouped set of remote audiences, and treats the average delay time of the samples as the delay time of all videos in the group. , the playback times of the videos of all remote spectators in the group may be corrected. The delay correction unit 112 corrects the playback time of the video based on the delay time, thereby adjusting the movements of the multiple audience members as shown in the right diagram from the videos of the multiple spectators as shown in the left diagram of FIG. 11. It is possible to create an image with all the necessary information. In the following description, "reproduction" may be read as "output" or "transmission".

The learning unit 114 performs learning based on the phase of the learning video or learning based on the phase difference. Learning based on phase and learning based on phase difference will be described later. The learning unit 114 executes learning of learning data including a plurality of learning videos and phases associated with each of the plurality of learning videos. The phase is a phase corresponding to a part of the learning video. For example, part of the learning video is a penlight included in the learning video. Since the penlight is swung by a person, the phase corresponds to the position of the penlight. When the penlight is swung to the left side of the person in the video, the phase may be set to 0 [rad]. When the penlight is swung to the right toward a person in the video, the phase may be set to π [rad]. When a penlight is swung in front of a person in the video, the phase may be set to π/2 [rad]. Note that a part of the phase-related learning video is not limited to a penlight, but may be various human movements, objects, human facial expressions, etc. in the video. The learning data may be stored in the data storage unit 13 or may be stored in an electronic device different from the server 1. The learning unit 114 can obtain a video feature extraction method by performing learning.

(Operation example)
The procedure of processing by the server 1 will be explained.
In addition, in the following description mainly based on the server 1, the server 1 may be read as the control unit 11.

Note that the processing procedure described below is only an example, and each process may be changed as much as possible. Further, regarding the processing procedure described below, steps can be omitted, replaced, or added as appropriate depending on the embodiment.

FIG. 6 is a flowchart showing an example of the video synchronization procedure and processing contents of the server 1 according to the first embodiment.

In the following processing, images from cameras of multiple remote spectators and images used for video synchronization are input, and a video in which the images of multiple remote spectators are synchronized is output. Images from remote audience cameras and images used for video synchronization are examples of input images. It is assumed that the input video is a remote audience video obtained from the spectator terminals 2 to 2n. The synchronized video is output via the video output device 102 at the event venue. The synchronized video may be output to the audience terminals 2 to 2n.

The control unit 11 obtains input images obtained from the audience terminals 2 to 2n (step S1).

The video feature extraction unit 110 waits for an input video (step S2).

The video feature extraction unit 110 extracts video features from the input video (step S3). In step S3, for example, the video feature extraction unit 110 obtains an input video. As shown in FIG. 5, the video feature extraction unit 110 extracts video features while shifting the input video. The video feature extraction unit 110 may extract video features using machine learning, for example. The video feature extraction unit 110 may extract video features using a known method described in Non-Patent Document 2 or Non-Patent Document 3. For example, the video feature extraction unit 110 may use a video feature extraction method learned in advance by associating rhythmic sounds with video.

The video feature extraction unit 110 may perform feature extraction on a plurality of input videos at different intervals or densities. For example, the video feature extraction unit 110 may extract video features by providing at least two types of feature extraction intervals or densities. When extracting features of two videos, the video feature extraction unit 110 may narrow the feature extraction interval for one video and widen the feature extraction interval for the other video. The video feature extraction unit 110 may increase the density of feature extraction for one video and lower the density of feature extraction for the other video.

The video feature extraction unit 110 determines whether feature extraction has been performed for all input videos (step S4). If the video feature extraction unit 110 determines that feature extraction has been performed for all input videos (step S4: YES), the process transitions from step S4 to step S5. If the video feature extraction unit 110 determines that feature extraction has not been performed for all input videos (step S4: NO), the process transitions from step S4 to step S2.

The delay estimation unit 111 estimates a relative delay time from video features extracted from a plurality of input videos (step S5). In step S5, for example, the delay estimating unit 111 collates a plurality of video features and determines which video feature at each time corresponds to which video feature at each time based on the distance between the video features or the similarity between the video features. Find the closest one. The delay estimation unit 111 estimates the delay time between two videos based on the video feature of a certain video and the distance, or the time of the video feature with the closest similarity. The delay estimation unit 111 may extract a plurality of pairs from a set of a plurality of input videos and estimate the delay time for each pair.

The delay estimating unit 111 may estimate at least one of the relative delay time and speed using voting, for example. The delay estimating unit 111 determines the time difference between the timings of the movements of the plurality of remote spectators, that is, the delay time, by determining which time the feature of each time is closest to, and voting for that time. The delay estimation unit 111 estimates the delay time between two videos based on the voted time. The delay estimation unit 111 may perform estimation by voting using Hough transform. The delay estimation unit 111 may perform delay estimation by setting the relative speed as 1.

When the video feature extraction unit 110 provides two types of feature extraction intervals or densities, the delay estimation unit 111 may pair the two types of intervals or two types of densities to estimate the relative delay time. . For example, the delay estimation unit 111 may estimate a relative delay time by pairing intervals between two types of feature extraction or densities of two types of feature extraction.

The delay correction unit 112 waits for input video (step S6).

The delay correction unit 112 uses the delay time estimated by the delay estimation unit 111 to correct the delay times of the plurality of input videos, and synchronizes the plurality of input videos (step S7). In step S7, for example, the delay correction unit 112 obtains an input video. The delay correction unit 112 performs delay correction based on a plurality of input videos. The delay correction unit 112 performs delay correction based on the time determined from a plurality of input videos. The delay correction unit 112 corrects the playback times of the plurality of input videos based on the delay times estimated for the plurality of input videos. For example, the delay correction unit 112 may create synchronized video by inserting a delay time difference into the playback time of the video of the remote audience with the shortest delay time in accordance with the video of the remote audience with the longest delay time. . For example, the delay correction unit 112 may perform delay correction to match all other videos to a predetermined video. The delay correction unit 112 may perform delay correction based on delay times calculated by grouping a plurality of videos. In this case, the delay correction unit 112 extracts a small number of samples from the grouped set, sets the average delay time of the samples as the delay time of all videos in the group, and sets the average delay time of the samples as the delay time of all videos in the group. Correction may be made.

The delay correction unit 112 determines whether delay correction has been performed for all input videos (step S8). If the delay correction unit 112 determines that delay correction has been performed on all input videos (step S8: YES), the process ends. If the delay correction unit 112 determines that delay correction has not been performed on all input videos (step S8: NO), the process transitions from step S8 to step S6.

The control unit 11 outputs the delay-corrected video to the video output device 102 via the input/output interface 15. The video output device 102 outputs video that has undergone delay correction. The control unit may output the delay-corrected video to the audience terminals 2 to 2n via the IP network. The audience terminals 2 to 2n output video images on which delay correction has been performed.

(effect)
In the embodiment described above, the server 1 extracts video features from a plurality of input videos, estimates a relative delay time from the video features extracted from the multiple input videos, uses the delay time, and extracts video features from a plurality of input videos. It is possible to synchronize multiple input videos by correcting the delay time. Therefore, the server 1 estimates the delay time between multiple videos based on the characteristics such as motion included in the input video, and corrects the playback time of the multiple videos based on the delay time. It is possible to play back a video that includes all the actions that will be performed. Thereby, the server 1 can synchronize the videos based on the motion included in the videos.

[Second embodiment]
The second embodiment is an embodiment in which a sound feature is extracted from an input sound and a plurality of videos are synchronized based on the input sound.

(Configuration example)
In the second embodiment, the same components as in the first embodiment are denoted by the same reference numerals, and the description thereof will be omitted. In the second embodiment, parts that are different from the first embodiment will be mainly described.

FIG. 12 is a block diagram showing an example of the software configuration of the server 1 configuring the video synchronization system according to the second embodiment.
The server 1 includes a video feature extraction section 110, a delay estimation section 111, a delay correction section 112, a sound feature extraction section 113, and a learning section 114. Each functional unit is realized by execution of a program by the control unit 11. It can also be said that each functional unit is included in the control unit 11 or the processor. Each functional unit can be read as the control unit 11 or a processor. Although three video feature extraction units 110 are illustrated in FIG. 12, the number of video feature extraction units 110 is not limited to this. In the following description, it is assumed that each of a plurality of input videos is processed by a different video feature extraction unit 110, but it may be processed by a single video feature extraction unit 110.

The sound feature extraction unit 113 extracts sound features from the input sound. For example, in the case of a live music event, the input sound is the sound played at the venue. The input sound is a sound that serves as a reference for sound characteristics. The sound feature extraction unit 113 may perform sound feature extraction using the method described in

Non-Patent Document

2 or 3. The sound feature extraction unit 113 may perform matching of feature vectors of different modals on a common feature space. The sound feature extraction unit 113 may provide at least two types of sound feature extraction intervals or densities. For example, the sound feature extraction unit 113 may narrow the interval or increase the density of feature extraction for one of the two videos, and widen the interval or lower the density of feature extraction for the other video. Note that the sound feature extraction unit 113 may extract sound features from the input video.

The delay estimation unit 111 estimates the delay time from the sound of the video by comparing the video features and sound features of a plurality of videos. The delay estimating unit 111 estimates relative delay times between a plurality of input videos and sounds from sound features. The delay estimation unit 111 adjusts the playback times of the plurality of videos according to the input sound.

For example, when the sound feature extraction unit 113 provides two types of feature extraction intervals or densities, the delay estimation unit 111 pairs the two types of intervals or two types of densities to estimate the relative delay time. Good too. For example, when estimating the delay between two videos, the delay estimating unit 111 narrows or increases the density of feature extraction for one sound and widens the interval for feature extraction of the other sound, or The delay time may be estimated by lowering the density.

The delay correction unit 112 uses the delay time to correct the delay times of the plurality of videos, and synchronizes the plurality of videos. The delay correction unit corrects the delay times of the plurality of videos based on the sound. The delay correction unit 112 adjusts the playback times of the plurality of videos in accordance with the sound.

(Operation example)
FIG. 13 is a flowchart illustrating an example of a video synchronization procedure and processing contents of the server according to the second embodiment.
In the following process, images from cameras of multiple remote audience members and input sounds are input, and a video in which the images of multiple remote audience members are synchronized is output. The image from the remote spectator's camera is an example of the input image. It is assumed that the input video is a remote audience video obtained from the spectator terminals 2 to 2n. The synchronized video is output via the video output device 102 at the event venue. The synchronized video may be output to the audience terminals 2 to 2n. The input sound is, for example, sound obtained from the audio output device 101.

The control unit 11 acquires input sound (step S101). The input sound is, for example, reproduced sound played at an event venue.

The sound feature extraction unit 113 waits for input sound (step S102).

The sound feature extraction unit 113 extracts sound features from the input sound (step S103). In step S103, for example, the sound feature extraction unit 113 extracts sound features using a known method. The sound feature extraction unit 113 may perform feature extraction on the input sound at different intervals or densities. For example, the sound feature extraction unit 113 may extract sound features by providing at least two types of feature extraction intervals or densities. Types of spacing or density include wide spacing, narrow spacing, high density, low density, and the like.

The control unit 11 obtains input images obtained from the audience terminals 2 to 2n (step S104).

The video feature extraction unit 110 waits for an input video (step S105).

The video feature extraction unit 110 extracts video features from the input video similarly to step S3 (step S106).

The video feature extraction unit 110 determines whether feature extraction has been performed for all input videos (step S107). If the video feature extraction unit 110 determines that feature extraction has been performed for all input videos (step S107: YES), the process transitions from step S107 to step S108. If the video feature extraction unit 110 determines that feature extraction has not been performed for all input videos (step S107: NO), the process transitions from step S107 to step S105.

The delay estimation unit 111 estimates the relative delay times of the plurality of input videos and sounds from the sound features (step S108). In step S108, for example, the delay estimation unit 111 compares the video features and sound features of the input video. The delay estimation unit 111 estimates the delay time from the sound of the plurality of videos based on the result of the comparison. For example, the delay estimating unit 111 collates a plurality of video features and sound features, and determines which video feature at each time corresponds to the video feature at which time based on the distance between the video feature and the sound feature or the similarity between the video feature and the sound feature. Find the closest match to the sound feature. The delay estimating unit 111 estimates the delay time from the sound of the video based on the distance to the video feature of a certain video or the time of the sound feature with the closest similarity.

The delay estimating unit 111 may estimate at least one of the relative delay time and speed using voting, for example. The delay estimating unit 111 determines the timing deviation of the movements of the plurality of remote spectators, that is, the delay time, by determining which time the video feature at each time is closest to the sound feature at which time, and voting for that time. The delay estimation unit 111 estimates the delay time from the sound of the video based on the voted time. The delay estimation unit 111 may perform estimation by voting using Hough transform. The delay estimation unit 111 may perform delay estimation by setting the relative speed as 1.

When the sound feature extraction unit 113 provides two types of feature extraction intervals or densities, the delay estimation unit 111 may pair the two types of intervals or two types of densities to estimate the relative delay time. . For example, the delay estimation unit 111 may estimate a relative delay time by pairing intervals between two types of feature extraction or densities of two types of feature extraction.

The delay correction unit 112 waits for input video (step S109).

The delay correction unit 112 uses the delay time estimated by the delay estimation unit 111 to correct the delay time of the plurality of input videos, and synchronizes the plurality of input videos (step S110). In step S110, for example, the delay correction unit 112 obtains an input video. The delay correction unit 112 performs delay correction based on the input sound. The delay correction unit 112 performs delay correction based on the delay time from the sound of a plurality of input videos. The delay correction unit 112 corrects the playback times of a plurality of input videos based on the delay time.

The delay correction unit 112 determines whether delay correction has been performed for all input videos (step S111). If the delay correction unit 112 determines that delay correction has been performed on all input videos (step S111: YES), the process ends. If the delay correction unit 112 determines that delay correction has not been performed on all input videos (step S111: NO), the process transitions from step S111 to step S109.

(effect)
In the embodiment described above, the server 1 extracts video features from a plurality of input videos, estimates a relative delay time from the video features extracted from the multiple input videos, uses the delay time, and extracts video features from a plurality of input videos. It is possible to synchronize multiple input videos by correcting the delay time. Furthermore, the server 1 can extract sound features from the input sound and estimate relative delay times between a plurality of videos and sounds from the sound features. Therefore, the server 1 estimates the delay time from the input sound of multiple videos based on the characteristics such as motion included in the input video and the sound characteristics of the input sound, and plays the multiple videos based on the delay time. The time can be corrected. Thereby, the server 1 can adjust the playback time of the video in accordance with the input sound, and can synchronize the video in accordance with the sound.

Hereinafter, matters common to the first embodiment and the second embodiment will be explained as supplementary information.
A method for photographing crowd images at an event venue will be explained.
FIG. 14 is a diagram illustrating an example of a method of photographing a video at an event venue according to the first and second embodiments.
As shown in FIG. 14, a camera installed inside the event venue photographs the crowd inside the venue. For example, an image of the crowd as shown in FIG. 4 is captured by a camera in the venue. For example, cameras in the venue are installed on the stage side of the venue, and are installed to take pictures of the audience seats. The number of cameras in the venue is not limited to one, and a plurality of cameras may be installed. The crowd image may be an image selected from images captured by at least one camera.

An application example of the video synchronization system S will be explained.
FIG. 15 is a diagram illustrating an application example of the video synchronization system S according to the first and second embodiments.
FIG. 15 shows an example of a system where remote audience members participate in a live streaming event. Remote spectators participate in the event using a PC that allows them to watch the live broadcast while filming themselves with a camera. The roles of the devices and functions in (1) to (5) in FIG. 15 are as follows.

(1) Remote spectators will continue to transmit camera images of their faces and upper bodies during the event. (1) is a function performed by a plurality of audience terminals 2 to 2n.
(2) Images from remote audiences are arranged horizontally, vertically, or vertically and horizontally in a grid, aggregated, and processed into an easy-to-use format. (2) is a function provided by the server 1.
(3) Display the video created in (2) at the event venue. (3) is a function provided by the server 1.
(4) Deliver the captured video of the event venue in which the remote audience's video is reflected to the remote audience's PC as a live distribution video. (4) is a function provided by the server 1.
(5) Live streaming video is displayed on the remote audience's PC screen. (5) is a function performed by a plurality of audience terminals 2 to 2n.

Using video of the remote audience in this way not only allows the audience and performers at the venue to see how excited the remote audience is, but it can also be used for interaction between the performers and audience at the venue and the remote audience. This is thought to increase the utility value of interactive video distribution. Furthermore, at live music events and sports, it is common for spectators to enjoy the event by using penlights and cheering goods to synchronize their movements with each other. It is difficult to match the remote audience's video with the on-site audience's experience due to the delay that occurs between the live venue and the remote environment.

The first and second embodiments are designed to create harmonious synchronized images from the camera images of multiple remote audience members viewing a live streaming event without sacrificing the sense of rhythm and use them in venue production. Adjust the delay time due to various factors such as communication time, video processing time, reaction time of remote audience, etc. In the first and second embodiments, as shown in FIG. 11, the movements of a plurality of audience members, such as clapping hands or shaking a penlight, are synchronized in time to produce a harmonious video.

An example of processing by the server 1 will be explained.
FIG. 16 is a diagram illustrating a processing example of the server 1 according to the first and second embodiments.
In order to synchronize the video so that the movements of the audience are aligned, Server 1 needs to focus on and align the objects that serve as the reference for synchronization of people, that is, the partial movements in the video of people clapping their hands or waving their penlights. There is. In order to capture this movement, the server 1 uses a 3-Dimensional Convolutional Neural Network (3D-CNN) to extract spatial and temporal features, and performs a search based on the obtained features.

The video feature extraction unit 110 extracts video features X from the input video X based on feature extraction by phase-based learning or phase difference-based learning. The video feature extraction unit 110 extracts the video feature Y from the input video Y based on feature extraction by phase-based learning or phase difference-based learning. The delay estimation unit 111 estimates a relative delay time from the video feature X and the video feature Y. The delay correction unit 112 synchronizes the reproduced video X based on the input video X and the reproduced video Y based on the input video Y using the estimated delay time. For example, the delay correction unit 112 corrects the playback time of the playback video X by inserting into the playback video X the estimated delay time of the playback video Y with respect to the playback video X, and synchronizes the playback video X and the playback video Y. Create an image that looks great.

An implementation example of the video feature extraction unit 110 will be described.
FIG. 17 is a diagram illustrating an example of a Deep Neural Network (DNN) structure implemented in the video feature extraction unit 110 according to the first and second embodiments.
S-avgpool represents average pooling into the spatial dimension. T-avgpool represents average pooling into the time dimension.

The video feature extraction unit 110 uses the DNN structure of the 3D-CNN in order to extract video features based on phase-based learning or phase difference-based learning. The DNN structure of the 3D-CNN is Resnet18-3D (R3D-18) described in Non-Patent Document 5, which is trained using Kinetics-400 described in Non-Patent Document 4, which is used for human behavior classification tasks. The top three layers are used. The image features extracted from R3D-18 are encoded into a G-dimensional latent space through a fully connected layer and a pooling layer.

Video Encoder f is the input video

of

used to encode into latent space as in Here, H, W, and P are the height, width, and number of frames of the input video. The loss function used for DNN training is triplet loss

use. here,

is a positive constant called the margin parameter. In learning based on triplet loss, learning is performed using a set of Anchor x _a as a reference, Positive x _p in the same category as Anchor, and Negative x _n in a different category from Anchor.

Each input is output as a vector in the embedding space by CNN (transform f). In the embedding space, the distance d _p between Anchor and Positive and the distance d _n between Anchor and Negative are measured using a distance function d. Note that the Euclidean distance is used as the distance function d.

Here, in order to make learning more efficient, a triplet is constructed by selecting positive samples and negative samples only in the mini-batch using the method described in Non-Patent Document 6. In particular, negative samples are selected according to the following semi-hard negative condition (formula (1)).
d _p < d _n < d _p +α (1)

Non-patent document 4: W. Kay et al. “The Kinetics Human Action Video Dataset”. Computing Research Repository, abs/1705.06950, 2017.
Non-patent document 5: D. Tran et al. “A Closer Look at Spatiotemporal Convolutions for Action Recognition”. Proc. of IEEE International Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
Non-patent document 6: F. Faghri et al. “VSE++: Improving visual-semantic embeddings with hard negatives”. Proc. of the British Machine Vision Conf. (BMVC), 2018

Here, a method for determining Positive and Negative will be explained.
FIG. 18 is a diagram illustrating an example of phase-based learning performed by the learning unit 114 according to the first and second embodiments. The upper half of FIG. 18 shows the left and right swing of the penlight viewed from the front, and the lower half shows the corresponding phase expressed using the arc method. In this example, moving the penlight back and forth from side to side is considered to be one cycle, and the phases are associated with each other. However, it is also possible to associate the phases with moving the penlight back and forth as one cycle, or clapping hands once. Hitting may be associated with one cycle. Furthermore, if one period can be defined by periodic movements such as dancing, hand movements, and object movements, each position can be similarly associated with a phase.
It is necessary to determine whether each pair is Positive or Negative, but the phase position is divided into four divisions (corresponding to quadrants) in advance, and if the pairs are in the same division, it is determined to be Positive, and if they are in different divisions, it is determined to be Negative. This is the standard. Here, this will be referred to as phase-based learning.

FIG. 19 is a diagram illustrating an example of learning based on phase differences performed by the learning unit 114 according to the first and second embodiments.
This is a standard in which the phase difference between the two pairs is positive when it is less than π/2, and negative when it is π/2 or more. Here, this will be referred to as learning based on phase differences.

An example of time series search by the delay estimation unit 111 will be described.
FIG. 20 is a diagram illustrating an example of a time-series search by the delay estimation unit 111 according to the first and second embodiments.
The delay estimating unit 111 compares the feature vectors of the latent space obtained as video features through the processing of the video feature extracting unit 110 in time series. Here, the delay estimation unit 111 extracts feature vectors for person X and person Y in time series. For person X, the delay estimating unit 111 extracts feature vectors in time series from the time of delay time _t0 . The delay estimating unit 111 compares the feature vector F _x (t+t ₀ ) of person X with the feature vector F _Y (t) of person Y while shifting the delay time t ₀ . The delay estimation unit 111 determines the delay time t ₀ at which the distance D (F _x , F _Y , t ₀ ) is the smallest. The delay correction unit 112 adjusts the playback time of the playback video X of the person X by inserting the time of the delay time t ₀ obtained by the processing of the delay estimation unit 111 into the playback time. The delay correction unit 112 can create an image in which the reproduced image X of the person X and the reproduced image Y of the person Y appear to be synchronized. Note that in the distance-based search, the delay estimation unit 111 can use Euclidean distance as the distance function D.

[Other embodiments]
The video synchronization device may be realized by one device as explained in the above example, or may be realized by a plurality of devices with distributed functions.

The program may be transferred while being stored in the electronic device, or may be transferred without being stored in the electronic device. In the latter case, the program may be transferred via a network or may be transferred while being recorded on a recording medium. The recording medium is a non-transitory tangible medium. The recording medium is a computer readable medium. The recording medium may be any medium capable of storing a program and readable by a computer, such as a CD-ROM or a memory card, and its form is not limited.

Although the embodiments of the present invention have been described in detail above, the above description is merely an illustration of the present invention in all respects. It goes without saying that various improvements and modifications can be made without departing from the scope of the invention. That is, in implementing the present invention, specific configurations depending on the embodiments may be adopted as appropriate.

In short, the present invention is not limited to the above-described embodiments as they are, but can be embodied by modifying the constituent elements at the implementation stage without departing from the spirit of the invention. Moreover, various inventions can be formed by appropriately combining the plurality of components disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiments. Furthermore, components from different embodiments may be combined as appropriate.

The embodiments described above may be applied not only to electronic devices but also to methods performed by electronic devices. The above-described embodiments may be applied to a program that allows a computer to execute the processing of each part of an electronic device.

1 Server 2 to 2n Spectator terminal 11 Control unit 12 Program storage unit 13 Data storage unit 14 Communication interface 15 Input/output interface 101 Audio output device 102 Video output device 110 Video feature extraction unit 111 Delay estimation unit 112 Delay correction unit 113 Sound feature Extraction unit 114 Learning unit S Video synchronization system

Claims

a video feature extraction unit that extracts video features from a plurality of input videos;
a delay estimation unit that estimates a relative delay time from video features extracted from the plurality of input videos;
a delay correction unit that uses the delay time to correct the delay times of the plurality of input videos to synchronize the plurality of input videos;
A video synchronization device equipped with.
It further includes a sound feature extraction unit that extracts sound features from the input sound,
The delay estimation unit estimates a relative delay time between the plurality of input videos and the input sound from the sound feature.
The video synchronization device according to claim 1.
The video feature extraction unit performs feature extraction at two types of intervals or two types of density,
The delay estimation unit estimates a relative delay time by pairing the two types of intervals or the two types of densities.
The video synchronization device according to claim 1.
The sound feature extraction unit performs feature extraction at two types of intervals or two types of density,
The delay estimation unit estimates a relative delay time by pairing the two types of intervals or the two types of densities.
The video synchronization device according to claim 2.
The delay estimation unit estimates at least one of relative delay time or speed using distance or voting.
The video synchronization device according to claim 1.
comprising a learning section that executes learning based on the phase of the learning video or learning based on the phase difference,
The video feature extraction unit extracts the video features based on the feature extraction by learning.
The video synchronization device according to claim 1.
a video feature extraction process for extracting video features from a plurality of input videos;
a delay estimation step of estimating a relative delay time from video features extracted from the plurality of input videos;
a delay correction step of synchronizing the plurality of input images by correcting the delay time of the plurality of input images using the delay time;
A video synchronization method comprising:
A video synchronization program that causes a computer to execute processing by each unit included in the video synchronization device according to any one of claims 1 to 6.