WO2022137325A1

WO2022137325A1 - Device, method, and program for synthesizing video signals

Info

Publication number: WO2022137325A1
Application number: PCT/JP2020/047864
Authority: WO
Inventors: 稔久藤原; 央也小野; 達也福井; 智彦池田; 亮太椎名
Original assignee: 日本電信電話株式会社
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2022-06-30
Also published as: JPWO2022137325A1

Abstract

One purpose of this disclosure is to reduce a delay time before outputting a synthesized video.　According to this disclosure, a video synthesizer detects the input timing of each of input frames constituting multiple video signals, starts sequentially a process of synthesizing a set number of video signals among the multiple video signals when the set number of video signals are input, and generates an output frame obtained by synthesizing the multiple video signals into one video signal.

Description

Devices, methods and programs for synthesizing video signals

Regarding a video composition system that synthesizes and outputs screens from multiple video input signals into one.

In recent years, many video devices have been used. Various numbers of pixels (resolutions), frame rates, and the like are used in the images of many such image devices. The video signal of this video device is transmitted on one screen using a time of 1 part of the frame rate, although there are differences in physical signals, control signals, etc. depending on the standard.

There is a form of using these images, such as a video conference, in which multiple cameras are displayed on a monitor that is smaller than the number of cameras. In such a case, screen composition is performed, for example, a plurality of images are divided and displayed on one screen, or another image screen is reduced and displayed in a certain image screen.

Normally, the timing of the video signal is not synchronized, and the timing of other video signals to be combined is different. Therefore, the signal is temporarily buffered in a memory or the like before being combined. As a result, there is a delay in the output of the combined screen.

Assuming that an ensemble in a remote place is performed in a video conference that performs such screen composition, the delay related to this composition greatly impairs its feasibility. For example, in the case of a song with 120 beats per second (hereinafter, 120 BPM (Beat Per Minute)), the time for one beat is 60/120 seconds = 500 milliseconds. If it is necessary to adjust this with an accuracy of 5%, it is necessary to suppress the delay until the camera takes a picture and displays it at 500 × 0.05 = 25 milliseconds or less.

In addition to the processing related to compositing, it is actually necessary to include other delays such as image processing time on the camera, display time on the monitor, and time related to transmission before shooting and displaying with the camera. .. As a result, with the prior art, it has been difficult to perform collaborative work in applications where timing such as ensemble while watching images mutually at a remote location is important.

Therefore, in a system that synthesizes multiple video signals from multiple locations, etc., for collaborative work with strict low delay requirements, the time from the input of asynchronous video signals to the output of the synthesized video signals is reduced. It is necessary to provide the system.

The purpose of this disclosure is to reduce the delay time until the output of the composite video.

In the present disclosure, in a device that synthesizes and displays a plurality of asynchronous videos, a synthesis process is performed in which the screens are arranged from the top in the order of the earliest input timing from the plurality of input video signals.

The video synthesizer of the present disclosure is
Detects the input timing of each input frame that composes multiple video signals,
When a set number of video signals among the plurality of video signals are input, the synthesis processing of the set number of video signals is sequentially started.
An output frame is generated by synthesizing the plurality of video signals into one video signal.

The video composition method of the present disclosure is
The video synthesizer
Detects the input timing of each input frame that composes multiple video signals,
When a set number of video signals among the plurality of video signals are input, the synthesis processing of the set number of video signals is sequentially started.
An output frame is generated by synthesizing the plurality of video signals into one video signal.

The program of the present disclosure is a program for realizing a computer as each functional unit provided in the apparatus according to the present disclosure, and is a program for causing the computer to execute each step provided in the method of the apparatus according to the present disclosure. ..

This disclosure can shorten the delay time until the output of the composite video.

An example of screen information included in a video signal is shown. An example of screen composition is shown. An example of a video composition method related to the present disclosure is shown. An example of the video synthesis method of the present disclosure is shown. An example of the video synthesis method of the present disclosure is shown. An example of the video synthesis method of the present disclosure is shown. An example of the video synthesis method of the present disclosure is shown. A configuration example of the video synthesizer according to this embodiment is shown.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings. The present disclosure is not limited to the embodiments shown below. Examples of these implementations are merely examples, and the present disclosure can be implemented in various modified and improved forms based on the knowledge of those skilled in the art. In the present specification and the drawings, the components having the same reference numerals shall indicate the same components.

FIG. 1 shows an example of screen information included in a video signal. The information on the screen is transmitted by scanning the screen in the horizontal direction for each scanning line 21 and sequentially scanning the lower scanning line 21. This scan includes scanning of overhead information / signals such as the blanking portion 22 and the border portion 23 in addition to the display screen 24. The blanking portion 22 may include information other than video information, such as control information and audio information. (See, for example, Non-Patent Document 1, Chapter 3.)

FIG. 2 shows an example of synthesizing video signals. In the present disclosure, as an example, four video signals are input to the video synthesizer, and the video synthesizer synthesizes and outputs one video signal. For video signals, one screen is transmitted using a time that is one-third of the frame rate. For example, in the case of a video signal of 60 frames per second, the video signal of one screen is transmitted in 1/60 seconds, that is, about 16.7 milliseconds (hereinafter, 60 fps (Frame per Second)). The information on one screen at each time point included in the video signal is called a "frame", and the information on one screen of each video signal input to the video synthesizer is called an "input frame", which is synthesized and output from the video synthesizer. The information on one screen is referred to as an "output frame".

For example, consider a case where the video synthesizer reads all the input frames, then synthesizes them into one output frame, and outputs them, as shown in FIG. In this case, assuming that the frame time of each input frame is T_f and the synthesis processing time is T_p, the output of the output frame is delayed by 2T_f + T_p at the maximum from the input start time of the first input 1. If the frame rate is 60 fps, there will be a delay of at least 33.3 ms.

The apparatus and method of the present disclosure is a system for inputting a plurality of asynchronous videos and synthesizing the images, and is characterized in that the compositing process is started in the order of the earliest input timing and arranged at the upper part of the screen. ..

In this embodiment, a case where each input has a deviation of 1/4 T_f time in 4 inputs, 4 divisions and 1 screen output will be described. Inputs are 1, 2, 3, and 4 in order of input timing. In this case, the video signals of input 1 and input 2 out of the four video signals are displayed at the upper part of the display screen 24, and the remaining video signals of input 3 and input 4 are displayed at the lower part of the display screen 24. Therefore, in the present embodiment, when two video signals out of the four video signals are input, the synthesis processing of the two video signals is started. For the sake of simplification, the blanking portion 22 and the border portion 23 of the video signal are excluded, and only the signal of the portion of the display screen 24 will be described.

FIG. 4 shows a first synthetic example of the present disclosure. When the input of the input 2 is completed, the composition process (1) is started and the output to the upper part of the display screen 24 is started.
Next, when the input of the input 4 is completed, the composition process (2) is started, and the output to the lower part of the display screen 24 is started.
In this case, the maximum delay time from the input start time of the input 1 to the output start time of the upper display screen 24 is (5 / 4T_f + T_p). As a result, the output delay can be shortened by 3/4 T_f time compared to the example shown in FIG. For example, when the frame rate is 60 fps, the delay is 21 milliseconds + T_p.

Next, a case where each input is not shifted by 1/4 T_f time will be described.
As shown in FIG. 5, when the time difference T_in2toin4 at the end of the input frame of the input 2 and the input 4 is longer than T_f / 2, the lower display follows the output of the upper display screen 24 in which the input 1 and the input 2 are combined. After the input of the input 2 is completed, the start of the synthesis process (1) is waited for at least T_in2toin4-T_f / 2 hours so that the output of the screen 24 is in time. Alternatively, after performing the synthesis process (1), the output of the output frame on the upper screen may be waited for T_in2toin4-T_f / 2 hours.

As shown in FIG. 6, when the time difference T_in2toin4 at the end of the input frame of the input 2 and the input 4 is shorter than T_f / 2, the output of the upper display screen 24 is followed by the output of the lower display screen 24 so that the output of the lower display screen 24 is in time. After the input of the input 4 is completed, the start of the synthesis process (2) is waited for T_f / 2-T_in2toin 4 hours. Alternatively, the output of the output frame on the lower screen may be waited for T_f / 2-T_in2toin 4 hours after the synthesis process (2) is performed.

In this embodiment, an example is shown in which four video signals as shown in FIG. 2 are combined into one video signal on the upper two screens and the lower two screens and output. Therefore, in the examples of FIGS. 5 and 6, the time difference at the end of the input frame is set as the time difference between the input 2 and the input 4, and the comparison target is compared with T_f / 2. However, the time difference at the end of the input frame and the comparison target thereof may be any number determined according to the number of video signals to be combined and the arrangement of the screen. For example, when synthesizing six video signals into one video signal of the upper two screens, the middle two screens, and the lower three screens, the time difference at the end of the input frame is the time difference between the input 4 and the input 6, and the comparison target is T_f /. It may be set to 3.

Since the actual video signal has an overhead portion such as the blanking portion and the border portion described above, the T_f / 2 and T_f / 3 to be compared are the signals for the display screen 24. It is a numerical value and needs to be corrected according to the overhead portion.

With reference to FIG. 7, a case where the above method is pipelined will be described. The pipelined synthesis processing time is T_pp (Time of Pipelined Processing). Here, the pipelined synthesis processing time indicates only the initial overhead time for the pipeline (the time required for the entire processing before passing it to the next stage processing including data reading etc.), and the synthesis processing is , Is executed continuously according to the input or output. The actual time of the pipelined synthesis process is the time for processing one unit of data in the pipeline process before the output, which is the subsequent process. In this case, the process can be started so that the output of the output frame is completed at the end time of the video signal input + T_pp.

A case where there is a 1/4 T_f time difference in each input frame with 4 inputs and 4 divisions and 1 screen output will be described. Inputs are 1, 2, 3, and 4 in order of input timing. For simplification, the blanking portion 22 and the border portion 23 of the video signal are excluded, and only the signal of the display screen 24 will be described.

The synthesis process (1) is started so that the time when the T_pp time elapses from the input completion time T_2E of the input 2 coincides with the output completion time T_UE of the upper display screen 24, and the output to the upper part of the display screen 24 is started. ..
The synthesis process (2) is started so that the time when the T_pp time elapses from the input completion time T_4E of the input 4 coincides with the output completion time T_DE of the lower display screen 24, and the output to the lower part of the display screen 24 is started. ..
In this case, the maximum delay time from the input start time of the input 1 to the output start time of the upper display screen 24 is (3/4 T_f + T_pp). As a result, the output delay can be shortened as compared with the example shown in FIG. For example, when the frame rate is 60 fps, the delay is 12.5 milliseconds + T_pp.

Next, a case where each input is not shifted by 1/4 T_f time will be described.
When the time difference T_in2toin4 at the end of the input frame of input 2 and input 4 is longer than T_f / 2, the output of the upper display screen 24 that combines input 1 and input 2 is followed by the output of the lower display screen 24. After the input of the input 2 is completed, the start of the synthesis process (1) is waited for at least T_in2toin4-T_f / 2 hours. Alternatively, after performing the synthesis process (1), the output of the T_in2toin4-T_f / 2 hour output frame may be waited for.

When the time difference T_in2toin4 at the end of the input frame of input 2 and input 4 is shorter than T_f / 2, T_f after the input of input 4 is completed so that the output of the upper display screen 24 is followed by the output of the lower display screen 24. / 2-T_in2toin 4 hours, wait for the start of the synthesis process (2). Alternatively, the output of the output frame may be waited for T_f / 2-T_in2toin 4 hours after the synthesis process (2) is performed.

In this embodiment, an example is shown in which four video signals as shown in FIG. 2 are combined into one video signal on the upper two screens and the lower two screens and output. Therefore, in the above example, the time difference at the end of the input frame is set as the time difference between the input 2 and the input 4, and the comparison target is compared with T_f / 2. However, the time difference at the end of the input frame and the comparison target thereof may be any number determined according to the number of video signals to be combined and the arrangement of the screen. For example, when synthesizing six video signals into one video signal of the upper two screens, the middle two screens, and the lower three screens, the time difference at the end of the input frame is the time difference between the input 4 and the input 6, and the comparison target is T_f /. It may be set to 3.

FIG. 8 shows an example of the system configuration according to this embodiment. The video compositing device 10 according to the present embodiment includes a detection unit 101, a crossbar switch 102, an up / down converter 103, a buffer 104, and a pixel compositing unit 105. The figure shows 4 inputs and 1 output, but any number of inputs and outputs may be used.

Reference numeral 101 is a functional unit that detects the input order within the frame time for N input frames. For example, the input timings of the

inputs

1, 2, 3, and 4 shown in FIGS. 4 and 5 are detected, and the order of the

inputs

1, 2, 3, and 4 is determined using the input timings.
102 is a crossbar switch, and is a function of sorting and outputting in the order of detection results of the input order from 101. For example, the input frame order is arranged in the order of

inputs

1, 2, 3, and 4 shown in FIGS. 4 and 5.
Reference numeral 103 is an up / down converter that enlarges / reduces the number of pixels to an arbitrary size. For example, the number of pixels of the input 1 is enlarged or reduced so as to match the size of the screen shown in FIG.
102 and 103 may be connected in reverse to the inputs (a, b, c, d, ...). That is, the inputs a, b, c, and d may be enlarged or reduced at 103, and then the

inputs

1, 2, 3, and 4 may be rearranged and output at 102.
Reference numeral 104 is a buffer for storing each input frame. The inputs of 103 or 102 can be buffered and output in any order.
Reference numeral 105 is a pixel synthesizing unit. The pixel synthesizing unit 105 reads pixel data from 104 in the order of output from the entire output screen, synthesizes them, generates an output frame, and outputs the data. As a result, a video in which the four video signals are combined is displayed on the screen as shown in FIG. This timing is as described above. The 105 may add an arbitrary control signal to the blanking portion 22 of the screen.

The device of the present disclosure can also be realized by a computer and a program, and the program can be recorded on a recording medium or provided through a network.

In the above embodiment, an example of 4 inputs, 4 divisions and 1 screen is shown, but the present disclosure is not limited to this, and can be applied to any input. Further, in the above-described embodiment, the example in which the frame times T_f of the inputs 1 to 4 are the same is mainly shown, but the present disclosure can be applied to the inputs 1 to 4 having different frame times T_f.

(Effect of this disclosure)
By arranging the screens from the top in the order of the input timing of the asynchronous video input signal and outputting them, the delay time until the output after the composition can be shortened. This enables collaborative work with strict low delay requirements in a system that synthesizes multiple screens at multiple locations.

This disclosure can be applied to the movie, advertising, and game industries related to video production, as well as the information and communication industry that distributes video content and game content.

10: Video compositing device 21: Scanning line 22: Blanking part 23: Border part 24: Display screen 101: Detection unit 102: Crossbar switch 103: Up / down converter 104: Buffer 105: Pixel compositing unit

Claims

Detects the input timing of each input frame that composes multiple video signals,
When a set number of video signals among the plurality of video signals are input, the synthesis processing of the set number of video signals is sequentially started.
Generates an output frame in which the plurality of video signals are combined into one video signal.
Video synthesizer.
The time difference between the time when the input of the set number of video signals among the plurality of video signals is completed and the time when the input of the last video signal of the plurality of video signals is completed is set as the number of the plurality of video signals or the composite screen. Compared to the time set according to one or both of the placements,
If the time difference is longer than the time determined according to the number of the plurality of video signals and / or the arrangement of the composite screens, the timing of the composite processing of the set number of video signals is adjusted.
When the time difference is shorter than the time determined according to the number of the plurality of video signals and / or the arrangement of the composite screens, the timing of the synthesis processing of the remaining video signals among the plurality of video signals is set. adjust,
The video synthesizer according to claim 1.
The composition processing of the plurality of video signals is started so that the output of the output frame is completed when all the inputs of the plurality of video signals are completed and the composition processing of the plurality of video signals is completed.
The video synthesizer according to claim 1.
The video synthesizer
Detects the input timing of each input frame that composes multiple video signals,
When a set number of video signals among the plurality of video signals are input, the synthesis processing of the set number of video signals is sequentially started.
Generates an output frame in which the plurality of video signals are combined into one video signal.
Video composition method.
A program for realizing a computer as each functional unit provided in the video synthesizer according to any one of claims 1 to 3.