WO2019188406A1

WO2019188406A1 - Subtitle generation device and subtitle generation program

Info

Publication number: WO2019188406A1
Application number: PCT/JP2019/010807
Authority: WO
Inventors: 英樹竹原
Original assignee: 株式会社Ｊｖｃケンウッド
Priority date: 2018-03-26
Filing date: 2019-03-15
Publication date: 2019-10-03
Also published as: JP2019169928A

Abstract

A subtitle generation device equipped with an image acquisition unit (summarized subtitle generation units 41-4n) and a subtitle summarization unit (403). The image acquisition unit acquires an image. The subtitle summarization unit (403) generates summarized subtitles which summarize the text data pertaining to the image, according to the number of images displayed on a display unit (5) or the display image size expressing the size of the images displayed on the display unit (5).

Description

Subtitle generating apparatus and subtitle generating program

This disclosure relates to a caption generation device and a caption generation program.

In recent years, the number of video channels that can be viewed by users, such as terrestrial or satellite television broadcasting, Internet broadcasting, and video distribution websites, has increased.

JP 2002-223399 A JP 2010-81149 A JP 2013-183217 A

When displaying video on the display unit, audio transmitted along with the video may be displayed on the display unit as subtitles. In addition to the video, characters for supplementing the video may be displayed as subtitles on the display unit. It is required to display subtitles in a manner corresponding to the display state of the video of one or more channels displayed on the display unit.

Embodiments are intended to provide a caption generation device and a caption generation program capable of generating captions in a manner corresponding to the display state of video of one or more channels displayed on a display unit.

According to the first aspect of the embodiment, the text data related to the video is obtained according to the number of videos displayed on the display unit or the display video size indicating the size of the video displayed on the display unit. Provided is a caption generation device including a caption summary unit that generates a summary caption.

According to the second aspect of the embodiment, the map scale setting unit for setting the scale of the map displayed on the display unit, and the text data related to the video displayed on the display unit according to the scale of the map A caption generation device is provided that includes a caption summary unit that generates a summary caption.

According to the third aspect of the embodiment, a summary subtitle summarizing text data related to the video is generated according to the number of displayed videos or the display video size indicating the size of the displayed video A caption generation program for causing a computer to execute a caption summarization step is provided.

According to the fourth aspect of the embodiment, the map scale setting step for setting the scale of the displayed map, and the caption summary for generating the summary caption that summarizes the text data related to the video according to the map scale. There is provided a caption generation program for causing a computer to execute the steps.

According to the caption generation device and the caption generation program of the embodiment, captions can be generated in a manner corresponding to the number of channels of one or a plurality of videos displayed on the display unit.

FIG. 1 is a block diagram showing a video display device configured to include the caption generation device of the first embodiment. FIG. 2 is a block diagram illustrating a specific configuration example of the summary caption generation unit in FIG. FIG. 3A is a diagram illustrating an example of a one-screen mode in which an image is displayed on the display unit. FIG. 3B is a diagram illustrating an example of a two-screen mode in which an image is displayed on the display unit. FIG. 3C is a diagram illustrating an example of a four-screen mode in which an image is displayed on the display unit. FIG. 4A is a diagram showing the channel importance level and the summary level in the single screen mode in a tabular format. FIG. 4B is a diagram showing the channel importance level and the summary level in the two-screen mode in a tabular format. FIG. 4C is a diagram showing the channel importance level and the summary level in the 4-screen mode in a tabular format. FIG. 5A is a diagram illustrating an example of a picture-in-picture mode in which video is displayed on the display unit. FIG. 5B is a diagram illustrating an example of a picture-out-picture mode in which video is displayed on the display unit. FIG. 6A is a diagram showing, in a tabular form, channel importance levels and summarization degrees in the picture-in-picture mode. FIG. 6B is a diagram showing in a tabular form the channel importance and the summarization degree in the picture-out-picture mode. FIG. 7 is a flowchart illustrating the operation of the caption generation device according to the first embodiment and the process that the caption generation program according to the first embodiment causes the computer to execute. FIG. 8 is a block diagram illustrating a video transmission / reception system including a map display device configured to include the caption generation device of the second embodiment. FIG. 9 is a block diagram illustrating a specific configuration example of a map display device configured to include the caption generation device of the second embodiment. FIG. 10A is a diagram illustrating an example of display state transition when the map scale is changed from 1/10000 to 1 / 50,000. FIG. 10B is a diagram illustrating an example of display state transition when the scale of the map is changed from 1 / 50,000 to 1 / 100,000. FIG. 10C is a diagram illustrating an example of transition of the display state when the scale of the map is changed from 1/10000 to 1 / 200,000. FIG. 11 is a diagram illustrating another display method of captions in the caption generation device of the second embodiment. FIG. 12A is a diagram illustrating another example of display state transition when the map scale is changed from 1/10000 to 1 / 50,000. FIG. 12B is a diagram illustrating another example of the transition of the display state when the scale of the map is changed from 1 / 50,000 to 1 / 100,000. FIG. 12C is a diagram illustrating another example of the transition of the display state when the scale of the map is changed from 1/10000 to 1 / 200,000. FIG. 13 is a diagram showing an example in a tabular form in which the summarization degree is set according to the number of display channels that is the number of camera videos displayed in the map. FIG. 14 is a diagram showing an example of setting a summary level according to a map scale in a table format. FIG. 15 is a flowchart illustrating the operation of the caption generation device according to the second embodiment and the process that the caption generation program according to the second embodiment causes the computer to execute. FIG. 16 is a block diagram showing a posted moving image distribution system including the caption generation device according to the third embodiment. FIG. 17A is a diagram illustrating a first display state of a display unit included in a computer that receives a moving image or the like distributed from a content server of the posted moving image distribution system. FIG. 17B is a diagram illustrating a second display state of the display unit included in the computer that receives a moving image or the like distributed from the content server of the posted moving image distribution system.

Hereinafter, the caption generation device and the caption generation program of each embodiment will be described with reference to the accompanying drawings.

<First Embodiment>
FIG. 1 shows a video display device 10 configured to include the caption generation device of the first embodiment. In FIG. 1, multimedia streams of channels 1 to n are input to input terminals 41t to 4nt of summary caption generation units 41 to 4n, respectively. The multimedia stream includes a video stream and an audio stream. The video stream includes video data, and the audio stream includes audio data.

Any summary subtitle generation unit of summary subtitle generation units 41 to 4n is referred to as summary subtitle generation unit 4, and any input terminal of input terminals 41t to 4nt is referred to as input terminal 4t. n is an integer of 2 or more. The multimedia stream input to the input terminals 41t to 4nt is distributed from an arbitrary content distribution source such as a terrestrial or satellite television broadcast, an Internet broadcast, and a moving image distribution website. In addition, a multimedia stream in which a video shot with a smartphone or a video camera is edited as necessary with a personal computer, a smartphone, a video camera, or the like may be distributed.

Note that since the multimedia stream including the video stream and the audio stream is input to the summary subtitle generation unit 4, the summary subtitle generation unit 4 is a video acquisition unit that acquires video or audio that acquires audio related to video. Functions as an acquisition unit.

The channel number setting unit 1 sets the number of channels for displaying video based on the video data of the multimedia stream on the display unit 5. When the user operates the operation unit 6, the channel number setting unit 1 may set the number of channels. The number of channels is the number of channels displayed on the display unit 5 among the images of channels 1 to n, and is the number of channels set from 1 to the maximum number of channels. As an example, it is assumed here that the maximum number of channels is set to four. The number of channels may be fixed.

The channel importance setting unit 2 sets the importance of each channel. In the channel importance setting unit 2, the same importance may be set in advance for all channels. The channel importance setting unit 2 may set the importance higher as the channel number is smaller. When the user operates the operation unit 6, the channel importance level setting unit 2 may set the importance level of each channel. As will be described later, the channel importance setting unit 2 may automatically set the importance of each channel in accordance with the display mode selected by the user by operating the operation unit 6.

The summarization degree setting unit 3 displays the video of each channel on the display unit 5 according to the number of channels set by the channel number setting unit 1 and the importance of each channel set by the channel importance setting unit 2. Set the subtitle summary level. The subtitle summarization degree is an index indicating the degree of summarization of the number of characters of text data to be displayed as subtitles on the display unit 5.

要約 Define the degree of summarization with equation (1). In the first embodiment and the second embodiment to be described later, in order to facilitate understanding, as can be seen from equation (1), the summarization degree in a state where the number of characters of text data is not reduced (not summarized) is set to 100, and the number of characters The amount of summarization is reduced as the amount of reduction increases. That is, the summarization degree here indicates the remaining rate of text data to be displayed as subtitles.

Summarization degree = (number of characters in summarized text data / number of characters in text data) × 100 (1)

The summary degree setting signal indicating the summary degree of each channel set by the summary degree setting unit 3 is supplied to each summary caption generation unit 4. Here, since the number of channels is set to 4, the summary level setting unit 3 may supply the summary level setting signal to the summary caption generation units 41 to 44.

Each summary caption generation unit 4 generates text data (caption data) that is a caption displayed on the display unit 5 together with the video of each channel, based on the audio data accompanying the video data of the multimedia stream. Each summary caption generation unit 4 summarizes the text data according to the input summary degree setting signal and generates summary caption data. Both subtitle data that does not reduce the number of characters of text data and subtitle data that reduces the number of characters of text data may be referred to as summary subtitle data.

Each summary caption generation unit 4 supplies video data and audio data included in the multimedia stream and summary caption data to the display unit 5. The display unit 5 includes a drawing unit 51, a display panel 52, an audio processing circuit 53, and a speaker 54. The audio processing circuit 53 and the speaker 54 may be provided outside the display unit 5.

Here, each summary caption generation unit 4 stores the input multimedia stream and the generated summary caption data, and stores them in response to a request from the display unit 5 such as VOD (Video on demand). It is also possible to supply the multimedia stream and summary caption data that have been used to the display unit 5.

The drawing unit 51 draws video data and summary caption data of each channel. The display panel 52 displays a video based on the video data and summary caption data drawn on the drawing unit 51. The video data of each channel and the summary caption data may be reduced.

The audio processing circuit 53 performs D / A conversion on the selected audio data among the audio data of the channels 1 to 4 and supplies an analog audio signal to the speaker 54. The speaker 54 outputs sound based on the input analog sound signal. The sound data output as sound by the speaker 54 may be fixed as the sound data of the channel 1, or the user may be configured to select the sound data of any channel by the operation unit 6.

When the user operates the operation unit 6, the display unit 5 displays the video data selected from the video data and the summary subtitle data supplied from the summary subtitle generation units 41 to 44 and the video based on the summary subtitle data. The display mode can be switched. For example, the display unit 5 displays a display mode in which only video data and summary subtitle data from one summary subtitle generation unit 4 are displayed, and a plurality of video data and summary subtitle data from a plurality of summary subtitle generation units 4. It is possible to switch between display modes for simultaneously displaying video. Details of the display mode will be described later.

A specific configuration example of the summary caption generation unit 4 will be described with reference to FIG. As shown in FIG. 2, the summary caption generation unit 4 includes an audio stream acquisition unit 401, an audio recognition unit 402, a caption summary unit 403, and a multiplexing unit 404.

The audio stream acquisition unit 401 acquires an audio stream from the multimedia stream input to the input terminal 4t. The audio stream acquisition unit 401 supplies the input multimedia stream to the multiplexing unit 404 and supplies the audio stream to the audio recognition unit 402.

The voice recognition unit 402 recognizes voice data included in the voice stream, generates text data, and supplies the text data to the caption summarization unit 403. The caption summarization unit 403 receives a summary degree setting signal. The caption summarizing section 403 summarizes the text data according to the summarization degree indicated by the summarization degree setting signal, and generates summary text data. The caption summarizing unit 403 supplies both the text data before summarization and the summary text data to the multiplexing unit 404 as summary caption data.

The subtitle summarization section 403 generates summary subtitle data using a representative extraction type summary as a summary technique. For example, the caption summary unit 403 extracts words with high appearance frequency included in the text data as important words, and generates summary caption data. The caption summary unit 403 may use a generation summary instead of the extraction summary. In the generation type summary, for example, summary caption data is generated using an expression different from the text data, such as paraphrasing, generalizing, or rearranging the text based on the content of the text data.

The multiplexing unit 404 multiplexes the video data and audio data included in the multimedia stream supplied from the audio stream acquisition unit 401 and the summary caption data supplied from the caption summary unit 403 in synchronization. The multiplexing unit 404 supplies the multiplexed data to the display unit 5.

An example of a display mode for displaying the video images of channels 1 to 4 on the display unit 5 will be described with reference to FIGS. 3A to 3C. 4A to 4C show examples of channel importance, summarization degree, and display video size of each channel in the display modes of FIGS. 3A to 3C. It is assumed that the video data of each channel is full HD and has 1920 pixels in the horizontal direction and 1080 pixels in the vertical direction, and the display panel 52 is a full HD panel. The display video size indicates the size of the video data displayed on the display unit 5.

FIG. 3A shows a display mode in which only the video V1 of channel 1 is displayed on the display panel 52 in full screen. The display mode shown in FIG. 3A will be referred to as a single screen mode. A caption ST1 based on the summary caption data of channel 1 is displayed near the lower end of the video V1. Subtitle ST1 is displayed in one or more lines. The number of characters and the number of lines in one line of the subtitle ST1 and subtitles ST2 to ST4 described later are set to the number of characters and the number of lines that allow the user to recognize the subtitle according to the size and resolution of the display panel 52.

At this time, as shown in FIG. 4A, the channel importance level setting unit 2 sets the importance level of the channel 1 to 100, and the channels 2 to 4 are not displayed, so the importance levels of the channels 2 to 4 are set to 0. . The summarization degree setting unit 3 sets the summarization degree of subtitle data of channel 1 to 100, and sets the summarization degree of subtitle data of channels 2 to 4 to 0.

In the single screen mode, the display video size of the video of channel 1 is 1920 pixels in the horizontal direction and 1080 pixels in the vertical direction. In the one-screen mode, the text data generated by the speech recognition unit 402 is not reduced and is displayed as the caption ST1.

FIG. 3B shows a display mode in which the display video size of the video V1 of channel 1 and the video V2 of channel 2 is reduced and displayed on the display panel 52 side by side. The display mode shown in FIG. 3B is referred to as a two-screen mode. A caption ST1 based on the summary caption data of channel 1 is displayed near the lower end of the video V1, and a caption ST2 based on the summary caption data of channel 2 is displayed near the lower end of the video V2.

At this time, as shown in FIG. 4B, the channel importance setting unit 2 sets the importance of the

channels

1 and 2 to 100, and the

channels

3 and 4 are not displayed, so the importance of the

channels

3 and 4 is set to 0. Set. The summarization degree setting unit 3 sets the summarization degree of the caption data of

channels

1 and 2 to 70, and sets the summarization degree of the caption data of

channels

3 and 4 to 0.

In the 2-screen mode, the display video size of the video of

channels

1 and 2 is 960 pixels in the horizontal direction and 540 pixels in the vertical direction. In the two-screen mode, subtitles ST1 and ST2 based on summary subtitle data obtained by reducing the text data generated by the speech recognition unit 402 by 30% are displayed.

In the 2-screen mode, there are more video channels displayed on the display panel 52 than in the 1-screen mode. Accordingly, the area where the subtitles ST1 and ST2 are displayed in the two-screen mode is narrower than the area where the subtitle ST1 is displayed in the one-screen mode. The subtitle ST1 in the two-screen mode is a subtitle in which text data is reduced compared to the subtitle ST1 in the one-screen mode.

FIG. 3C shows a display mode in which the display image size of the images V1 to V4 of the channels 1 to 4 is reduced and displayed side by side on the display panel 52. The display mode shown in FIG. 3C will be referred to as a 4-screen mode. Subtitles ST1 to ST4 based on the summary caption data of channels 1 to 4 are displayed near the lower ends of the videos V1 to V4.

At this time, as shown in FIG. 4C, the channel importance setting unit 2 sets the importance of the channels 1 to 4 to 100. The summarization degree setting unit 3 sets the summarization degree of subtitle data of channels 1 to 4 to 25.

In the 4-screen mode, the display video size of the video of channels 1 to 4 is the same as the display video size of the video of

channels

1 and 2 in the 2-screen mode, which is 960 pixels in the horizontal direction and 540 pixels in the vertical direction. In the 4-screen mode, subtitles ST1 to ST4 based on summary subtitle data obtained by reducing the text data generated by the speech recognition unit 402 by 75% are displayed.

In the 4-screen mode, there are more video channels displayed on the display panel 52 than in the 2-screen mode. Accordingly, the area where the subtitles ST1 to ST4 are displayed in the 4-screen mode is narrower than the area where the subtitles ST1 and ST2 are displayed in the 2-screen mode. The subtitles ST1 and ST2 in the 4-screen mode are subtitles with text data reduced compared to the subtitles ST1 and ST2 in the 2-screen mode.

5A and 5B, another example of the display mode for displaying the video images of channels 1 to 4 on the display unit 5 will be described. 6A and 6B show examples of channel importance, summarization degree, and display video size of each channel in the display modes of FIGS. 5A and 5B.

FIG. 5A shows a display mode in which the channel 1 video V1 is displayed on the display panel 52 in a full screen, and the display video size of the channel 2 video V2 is reduced and superimposed on the video V1. The display mode shown in FIG. 5A is referred to as a picture-in-picture mode (hereinafter referred to as PIP mode). A caption ST1 based on summary caption data of channel 1 is displayed near the lower end of the video V1, and a caption ST2 based on summary caption data of channel 2 is displayed near the lower end of the video V2.

At this time, as shown in FIG. 6A, the channel importance level setting unit 2 sets the importance level of the channel 1 to 100, sets the importance level of the channel 2 to 25, and the

channels

3 and 4 are not displayed. Set the importance of 3 and 4 to 0. Summarization degree setting unit 3 sets the summarization degree of subtitle data of channel 1 to 100, sets the summarization degree of subtitle data of channel 2 to 25, and sets the summarization degree of subtitle data of

channels

3 and 4 to 0. .

In the PIP mode shown in FIG. 5A, the display image size of the channel 1 image is 1920 pixels in the horizontal direction and 1080 pixels in the vertical direction, and the display image size of the image in channel 2 is 960 pixels in the horizontal direction and 540 pixels in the vertical direction. However, since the channel 2 image is superimposed on the channel 1 image, the region of 960 pixels in the horizontal direction and 540 pixels in the vertical direction of the channel 1 image is not displayed.

In the PIP mode, the text data of the channel 1 generated by the speech recognition unit 402 is displayed as subtitle ST1 without being reduced, and the subtitle ST2 based on the summary subtitle data in which the text data of the channel 2 generated by the speech recognition unit 402 is reduced by 75%. Is displayed.

In the PIP mode, the display video size of the video of channel 2 is smaller than that of channel 1. Accordingly, the area in which the subtitle ST2 is displayed in the PIP mode is narrower than the area in which the subtitle ST1 is displayed. The subtitle ST2 in the PIP mode is a subtitle in which text data is reduced compared to the subtitle ST1.

FIG. 5B shows a display mode in which the display video size of the video V1 of the channel 1 is reduced to the video V1, and the display video size of the videos V2 to V4 of the channels 2 to 4 is reduced and displayed outside the video V1. The display mode shown in FIG. 5B is referred to as a picture-out-picture mode (hereinafter referred to as POP mode). Subtitle ST1 based on summary caption data of channel 1 is displayed near the lower end of video V1, and subtitles ST2 to ST4 based on summary subtitle data of channels 2 to 4 are displayed near the lower end of video V2 to V4. ing.

At this time, as shown in FIG. 6B, the channel importance level setting unit 2 sets the importance level of the channel 1 to 100, and sets the importance levels of the channels 2 to 4 to 11. Summarization degree setting unit 3 sets the summarization degree of caption data of channel 1 to 56, and sets the summarization degree of caption data of channels 2 to 4 to 6.

In the POP mode shown in FIG. 6B, the display video size of the video of channel 1 is 1440 pixels in the horizontal direction and 810 pixels in the vertical direction, and the display video size of the video of channels 2 to 4 is 480 pixels in the horizontal direction and 270 pixels in the vertical direction. is there.

In the POP mode, the subtitle ST1 based on the summary subtitle data generated by the speech recognition unit 402 and reduced by 44% in the channel 1 text data is displayed, and the subtitle based on the summary subtitle data obtained by reducing the text data in the channels 2 to 4 by 94%. ST2 to ST4 are displayed.

In the POP mode, the display video size of the video of channel 1 is smaller than that of channel 1 in the single screen mode of FIG. 3A or the PIP mode of FIG. 5A. Accordingly, the area in which the subtitle ST1 is displayed in the POP mode is narrower than that in the single-screen mode or the PIP mode. The subtitle ST1 in the POP mode is a subtitle in which text data is reduced as compared with the subtitle ST1 in the one-screen mode or the PIP mode of FIG. 5A.

In the POP mode, the display video size of the video of channels 2 to 4 is smaller than that of channel 1. Accordingly, the area where the subtitles ST2 to ST4 are displayed in the POP mode is narrower than the area where the subtitle ST1 is displayed. Subtitles ST2 to ST4 in the POP mode are subtitles in which text data is reduced compared to the subtitle ST1.

The operation of the caption generation device of the first embodiment will be described using the flowchart shown in FIG. In FIG. 7, when the process is started, the channel number setting unit 1 sets the number of channels in step S101. Here, it is assumed that the number of channels is fixed. In step S102, the channel importance level setting unit 2 sets the importance level of each channel. In step S103, the summary level setting unit 3 sets the summary level.

In step S104, the audio stream acquisition unit 401 separates the video stream and the audio stream and acquires the audio stream. Since the audio stream acquisition unit 401 acquires a video stream in step S104, step S104 is a video acquisition step. In step S105, the speech recognition unit 402 recognizes speech data included in the speech stream and generates text data. In step S106, the caption summary unit 403 generates summary caption data with the set summary degree. Step S106 is a caption summarization step.

In step S107, the multiplexing unit 404 multiplexes the video data, audio data, and summary caption data included in the video stream. In step S108, the display unit 5 displays video and outputs audio based on the multiplexed data.

The channel importance level setting unit 2 and the summary level setting unit 3 determine whether or not an instruction to change the display mode has been given in step S109. If an instruction to change the display mode is given (YES), the processing of steps S102 to S109 is repeated. If an instruction to change the display mode is not given (NO), the summary subtitle generating unit 4 determines whether or not the multimedia stream is continuously input in step S110.

If the multimedia stream is continuously input (YES), the processes in steps S104 to S110 are repeated. If the multimedia stream is not continuously input (NO), the summary subtitle generating unit 4 ends the process.

As described above, the caption generation device according to the first embodiment includes the summary degree setting unit 3 and the caption summary unit 403. The summarization degree setting unit 3 is related to the video of each channel according to the number of channels of one or more channels displayed on the display unit 5 (display panel 52) or the display video size of the video of each channel. The summarization degree of subtitles of each channel displayed on the display unit 5 is set. The subtitle summarizing section 403 summarizes the subtitles of each channel according to the subtitle summarization levels set by the summarization degree setting section 3 and generates summary subtitles.

According to the caption generation device of the first embodiment, captions can be generated in a manner corresponding to the display state of the video of one or a plurality of channels displayed on the display unit 5. The video display device 10 that includes the caption generation device of the first embodiment and displays the video of each channel in any one of a plurality of display modes corresponds to the display state of the video of one or more channels A mode subtitle can be displayed.

Each unit of the video display device 10 shown in FIG. 1 and each unit of the summary subtitle generating unit 4 shown in FIG. 2 may be configured by hardware such as an integrated circuit, or may be configured by software (computer program). Good. Use of hardware and software is optional. The flowchart shown in FIG. 7 may be processing that the subtitle generation program of the first embodiment causes the computer to execute.

The caption generation program may be transmitted to the video display device 10 via a network such as the Internet, or may be stored in a non-temporary storage medium and provided to the video display device 10.

In FIG. 1, the following configuration can be additionally provided. For example, the number of channels is 16, and the display unit 5 displays a 16-channel reduced video and corresponding subtitles. The line-of-sight detection device selects four channels that are determined to be viewed with interest by the user by detecting the line of sight of the user. As shown in FIG. 3C, the display unit 5 displays the selected 4-channel reduced video and subtitles corresponding thereto.

The line-of-sight detection device selects one channel that is determined to be viewed with interest by the user by detecting the line of sight of the user. As shown in FIG. 3A, the display unit 5 displays the selected one-channel video and subtitles corresponding thereto.

Second Embodiment
FIG. 8 shows a video transmission / reception system including a map display device 20 configured to include the caption generation device of the second embodiment. The map display device 20 is connected to a network 50 such as the Internet. The video cameras 301 to 30n and the map providing server 40 are also connected to the network 50. Here, n is an integer of 2 or more. Any video camera among the video cameras 301 to 30n is referred to as a video camera 30.

The video shot by one or more video cameras 30 is video of one or more channels.

A specific configuration example and operation of the map display device 20 will be described with reference to FIG. In FIG. 9, the network interface 201 of the map display device 20 is connected to the network 50. The map display device 20 receives the map data provided from the map providing server 40, the video data transmitted by each video camera 30, and the metadata. Specifically, the map display device 20 receives from each video camera 30 video data captured by each video camera 30 and metadata which is data describing information related to the video data.

Here, it is assumed that the map display device 20 acquires video data and metadata by using WEB API (WEB Application Programming INTERFACE) provided by each video camera 30. The WEB API is an interface that the program of one device calls and uses the function provided by the program of the other device via a network. Each metadata is generated and recorded by each video camera 30.

In the metadata, various types of information related to video data are described as text data. The metadata includes, for example, position information of the video camera 30 (that is, shooting location), shooting date, photographer information, producer information, camera number, camera priority, shooting purpose, title, shooting overview, caster name, Arbitrarily entered sentences such as the names of characters are included. The metadata includes at least position information of the video camera 30. The position information of the video camera 30 may include the name of the shooting location such as Tokyo Station or Tokyo International Airport, in addition to the latitude and longitude of the shooting location.

Specifically, the shooting date is information indicating the date and time when the video data was shot. The photographer information is, for example, a photographer's name or an ID that identifies the photographer. The producer information is, for example, a broadcast station name or an ID for identifying the broadcast station. The camera number is a number assigned to each video camera 30. The camera number may be a serial code, for example. The camera priority is information indicating the priority of video data to be displayed. The shooting purpose is, for example, program shooting, landscape shooting, interview, or the like. The title is, for example, a program name of video data.

The shooting summary includes position information (ie, shooting location), shooting date, photographer information, producer information, camera number, camera priority, camera serial code, shooting purpose, title, caster name, and character name. Assume that it is summarized and described with other information. In the caster name and the character name, a person name or the like suitable for each purpose is described. Although the metadata includes the position information of the video camera 30, the metadata does not include the position information of the video camera 30, and the position information of the video camera 30 and the metadata including other contents are separately provided. It is good also as data.

The camera video acquisition unit 202 acquires video data transmitted from each video camera 30. The metadata extraction unit 203 acquires metadata. As described above, the camera image acquisition unit 202 and the metadata extraction unit 203 acquire the image data and metadata transmitted from each video camera 30 using the WEB API provided by each video camera 30. May be. The map data acquisition unit 209 acquires map data. The user can change the center position of the map displayed on the display panel 2131 of the display unit 213 by operating the operation unit 214, and can also change the scale of the map.

The map center position setting unit 216 sets the center position of the map displayed on the display panel 2131 according to the operation of the operation unit 214 by the user, and the map scale setting unit 217 is set according to the operation of the operation unit 214 by the user. The scale of the map displayed on the display panel 2131 is set. The map data acquisition unit 209 acquires a map of the set center position and scale from the map providing server 40.

The metadata of each video camera 30 acquired by the metadata extraction unit 203 is supplied to the camera position acquisition unit 204 and the caption information acquisition unit 205. The camera position acquisition unit 204 acquires the latitude / longitude of the shooting location included in the position information of the metadata. In other words, the camera position acquisition unit 204 is a position acquisition unit that acquires position information indicating a position where a video is taken. The subtitle information acquisition unit 205 acquires text data displayed as subtitles on the display panel 2131 among the text data described in the metadata, and supplies the text data to the subtitle summary unit 208.

It should be noted that the position information of the video camera 30 can be acquired through a route different from the metadata. For example, if a video camera 30 that does not have a GPS acquisition function is connected to a router that has a GPS acquisition function, the router may notify the map display device 20 of location information via a route different from the metadata. Good. In this case, the metadata extraction unit 203 acquires the position information using the WEB API provided by the router. Here, the connection status between the video camera 30 and the router is provided by WEB API information provided by the router, the video display unit 20 associates the video camera 30 with the router in the map display device 20, and the router location information is obtained from the video camera 30. It is position information.

The text data displayed as subtitles may be the name of the shooting location, the shooting purpose, or an arbitrarily entered sentence, and is arbitrary. Text data to be displayed as subtitles may be determined in advance, or may be configured so that the user can select by operating the operation unit 214. When the metadata includes only position information of the video camera 30, the shooting location is displayed as a caption.

The display area inside / outside determination unit 206 receives information indicating the map center position set by the map center position setting unit 216 and the map scale set by the map scale setting unit 217. The display area inside / outside determination unit 206 has information on the screen size of the display panel 2131. The display area inside / outside determination unit 206 determines whether each video camera 30 is located within the display area of the map displayed on the display panel 2131 based on the input information indicating the center position of the map and the scale of the map. It is determined whether it is located outside the display area. The display area inside / outside determination unit 206 functions as a determination unit that determines the number of videos to be displayed on the display unit 213 from the scale of the map and the position information indicating the position where the video was shot.

The display area inside / outside determination unit 206 supplies information on the number of video cameras 30 located within the display area of the map displayed on the display panel 2131 among the video cameras 301 to 30n to the summary degree setting unit 207. . The summarization degree setting unit 207 sets the summarization degree according to the number of video cameras 30 located within the map display area.

The subtitle summarizing section 208 summarizes the text data supplied from the subtitle information acquiring section 205 according to the summarization degree set by the summarization degree setting section 207, and generates summary subtitle data.

In the first embodiment, the caption summary unit 403 creates summary caption data using an extraction summary or a generation summary according to the degree of summary. In the second embodiment, the caption summarizing unit 208 is configured to display position information, shooting date, photographer information, producer information, camera number, camera priority, shooting purpose, title, shooting outline, caster according to the degree of summarization. Summarize by selecting one or more items such as names and characters.

For example, when the summarization degree is 100, the caption summarization unit 208 selects 10 items from the above-described items. The caption summary unit 208 generates summary caption data including information on the selected 10 items. Moreover, when the summarization degree is 20, the caption summarizing unit 208 selects two items from the above-described items. The caption summary unit 208 generates summary caption data including information on the two selected items.

Furthermore, in the caption summary unit 208 or other parts, priorities may be set in advance for each item included in the metadata or the like. The caption summary unit 208 may be configured to select an item with a higher priority based on the priority.

Also in the second embodiment, similar to the first embodiment, the caption summarizing unit 208 may be configured to create summary caption data using an extraction summary or a generation summary according to the degree of summarization. In particular, when the shooting summary is included in the metadata, the caption summary unit 208 may create summary caption data from the shooting summary using the extraction type summary or the generation type summary according to the degree of summary.

Note that the number of characters in the text data may not be reduced as in the first embodiment. The summary caption data based on the text data supplied from each video camera 30 is supplied to the image composition unit 212.

The video reduction unit 210 reduces the video data supplied from each video camera 30 to generate reduced video data (thumbnail image). The video reduction rate instruction unit 211 instructs the video reduction unit 210 to reduce the video data. The reduction ratio may be fixed. When the number of pixels of one frame of video data from each video camera 30 is sufficiently smaller than the number of pixels of the display panel 2131, the video data may not be reduced.

The video reduction rate instruction unit 211 may be supplied with information on the number of video cameras 30 located in the display area of the map displayed on the display panel 2131 from the display area inside / outside determination unit 206. The video reduction rate instructing unit 211 may change the video data reduction rate in accordance with the input information on the number of video cameras 30.

The image composition unit 212 synthesizes the map data supplied from the map data acquisition unit 209, the reduced video data supplied from the video reduction unit 210, and the summary caption data supplied from the caption summary unit 208. Here, the reduced video data is synthesized based on the position information of the video camera 30 so as to be arranged at the position indicated by the position information of the video camera 30 on the map data. The image composition unit 212 supplies the composite image data to the display unit 213.

With the above configuration and operation, the video camera 30 located in the display area transmits to the display panel 2131 the map at a predetermined scale of the predetermined display area that the user has operated to display on the display panel 2131. The reduced video data and the summary caption related to the video are superimposed and displayed. This makes it possible to visualize who is shooting what video for what purpose at which position on the map.

In FIG. 9, the illustration of the configuration relating to reception of audio data and output of audio is omitted. The map display device 20 may be configured to receive audio data of sound collected at the time of shooting by each video camera 30, reproduce the audio data of the selected video camera 30, and output from the speaker.

In this case, the metadata extraction unit 203 functions as a voice acquisition unit that acquires voice data and a voice recognition unit that generates text data related to the video based on the acquired voice data. That is, each video camera 30 records the sound collected at the time of shooting the video data as sound data. Each video camera 30 transmits video data and audio data to the map display device 20 via the network 50. The metadata extraction unit 203 acquires audio data transmitted from each video camera 30. The metadata extraction unit 203 recognizes the acquired voice data to generate text data, and supplies the text data to the camera position acquisition unit 204 and the caption information acquisition unit 205 as metadata.

By adopting such a configuration, metadata can be created from sound collected at the time of shooting video data, and it is not necessary to prepare metadata in advance, so that the burden on the photographer is reduced.

In addition, each video camera 30 is not limited to recording the sound collected when shooting the video data as the sound data. Each video camera 30 may record, as audio data, sound collected when the video data is not captured, in addition to the sound collected when the video data is captured. For example, the video camera 30 collects sound including information such as an outline of shooting other than during shooting and records it as audio data. The metadata extraction unit 203 may acquire the voice data and generate text data based on the voice data. In other words, the metadata extraction unit 203 may generate text data based on audio data associated with video data.

The display order setting unit 215 sets the display order when the reduced video data in the plurality of video cameras 30 overlap. When the priority of the camera of the video camera 30 is included in the metadata, the display order of the reduced video data captured by the camera with the higher priority may be increased based on the priority of the camera. When the user operates the operation unit 214, the display order setting unit 215 may change the display order.

An example of display state transition when the map scale is changed will be described with reference to FIGS. 10A to 10C. In FIG. 10A, only the video camera 30 with the camera number 01 is located in the map M1 in a state where the map M1 with a scale of 1/10000 is displayed on the display panel 2131. On the map M1, the camera video Ci1 and the subtitle CST1 of the camera video Ci1 are superimposed and displayed. The subtitle CST1 is displayed outside the camera video Ci1.

When the map M1 with a scale of 1/10000 is displayed on the display panel 2131, for example, the summarization degree is set to 100. Based on the degree of summarization, position information, shooting date, photographer information, producer information, camera number, camera priority, shooting purpose, title, caster name, character name, and summary caption data including 10 items Is generated and the summary caption data is displayed as captions. The camera video indicates the reduced video data described above, and the same applies to the following.

When the scale of the map is changed to 1 / 50,000, the video camera 30 with the camera numbers 01 to 03 is positioned in the map M5 in a state where the map M5 of 1 / 50,000 scale is displayed on the display panel 2131. . Here, the center position of the map M1 is displaced to the right of the map M5. On the map M5, camera video images Ci1 to Ci3 and captions CST1 to CST3 of the camera video images Ci1 to Ci3 are displayed in a superimposed manner.

The number of videos of the video camera 30 displayed on the map M5 is larger than that displayed on the map M1. Accordingly, the area where the subtitles CST1 to CST3 are displayed on the map M5 is narrower than the area where the subtitle CST1 is displayed on the map M1. The subtitle CST1 in the map M5 is a subtitle in which text data is reduced compared to the subtitle CST1 in the map M1.

When the map M5 with a scale of 1 / 50,000 is displayed on the display panel 2131, for example, the summarization degree is set to 70, position information, shooting date, photographer information, producer information, camera number, shooting purpose, title Summary subtitle data including the following seven items is displayed as subtitles.

In FIG. 10B, when the user further changes the scale of the map to 1 / 100,000, the video camera 30 with the camera numbers 01 to 06 is displayed with the map M10 having a scale of 1 / 100,000 displayed on the display panel 2131. Located in the map M10. Here, the center position of the map M5 is displaced to the right of the map M10. On the map M10, camera images Ci1 to Ci6 and captions CST1 to CST6 of the camera images Ci1 to Ci6 are displayed in a superimposed manner.

The number of videos of the video camera 30 displayed on the map M10 is larger than that displayed on the map M5. Accordingly, the area where the subtitles CST1 to CST6 are displayed on the map M10 is narrower than the area where the subtitles CST1 to CST3 are displayed on the map M5. The subtitles CST1 to CST3 in the map M10 are subtitles with text data reduced from the subtitles CST1 to CST3 in the map M5.

When the map M10 with a scale of 1: 100,000 is displayed on the display panel 2131, for example, the summary degree is set to 20, and summary caption data including two items of photographer information and title is displayed as captions.

When the images of the plurality of video cameras 30 are overlapped as in the map M10, the display order setting unit 215 positions the image located on the lower side by clicking the image located on the lower side or the like. be able to.

In FIG. 10C, when the user further changes the scale of the map to 1 / 200,000, the video camera 30 with the camera numbers 01 to 10 is displayed with the map M20 with a scale of 1 / 200,000 displayed on the display panel 2131. Located in the map M10. Here, the center position of the map M10 is displaced to the left of the map M20. On the map M20, camera videos Ci1 to Ci10 and captions CST1 to CST10 of the camera videos Ci1 to Ci10 are displayed in a superimposed manner.

The number of videos of the video camera 30 displayed on the map M20 is larger than that displayed on the map M10. Accordingly, the area where the subtitles CST1 to CST10 are displayed on the map M20 is narrower than the area where the subtitles CST1 to CST6 are displayed on the map M10. The subtitles CST1 to CST6 in the map M20 are subtitles with text data reduced compared to the subtitles CST1 to CST6 in the map M10.

When the map M20 with a scale of 1: 200,000 is displayed on the display panel 2131, for example, the summarization degree is set to 10, and the summary caption data including one item as the title is displayed as the caption.

In the map M20, icons smaller than the camera images Ci1 to Ci10 indicating the respective video cameras 30 may be displayed instead of the camera images Ci1 to Ci10. Here, the video camera 30 can be identified by assigning a camera number to the icon. Here, when icons are displayed instead of the camera videos Ci1 to Ci10, each subtitle may be displayed larger than the icon so that each subtitle can be easily recognized.

Further, when the camera video is not displayed, even if the map scale is the same (for example, the map scale is 1 / 100,000), the content of the camera video is supplemented as compared with the case where the camera video is displayed. Need to explain. Therefore, when the camera video is not displayed, the number of metadata items included in the summary subtitle data may be increased or the number of characters of the summary subtitle data may be increased as compared with the case where the camera video is displayed. .

That is, when the camera video is not displayed, the summarization degree is set larger than when the camera video is displayed even if the map scale is the same. For example, in the case of displaying a camera image, when the map scale is set to 1 / 100,000, the summarization degree is set to 20, and summary caption data including two items of photographer information and title is displayed as captions. Is done. On the other hand, when the camera image is not displayed and the map scale is set to 1 / 100,000, the summarization degree is set to 30, and the summary caption including the three items of photographer information, producer information, and title Data is displayed as subtitles.

10A to 10C, the subtitles CST1 to CST10 are displayed so as to be adjacent to the camera videos Ci1 to Ci10. As shown in FIG. 11, an area for displaying subtitles CST1 to CST3 may be set, for example, at the right end of the maps M1 and M5, and the subtitles CST1 to CST3 may be displayed separately from the camera videos Ci1 to Ci3. The same applies to the maps M10 and M20.

Further, as shown in FIGS. 10A to 10C, the captions CST1 to CST10 are displayed on the outside so as to be adjacent to the camera videos Ci1 to Ci10. A region for displaying CST3 may be set, and subtitles CST1 to CST3 may be displayed in a state of being separated from the camera videos Ci1 to Ci3. In this case, the summarization degree of the caption summary data displayed outside so as to be adjacent to the camera video and the summarization degree of the caption summary data displayed in a state separated from the camera video may be changed.

As described above, in the second embodiment, as shown in FIGS. 10A to 10B, the size of the subtitle display area is changed without changing the size of the camera video display area according to the scale of the map. The type of subtitles to be displayed is reduced step by step so as not to be displayed step by step from the lowest priority. Thus, according to the second embodiment, it is possible to conceptualize and visualize step by step who is taking what video for what purpose at which position on the map.

In the second embodiment, as shown in FIG. 10C, when a predetermined scale is reached, a simple icon is displayed instead of the camera image, and the size of the subtitle display area is displayed to display the subtitle at the previous scale. It is larger than the size of the area. As a result, according to the second embodiment, even when the number of videos of the video camera 30 increases, the subtitle information interpolates who is shooting what video at what position on the map for what purpose. Can be visualized.

12A to 12C, another example of the display state transition when the map scale is changed will be described. 12A to 12C, in all of the maps M1 to M20, the size of the camera video is varied according to the number of videos of the video camera 30 displayed on each map. In the example shown in FIGS. 12A to 12C, each caption is displayed inside each camera video except for the camera videos Ci1 to Ci10 displayed on the map M20.

12A to 12C, as in FIGS. 10A to 10C, subtitles may be displayed outside each camera video adjacent to each camera video, and each subtitle is displayed in the same manner as in FIG. And each subtitle may be displayed in a state separated from each camera video.

FIG. 13 shows an example in which the summarization degree is set according to the number of display channels, which is the number of camera videos displayed in the map. As shown in FIG. 13, the number of display channels is 1 to 2, 3 to 5, 6 to 10, 11 to 20, 21 or more, and the summarization degrees are 100, 80, 40, 10, 5 respectively. The number of display channels may be a predetermined number or more, the summarization degree may be set to 0, and captions may not be displayed. The number of display channels may be 21 or more, and the summarization degree may be 0.

FIG. 14 shows an example of setting the summarization degree according to the scale of the map. As shown in Fig. 14, the scale of the map is less than 1 / 10,000, less than 1 / 10,000, less than 1 / 50,000, more than 1 / 50,000, less than 1 / 100,000 and more than 1 / 100,000. Are 100, 70, 20, 10 respectively. Similarly, if the scale of the map exceeds a predetermined scale, the summarization degree may be set to 0 so that no caption is displayed.

The operation of the caption generation device of the second embodiment will be described using the flowchart shown in FIG. In FIG. 15, when the process is started, the map data acquisition unit 209 acquires map data in step S201. In parallel with this, the camera image acquisition unit 202 and the metadata extraction unit 203 acquire the image data and metadata transmitted from each video camera 30 in step S202. Step S202 is a video acquisition step.

The display area inside / outside determination unit 206 detects the video camera 30 located in the map displayed on the display unit 213 (display panel 2131) in step S203. In step S204, the summarization degree setting unit 207 sets the summarization degree according to the number of video cameras 30 located in the map displayed on the display unit 213. In step S205, the caption summarizing section 208 summarizes the text data described in the metadata according to the degree of summarization to generate summarized caption data. Step S205 is a caption summarization step.

In parallel with steps S204 and S205, the video reduction unit 210 reduces the video data transmitted from the video camera 30 located in the map displayed on the display unit 213 in step S206.

In step S207, the image composition unit 212 synthesizes the map data, the reduced video data, and the summary caption data. In step S208, the display unit 213 displays the composite image.

The map center position setting unit 216 and the map scale setting unit 217 determine whether or not the map center position or scale has been changed in step S209. The process of setting the map scale by the map scale setting unit 217 is a map scale setting step. If the center position or scale of the map is changed (YES), the processes in steps S201 to S209 are repeated. If the map center position or scale is not changed (NO), in step S210, the map display device 20 determines whether or not an instruction to end map display has been given by the operation unit 214.

If the instruction to end the map display is not given (NO), the processing of steps S208 to S210 is repeated. If an instruction to end the map display is given (YES), the map display device 20 ends the process.

In the video transmission / reception system shown in FIG. 8, the case where the video camera 30 transmits a moving image to the map display device 20 has been described. However, the video camera 30 may transmit still images to the map display device 20 at predetermined intervals. . In this case, the video camera 30 may capture a still image and transmit it to the map display device 20, or a still camera may capture a still image and transmit it to the map display device 20 instead of the video camera 30. . That is, it is only necessary that an image obtained by photographing the subject by the camera is transmitted to the map display device 20. Here, it is assumed that the predetermined interval is, for example, 3 seconds.

In the map display device 20, when a plurality of camera images are superimposed on each map, the plurality of camera images may be sequentially displayed at predetermined time intervals so that the plurality of camera images are not displayed simultaneously. Moreover, in the map display apparatus 20, as shown in FIG. 11, you may set the area | region which displays a camera image | video with a caption at the right end part.

Each part of the map display device 20 shown in FIG. 9 may be configured by hardware such as an integrated circuit, or may be configured by software (computer program). Use of hardware and software is optional. The flowchart shown in FIG. 15 may be processing that the subtitle generation program of the second embodiment causes the computer to execute. The map display device 20 shown in FIG. 9 can be configured by a browser which is software for viewing a map.

Similarly, the caption generation program may be transmitted to the map display device 20 via the network 50, or may be stored in a non-temporary storage medium and provided to the map display device 20.

<Third Embodiment>
FIG. 16 shows a posted moving image distribution system including the caption generation device according to the third embodiment. A content server 60 and a computer 70 are connected to the network 50. The content server 60 includes a moving image storage unit 601 that stores posted moving images, a thumbnail image generation unit 602 that generates thumbnail images of posted moving images, and a text data storage unit 603 that stores text data associated with each posted moving image. Prepare.

Also, the content server 60 includes a summary degree setting unit 604 and a caption summary unit 605.

The text data may be a character describing the outline of the content of the posted moving image, may be a character supplementing the content of the posted moving image, or may be a comment related to the posted moving image.

The posted moving image distributed by the content server 60 or the thumbnail image of the posted moving image is a video of one or a plurality of channels. The computer 70 receives a posted moving image or a thumbnail image distributed by the content server 60.

The storage unit 701 provided in the computer 70 stores a browser 702 that is software for viewing a posted moving image provided by the content server 60. The computer 70 can display a thumbnail image for selecting a posted moving image on the display unit 703 by executing the browser 702, or can display a posted moving image by selecting a thumbnail image.

FIG. 17A shows an example of the display state of the display unit 703 when the computer 70 instructs the content server 60 to display a large thumbnail image. The subtitle TST1 of the thumbnail image Ti1 is displayed adjacent to the thumbnail image Ti1 of the posted moving image 001, and the subtitle TST2 of the thumbnail image Ti2 is displayed adjacent to the thumbnail image Ti2 of the posted moving image 002.

The summary degree setting unit 604 sets a summary degree that does not reduce the number of characters in the text data or does not reduce the number of characters in response to an instruction to display a large thumbnail image. The subtitle summarizing section 605 summarizes the text data with the summarization degree set by the summarization degree setting section 604 to generate the subtitle data of the subtitles TST1 and TST2. The content server 60 distributes the video data of the large thumbnail images Ti1 and Ti2 and the summary caption data of the captions TST1 and TST2 to the computer 70.

FIG. 17B shows an example of the display state of the display unit 703 when the computer 70 instructs the content server 60 to display a thumbnail image having a small size. Subtitles TST1 to TST9 are displayed adjacent to thumbnail images Ti1 to Ti9 of posted moving images 001 to 009.

The summary degree setting unit 604 sets a summary degree that reduces the number of characters in the text data in response to an instruction to display a small thumbnail image. The caption summarizing section 605 summarizes the text data with the summarization degree set by the summarization degree setting section 604, and generates the summarization caption data of the captions TST1 to TST9. The content server 60 distributes video data of small thumbnail images Ti1 to Ti9 and summary caption data of captions TST1 to TST9 to the computer 70.

In FIG. 17A or FIG. 17B, when the reproduction of the posted moving image in which any thumbnail image is selected is instructed, the content server 60 distributes the moving image data of the posted moving image to the computer 70.

In the posted moving image distribution system shown in FIG. 16, the summary degree setting unit 604 and the caption summary unit 605 are provided in the content server 60, but the browser 702 has the same function, and the display of FIGS. 17A and 17B. It is also possible to realize the state.

As described above, according to the caption generation device and the caption generation program of the first to third embodiments, captions are generated in a mode corresponding to the number of channels of one or a plurality of videos displayed on the

display units

5, 213, and 703. Can be generated. According to the video display device 10, the map display device 20, and the posted moving image distribution system (computer 70) that include the caption generation device of the first to third embodiments or execute the caption generation program of the first to third embodiments. For example, even if the number of video channels increases, the user can comprehensively grasp the subtitles of each channel.

The present invention is not limited to the first to third embodiments described above, and various modifications can be made without departing from the scope of the present invention.

The disclosure of the present application is related to the subject matter described in Japanese Patent Application No. 2018-05859 filed on March 26, 2018, the entire disclosure content of which is incorporated herein by reference.

Claims

A subtitle summarizing section that generates a summary subtitle summarizing text data related to the video according to the number of videos displayed on the display unit or a display video size indicating a size of the video displayed on the display unit. A caption generation device.
A position acquisition unit that acquires position information indicating a position at which the video was shot;
A map scale setting unit for setting a scale of a map displayed on the display unit;
A determination unit that determines the number of images to be displayed on the display unit from the scale of the map and the position information;
The caption generation device according to claim 1, comprising:
A map scale setting unit for setting the scale of the map displayed on the display unit;
According to the scale of the map, a caption summary unit that generates summary captions that summarize text data related to video displayed on the display unit;
A caption generation device comprising:
A voice recognition unit that generates text data related to the video based on voice data;
The caption generation device according to any one of claims 1 to 3, further comprising:
The caption generation device according to any one of claims 1 to 3, wherein the text data is metadata in which various information related to the video is described.
Subtitle generation that causes a computer to execute a subtitle summarization step that generates summary subtitles summarizing text data related to the video according to the number of videos to be displayed or a display video size indicating the size of the video to be displayed program.
A map scale setting step for setting the scale of the displayed map;
Subtitle summarizing step for generating summary subtitles summarizing text data related to video according to the scale of the map;
A subtitle generation program that causes a computer to execute.