WO2020091431A1

WO2020091431A1 - Subtitle generation system using graphic object

Info

Publication number: WO2020091431A1
Application number: PCT/KR2019/014501
Authority: WO
Inventors: 한승룡
Original assignee: 주식회사 모두앤모두
Priority date: 2018-11-02
Filing date: 2019-10-30
Publication date: 2020-05-07
Also published as: KR20200050707A; KR102136059B1

Abstract

The present invention provides a subtitle generation system using a graphic object. The system comprises a subtitle generator for receiving a multimedia content including audio data and video data from a content management server, and generating extension subtitle data by using the multimedia content, wherein the subtitle generator includes a basic subtitle data generator, a voice recognition processor, a facial recognition processor, a speaker information generator, an emotion information generator, an extension subtitle data generator, and a database.

Description

Subtitle generation system using graphic objects

The present invention relates to a system for generating captions using graphic objects.

Conventional subtitle broadcasting for the hearing impaired generates subtitles as shorthand in real-time broadcasting or recorded broadcasting. As the subtitles were generated in shorthand according to the speed of the video speaker, there were frequent typos, and a time difference occurred between the subtitles and the actual screen. In addition, there is a problem in that the quality of shorthand subtitles varies according to the shorthand's capabilities. And it was difficult to understand the emotions and emotions of a real actor simply by displaying them at the bottom of the video as subtitles. For example, shorthand has no way of expressing an actor's angry voice, and since he does not feel the emotion through the actor's voice, the hearing impaired enjoys half the pleasure of watching the video.

The problem to be solved by the present invention is to provide a caption generation system using a graphic object.

The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

A system for generating captions using a graphic object according to an aspect of the present invention for solving the above-described problem, receives multimedia content including audio data and video data from a content management server, and uses the multimedia content to expand caption data. And a subtitle generator for generating, wherein the subtitle generator receives the audio data, and generates basic subtitle data using the audio data, and receives the audio data, and uses the audio data. A speech recognition processor for generating speaker character identifiers and emotion classification information by speech recognition, and a facial recognition processor for receiving the video data and generating appearance character information and emotion classification information by facial recognition using the video data, The speaker car A speaker information generator that receives a liter identifier and the character information, and generates speaker information, receives emotion classification information by the speech recognition, emotion classification information by the face recognition, and the speaker information, and generates emotion information The emotion information generator, the basic subtitle data, the speaker information and the emotion information, and the extended subtitle data generator for generating the extended caption data and the voice and emotion information of the speaker character and the facial information and emotion information of the character appearing And a management database, wherein the basic subtitle data includes a character string detected from the audio data and information on a start time point, and the speaker information is the speaker character identifier when the character includes a speaker And the position, and if the speaker does not appear in the scene, The speaker character identifier, and the emotion information is information that combines the results of the speech recognition and the facial recognition when the character includes the speaker, and when the speaker does not appear in the scene, the voice recognition result The reflected information, and the extended subtitle data includes a subtitle graphic object corresponding to the character string, the start point, the speaker character identifier, the location and the emotion information, and is combined with the multimedia content.

In some embodiments, the subtitle graphic object is output as a letter type, a font size, a font color, a font thickness, a graphic object shape, a graphic object size, and a background color of the graphic object corresponding to the emotion information.

In some embodiments, the speech recognition processor generates the speaker character identifier and the emotion classification information in cooperation with the database.

In some embodiments, the facial recognition processor generates the character information and the emotion classification information in cooperation with the database.

In some embodiments, the character information includes the number of characters, the character identifier, and the location.

In some embodiments, a subtitle synthesizer is further included, and the subtitle synthesizer receives the extended subtitle data from the subtitle generator and synthesizes the multimedia content and the extended subtitle data.

A system for generating captions using a graphic object according to another aspect of the present invention for solving the above-described problem, receives multimedia content and basic subtitle data including audio data and video data from a content management server, and the multimedia content and And a subtitle generator for generating extended subtitle data using basic subtitle data, the subtitle generator receiving the audio data, and generating speaker character identifiers and emotion classification information by speech recognition using the audio data A voice recognition processor, receiving the video data, receiving the facial recognition processor for generating character information and emotion classification information by facial recognition using the video data, receiving the speaker character identifier and the character information, and speaker information To generate Speaker information generator, emotion classification information by the speech recognition, emotion classification information by the face recognition, and the speaker information, and an emotion information generator that generates emotion information, the basic subtitle data, the speaker information, and the emotion information And a database for storing and managing voice information and emotion information of the speaker character and facial information and emotion information of the character of the speaker, and an extended caption data generator for generating extended caption data, and wherein the basic caption data is a character string. , Including information on a start time, and the speaker information is the speaker character identifier and the location when the character is included in the character, and the speaker character identifier when the speaker does not appear in the scene, the Emotion information, if the character is included in the speaker, the speech recognition The information obtained by combining the facial recognition results, and when the speaker does not appear in the scene, is information reflecting the results of the speech recognition, and the extended subtitle data includes the character string, the starting point, the speaker character identifier, the location and It includes a subtitle graphic object corresponding to the emotion information, and is combined with the multimedia content.

Other specific matters of the present invention are included in the detailed description and drawings.

According to the subtitle generation system using the graphic object of the present invention, the subtitle generation system using the graphic object generates emotion information of the speaker character and the character by voice recognition and facial recognition, and generates extended subtitle data corresponding to the emotion information. can do.

In addition, the caption generation system using the graphic object makes it possible to visually transmit the emotion information to the hearing impaired by making the graphic object displaying the extended caption correspond to the emotion information.

In addition, the subtitle generation system using a graphic object enables a visually recognizable speaker to a hearing impaired person by adding a speaker character identifier to the video when the speaker is not present in the video.

In addition, the subtitle generation system using a graphic object allows a subtitle to correspond to a speaker using a graphic object when a single video has multiple speakers, so that a visually recognized who is speaking among the plurality of speakers to a hearing impaired person. do.

The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

1 is an input / output diagram of a caption generator according to an embodiment of the present invention.

2 is a block diagram of a caption generator according to an embodiment of the present invention.

3 is an input / output diagram of a basic caption data generator and a speech recognition processor according to an embodiment of the present invention.

4 is an input / output diagram of a facial recognition processor according to an embodiment of the present invention.

5 is an input / output diagram of a speaker information generator according to an embodiment of the present invention.

6 is an input / output diagram of an emotion information generator according to an embodiment of the present invention.

7 is an input / output diagram of an extended caption data generator according to an embodiment of the present invention.

8 is a flowchart of a subtitle generation method using a graphic object according to an embodiment of the present invention.

9 is an input / output diagram of a caption synthesizer according to an embodiment of the present invention.

10 is an exemplary diagram of multimedia content synthesized with extended caption data according to an embodiment of the present invention.

11 is an input and output diagram of a caption generator and a caption synthesizer according to an embodiment of the present invention.

12 is an input / output diagram of a caption generator according to an embodiment of the present invention.

13 is a block diagram of a caption generator according to an embodiment of the present invention.

Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, only the present embodiments allow the disclosure of the present invention to be complete, and are common in the technical field to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

The terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, the singular form also includes the plural form unless otherwise specified in the phrase. As used herein, “comprises” and / or “comprising” does not exclude the presence or addition of one or more other components other than the components mentioned. Throughout the specification, the same reference numerals refer to the same components, and “and / or” includes each and every combination of one or more of the mentioned components. Although "first", "second", etc. are used to describe various components, it goes without saying that these components are not limited by these terms. These terms are only used to distinguish one component from another component. Therefore, it goes without saying that the first component mentioned below may be the second component within the technical spirit of the present invention.

Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used as meanings commonly understood by those skilled in the art to which the present invention pertains. In addition, terms defined in the commonly used dictionary are not ideally or excessively interpreted unless explicitly defined.

1 to 11 describe a system for generating basic subtitle data and extended subtitle data in real time by receiving multimedia content, and in FIGS. 12 to 13, receiving and expanding basic subtitle data and multimedia content already generated by a content production company A system for generating subtitle data is described.

Multimedia content is digitized information that is produced, distributed, and consumed by information equipment, and includes drama, movies, news, animation, educational programs, and games, and is composed of audio data and video data.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

1 is an input and output diagram of the caption generator 1000 according to an embodiment of the present invention.

Referring to FIG. 1, the caption generator 1000 receives multimedia content and generates extended caption data.

The subtitle generator 1000 receives multimedia content including audio data and video data from the content management server, and generates extended subtitle data using the multimedia content.

The content management server is operated by a multimedia content production company, and manages and stores multimedia content and basic subtitle data. The content management server may transmit only the multimedia content to the subtitle generator 1000, and may simultaneously transmit the multimedia content and the basic subtitle data produced by the content producer to the subtitle generator 1000.

The caption generator 1000 may process multimedia data according to a predetermined rule to generate extended caption data reflecting the speaker's emotion information.

2 is a block diagram of a caption generator 1000 according to an embodiment of the present invention.

Referring to FIG. 2, the subtitle generator 1000 includes a basic subtitle data generator 1100, a speech recognition processor 1200, a facial recognition processor 1300, a speaker information generator 1400, an emotion information generator 1500, and extended subtitles. Data generator 1600 and database 1700.

The basic caption data generator 1100 receives audio data and generates basic caption data using the audio data. The basic subtitle data generator 1100 receives audio data among multimedia contents, and processes the audio data according to a predetermined rule to generate basic subtitle data.

The speech recognition processor 1200 receives audio data and generates speaker character identifiers and emotion classification information by speech recognition using the audio data. The speech recognition processor 1200 receives audio data among multimedia contents, and processes the audio data according to predetermined rules to generate speaker character identifiers and emotion classification information.

The facial recognition processor 1300 receives video data, and generates character information and emotion classification information by facial recognition using the video data. The facial recognition processor 1300 receives video data among multimedia contents, and processes the video data according to predetermined rules to generate character information and emotion classification information.

The speaker information generator 1400 receives the speaker character identifier and the character information, and generates speaker information. The speaker information generator 1400 receives the speaker character identifier from the speech recognition processor 1200, receives the character information from the facial recognition processor 1300, and processes the speaker character identifier and the character information according to predetermined rules. Generate speaker information.

The emotion information generator 1500 receives emotion classification information by voice recognition, emotion classification information by face recognition, and speaker information, and generates emotion information. The emotion information generator 1500 receives emotion classification information by voice recognition from the speech recognition processor 1200, receives emotion classification information by face recognition from the facial recognition processor 1300, and performs voice recognition and facial recognition. The emotion classification information is processed according to a predetermined rule to generate emotion information.

The extended caption data generator 1600 receives basic caption data, speaker information, and emotion information, and generates extended caption data. The extended caption data generator 1600 receives basic caption data from the basic caption data generator 1100, receives speaker information from the speaker information generator 1400, and receives emotion information from the emotion information generator 1500, and Extended caption data is generated by processing caption data, speaker information, and emotion information according to predetermined rules.

The database 1700 stores and manages voice information and emotion information of the speaker character and facial information and emotion information of the character. The database can store and manage the voice information and emotion information of the speaker character and the facial information and emotion information of the character in advance in broadcasting the multimedia content, and the voice information and emotion information and the character of the speaker character in real time during the multimedia content broadcast. Can store and manage facial information and emotion information.

The database 1700 automatically receives and stores voice information and emotion information of the speaker character and facial information and emotion information of the character from the subtitle generator 1000 from multimedia contents when a specific speaker repeatedly or periodically appears for a certain period of time. And manage.

3 is an input / output diagram of the basic caption data generator 1100 and the speech recognition processor 1200 according to an embodiment of the present invention.

Referring to FIG. 3, the basic caption data generator 1100 receives audio data, generates basic caption data, and the voice recognition processor 1200 receives audio data, and generates a speaker character identifier.

The basic caption data generator 1100 receives audio data and generates basic caption data using the audio data. The basic subtitle data includes character strings detected from the audio data and information at the start time.

The speech recognition processor 1200 receives audio data and generates speaker character identifiers and emotion classification information by speech recognition using the audio data. The speaker character identifier may be a specific symbol, icon, or image for defining a character, character, or the like speaking on the audio data. The emotion classification information is information that classifies the emotions of the speaker, and may be joy, sadness, anger, and the like.

The voice recognition processor 1200 recognizes voice information of the audio data, and generates speaker character identifiers and emotion classification information in association with the database 1700. The voice recognition processor 1200 recognizes a speaker character corresponding to the voice information of the database 1700 when the voice information recognized from the audio data of the multimedia content is similar to or more than a predetermined criterion as the voice information of the database 1700, thereby recognizing the speaker character. Create a character identifier. In addition, the voice recognition processor 1200 recognizes emotion information corresponding to the voice information in the database 1700 when the voice information recognized from the audio data of the multimedia content is similar to or more than a predetermined criterion as the voice information in the database 1700. To generate emotion classification information.

4 is an input / output diagram of the facial recognition processor 1300 according to an embodiment of the present invention.

Referring to FIG. 4, the facial recognition processor 1300 receives video data and generates character information and emotion classification information.

The facial recognition processor 1300 receives video data, and generates character information and emotion classification information by facial recognition using the video data. The character information includes the number of characters, the character identifier, and the location. Also, the emotion classification information is information that classifies emotions of the character of the character, and may be joy, sadness, anger, and the like.

The face recognition processor 1300 face-recognizes face information of the video data, and generates character information and emotion classification information in association with the database 1700. The facial recognition processor 1300 recognizes an appearance character corresponding to the facial information in the database 1700 when the facial information recognized from the video data of the multimedia content is similar to or greater than a predetermined criterion for the facial information in the database 1700. Create number of characters, identifier, and location. In addition, the facial recognition processor 1300 recognizes the emotion information corresponding to the facial information in the database 1700 when the facial information recognized from the video data of the multimedia content is similar to or greater than a predetermined criterion with the facial information in the database 1700 To generate emotion classification information.

5 is an input / output diagram of the speaker information generator 1400 according to an embodiment of the present invention.

Referring to FIG. 5, the speaker information generator 1400 receives speaker character identifiers and character information, and generates speaker information.

The speaker information generator 1400 receives the speaker character identifier and the character information, and generates speaker information.

The speaker information is a speaker character identifier and location when a speaker is included in the character, and is a speaker character identifier when the speaker does not appear in the scene. The speaker information generator 1400 determines that a speaker is included in the character, if the speaker character identifier and the character identifier are similar to or above a predetermined criterion, and generates speaker information of the speaker character identifier and location. In addition, the speaker information generator 1400 determines that the speaker does not appear in the scene when the speaker character identifier and the character identifier are similar below a predetermined criterion, and generates speaker information of the speaker character identifier.

6 is an input / output diagram of the emotion information generator 1500 according to the embodiment of the present invention.

Referring to FIG. 6, the emotion information generator 1500 receives emotion classification information by voice recognition, emotion classification information by face recognition, and speaker information, and generates emotion information.

The emotion information is information that combines the results of speech recognition and facial recognition when the character includes a speaker, and information that reflects the results of speech recognition when the speaker does not appear in the scene. The emotion information generator 1500 determines if the speaker character identifier and the character identifier are similar to or above a predetermined criterion, determines that the character includes a speaker, and combines emotion classification information by voice recognition and emotion classification information by facial recognition. To generate emotion information. In addition, the emotion information generator 1500 determines that the speaker does not appear in the scene when the speaker character identifier and the appearance character identifier are similar below a predetermined criterion, and generates emotion information using emotion classification information by voice recognition. do.

7 is an input / output diagram of the extended caption data generator 1600 according to an embodiment of the present invention.

Referring to FIG. 7, the extended caption data generator 1600 receives basic caption data, speaker information, and emotion information, and generates extended caption data.

The extended caption data generator 1600 receives basic caption data, speaker information, and emotion information, and generates extended caption data. The extended subtitle data includes a subtitle graphic object corresponding to a character string, a starting point, a speaker character identifier, location and emotion information, and is combined with multimedia content.

The extended subtitle data is different for every starting point when subtitles of multimedia contents are output, the character string, the speaker character identifier, the character position of the characters, the character type, the font size, the font color, the shape of the graphic object, the size of the graphic object, and the background color of the graphic object. Thus, extended subtitle data can be configured to reflect the speaker's emotions rather than simply outputting a string.

The subtitle graphic object is a tool for displaying a character string on the screen, and is output as a character type, a font size, a font color, a graphic object shape, a graphic object size, and a graphic object background color corresponding to emotion information. The subtitle graphic object is described in detail in FIG. 10.

Referring to FIG. 8, a subtitle generation method using a graphic object includes a basic subtitle data generation step, a voice recognition processing step, a face recognition processing step, speaker information generation step, emotion information generation step, and extended subtitle data generation step.

In step S5100, audio data among multimedia contents is received from the content management server, and basic subtitle data is generated using the audio data.

In step S5200, audio data among multimedia contents is received from the content management server, and speaker character identifiers and emotion classification information by voice recognition are generated using the audio data.

In step S5300, video data among multimedia contents is received from the content management server, and character information and emotion classification information by facial recognition are generated using the video data.

In step S5400, the speaker character identifier and the character information are received, and the speaker information is generated.

In step S5500, emotion classification information by voice recognition, emotion classification information by face recognition, and speaker information are received, and emotion information is generated.

In step S5600, basic subtitle data, speaker information, and emotion information are received, and extended subtitle data is generated.

9 is an input / output diagram of the caption synthesizer 2000 according to an embodiment of the present invention.

Referring to FIG. 9, the subtitle synthesizer 2000 receives multimedia content and extended subtitle data, and generates multimedia content synthesized with the extended subtitle data.

The subtitle synthesizer 2000 receives the extended subtitle data from the subtitle generator 1000 and synthesizes the multimedia content and the extended subtitle data. The multimedia content in which the extended subtitle data is synthesized will be described in detail in FIG. 10.

In order to determine whether the multimedia content and the extended subtitle data correspond to each other, the subtitle synthesizer 2000 compares the identifier of the multimedia content and the identifier of the extended subtitle data and synthesizes the multimedia content and the extended subtitle data when they correspond to each other.

Referring to (a) of FIG. 10, it is an exemplary view of multimedia content in which extended subtitle data is synthesized when the speaker is not present in the video.

It is difficult for the hearing impaired to determine which character is speaking in the subtitle if the speaker simply displays the subtitle at the bottom of the video in the multimedia content that the speaker does not have. Accordingly, the present invention outputs the speaker character identifier to the image, and outputs extended caption data at the location of the speaker character identifier. The subtitle graphic object corresponding to the emotion information of the extended subtitle data may be output on the image in the form of a speech bubble, and the text type, text size, text color, text thickness, shape of the graphic object, and graphic object correspond to the speaker's emotion information. The size and background color of the graphic object can be output. For example, when the speaker's emotion information is anger, the font size can be increased to a predetermined standard or more, the font color is red, the font thickness is thick, and the subtitle graphic object is in a sharp shape. can do.

The text type, text size, text color, text thickness, graphic object shape, graphic object size, and graphic object background color corresponding to the speaker's emotion information can be set by the administrator of the subtitle generation system, and the user is a hearing impaired person. It can be set differently according to the taste of the.

Referring to (b) of FIG. 10, it is an exemplary view of multimedia content in which extended subtitle data is synthesized when a plurality of speakers are present in an image.

If a plurality of speakers simply subtitles at the bottom of the video in the multimedia content, it is difficult for the hearing impaired to determine which character is speaking. Accordingly, the present invention can determine the appearance character corresponding to the extended caption data, and output the extended caption data at the position of the corresponding character.

11 is an input and output diagram of the caption generator 1000 and the caption synthesizer 2000 according to an embodiment of the present invention.

Referring to FIG. 11, the subtitle generator 1000 receives multimedia content, generates extended subtitle data, and the subtitle synthesizer 2000 receives multimedia content and extended subtitle data, and displays multimedia content synthesized by extended subtitle data. To create.

The subtitle generator 3000, the subtitle synthesizer, the multimedia content, the extended subtitle data, and the multimedia content in which the extended subtitle data is synthesized are described in FIGS. 1 to 10, and thus will be omitted in the description of FIG.

The subtitle generator 3000 of FIGS. 12 and 13 is a system for generating extended subtitle data by receiving basic subtitle data and multimedia content already generated by a content production company. FIGS. 1 to 10 in that the basic subtitle data is received from a content producer It is different from the subtitle generator 3000 of, and the description of terms is the same, and thus will be briefly described below.

12 is an input / output diagram of the caption generator 3000 according to an embodiment of the present invention.

Referring to FIG. 12, the subtitle generator 3000 receives multimedia content and basic subtitle data, and generates extended subtitle data.

The subtitle generator 3000 receives multimedia content and basic subtitle data including audio data and video data from the content management server, and generates extended subtitle data using the multimedia content and the basic subtitle data.

The basic subtitle data includes a character string and information at a starting point.

The subtitle synthesizer receives extended subtitle data from the subtitle generator 3000 and synthesizes multimedia content and extended subtitle data.

13 is a block diagram of a caption generator 3000 according to an embodiment of the present invention.

Referring to FIG. 13, the subtitle generator 3000 includes a speech recognition processor 3100, a facial recognition processor 3200, a speaker information generator 3300, an emotion information generator 3400, an extended subtitle data generator 3500 and a database ( 3600).

The voice recognition processor 3100 receives audio data, and generates speaker character identifiers and emotion classification information by voice recognition using the audio data.

The speech recognition processor 3100 generates a speaker character identifier and emotion classification information in cooperation with the database 3600.

The facial recognition processor 3200 receives video data, and generates character information and emotion classification information by facial recognition using the video data.

The facial recognition processor 3200 generates appearance character information and emotion classification information in conjunction with the database 3600.

The character information includes the number of characters, the character identifier, and the location.

The speaker information generator 3300 receives the speaker character identifier and the character information, and generates speaker information.

The speaker information is a speaker character identifier and location when a speaker is included in the character, and is a speaker character identifier when the speaker does not appear in the scene.

The emotion information generator 3400 receives emotion classification information by voice recognition, emotion classification information by face recognition, and speaker information, and generates emotion information.

The emotion information is information that combines the results of speech recognition and facial recognition when the character includes a speaker, and information that reflects the results of speech recognition when the speaker does not appear in the scene.

The extended caption data generator 3500 receives basic caption data, speaker information, and emotion information, and generates extended caption data.

The extended subtitle data includes a subtitle graphic object corresponding to a character string, a starting point, a speaker character identifier, location and emotion information, and is combined with multimedia content.

The subtitle graphic object is output as text type, text size, text color, graphic object shape, graphic object size, and graphic object background color corresponding to emotion information.

The database 3600 stores and manages voice information and emotion information of the speaker character, and facial information and emotion information of the character.

The subtitle generator and subtitle synthesizer of FIGS. 1 to 11 and the subtitle generator and subtitle synthesizer of FIGS. 12 to 13 are applicable to a video call environment.

In some embodiments, the application in the smartphone transmits the video call multimedia content to the video call multimedia content management server, and the subtitle generator and subtitle synthesizer of the video call multimedia content management server can generate multimedia content synthesized with extended subtitle data. have.

In some other embodiments, an application in a smartphone includes a program corresponding to a caption generator and a caption synthesizer, and a program corresponding to the caption generator and the caption synthesizer can generate multimedia content synthesized with extended caption data.

The steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, a software module executed by hardware, or a combination thereof. The software modules may include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EPMROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside on any type of computer readable recording medium well known in the art.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but a person skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing its technical spirit or essential features. You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive.

Claims

And a subtitle generator that receives multimedia content including audio data and video data from a content management server, and generates extended subtitle data using the multimedia content,

The subtitle generator,

A basic caption data generator that receives the audio data and generates basic caption data using the audio data;

A voice recognition processor that receives the audio data and generates speaker character identifiers and emotion classification information by voice recognition using the audio data;

A facial recognition processor that receives the video data and generates appearance character information and emotion classification information by facial recognition using the video data;

A speaker information generator that receives the speaker character identifier and the character information and generates speaker information;

An emotion information generator that receives the emotion classification information by the speech recognition, the emotion classification information by the facial recognition, and the speaker information, and generates emotion information;

An extended caption data generator that receives the basic caption data, the speaker information, and the emotion information, and generates extended caption data;

And a database for storing and managing voice information and emotion information of the speaker character and facial information and emotion information of the character of the character,

The basic subtitle data,

A string detected from the audio data, and information on a start time point,

The speaker information,

If the character includes a speaker, the speaker character identifier and location, and if the speaker does not appear in the scene, the speaker character identifier,

The emotion information,

When the character includes a speaker, it is information combining the speech recognition and the facial recognition result, and when the speaker does not appear in the scene, it is information reflecting the speech recognition result,

The extended subtitle data,

The subtitle graphic object corresponding to the character string, the start point, the speaker character identifier, the location and the emotion information, and is combined with the multimedia content,

Subtitle generation system using graphic objects.
According to claim 1,

The subtitle graphic object,

Character type, font size, font color, font size, graphic object size, graphic object size and graphic object color corresponding to the emotion information are output.

Subtitle generation system using graphic objects.
According to claim 1,

The speech recognition processor,

Generating the speaker character identifier and the emotion classification information in cooperation with the database,

Subtitle generation system using graphic objects.
According to claim 1,

The facial recognition processor,

Generating the character information and the emotion classification information in cooperation with the database,

Subtitle generation system using graphic objects.
According to claim 1,

The above character information,

Including the number of characters, character identifier, and location,

Subtitle generation system using graphic objects.
According to claim 1,

Further comprising a subtitle synthesizer,

The subtitle synthesizer,

Receiving the extended subtitle data from the subtitle generator, and synthesizing the multimedia content and the extended subtitle data,

Subtitle generation system using graphic objects.
And a subtitle generator that receives multimedia content and basic subtitle data including audio data and video data from a content management server, and generates extended subtitle data using the multimedia content and basic subtitle data,

The subtitle generator,

A voice recognition processor that receives the audio data and generates speaker character identifiers and emotion classification information by voice recognition using the audio data;

A facial recognition processor that receives the video data and generates appearance character information and emotion classification information by facial recognition using the video data;

A speaker information generator that receives the speaker character identifier and the character information and generates speaker information;

An emotion information generator that receives the emotion classification information by the speech recognition, the emotion classification information by the facial recognition, and the speaker information, and generates emotion information;

An extended caption data generator that receives the basic caption data, the speaker information, and the emotion information, and generates extended caption data;

And a database for storing and managing voice information and emotion information of the speaker character and facial information and emotion information of the character of the character,

The basic subtitle data,

String, contains information at the beginning,

The speaker information,

If the character includes a speaker, the speaker character identifier and location, and if the speaker does not appear in the scene, the speaker character identifier,

The emotion information,

If the character includes a speaker, it is information combining the speech recognition and the facial recognition result, and if the speaker does not appear in the scene, it is information reflecting the speech recognition result,

The extended subtitle data,

The subtitle graphic object corresponding to the character string, the start point, the speaker character identifier, the location and the emotion information, and is combined with the multimedia content,

Subtitle generation system using graphic objects.
The method of claim 7,

The subtitle graphic object,

Character type, font size, font color, font size, graphic object size, graphic object size and graphic object color corresponding to the emotion information are output.

Subtitle generation system using graphic objects.
The method of claim 7,

The speech recognition processor,

Generating the speaker character identifier and the emotion classification information in cooperation with the database,

Subtitle generation system using graphic objects.
The method of claim 7,

The facial recognition processor,

Generating the character information and the emotion classification information in association with the database,

Subtitle generation system using graphic objects.
The method of claim 7,

The above character information,

Including the number of characters, character identifier, and location,

Subtitle generation system using graphic objects.
The method of claim 7,

Further comprising a subtitle synthesizer,

The subtitle synthesizer,

Receiving the extended subtitle data from the subtitle generator, and synthesizing the multimedia content and the extended subtitle data,

Subtitle generation system using graphic objects.