KR20170052084A

KR20170052084A - Apparatus and method for learning foreign language speaking

Info

Publication number: KR20170052084A
Application number: KR1020150154051A
Authority: KR
Inventors: 김현수
Original assignee: 주식회사 셀바스에이아이
Priority date: 2015-11-03
Filing date: 2015-11-03
Publication date: 2017-05-12

Abstract

According to an embodiment of the present invention, there is provided a foreign language speaking learning method comprising the steps of acquiring audio data and caption data for a moving picture, extracting at least two speakers for pronouncing the audio data, Extracting a plurality of unit audio data to be composed, extracting a plurality of unit subtitle data constituting subtitle data on the basis of a plurality of unit audio data, matching the plurality of unit audio data with at least two or more speakers, A step of matching a plurality of unit audio data with a plurality of unit subtitle data and a step of matching information between a plurality of unit audio data and at least two speakers and matching information between a plurality of unit audio data and a plurality of unit subtitle data And a plurality of unit subtitles And matching the data with each other.

Description

[0001] APPARATUS AND METHOD FOR LEARNING FOREIGN LANGUAGE SPEAKING [0002]

The present invention relates to a device and method for learning a foreign language, and more particularly, to a device and method for learning a foreign language which can easily acquire subtitles for a specific character in a moving picture without performing a separate process .

As the internationalization and the globalization of Korean companies become more and more multinational, the importance of foreign language learning is becoming more important. Especially, as the need for global network is emphasized, the importance of foreign language speaking learning for direct communication with foreigners is getting more and more important.

In response to these trends, various methods have been introduced to effectively perform foreign language speaking learning. Among them, learning how to pronounce a foreign language subtitle provided with a video while listening to the pronunciation of a speaker appearing in a foreign language video is gaining popularity among learners. By using such a method, foreign language speech learning can be performed while watching a foreign language video such as a movie or an animation, so that a learner can perform foreign language speech learning while having fun.

While performing the above-mentioned method of speaking a foreign language, a learner who wants to perform a role-playing video role is increasing as if it is a character in a video. However, the foreign language subtitles provided with the video only show the pronunciation of all the speakers in the video, but not the contents of a single character.

In some cases, a manufacturer of a language-based educational program extracts captions from a foreign language video and provides subtitles for a single character. However, in this method, subtitles for a single character are individually extracted from a foreign language subtitle, There is a problem in which a separate effort for creating a new one must be introduced.

Accordingly, there is a need for a new technique capable of providing subtitles for a specific character in a moving picture without processing such as extracting individual subtitles.

SUMMARY OF THE INVENTION It is an object of the present invention to provide a foreign language speech learning apparatus and method capable of capturing subtitles about a specific character in a moving picture without processing such as extracting individual captions.

A problem to be solved by the present invention is to provide a device and a method for learning a foreign language spoken language which helps a learner to perform a role playing video in a foreign language spoken learning.

A problem to be solved by the present invention is to provide a foreign language speech learning apparatus and method which enable a learner to more effectively perform foreign language speech learning.

The problems of the present invention are not limited to the above-mentioned problems, and other problems not mentioned can be clearly understood by those skilled in the art from the following description.

According to an embodiment of the present invention, there is provided a method of speaking a foreign language, the method comprising: acquiring audio data and caption data for a moving picture; extracting at least two speakers for pronouncing the audio data; Extracting a plurality of unit audio data constituting audio data on the basis of at least two speakers, extracting a plurality of unit subtitle data constituting subtitle data based on the plurality of unit audio data, A step of matching at least two speakers with each other, a step of matching a plurality of unit audio data with a plurality of unit subtitle data, a step of matching information between a plurality of unit audio data and at least two speakers, Using matching information between a plurality of unit subtitle data, And matching the plurality of unit caption data with two or more speakers.

According to another aspect of the present invention, the step of extracting the plurality of unit subtitle data constituting the subtitle data based on the plurality of unit audio data may be performed using the sync information of each of the plurality of unit audio data have.

According to another aspect of the present invention, extracting at least two speakers and matching the plurality of unit audio data with at least two speakers may be performed using a speaker recognition technique.

According to another aspect of the present invention, the method may further include storing matching information between a plurality of unit subtitle data and at least two speakers.

According to still another aspect of the present invention, there is provided a method for reproducing a moving picture, the method including reproducing a moving picture together with at least a part of a plurality of unit subtitle data, A learning interface may be provided whenever unit subtitle data appears.

According to still another aspect of the present invention, in the step of reproducing the moving picture, only the unit subtitle data matching the predetermined one or more speakers among the plurality of unit subtitle data may be provided.

According to still another aspect of the present invention, there is provided a method of receiving a message comprising the steps of receiving at least one speaker selected from at least two speakers, and at least one speaker selected may be one or more predetermined speakers.

According to an aspect of the present invention, there is provided a foreign language speaking learning apparatus including an acquisition unit for acquiring speech data and caption data for a moving picture, at least two speakers for pronouncing speech data, An extracting unit that extracts a plurality of unit audio data constituting audio data on the basis of at least two speakers and extracts a plurality of unit subtitle data constituting subtitle data based on a plurality of unit audio data, Data, at least two speakers, matching a plurality of unit audio data and a plurality of unit subtitle data, and generating a plurality of unit audio data, matching information between at least two speakers, and a plurality of unit audio data and a plurality Using the matching information between the unit subtitle data, It characterized in that it comprises a plurality of parts and matching unit for matching the subtitle data.

In order to solve the above-mentioned problems, a computer-readable medium storing sets of instructions according to an embodiment of the present invention includes a set of instructions that, when executed by a computing device, And extracting a plurality of unit audio data constituting audio data on the basis of at least two speakers and extracting a plurality of unit audio data, A plurality of unit audio data and at least two or more speakers are matched so that a plurality of unit audio data and a plurality of unit subtitle data are matched, And at least two or more Using the matching information between the matching information and a plurality of units of audio data and a plurality of units of subtitle data from among the speakers, so as to be matched to at least two or more speakers and a plurality of units of subtitle data.

The details of other embodiments are included in the detailed description and drawings.

INDUSTRIAL APPLICABILITY According to the present invention, it is possible to acquire subtitles related to a specific character in a moving picture without processing such as extracting the subtitles one by one.

The present invention has an effect of assisting a learner in performing a role playing role of a moving picture in learning a foreign language speech.

The present invention has the effect of allowing a learner to more effectively perform learning to speak a foreign language.

The effects according to the present invention are not limited by the contents exemplified above, and more various effects are included in the specification.

1 is a schematic block diagram of a foreign language speech learning apparatus according to an embodiment of the present invention.
2 is a flowchart of a foreign language speaking learning method according to an embodiment of the present invention.
Figures 3A-3F are illustrations of examples in which the present invention may be advantageously utilized.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. It should be understood, however, that the invention is not limited to the disclosed embodiments, but is capable of many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

Although the first, second, etc. are used to describe various components, it goes without saying that these components are not limited by these terms. These terms are used only to distinguish one component from another. Therefore, it is needless to say that the first component mentioned below may be the second component within the technical spirit of the present invention.

Like reference numerals refer to like elements throughout the specification.

It is to be understood that each of the features of the various embodiments of the present invention may be combined or combined with each other partially or entirely and technically various interlocking and driving is possible as will be appreciated by those skilled in the art, It may be possible to cooperate with each other in association.

Various embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

1 is a schematic block diagram of a foreign language speech learning apparatus according to an embodiment of the present invention.

1, a foreign language speaking device 100 according to an exemplary embodiment of the present invention includes an acquisition unit 110, an extraction unit 120, a matching unit 130, a control unit 140, a communication unit 150, A memory 160, and a display 170.

The acquisition unit 110 acquires audio data and caption data for a specific moving image. The acquiring unit 110 may analyze a file of a specific moving picture in order to acquire voice data for a specific moving picture. The acquiring unit 110 may receive a file containing caption data from outside in order to acquire caption data for a specific moving image.

Here, a moving image refers to an image provided together with a voice as an image produced so that objects in the display appear to be continuously moving. The video may include a video capable of performing foreign language learning, for example, a movie, a drama, a news, an animation, etc. in which foreign language speakers appear.

The audio data is data relating to audio transmitted from a specific moving picture, and may also be referred to as audio data. The caption data may be displayed on a display so that the viewer can read the voice uttered by the speaker in a specific moving picture, and may be displayed in the form of text related to a specific language.

The extracting unit 120 extracts at least two speakers for pronouncing voice data in a specific moving picture. The extraction unit 120 also extracts a plurality of unit speech data constituting speech data based on at least two speakers. The extraction unit 120 may use the speaker recognition technology of the present technology to extract at least two speakers and a plurality of pieces of unit caption data.

Here, the speaker refers to an object that speaks in a specific language in a specific moving picture. Thus, objects that produce sound that can not be defined in a particular language, such as wind or wave sounds, can not be considered speakers. In a particular video, the speaker would normally be a person, but in an animation, an animal or other object could also be a speaker.

A plurality of unit audio data are gathered to constitute one audio data for a specific moving picture. Sync information may be assigned to each of a plurality of unit audio data. Here, the sync information is information about a time at which each of a plurality of unit audio data is pronounced in a specific moving picture. For example, the first unit audio data includes sync information that can be sounded in 0 to 20 seconds in a specific moving picture, and the second unit sound data includes sync information that can be sounded in 20 to 40 seconds in a specific moving picture Can be assigned.

The extracting unit 120 also extracts a plurality of unit subtitle data from the subtitle data. The extracting unit 120 may use a plurality of unit audio data to extract a plurality of unit subtitle data.

A plurality of unit subtitle data are gathered to constitute one subtitle data for a specific moving picture. Sink information can also be assigned to each of a plurality of unit subtitle data. For example, sync information indicating that the first unit subtitle data is output in 0 to 20 seconds in a specific moving picture, and sync information indicating that the second unit subtitle data can be output in 20 to 40 seconds in a specific moving picture is allocated .

The matching unit 130 includes a function of matching a plurality of unit audio data with at least two speakers, a function of matching a plurality of unit audio data and a plurality of unit caption data, and a function of matching at least two speakers and a plurality of unit caption data It plays a role of matching.

In order to match a plurality of unit subtitle data with at least two speakers, the matching unit 130 may include matching information that is already made, that is, matching information between a plurality of unit audio data and at least two speakers, And matching information between a plurality of unit subtitle data.

The control unit 140 controls the flow of data between the acquisition unit 110, the extraction unit 120, the matching unit 130, the communication unit 150, the memory 160, and the display 170. In other words, the control module controls the flow of data between the components of the foreign language speaking device 100 or from the outside to thereby acquire the data from the acquiring unit 110, the extracting unit 120, the matching unit 130, 150, the memory 160, and the display 170, respectively.

The communication unit 150 serves to enable the foreign language learning apparatus 100 to communicate with an external apparatus.

The memory 160 stores matching information between at least two speakers and a plurality of unit subtitle data, matching information between a plurality of unit audio data and a plurality of unit subtitle data, at least two or more speakers and a plurality of unit subtitle data Matching information and the like can be stored. The storage module may include a random access memory (RAM), a read-only memory (ROM), a magnetic disk device, an optical disk device, a flash memory, It is not.

The display 170 functions to reproduce a specific moving image when a user's playback request is received. The display 170 may provide the user with caption data and a learning interface while reproducing a specific moving picture. Any device capable of playing a specific moving picture together with caption data such as a monitor or a touch screen display panel can be employed as the display 170 of the present invention.

Meanwhile, although the above-described configurations have been shown and described as separate configurations for convenience of explanation, each configuration may be merged, or one configuration may be implemented separately.

2 is a flowchart of a foreign language speaking learning method according to an embodiment of the present invention.

First, audio data and caption data for a specific moving picture are acquired (S210).

The method of acquiring voice data for a specific moving picture is not particularly limited, and a method commonly known in the technical field, for example, a method of analyzing a specific moving picture file and separating video data and voice data, Can be obtained.

The method of acquiring caption data is not particularly limited, and caption data created by a producer or a viewer of a specific video can be utilized so that the viewer can visually recognize the voice data of the specific video. Depending on the implementation method, caption data may be obtained by using a voice character conversion technique that can automatically convert voice to text.

Next, at least two speakers for pronouncing the speech data are extracted (S220).

Speaker recognition techniques may be used to extract at least two speakers that pronounce the speech data. The speaker recognition technology extracts information corresponding to a speech characteristic of an arbitrary speaker from voices pronounced in a moving image, generates a plurality of arbitrary speaker models using this information as training data, Is a technique that identifies and distinguishes the voice that is pronounced by each of the speaker models of the speaker model.

In step 220, all speakers that pronounce the voice data are preferably extracted. Therefore, the number of speakers to be extracted based on the audio data is not particularly limited, and the number of speakers to be extracted will depend on how many speakers are present in a specific moving picture.

Next, a plurality of unit speech data may be extracted from the speech data based on at least two speakers (S230).

There are various forms in which one voice data can be classified as unit voice data. For example, one speech data " computer input device " may be categorized into two unit audio data " computer input " and " . Therefore, it is necessary to set a criterion for classifying the voice data into unit voice data. In the present invention, voice data is classified into unit voice data based on at least two speakers. For example, in the case where speaker A pronounces "computer input" and speaker B pronounces "device", one piece of voice data "computer input device" includes two unit voice data "computer input" .

Next, a plurality of unit audio data constituting the audio data and at least two speakers may be matched (S240).

Depending on whether or not each of the plurality of unit voice data is pronounced by one of the at least two speakers. And at least two speakers are associated with a plurality of unit voice data. For example, when the first unit voice data is pronounced by a speaker A, a speaker A is associated with the first unit voice data, and when the second unit voice data is pronounced by a speaker B, The speaker B corresponds to the voice data.

The method of matching a plurality of unit sound data and at least two speakers may be preferably one-to-many. As an example of a one-to-many method, speech data is composed of first unit speech data, second unit speech data, third unit speech data and fourth unit speech data, and the speakers extracted in step 220 are classified into two groups A and B , The speaker A may be matched to the first unit sound data and the third unit sound data, and the speaker B may be matched to the second unit sound data and the third unit sound data.

In step 240, the speaker recognition technique described above may be used to match a plurality of unit sound data and at least two speakers.

Next, a plurality of unit subtitle data constituting subtitle data is extracted based on a plurality of unit audio data (S250).

Whether or not the criterion for classifying the caption data into a plurality of unit caption data is associated with each of a plurality of unit audio data. For example, unit sound data " Hello " is pronounced by a specific speaker in a specific moving picture, and subtitle data " Hello " output in a specific moving picture in association with such unit sound data is extracted as unit subtitle data.

At this time, in order to more accurately extract the unit subtitle data from the subtitle data, it is possible to utilize the sync information of each of the plurality of unit audio data. For example, when the first unit sound data having the sync information of 0 to 20 seconds and the second unit sound data having the sync information of 20 seconds to 40 seconds are extracted in step 230, Second unit subtitle data having sync information of 20 seconds to 40 seconds and second unit subtitle data having sync information of 20 seconds to 40 seconds can be extracted.

Next, a plurality of unit audio data and a plurality of unit subtitle data are matched (260).

A plurality of unit subtitle data associated with each of the plurality of unit audio data are matched with each other. For example, unit audio data that is pronounced " Hello " in a specific video is matched with unit caption data output as " Hello " in a specific video. At this time, similarly to step 250, in step 260, the sync information of each of a plurality of unit audio data and the sync information of each of a plurality of unit subtitle data may be utilized.

Unlike the matching between a plurality of unit audio data and at least two speakers, a plurality of unit audio data and a plurality of unit subtitle data can be matched one to one with each other.

Next, at least two speakers and a plurality of unit subtitle data are recorded using matching information between a plurality of unit audio data, at least two speakers, and matching information between a plurality of unit audio data and a plurality of unit subtitle data (270).

In step 270, at least two speakers and a plurality of unit caption data are matched with each other via a plurality of unit audio data. For example, if the first unit audio data matches the A speaker in step 250 and the first unit audio data matches the first unit audio data in step 260, then in step 270, And matches the speaker with the first unit caption data. Similarly, when the second unitary voice data is matched with the B speaker in step 250 and the second unitary voice data is matched with the second unitary caption data in step 270, in step 270, And matches the second unitary caption data with the speaker.

The method of matching a plurality of unit caption data and at least two speakers may also be a one-to-many method.

Thereafter, optionally, matching information between a plurality of unit subtitle data and at least two speakers can be stored.

Matching information between a plurality of unit subtitle data and at least two speakers appearing in a specific moving picture can be stored in a file form. When the matching information is stored in the form of a file, foreign language speaking learning such as video role playing can be performed later, while omitting steps 210 to 260. Thus, efficient foreign language speaking learning can be performed for different learners and at different times have.

In addition, if there is a request from the user thereafter, the moving picture can be reproduced together with at least a part of the plurality of unit subtitle data.

At this time. At the same time, a selection can be made for one or more speakers of at least two speakers. For example, the user may provide at least two or more speakers extracted in step 220 in a list form, and the user may perform selection of one or more speakers of at least two speakers. The selected one or more speakers may be one or more predetermined speakers.

As the moving picture is reproduced, a learning interface can be provided whenever unit subtitle data matching with a predetermined one or more speakers appears. The provided learning interface may include, for example, displaying a specific color in the unit caption data matched with the predetermined one or more speakers, unitary caption data matched with the predetermined one or more speakers, Providing an interface for recording, and recording and evaluating the pronunciated unit caption data, but are not necessarily limited thereto.

Whether only a plurality of unit subtitle data will be provided with moving pictures or only unit subtitle data matching a predetermined one or more speakers among a plurality of unit subtitle data will be provided. Such a setting may be made by the user, for example, a method of checking one of the caption data selection view and the caption data selection view in a table including a check box.

Figures 3A-3F are illustrations of examples in which the present invention may be advantageously utilized.

Referring to FIG. 3A, audio data composed of first unit audio data 312, second unit audio data 314, and third unit audio data 316 is shown.

In this example, the first unit voice data 312 is pronounced as "Hello" and the second unit voice data 314 is pronounced as "I am fine." Quot; And you " and the third unit voice data 316 is pronounced " I am good ". It is assumed that the first unit sound data 312 and the third unit sound data 316 are sounded by the A speaker and the second unit sound data 314 is sounded by the B speaker in the moving picture.

The first unit speech data 312, the second unit speech data 314, and the third unit speech data 316 are extracted using the speaker recognition technology.

Referring to FIG. 3B, the first unitary audio data 312, the second unit audio data 314, and the third unit audio data 316 are matched with the first unitary audio data 312, Caption data 324 and third unit caption data 326 are extracted.

The first unit audio data 312 and the second unit audio data 314 are extracted in order to extract the first unitary caption data 322, the second unit caption data 324 and the third unit caption data 326 from the caption data, (I.e., 0.0 to 1.1 seconds of sync information, 1.1 to 3.5 seconds of sync information, and 3.5 to 5.0 seconds of sync information) of the third unit audio data 316 and third unit audio data 316 are used.

Referring to FIG. 3C, the first unit audio data 312, the second unit audio data 314, and the third unit audio data 316 are matched with the speakers appearing in the moving picture.

Speakers in the video were extracted in advance as A and B using speaker recognition technology. The first unit sound data 312 and the third unit sound data 316 are matched with the A speaker by using the speaker recognition technology and the second unit sound data 314 is matched with the A speaker as shown in FIG. Respectively.

Referring to FIG. 3D, the first unit audio data 312, the second unit audio data 314, and the third unit audio data 316 are matched with the speakers appearing in the moving picture.

The first unitary audio data 312, the second unit audio data 314 and the third unit audio data 316 and the first unitary subtitle data 322, the second unitary subtitle data 324, The first unitary caption data 322 and the third unit caption data 326 are matched with the A speaker by using the matching information between the third unitary caption data 326 and the second unitary caption data 324 is B It was matched with the speaker.

Referring to FIG. 3E, after the user plays the moving image 330, the user is provided with an option to select a speaker. As shown in FIG. 3E, the user continued to play the movie 330 after selecting the A speaker.

Referring to FIG. 3F, when the first unitary caption data 322 and the third unitary caption data 326 matching the A selected by the user in the moving picture 330 are provided, a learning interface 340 are provided and the second unitary subtitle data 324 matching the B speaker is provided, the learning interface 340 is not provided.

Each block of the block diagrams attached hereto and combinations of steps of the flowchart diagrams may be performed by computer program instructions. These computer program instructions may be loaded into a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus so that the instructions, which may be executed by a processor of a computer or other programmable data processing apparatus, And means for performing the functions described in each step are created. These computer program instructions may also be stored in a computer usable or computer readable memory capable of directing a computer or other programmable data processing apparatus to implement the functionality in a particular manner so that the computer usable or computer readable memory It is also possible that the instructions stored in the block diagram include each block of the block diagram or instruction means for performing the functions described in each step of the flowchart. Computer program instructions may also be stored on a computer or other programmable data processing equipment so that a series of operating steps may be performed on a computer or other programmable data processing equipment to create a computer- It is also possible that the instructions that perform the processing equipment provide the steps for executing the functions described in each block of the block diagram and at each step of the flowchart.

In this specification, each block may represent a portion of a module, segment, or code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative implementations, the functions mentioned in the blocks may occur out of order. For example, two blocks shown in succession may actually be executed substantially concurrently, or the blocks may sometimes be performed in reverse order according to the corresponding function.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, which is capable of reading information from, and writing information to, the storage medium. Alternatively, the storage medium may be integral with the processor. The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the present invention is not limited to the disclosed exemplary embodiments, but various changes and modifications may be made without departing from the spirit and scope of the invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the scope of the present invention but to limit the scope of the technical idea of the present invention. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100: Foreign language speaking device
110:
120:
130:
140:
150:
160: Memory
170: Display
312: first unit audio data
314: second unit voice data
316: third unit voice data
322: first unit subtitle data
324: second unit subtitle data
326: third unit subtitle data
330: Video
340: Learning interface

Claims

Obtaining audio data and caption data for a moving picture;
Extracting at least two speakers for pronouncing the speech data;
Extracting a plurality of unit speech data constituting the speech data based on the at least two speakers;
Extracting a plurality of unit subtitle data constituting the subtitle data based on the plurality of unit audio data;
Matching the plurality of unit sound data and the at least two speakers;
Matching the plurality of unit audio data with the plurality of unit subtitle data; And
Using the matching information between the plurality of unit audio data and the at least two speakers and the matching information between the plurality of unit audio data and the plurality of unit subtitle data, And matching the unit caption data with each other.

The method according to claim 1,
Wherein the step of extracting the plurality of unit subtitle data constituting the subtitle data based on the plurality of unit audio data is performed using sync information of each of the plurality of unit audio data. Speaking learning method.

The method according to claim 1,
Wherein the step of extracting the at least two speakers and the matching of the plurality of unit speech data and the at least two speakers are performed using a speaker recognition technique.

The method according to claim 1,
Further comprising the step of storing matching information between the at least two speakers and the plurality of unit subtitle data.

The method according to claim 1,
Reproducing the moving picture together with at least a part of the plurality of unit subtitle data,
Wherein the learning interface is provided whenever the unit subtitle data matched with a predetermined one or more speakers among the at least two speakers appears in the step of reproducing the moving picture.

The method according to claim 1,
Wherein, in the step of playing back the moving picture, only the unit subtitle data matched with the predetermined one or more speakers among the plurality of unit subtitle data is provided.

The method according to claim 1,
Further comprising selecting one or more speakers of the at least two speakers,
Wherein the one or more speakers that have been selected become the predetermined one or more speakers.

An acquisition unit for acquiring audio data and caption data for a moving picture;
Extracting a plurality of unit audio data constituting the audio data based on the at least two speakers and extracting the subtitle data based on the plurality of unit audio data An extracting unit for extracting a plurality of constituent unit subtitle data; And
A plurality of unit audio data and at least two or more speakers are matched with each other, and the plurality of unit audio data and the plurality of unit subtitle data are matched, and matching between the plurality of unit audio data and matching information And a matching unit for matching the at least two speakers with the plurality of unitary caption data using matching information between the plurality of unit audio data and the plurality of unit caption data, Learning device.

43. A computer-readable medium for storing a set of instructions,
Wherein the sets of instructions cause the computing device to, when executed by the computing device,
The audio data and the caption data for the moving picture are acquired,
Extracting at least two speakers for pronouncing the speech data,
Extracting a plurality of unit audio data constituting the audio data based on the at least two speakers,
Extracting a plurality of unit subtitle data constituting the subtitle data based on the plurality of unit audio data,
The plurality of unit sound data and the at least two speakers are matched,
To match the plurality of unit audio data and the plurality of unit subtitle data, and
Using the matching information between the plurality of unit audio data and the at least two speakers and the matching information between the plurality of unit audio data and the plurality of unit subtitle data, And unit subtitle data is matched.