WO2019043871A1 - 表示タイミング決定装置、表示タイミング決定方法、及びプログラム - Google Patents
表示タイミング決定装置、表示タイミング決定方法、及びプログラム Download PDFInfo
- Publication number
- WO2019043871A1 WO2019043871A1 PCT/JP2017/031368 JP2017031368W WO2019043871A1 WO 2019043871 A1 WO2019043871 A1 WO 2019043871A1 JP 2017031368 W JP2017031368 W JP 2017031368W WO 2019043871 A1 WO2019043871 A1 WO 2019043871A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- order
- character information
- voice
- ratio
- timing
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/04—Synchronising
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/242—Synchronization processes, e.g. processing of PCR [Program Clock References]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/44—Receiver circuitry for the reception of television signals according to analogue transmission standards
- H04N5/60—Receiver circuitry for the reception of television signals according to analogue transmission standards for the sound signals
Definitions
- the present invention relates to a display timing determination device, a display timing determination method, and a program.
- voice storage data for example, moving image data
- character information for example, subtitles
- Patent Document 1 describes a system for creating text information indicating the voice of a performer in a live broadcast television program and providing the created text information to the viewer.
- a television official who listens to live voices creates letters by manual input. For this reason, even if the interval between the output timings of each voice and the interval between the display timings of the respective character information substantially match, the display timing of the character information is different from the output timing of the voices only by the manual input time. I'm totally delayed.
- the delay time is estimated based on the genre code of the television program, and the delay time corresponding to the genre code corresponds to the character at the time of recording. Overall display timing has been advanced.
- the output timing of the voice is changed, and therefore, it is not always the case that the timings close to each other correspond to each other. That is, since the output timing of the voice is influenced by the change of the voice storage data, it is difficult to specify the correspondence between the voice and the text information only by comparing the output timing of the voice and the display timing of the text information. Therefore, the output timing of voice and the display timing of character information can not be matched.
- the present invention has been made in view of the above problems, and an object thereof is to accurately specify the correspondence between voice and character information without being affected by changes in voice storage data, and to output timing of voice and character information Is to match the display timing of.
- a display timing determination device includes an output timing acquisition unit that acquires an output timing of each of a plurality of voices that are sequentially output, and an output order is first for each voice.
- First ratio acquiring means for acquiring a first ratio of the output timing intervals relating to voices that differ in output order by a second order with respect to the output timing intervals relating to different voices only, and during playback of the plurality of voices
- Temporary display timing acquiring means for sequentially displaying the timing of the character information indicating the content of each voice, and the display order corresponding to the first order or the first order for each character information With respect to an interval of temporary display timing related to character information different only in the order of 3, related to character information different in display order by the second order or the fourth order corresponding to the second order
- Each voice and each character information based on the second ratio acquisition means for acquiring a second ratio of the display timing interval, the first ratio of each voice, and the second ratio of each character information.
- a display timing determination unit configured to determine
- the display timing determination method includes an output timing acquisition step of acquiring output timings of a plurality of audios sequentially output, and an output timing related to audios different in output order by a first order for each audio.
- each voice and each character information based on a second ratio acquisition step of acquiring a second ratio of the interval, and the first ratio of each voice and the second ratio of each character information It is characterized by including a specification step of specifying a relationship, and a display timing determination step of determining a final display timing of each character information based on the correspondence relationship.
- a program according to the present invention is an output timing acquiring means for acquiring an output timing of each of a plurality of voices sequentially output, and for each voice, an output with respect to an output timing interval related to voices different in output order by the first order.
- First ratio acquiring means for acquiring a first ratio of output timing intervals of voices different in order by a second order, character information indicating contents of each voice sequentially displayed during reproduction of the plurality of voices
- Temporary display timing acquiring means for acquiring the temporary display timing of the interval of the temporary display timing concerning the character information in which the display order differs by the first order or the third order corresponding to the first order for each character information Acquire a second ratio of temporary display timing intervals according to the character information whose display order differs by the fourth order corresponding to the second order or the second order.
- a ratio acquiring unit of 2 a specifying unit for specifying the correspondence between each voice and each character information based on the first ratio of each voice and the second ratio of each character information, based on the correspondence
- the computer functions as display timing determination means for determining the final display timing of each character information.
- an information storage medium is a computer readable information storage medium storing the above program.
- the specifying means specifies the correspondence based on a small difference between the first ratio of each voice and the second ratio of each character information. It is characterized by
- the specifying unit acquires the candidate of the correspondence based on a small difference between the first ratio of each voice and the second ratio of each character information. For each candidate, an overall change amount of the provisional display timing when the correspondence indicated by the candidate is correct is acquired, and for each candidate, the provisional display timing of each character information based on the change amount corresponding to the candidate.
- the present invention is characterized in that the degree of deviation from the output timing of each voice in the case of changing is acquired, and the correspondence is specified from the candidates based on the degree of deviation of each candidate.
- the specifying means specifies the correspondence among the candidates based on an algorithm using dynamic programming.
- the display timing determination means randomly extracts a sample from the combination of voice and text information indicated by the correspondence relationship, and based on the sample, overall display of the provisional display timing is performed.
- the amount of change is acquired, and the degree of deviation from the output timing of each voice when the provisional display timing of each character information is changed based on the amount of change is acquired, until the degree of deviation becomes less than a threshold.
- the extraction and the acquisition of the change amount are repeated.
- the first ratio acquiring unit acquires a plurality of the first ratios based on a plurality of combinations of the first order and the second order for each voice.
- the second ratio acquisition unit is configured, based on a plurality of combinations of the first order or the third order and the second order or the fourth order, for each character information.
- Acquiring a plurality of the second ratios, and the specifying means specifies the correspondence based on the plurality of first ratios of each voice and the plurality of second ratios of each character information; It is characterized by
- the first order and the second order are the same predetermined number as each other, and the first ratio acquiring unit is configured to output the voice having the predetermined number before the predetermined order.
- the first ratio of the output timing interval related to the voice after the predetermined number of output sequences after the output timing interval concerned is acquired, and the second ratio acquiring means is the display sequence of the predetermined number earlier
- the second ratio of the temporary display timing interval according to the character information after the predetermined number of display order with respect to the temporary display timing interval according to the character information is acquired.
- the predetermined number is 1, and the first ratio acquiring unit is an audio whose output order is one after an output timing with respect to an audio at an output order one before.
- the second ratio acquisition means acquires the first ratio of the output timing interval according to a second display sequence with respect to the temporary display timing interval according to the character information of the immediately preceding display sequence. The second ratio of the interval of the provisional display timing relating to the character information is acquired.
- the present invention it is possible to accurately specify the correspondence between voice and character information without being affected by changes in voice storage data, and to match the output timing of voice and the display timing of character information.
- FIG. 1 is a diagram showing an overall configuration of a display timing determination system.
- the display timing determination system 1 includes a server 10 and a user device 20. Each of these devices may be communicably connected via a network by wire or wirelessly.
- the server 10 is a server computer, and includes, for example, a control unit 11, a storage unit 12, and a communication unit 13.
- the control unit 11 includes at least one processor.
- the control unit 11 executes processing in accordance with programs and data stored in the storage unit 12.
- the storage unit 12 includes a main storage unit and an auxiliary storage unit.
- the main storage unit is a volatile memory such as a RAM
- the auxiliary storage unit is a non-volatile memory such as a hard disk or a flash memory.
- the communication unit 13 includes a communication interface for wired communication or wireless communication, and performs data communication via, for example, a network.
- the user device 20 is a computer operated by the user, and is, for example, a personal computer, a portable information terminal (including a tablet computer), or a mobile phone (including a smart phone).
- the user device 20 includes a control unit 21, a storage unit 22, a communication unit 23, an operation unit 24, a display unit 25, and an audio output unit 26.
- the hardware configurations of the control unit 21, the storage unit 22, and the communication unit 23 may be the same as those of the control unit 11, the storage unit 12, and the communication unit 13, respectively.
- the operation unit 24 is an input device for the user to operate, and is, for example, a pointing device such as a touch panel or a mouse, a keyboard, or the like.
- the operation unit 24 transmits the operation content of the user to the control unit 21.
- the display unit 25 is, for example, a liquid crystal display unit or an organic EL display unit, and can display various images such as a moving image and a still image.
- the sound output unit 26 is, for example, a speaker, an earphone, a headphone, etc., and can output various sounds.
- the programs and data described as being stored in the storage units 12 and 22 may be supplied to them via a network.
- the hardware configuration of the server 10 and the user device 20 is not limited to the above example, and various computer hardware may be applied.
- the server 10 and the user device 20 each read a computer readable information storage medium (for example, an optical disk drive or a memory card slot) or an input / output unit (for example, USB) for directly connecting to an external device. Ports and video input / output terminals) may be included.
- the program or data stored in the information storage medium may be supplied to the server 10 or the user device 20 via the reading unit or the input / output unit.
- the server 10 manages a plurality of moving pictures. For example, when the user designates a moving image that the user wants to view in the user device 20, the moving image can be viewed by download distribution or streaming distribution. When a moving image is reproduced in the user device 20, subtitles are displayed on the display unit 25 together with the moving image, and the audio output unit 26 outputs the sound of the moving image.
- FIG. 2 is a diagram showing an outline of processing executed when a moving image is reproduced.
- the moving image is indicated by the code of Vid
- the voice of the moving image is indicated by the code of Voi
- the subtitle is indicated by the symbol of Sub.
- the server 10 separately manages the moving image Vid and the subtitle Sub as data.
- the t axis of the subtitle Sub shown in FIG. 2 is a time axis, and the subtitles displayed on the screen are shown in chronological order.
- the subtitle Sub is not embedded in the moving image Vid, but is managed as a separate item from the moving image Vid. Then, when the moving image Vid is displayed, the moving image Vid and the subtitle Sub are combined as shown in FIG. Thereby, when the audio Voi of the moving image is output, the subtitle Sub corresponding to the audio Voi is displayed.
- the server 10 separately transmits the data of the moving image Vid and the data of the subtitle Sub to the user device 20, and the user device 20 combines the subtitle Sub and the moving image Vid. Further, for example, the server 10 combines the subtitle Sub with the moving image Vid specified by the user, and then transmits the data of the combined moving image Vid to the user device 20.
- the subtitle Sub is created by an arbitrary method at any timing before and after the moving image Vid is registered in the server 10.
- the system administrator may manually input the text of subtitle Sub and the display timing of subtitle Sub while viewing moving image Vid, or the character data of subtitle Sub using audio analysis and Display timing may be generated.
- FIG. 3 is a diagram showing the relationship between the display timing of subtitles Sub and the output timing of audio.
- the display timing of the subtitle Sub is indicated by a code “x_i”
- the output timing of the voice Voi is indicated by a code “y_m”.
- the moving image Vid may include a portion called title credit for displaying the names of the performers, the title of the moving image, and the like.
- the title credit may be edited, and the length of the title credit of the moving image Vid for subtitle creation may be different from the length of the title credit of the moving image Vid for distribution.
- the output timing of the voice Voi and the display timing of the subtitle Sub may be shifted only by the difference in the length of the title credit. In order to adjust the deviation caused by the difference in the length of the title credit, it is necessary to shift the display timing of the subtitle Sub as a whole.
- the frame rate may be changed, and the frame rate of the moving image Vid for subtitle creation may be different from the frame rate of the moving image Vid for distribution.
- the interval between the output timings of the voices Voi may change, and there is a possibility that the output timing of the voices Voi and the display timing of the subtitle Sub may be shifted.
- the subtitle Sub may time-jump due to the file format.
- the output timing of the voice Voi and the display timing of the subtitle Sub may be shifted by the time jump.
- the output timing of the voice Voi and the display timing of the subtitle Sub may be shifted due to various factors.
- the timing deviation can be reduced by adjusting the display timing of the subtitle Sub to match the output timing of the corresponding voice Voi. it can.
- the output timing of the audio Voi is affected by the change of the moving image Vid, even if the output timing of the audio Voi and the display timing of the subtitle Sub are compared, the correspondence between the audio Voi and the subtitle Sub Identifying relationships is difficult.
- the server 10 identifies the correspondence between each voice and each character information by using information (ratio described later) which is not affected by the change of the moving image Vid, and displays the voice output timing and the characters. I try to match the timing.
- information ratio described later
- FIG. 4 is a functional block diagram showing an example of functions implemented by the display timing determination system.
- a data storage unit 100 an output timing acquisition unit 101, a first ratio acquisition unit 102, a temporary display timing acquisition unit 103, a second ratio acquisition unit 104, a specification unit 105,
- the display timing determination unit 106 is realized by the server 10.
- the data storage unit 100 is mainly implemented by the storage unit 12.
- the data storage unit 100 stores data for outputting voice and data for displaying character information.
- voice storage data and text information data will be described as an example of data stored in the data storage unit 100.
- the data storage unit 100 stores a combination of these.
- the voice storage data is data in which a plurality of voices which are sequentially output are stored.
- the sound may be a sound actually produced by a person recorded by a microphone, or an artificial sound synthesized by a computer.
- a period in which the voice storage data is reproduced (a period from the start time point to the end time point of the reproduction) includes a plurality of time periods in which the sound is output.
- Each voice can be said to be a block of voice output within a certain period.
- the speech may include at least one word, may be separated for each sentence, or may be composed of a plurality of sentences.
- the speech may be a cry or scream that does not contain particularly meaningful words.
- the sound may be an individual speech in a movie, a drama, an animation, etc., an individual speech of a person photographed in a moving image, or an individual phrase such as a song or poetry. It is also good.
- the voice storage data may be any data as long as it can output voice by reproduction.
- the voice storage data may be moving image data in which one or more images and voice are stored, or voice data not including an image It is also good.
- the data format and compression format of the moving image data and audio data may be various known formats, and may be, for example, avi format, mpeg format, or mp3 format. In the present embodiment, the case where the voice storage data is moving image data will be described.
- the voices stored in the voice storage data are output in a predetermined order and timing according to the passage of time.
- each voice is stored in voice storage data so as to be output in a predetermined order and timing.
- the audio output timing may be any timing in the period during which the audio is output, for example, it may indicate the timing to start the audio output, or indicate the end timing to end the audio output. It may also indicate the timing in between.
- the character information data is data relating to character information that is sequentially displayed during reproduction of the voice storage data and indicates the content of each voice.
- the text information indicates the audio content as at least one text, and may be, for example, text called captions, captions, or telops.
- the character information may be composed of only one character or may be a character string including a plurality of characters. Also, the character information may include symbols other than characters. Note that the text information does not have to completely correspond to the content of the voice up to a word or phrase, and the content of the voice and the content of the text information may have some difference.
- the voice storage data describes moving image data, the case where the character information is a subtitle of a moving image will be described.
- FIG. 5 is a diagram showing an example of data storage of character information data.
- a character information ID uniquely identifying the character information
- a display timing of the character information and the character information are stored.
- the character information data may store at least the display timing of the character information.
- the character information itself may be stored in data other than the character information data.
- the display timing stored in the character information data may indicate any timing within the period in which the character information is displayed, for example, may indicate the start timing to start the display of the character information, or the display of the character information May indicate the end timing of ending the process, or may indicate the timing in between.
- the start timing is used as the display timing
- the end timing and the length of the display time may be stored in the character information data.
- the length of the display time may differ depending on the character information, or may be common to all the character information.
- the display timing determination unit 106 since the display timing set in advance in the character information data is adjusted by the display timing determination unit 106 described later, in the present embodiment, the display timing before adjustment is described as “provisional display timing”, and after adjustment The display timing of is described as “final display timing”.
- the display timing determination unit 106 which will be described later, finds a final display timing that generally matches the audio output timing based on the temporary display timing.
- the character information is described as i (i is an integer from 1 to N i , and N i is the total number of character information), and the set of temporary display timings stored in the character information data is described as ⁇ x_i ⁇ .
- the start timing at which the display of the character information i is started is the temporary display timing x_i of the character information i.
- the text information data may include text information indicating information other than voice.
- Character information indicating information other than voice is character information in which the corresponding voice does not exist, and is, for example, a description, a title, or an annotation. For example, if the audio storage data is moving image data, the name of the place photographed in the moving image, the title / name of the character, the name of the actor, or the title of a movie / program / animation / song show information other than voice. Equivalent to text information. When such character information exists, the display timing of the character information also needs to be identified, so that the character information data also stores the temporary display timing of the character information.
- the output timing acquisition unit 101 is realized mainly by the control unit 11.
- the output timing acquisition unit 101 acquires the output timing of each of a plurality of sounds sequentially output.
- the output timing of each voice may be included in the voice storage data, but in the present embodiment, the output timing acquisition unit 101 analyzes the voice waveform of the voice storage data and acquires the output timing of each voice.
- the output timing acquisition unit 101 generates a spectrogram that indicates the strength of the signal for each frequency in time series based on the voice storage data.
- the spectrogram may be generated using a known sonograph, for example, a band pass filter may be used, or a short time Fourier transform may be used.
- the sonograph is a general term for an instrument equipped with a spectrogram generation algorithm.
- the output timing acquisition unit 101 generates a spectrogram by acquiring, for each frequency, a time-series change in strength (amplitude) indicated by the speech waveform.
- FIG. 6 is a diagram showing an example of a spectrogram.
- the vertical axis represents frequency and the horizontal axis represents time.
- the spectrogram often indicates the strength of the signal of each frequency in color, but here, the strength of the signal of each frequency is schematically shown in monochrome.
- the frequency band with dark dots indicates that the sound is strong
- the frequency band with thin dots indicates that the sound is weak.
- the output timing acquisition unit 101 acquires the output timing of each voice by executing deep learning (a type of machine learning) based on the spectrogram.
- Pattern information indicating voice characteristics necessary for deep learning may be stored in the data storage unit 100. For example, when the intensity of a frequency band (for example, about 100 Hz to several thousand Hz) corresponding to voice is equal to or higher than a threshold, the output timing acquisition unit 101 determines that voice is being output, and the intensity of the frequency band Is less than the threshold value, it is determined that voice is not being output.
- the frequency band corresponding to the voice may be specified in advance, and may be variable according to the input of the system administrator.
- the output timing acquisition unit 101 when the output timing acquisition unit 101 changes from a state where no voice is being output to a state where a voice is being output, the output timing acquisition unit 101 issues a voice ID uniquely identifying the voice, and outputs the voice identified by the voice ID. It is determined that it is the start timing of Then, the output timing acquisition unit 101 determines that it is the end timing of the output of the voice identified by the voice ID, when the voice changes from the outputting state to the not outputting state.
- the period from the start timing to the end timing is a period in which the voice identified by the voice ID is output. As described above, in this embodiment, the case where the start timing of speech is used as the output timing will be described, but the end timing and the length of the output period may be held.
- m an integer from 1 to N m , N m is the total number of speeches
- ⁇ y_m ⁇ a set of output timings
- the numerical value of m indicates the voice ID and indicates the output order of voice.
- the start timing is used as the output timing, the timing at which the output of the voice m is started is the output timing y_m of the voice m.
- the output timing acquisition unit 101 will be described for the case of acquiring voice storage data from the data storage unit 100.
- voice storage data is stored in a computer other than the server 10
- the output timing acquisition unit 101 Voice stored data may be obtained from a computer.
- the first ratio acquisition unit 102 is realized mainly by the control unit 11.
- the first ratio acquisition unit 102 acquires a first ratio of the output timing intervals of voices that differ in output order by a second order with respect to the output timing intervals of voices that differ in output order by only the first one Do.
- the output order being different means that the output order is before or after.
- the first ratio acquisition unit 102 acquires the first ratio for each voice based on the output timing of each voice acquired by the output timing acquisition unit 101.
- the first order and the second order may be the same number or different numbers.
- the first ratio may be calculated based on a plurality of intervals, for example, may be calculated by dividing two intervals, or three or more intervals may be substituted into a predetermined equation. It may be calculated by For example, when using three intervals, the ratio of the sum of the first interval and the second interval to the third interval may be used as the first ratio. For example, in the case of using four intervals, the ratio of the sum of the first interval and the second interval to the sum of the third interval and the fourth interval may be used as the first ratio. Good.
- the first ratio may be calculated by substituting each interval into a predetermined equation.
- the first ratio of each voice may be calculated using both intervals before and after the output timing of the voice, or only one of the intervals before or after the voice from the voice. It may be calculated using an interval including the output timing of the voice.
- the first ratio acquisition unit 102 calculates a first ratio for each voice based on Equation 1 below. Assuming that the number of voices included in the voice storage data is N m , the first ratio acquisition unit 102 calculates N m first ratios based on Formula 1.
- Equation 1 The left side of Equation 1 is a first ratio of speech m.
- Each of a, b, c, and d on the right side of Formula 1 is an arbitrary integer, and is a positive integer, a negative integer, or 0.
- the integers a or b correspond to the first order according to the invention, and the integers c or d correspond to the second order according to the invention.
- the voice m + a is a voice that is counted from the voice m and whose output order is the a-th after that.
- the voice m + a is the voice counted from the voice m and the output order is the absolute value of the integer a earlier.
- the speech m + a means the speech m.
- the integers b, c and d have the same meaning.
- the integer a When the integer a is 0, if the integer b is also 0, no interval exists, so the integer b is an integer other than 0. On the other hand, when the integer b is 0, if the integer a is also 0, there will be no interval, so the integer a is an integer other than 0. Similarly, when the integer c is 0, if the integer d is also 0, there will be no interval, so the integer d is an integer other than 0. On the other hand, when the integer d is 0, if the integer c is also 0, there is no interval, so the integer c is an integer other than 0.
- the numerator and denominator of Formula 1 have exactly the same value, and the first ratio F (m) has a fixed value (ie, Since the result is 1), the combination of the integers a and b is different from the combination of the integers c and d.
- the numerator on the right side of Equation 1 has output timing y_ (m + a) of voice m + a whose output order is different only by voice m and a, and output timing y_ (m + a) of voice m + b whose output order is different only voice m and b.
- the denominator on the right side of Equation 1 is output timing y_ (m + c) of voice m + c whose output order is different by voice m and c, and output timing y_ (m + d) of voice m + d whose output order is different by voice m and d.
- the first ratio acquiring unit 102 outputs the output timing y_ (m + 1) of the voice m + 1 and the voice m for each voice m.
- a ratio of the interval of the output timing y_m to the interval of the output timing y_ (m-1) of the voice m-1 and the output timing y_m of the voice m is acquired as a first ratio F (m).
- the first ratio obtaining unit 102 sets, for each voice m, a first ratio F of the ratio of the interval to the one speech m + 1 after and the interval to the one speech m ⁇ 1 before. Get as m).
- the first ratio acquiring unit 102 outputs the voices whose output order is the predetermined number after the output timing with respect to the output timing interval of the voices whose output order is the predetermined number earlier.
- the first ratio F (m) of the timing interval will be acquired.
- the ratio acquiring unit 102 acquires a first ratio F (m) of an output timing interval of voice one output sequence after an output timing interval of one voice previous output sequence. become.
- the temporary display timing acquisition unit 103 is realized mainly by the control unit 11.
- the provisional display timing acquisition unit 103 acquires provisional display timings of character information that is sequentially displayed during reproduction of a plurality of voices and indicates the content of each voice.
- the provisional display timing of the character information is stored in the character information data
- the case where the provisional display timing acquisition unit 103 acquires the character information data from the data storage unit 100 will be described.
- character information data may be acquired from the computer.
- the second ratio acquisition unit 104 is realized mainly by the control unit 11.
- the second ratio acquiring unit 104 displays, for each character information, the display order for the interval of temporary display timing related to the character information that differs in the display order by the first order or the third order corresponding to the first order.
- the second ratio of the temporary display timing interval related to the character information that differs by the fourth order corresponding to the second order or the second order is acquired.
- the second ratio acquisition unit 104 acquires the second ratio for each character information based on the temporary display timing of each character information acquired by the temporary display timing acquisition unit 103.
- the meaning of the first order and the second order is the same as that described for the first ratio acquisition unit 102.
- the third order is an order different from the first order
- the fourth order is an order different from the second order.
- the difference between the first order and the third order and the difference between the second order and the fourth order may be the same or different.
- the first order and the second order may be used to calculate the second ratio. Good.
- the third order having an absolute value smaller than the first order and the absolute order than the second order
- a fourth order in which the values are smaller may be used.
- the first order and the third order may be different by the number of voices in which the corresponding character information does not exist.
- the second order and the fourth order may be different by the number of voices in which the corresponding character information does not exist.
- the second ratio may be calculated based on a plurality of intervals, for example, may be calculated by dividing two intervals, or substituting three or more intervals into a predetermined equation. It may be calculated by For example, when using three intervals, the ratio of the sum of the first interval and the second interval to the third interval may be used as the second ratio. Also, for example, when four intervals are used, the ratio of the sum of the first interval and the second interval to the sum of the third interval and the fourth interval may be used as the second ratio. Good. Similarly, in the case of using five or more intervals, the second ratio may be calculated by substituting each interval into a predetermined equation.
- the second ratio of each voice may be calculated using an interval both before and after the temporary display timing of the character information, and either before or after viewed from the character information. It may be calculated using only the interval of (1), or may be calculated using the interval including the output timing of the character information.
- the second ratio acquisition unit 104 calculates a second ratio for each character information based on the following Equation 2.
- Equation 2 the number of text information contained in the character information data to N i
- second ratio obtaining unit 104 based on Equation 2 to calculate the N i-number of the second rate.
- the left side of Formula 2 is a second ratio of character information i.
- A, b, c, d on the right side of Equation 2 may be the same values as those described in Equation 1 for calculating the first ratio F (m) of speech m, or may be different values.
- the integer a or b corresponds to the first order or the third order according to the present invention
- the integer c or d corresponds to the second order or the fourth order according to the present invention.
- a, b, c, d of Formula 1 are different from a, b, c, d of Formula 2
- a, b, c, d of Formula 2 are a, b, c of Formula 1 , D as long as they have a predetermined relationship, and may differ by a predetermined value.
- all of a, b, c, d of Formula 1 and a, b, c, d of Formula 2 may be different, or only some of them may be different.
- the integer a when the integer a is described as an example, when the integer a is positive, the character information i + a is character information counted from the character information i and having a display order after the a. On the other hand, when the integer a is negative, the character information i + a is character information counted from the character information i and preceded by the absolute value of the display order a. When the integer a is 0, the character information m + a means the character information m. In this respect, the integers b, c and d have the same meaning.
- the numerator on the right side of Formula 1 is the display timing x_ (i + a) of the character information i + a whose display order differs only by the character information i and the ath, and the display timing x_ (i) of the character information i + b whose display order differs only by the character information i and the bth
- the interval between i + b) is shown.
- the denominator on the right side of Formula 1 is the display timing x_ (i + c) of the character information i + c whose display order is different only by the character information i and the c number, and the display timing x_ (i) of the character information i + d whose display order is different only by the character information i and the d number i + d) and the interval between
- the second ratio acquisition unit 104 displays the display timing x_ (i + 1) of the character information i + 1 and the character for each character information i.
- the ratio of the interval of display timing x_i of information i to the interval of display timing x_ (i-1) of character information i-1 and the interval of display timing x_i of character information i is acquired as a second ratio F (i) It will be.
- the second ratio acquiring unit 104 sets, for each character information i, the ratio of the interval between the next character information i + 1 and the interval between the previous character information i-1 to the second. Obtained as a ratio F (i).
- the second ratio acquiring unit 104 proceeds by a predetermined number of display orders after the interval of the temporary display timing according to the character information whose display order is a predetermined number before.
- the second ratio F (i) of the interval of the provisional display timing relating to the character information is acquired.
- the second The ratio acquiring unit 104 calculates a second ratio of the temporary display timing interval according to the character information i + 1 after the display order to the temporary display timing interval according to the character information i-1 before the display order.
- F (i) the predetermined number
- the identifying unit 105 is mainly implemented by the control unit 11.
- the identifying unit 105 identifies the correspondence between each voice and each character information based on the first ratio F (m) of each voice m and the second ratio F (i) of each character information i.
- the correspondence relationship is information indicating which voice and which character information correspond.
- the specifying unit 105 may specify corresponding text information for each voice, or may specify a corresponding voice for each text information.
- the specifying unit 105 specifies the correspondence based on the smallness of the difference between the first ratio F (m) of each voice m and the second ratio F (i) of each character information i. For example, the specifying unit 105 specifies the correspondence between each voice and each character information such that the difference between them is less than a threshold. In this case, the identifying unit 105 identifies, for each voice m, character information i having a second ratio F (i) in which the difference with the first ratio F (m) of the voice m is less than a threshold. Alternatively, for each character information i, a voice m having a first ratio F (m) in which the difference between the character information i and the second ratio F (i) is less than a threshold may be specified.
- the threshold may not be used in particular, and the specifying unit 105 sets, for each voice m, the second ratio F (i) having the smallest difference from the first ratio F (m) of the voice m.
- the character information i may be specified, or for each character information i, the voice m having the first ratio F (m) having the smallest difference with the second ratio F (i) of the character information i It may be specified.
- the identifying unit 105 identifies, for each voice m, a plurality of character information in order from the one with the smallest difference with the first ratio F (m) of the voice m, from among the plurality of character information.
- the corresponding character information may be selected, or for each character information i, a plurality of voices are specified in order from the smallest difference with the second ratio F (i) of the character information i, A corresponding voice may be selected from the voices.
- the identification unit 105 selects, for each character information, a plurality of speech candidates that may correspond to the character information, and solves the shortest path problem using the speech as a candidate as a node. Identify the correspondence of text information.
- a common Viterbi algorithm or a hidden Markov model may be used as the dynamic programming method.
- FIG. 7 is a diagram showing a method of specifying the correspondence between voice and text information.
- temporary display timings x_i of character information i are arranged in the horizontal direction, and in the vertical direction, a plurality (three in this case) are arranged in order of closeness to the second ratio F (i) of each character information i.
- the output timings of the voices are aligned.
- the nodes are shown as a graph arranged in a lattice. Note that, contrary to the example of FIG. 7, the output timing y_m of the voice m may be arranged in the horizontal direction, and the temporary display timing x_i of the character information i may be arranged in the vertical direction.
- the identifying unit 105 identifies the correspondence between each voice and each character information by identifying the shortest path from the grid in FIG. 7.
- the cost of moving between nodes is defined, and a path is calculated such that the total cost is minimized.
- the identifying unit 105 identifies the shortest path based on two types of costs, the node cost and the transition cost. For example, the identifying unit 105 calculates the node cost based on Formula 3 below.
- the left side of Equation 3 is the node cost of the node corresponding to the character information i and the speech m.
- the node cost is calculated for each node shown in FIG.
- the node cost is the distance (absolute value of the difference) between the second ratio F (i) of the character information i and the first ratio F (m) of the speech m.
- the lower the node cost the closer the ratio of the character information i to the ratio of the voice m, and the higher the probability that the character information i and the voice m correspond to each other.
- the identifying unit 105 calculates the transition cost based on Equation 4 below.
- the left side of Formula 4 is the transition cost of the node corresponding to the character information i and the voice m.
- the transition cost is calculated for each combination of each node of the grid shown in FIG. 7 and the nearest node.
- S on the right side of Formula 4 is a scale when it is assumed that the correspondence between the speech m and the character information i is correct. Therefore, assuming that the correspondence between the voice m and the character information i is correct, the transition cost is information indicating a deviation from the output timing y_m of the voice m when the display timing x_i of the character information i is changed. It can be said.
- the lower the transition cost the smaller the deviation between the final display timing of the character information i and the output timing of the voice, and the higher the probability that the character information and the voice correspond to each other.
- the identifying unit 105 identifies a path that minimizes the sum of the node cost and the transition cost, and identifies the correspondence between each voice and each character information based on the nodes on the path.
- the shortest path may be identified using a formula used in the Viterbi algorithm instead of a simple sum of costs. Also, although the case of solving the shortest path problem by using two of the node cost and the transition cost has been described, a single cost may be used.
- the identification unit 105 determines whether the difference between the first ratio F (m) of each voice m and the second ratio F (i) of each character information i is small. To obtain candidates for correspondence. In the example of FIG. 7, nodes arranged in a lattice are an example of candidates. Then, the specifying unit 105 acquires, for each candidate, an overall change amount of temporary display timing when the correspondence indicated by the candidate is correct. For example, the overall change amount may be a shift amount (coefficient t of equation 5 described later), but here, the scale amount shown in equation 4 corresponds to the change amount. Further, the specifying unit 105 acquires, for each candidate, the degree of deviation from the output timing of each sound when the provisional display timing of each character information is changed based on the change amount corresponding to the candidate. The said deviation is the left side of Formula 4.
- the identifying unit 105 identifies the correspondence among the candidates based on the degree of deviation of each candidate.
- the identifying unit 105 determines a candidate with a small degree of deviation from among the plurality of candidates as a correspondence.
- the identifying unit 105 identifies the correspondence among the candidates based on an algorithm using dynamic programming.
- the display timing determination unit 106 is mainly implemented by the control unit 11.
- the display timing determination unit 106 determines the final display timing of each character information based on the correspondence determined by the identification unit 105.
- the display timing determination unit 106 changes the temporary display timing of each character information so as to reduce the difference between the temporary display timing of each character information and the output timing of the sound corresponding to the character information, and finally Determine the display timing.
- the display timing determination unit 106 determines the change amount of the temporary display timing of each character information based on the correspondence determined by the identification unit 105.
- the display timing determination unit 106 determines a change amount such that the deviation between the output timing y_m of each voice m and the temporary display timing x_i of each character information i becomes small, and based on the change amount, each character information Determine the final display timing of the
- the display timing determination unit 106 acquires the matching degree information based on the output timing y_m of each voice m and the temporary display timing x_i of each character information i, and the matching degree indicated by the matching degree information becomes high. Thus, the provisional display timing x_i is changed to determine the final display timing.
- the matching degree information is an index indicating how much the timing matches. In other words, the matching degree information is an index indicating the degree of timing deviation.
- the matching degree information is indicated by numerical values.
- the coincidence degree information is calculated based on the time difference between the output timing y_m and the temporary display timing x_i, and the sum of these may be used as the coincidence degree information, or a numerical value calculated based on an equation using these as a variable It may be information.
- the display timing determination unit 106 acquires a plurality of change amounts of temporary display timing of each character information, and selects a change amount having the highest degree of coincidence indicated by the coincidence degree information from among the plurality of change amounts.
- This change amount is a timing movement amount of the temporary display timing, and indicates how much the time is shifted back and forth.
- the change amount may be different for each character information, or a common change amount may be used for the entire character information. In the present embodiment, a case will be described in which a common change amount is used as character information as a whole.
- the amount of change may be indicated by at least one numerical value, and for example, the coefficients of s and t in Equation 5 below may be used as the amount of change.
- the left side of Equation 5 is a candidate for the final display timing.
- the coefficient s on the right side of Formula 5 is a change amount of each interval of the temporary display timing x_i.
- the coefficient s indicates the scale of the character information i because changing the coefficient s extends the display time of the entire character information i.
- the coefficient t is an amount of movement in the case of shifting the temporary display timing x_i as a whole. When the coefficient t is changed, the character information i moves to the front or the rear as a whole, so that the coefficient t indicates a translation amount.
- the display timing determination unit 106 acquires a plurality of combinations of the coefficients s and t that are the change amounts of the temporary display timing x_i. In each of the plurality of combinations, at least one of the coefficients s and t is different from the other combinations.
- the method of obtaining the combination of the coefficients s and t itself may be a known sample extraction method, for example, it may be extracted based on RANSAC (Random Sample Consensus), or the combination of the coefficients s and t is a system in advance May be specified by the administrator. Note that the number of combinations obtained may be arbitrary, and for example, several tens to several hundreds of samples may be extracted.
- the display timing determination unit 106 determines, for each change amount, the coefficient s such that the total sum of the difference between the temporary display timing T (x_i) of each character information i after change and the output timing y_m of the corresponding voice m becomes the smallest. t is specified, and the final display timing is determined based on the coefficients s and t.
- the determination method of a display timing is not restricted to said example. Instead of acquiring the overall change amount, for example, the display timing determination unit 106 determines, for each character information, the display timing of the character information so as to coincide with the output timing of the voice corresponding to the character information. You may Further, for example, the display timing determination unit 106 may determine, for each character information, the display timing of the character information so that the difference from the output timing of the sound corresponding to the character information is less than the threshold. That is, the display timing determination unit 106 may acquire the change amount individually for each character information. In addition, for example, the display timing determination unit 106 may calculate the scale or the like so that the first ratio and the second ratio coincide with each other.
- the display timing determination unit 106 also adjusts the end timing and display time of each character information according to the change of the start timing. You may change it.
- FIG. 8 is a flow chart showing an example of processing executed by the server 10.
- the process shown in FIG. 8 is an example of the process executed by the functional block shown in FIG. 4, and is executed by the control unit 11 operating according to the program stored in the storage unit 12.
- the process shown in FIG. 8 may be performed when a predetermined condition is satisfied, and may be performed, for example, when the voice storage data and the character information data are registered in the server 10. And may be executed at any timing such as a system administrator's instruction.
- the control unit 11 acquires voice storage data stored in the storage unit 12 (S1).
- the control unit 11 generates a spectrogram based on the voice storage data acquired in S1 (S2).
- the control unit 11 performs frequency analysis on voice storage data using a sonograph, and generates a spectrogram by time-sequentially acquiring the strength of the signal for each frequency. Do.
- the control unit 11 acquires the output timing y_m of each voice m based on the spectrogram generated in S2 (S3).
- S3 the control unit 11 starts to output the voice by searching for a time in which the sound intensity of the predetermined frequency band is equal to or more than the threshold in chronological order from the beginning of the reproduction time of the voice storage data. Get the timing.
- the control unit 11 obtains an end timing at which the output of the voice ends, by searching for a time during which the sound intensity of the predetermined frequency band is less than the threshold.
- control unit 11 When the control unit 11 acquires the end timing of the first voice, the control unit 11 may hold the end timing and the output time of the first voice in the storage unit 12. Thereafter, the control unit 11 repeats the above processing until the end of the reproduction time of the voice storage data, increments the voice ID every time the voice start timing is found, and holds the start timing etc. in the storage unit 12 .
- the control unit 11 calculates a first ratio F (m) for each voice based on the output timing y_m of each voice m acquired in S3 (S4). In S4, the control unit 11 calculates the first ratio F (m) of each voice m based on the above-described Equation 1 and stores the first ratio F (m) in the storage unit 12.
- the control unit 11 acquires character information data stored in the storage unit 12 (S5).
- the control unit 11 calculates a second ratio F (i) for each character information i based on the character information data acquired in S5 (S6).
- the control unit 11 calculates the second ratio F (i) of each character information i based on the above-mentioned equation 2, and holds the second ratio F (i) in the storage unit 12.
- the control unit 11 determines each speech and each The correspondence with the character information is specified (S7).
- S7 based on the algorithm of the dynamic programming described with reference to FIG. 7, the control unit 11 specifies a predetermined number of voices in ascending order of difference in ratio for each character information, and sets nodes. The node cost and the transition cost are calculated based on Equation 3 and Equation 4 described above, and the shortest path with the smallest total cost is identified. Then, the control unit 11 specifies the correspondence between each voice and each character information based on the node on the shortest path.
- the control unit 11 determines the final display timing of each character information based on the correspondence between each voice and each character information (S8), and the process ends.
- the control unit 11 obtains a plurality of candidates for the coefficients s and t in Formula 5 and calculates a candidate for the final display timing of the character information. Then, the control unit 11 determines the final display timing based on the coefficients s and t that minimize the total sum of the deviation from the corresponding audio output timing.
- the correspondence between the voice and the character information can be accurately identified by using the ratio that is not affected by factors such as the overall shift amount and the scale of the timing.
- the output timing and the display timing of the character information can be matched. That is, as shown in Equation 1, for example, even if the output timing of the audio is shifted overall or the scale is changed, it does not affect the value of the ratio, so the ratio that does not change even if the audio output data is changed.
- the correspondence between voice and text information can be specified by comparing the feature quantities.
- the correspondence can be specified without considering factors such as the overall shift amount and scale, the correspondence between voice and character information can be specified by relatively simple processing, and the processing speed of the server 10 can be improved. And the processing load can be reduced.
- the correspondence between the voice and the character information can be obtained.
- the accuracy of identifying the correspondence can be enhanced.
- the processing speed of the server 10 can be effectively improved and the processing load can be effectively reduced.
- a plurality of costs such as a node cost and a transition cost instead of a single cost, it is possible to effectively improve the accuracy of specifying the correspondence between voice and text information. it can.
- the first order and the second order are the same predetermined number, and for each speech m, the first ratio F (m) is acquired based on the intervals between the speech m and the speech before and after the predetermined number.
- Voice and character information by acquiring the second ratio F (i) based on the intervals between the character information i and the character information before and after the predetermined number of character information i
- the accuracy of the correspondence relationship can be enhanced.
- the algorithm can be simplified, the processing speed of the server 10 can be improved and the processing load can be reduced.
- the first ratio F (m) is acquired based on the interval with one voice before and after for each voice m
- the second ratio is obtained based on the interval with one letter before and after for each character information i.
- the specification unit 105 specifies the character information i having the second ratio F (i) closest to the first ratio F (m) of each voice m.
- the correspondence between each voice and each character information may be specified based on the method described in.
- the display timing determination unit 106 randomly extracts a sample from the combination of voice and text information indicated by the correspondence relationship specified by the identification unit 105.
- the samples may be only one combination or a plurality of combinations.
- the method of randomly extracting samples may use the method used in RANSAC, and for example, samples may be extracted based on random numbers from a set of speech and text information.
- the display timing determination unit 106 acquires an overall change amount of the temporary display timing based on the sample.
- the amount of change is, for example, a combination of coefficients s and t in equation 5.
- the display timing determination unit 106 determines the combination of the coefficients s and t so that the difference between the audio output timing indicated by the sample and the temporary display timing of the character information is reduced.
- the display timing determination unit 106 acquires the degree of deviation from the output timing of each sound when the temporary display timing of each character information is changed based on the change amount.
- the degree of deviation may indicate the degree of deviation of the overall timing, and is, for example, the sum of the difference between the provisional display timing of each character information and the output timing of the corresponding voice.
- the display timing determination unit 106 repeats sample extraction and change amount acquisition until the degree of deviation falls below the threshold. When the degree of deviation becomes less than the threshold, the display timing determination unit 106 stops the sample extraction and the acquisition of the change amount, changes the temporary display timing of each character information based on the current change amount, and finally Display timing.
- the correspondence between the voice and the character information can be specified and the timing shift can be adjusted by the relatively simple processing, and the processing speed of the server 10 can be improved and the processing load can be reduced. It can be planned effectively.
- a plurality of first ratios F (m) may be acquired for each voice m
- a plurality of second ratios F (i) may be acquired for each character information i.
- the first ratio acquisition unit 102 of the modification (2) acquires a plurality of first ratios F (m) based on a plurality of combinations of the first order and the second order for each voice m. .
- the plurality of combinations of the first order and the second order means that there are a plurality of combinations of the integers a, b, c, and d in Formula 1.
- the second ratio acquisition unit 104 generates a plurality of second ratios based on a plurality of combinations of the first order or the third order and the second order or the fourth order for each character information i.
- a plurality of combinations of the first order or the third order and the second order or the fourth order indicate that there are a plurality of combinations of the integers a, b, c, d in Equation 2.
- the identifying unit 105 identifies the correspondence based on the plurality of first ratios F (m) of each voice m and the plurality of second ratios F (i) of each character information i. For example, the identifying unit 105 may calculate the numerical value calculated based on the plurality of first ratios F (m) of each voice m and the numerical value calculated based on the plurality of second ratios F (i) of each character information i. And the correspondence relationship is specified based on the smallness of the difference between Also, for example, the specifying unit 105 may calculate the sum of differences between each of the plurality of first ratios F (m) of each voice m and each of the plurality of second ratios F (i) of each character information i. Identify correspondences based on smallness.
- the modification (2) by using a plurality of ratios, it is possible to enhance the accuracy of specifying the correspondence between the voice and the text information. For example, if there is an error in the provisional display timing of the character information, an error may occur in the correspondence relationship between the voice and the character information. However, by using a plurality of ratios, the influence of such an error is It can be reduced.
- the start timing of the voice output and the display timing of the character information are set as the start timing, but the output timing of the voice and the display of the character information are indicated by other information.
- the timing may be specified. For example, by storing the time difference of the audio output timing in the first array and storing the time difference of the display timing of the character information in the second array, the output timing of the sound and the display timing of the character information are specified. You may do so.
- the display timing determination unit 106 determines the final display timing of each character information by changing the time difference stored in the second array.
- the output timing acquisition unit 101, the first ratio acquisition unit 102, the temporary display timing acquisition unit 103, the second ratio acquisition unit 104, the specifying unit 105, and the display timing determination unit 106 are realized by the user device 20. You may do so. In this case, these functions are mainly implemented by the control unit 21 and the user device 20 corresponds to the display timing determination device according to the present invention.
- the output timing acquisition unit 101, the first ratio acquisition unit 102, the temporary display timing acquisition unit 103, the second ratio acquisition unit 104, the specifying unit 105, and The display timing determination unit 106 may be realized. In this case, the computer corresponds to the display timing determination device according to the present invention.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Security & Cryptography (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Television Signal Processing For Recording (AREA)
- Controls And Circuits For Display Device (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
以下、本発明に関わる表示タイミング決定装置の一例であるサーバを含む表示タイミング決定システムの実施形態の例を説明する。図1は、表示タイミング決定システムの全体構成を示す図である。図1に示すように、表示タイミング決定システム1は、サーバ10及びユーザ装置20を含む。これら各装置は、それぞれ有線又は無線によりネットワークを介して通信可能に接続されるようにしてよい。
本実施形態では、サーバ10は、複数の動画を管理する。例えば、ユーザ装置20において、ユーザが視聴したい動画を指定すると、ダウンロード配信又はストリーミング配信により動画を視聴することができる。ユーザ装置20において動画が再生されると、動画とともに字幕が表示部25に表示され、音声出力部26から動画の音声が出力される。
図4は、表示タイミング決定システムで実現される機能の一例を示す機能ブロック図である。図4に示すように、本実施形態では、データ記憶部100、出力タイミング取得部101、第1の比率取得部102、仮表示タイミング取得部103、第2の比率取得部104、特定部105、及び表示タイミング決定部106が、サーバ10で実現される。
データ記憶部100は、記憶部12を主として実現される。データ記憶部100は、音声を出力するためのデータと、文字情報を表示させるためのデータと、を記憶する。本実施形態では、データ記憶部100が記憶するデータの一例として、音声格納データと文字情報データとを説明する。例えば、音声格納データごとに文字情報データが用意される場合には、データ記憶部100は、これらの組み合わせを記憶することになる。
音声格納データは、順次出力される複数の音声が格納されたデータである。音声は、人が実際に発した音がマイクで録音されてもよいし、コンピュータで合成された人工的な音であってもよい。音声格納データが再生される期間(再生の開始時点から終了時点までの期間)は、音声が出力される複数の期間を含むことになる。個々の音声は、ある1つの期間内に出力される音声のかたまりといえる。音声は、少なくとも1つの単語を含むようにしてもよいし、文章ごとに区切られていてもよいし、複数の文章から構成されてもよい。更に、音声は、特に意味のある単語を含まない叫び声や悲鳴であってもよい。例えば、音声は、映画・ドラマ・アニメなどにおける個々のセリフであってもよいし、動画に撮影された人の個々の発言であってもよいし、歌や詩などの個々のフレーズであってもよい。
文字情報データは、音声格納データの再生中に順次表示され、各音声の内容を示す文字情報に関するデータである。文字情報は、音声の内容を少なくとも1つの文字として示し、例えば、字幕、キャプション、又はテロップと呼ばれるテキストであってよい。文字情報は、1つの文字のみから構成されてもよいし、複数の文字を含む文字列であってもよい。また、文字情報は、文字以外の記号を含んでいてもよい。なお、文字情報は、音声の内容と一言一句まで完全に一致している必要はなく、音声の内容と文字情報の内容とはある程度の違いがあってよい。本実施形態では、音声格納データが動画データを説明するので、文字情報が動画の字幕である場合を説明する。
出力タイミング取得部101は、制御部11を主として実現される。出力タイミング取得部101は、順次出力される複数の音声の各々の出力タイミングを取得する。各音声の出力タイミングは、音声格納データに含まれていてもよいが、本実施形態では、出力タイミング取得部101は、音声格納データの音声波形を解析し、各音声の出力タイミングを取得する。
第1の比率取得部102は、制御部11を主として実現される。第1の比率取得部102は、出力順が第1の順番だけ異なる音声に係る出力タイミングの間隔に対する、出力順が第2の順番だけ異なる音声に係る出力タイミングの間隔の第1の比率を取得する。出力順が異なるとは、出力順が前又は後のことを意味する。第1の比率取得部102は、出力タイミング取得部101が取得した各音声の出力タイミングに基づいて、音声ごとに第1の比率を取得することになる。
仮表示タイミング取得部103は、制御部11を主として実現される。仮表示タイミング取得部103は、複数の音声の再生中に順次表示され、各音声の内容を示す文字情報の仮表示タイミングを取得する。本実施形態では、文字情報の仮表示タイミングが文字情報データに格納されているので、仮表示タイミング取得部103は、データ記憶部100から文字情報データを取得する場合を説明するが、サーバ10以外のコンピュータに文字情報データが記憶されている場合には、当該コンピュータから文字情報データを取得してもよい。
第2の比率取得部104は、制御部11を主として実現される。第2の比率取得部104は、文字情報ごとに、表示順が第1の順番又は第1の順番に対応する第3の順番だけ異なる文字情報に係る仮表示タイミングの間隔に対する、表示順が第2の順番又は第2の順番に対応する第4の順番だけ異なる文字情報に係る仮表示タイミングの間隔の第2の比率を取得する。第2の比率取得部104は、仮表示タイミング取得部103が取得した各文字情報の仮表示タイミングに基づいて、文字情報ごとに第2の比率を取得することになる。
特定部105は、制御部11を主として実現される。特定部105は、各音声mの第1の比率F(m)と各文字情報iの第2の比率F(i)とに基づいて、各音声と各文字情報との対応関係を特定する。対応関係とは、どの音声とどの文字情報とが対応しているかを示す情報である。例えば、特定部105は、音声ごとに、対応する文字情報を特定してもよいし、文字情報ごとに、対応する音声を特定してもよい。
表示タイミング決定部106は、制御部11を主として実現される。表示タイミング決定部106は、特定部105が特定した対応関係に基づいて、各文字情報の最終的な表示タイミングを決定する。表示タイミング決定部106は、各文字情報の仮表示タイミングと、当該文字情報に対応する音声の出力タイミングと、のずれが小さくなるように、各文字情報の仮表示タイミングを変更して最終的な表示タイミングを決定する。
図8は、サーバ10において実行される処理の一例を示すフロー図である。図8に示す処理は、図4に示す機能ブロックにより実行される処理の一例であり、制御部11が、記憶部12に記憶されたプログラムに従って動作することによって実行される。なお、図8に示す処理は、所定の条件が満たされた場合に実行されるようにすればよく、例えば、音声格納データと文字情報データがサーバ10に登録された場合に実行されてもよいし、システム管理者の指示などの任意のタイミングで実行されてもよい。
なお、本発明は、以上に説明した実施の形態に限定されるものではない。本発明の趣旨を逸脱しない範囲で、適宜変更可能である。
Claims (10)
- 順次出力される複数の音声の各々の出力タイミングを取得する出力タイミング取得手段と、
音声ごとに、出力順が第1の順番だけ異なる音声に係る出力タイミングの間隔に対する、出力順が第2の順番だけ異なる音声に係る出力タイミングの間隔の第1の比率を取得する第1の比率取得手段と、
前記複数の音声の再生中に順次表示され、各音声の内容を示す文字情報の仮表示タイミングを取得する仮表示タイミング取得手段と、
文字情報ごとに、表示順が前記第1の順番又は前記第1の順番に対応する第3の順番だけ異なる文字情報に係る仮表示タイミングの間隔に対する、表示順が前記第2の順番又は前記第2の順番に対応する第4の順番だけ異なる文字情報に係る仮表示タイミングの間隔の第2の比率を取得する第2の比率取得手段と、
各音声の前記第1の比率と各文字情報の前記第2の比率とに基づいて、各音声と各文字情報との対応関係を特定する特定手段と、
前記対応関係に基づいて、各文字情報の最終的な表示タイミングを決定する表示タイミング決定手段と、
を含むことを特徴とする表示タイミング決定装置。 - 前記特定手段は、各音声の前記第1の比率と各文字情報の前記第2の比率との差の小ささに基づいて、前記対応関係を特定する、
ことを特徴とする請求項1に記載の表示タイミング決定装置。 - 前記特定手段は、
各音声の前記第1の比率と各文字情報の前記第2の比率との差の小ささに基づいて、前記対応関係の候補を取得し、
候補ごとに、当該候補が示す対応関係が正しい場合の仮表示タイミングの全体的な変更量を取得し、
候補ごとに、当該候補に対応する前記変更量に基づいて各文字情報の仮表示タイミングを変更した場合の各音声の出力タイミングとのずれ具合を取得し、
各候補のずれ具合に基づいて、候補の中から前記対応関係を特定する、
ことを特徴とする請求項2に記載の表示タイミング決定装置。 - 前記特定手段は、動的計画法を利用したアルゴリズムに基づいて、候補の中から前記対応関係を特定する、
ことを特徴とする請求項3に記載の表示タイミング決定装置。 - 前記表示タイミング決定手段は、
前記対応関係が示す音声と文字情報の組み合わせの中からランダムにサンプルを抽出し、
前記サンプルに基づいて、仮表示タイミングの全体的な変更量を取得し、
前記変更量に基づいて各文字情報の仮表示タイミングを変更した場合の各音声の出力タイミングとのずれ具合を取得し、
前記ずれ具合が閾値未満になるまで、前記サンプルの抽出と前記変更量の取得を繰り返す、
ことを特徴とする請求項1~4の何れかに記載の表示タイミング決定装置。 - 前記第1の比率取得手段は、音声ごとに、前記第1の順番と前記第2の順番の複数通りの組み合わせに基づいて複数の前記第1の比率を取得し、
前記第2の比率取得手段は、文字情報ごとに、前記第1の順番又は前記第3の順番と、前記第2の順番又は前記第4の順番と、の複数通りの組み合わせに基づいて複数の前記第2の比率を取得し、
前記特定手段は、各音声の前記複数の第1の比率と各文字情報の前記複数の第2の比率とに基づいて、前記対応関係を特定する、
ことを特徴とする請求項1~5の何れかに記載の表示タイミング決定装置。 - 前記第1の順番と前記第2の順番とは、互いに同じ所定数であり、
前記第1の比率取得手段は、出力順が前記所定数だけ前の音声に係る出力タイミングの間隔に対する、出力順が前記所定数だけ後の音声に係る出力タイミングの間隔の前記第1の比率を取得し、
前記第2の比率取得手段は、表示順が前記所定数だけ前の文字情報に係る仮表示タイミングの間隔に対する、表示順が前記所定数だけ後の文字情報に係る仮表示タイミングの間隔の前記第2の比率を取得する、
ことを特徴とする請求項1~6の何れかに記載の表示タイミング決定装置。 - 前記所定数は1であり、
前記第1の比率取得手段は、出力順が1個前の音声に係る出力タイミングの間隔に対する、出力順が1個後の音声に係る出力タイミングの間隔の前記第1の比率を取得し、
前記第2の比率取得手段は、表示順が1個前の文字情報に係る仮表示タイミングの間隔に対する、表示順が1個後の文字情報に係る仮表示タイミングの間隔の前記第2の比率を取得する、
ことを特徴とする請求項7に記載の表示タイミング決定装置。 - 順次出力される複数の音声の各々の出力タイミングを取得する出力タイミング取得ステップと、
音声ごとに、出力順が第1の順番だけ異なる音声に係る出力タイミングの間隔に対する、出力順が第2の順番だけ異なる音声に係る出力タイミングの間隔の第1の比率を取得する第1の比率取得ステップと、
前記複数の音声の再生中に順次表示され、各音声の内容を示す文字情報の仮表示タイミングを取得する仮表示タイミング取得ステップと、
文字情報ごとに、表示順が前記第1の順番又は前記第1の順番に対応する第3の順番だけ異なる文字情報に係る仮表示タイミングの間隔に対する、表示順が前記第2の順番又は前記第2の順番に対応する第4の順番だけ異なる文字情報に係る仮表示タイミングの間隔の第2の比率を取得する第2の比率取得ステップと、
各音声の前記第1の比率と各文字情報の前記第2の比率とに基づいて、各音声と各文字情報との対応関係を特定する特定ステップと、
前記対応関係に基づいて、各文字情報の最終的な表示タイミングを決定する表示タイミング決定ステップと、
を含むことを特徴とする表示タイミング決定方法。 - 順次出力される複数の音声の各々の出力タイミングを取得する出力タイミング取得手段、
音声ごとに、出力順が第1の順番だけ異なる音声に係る出力タイミングの間隔に対する、出力順が第2の順番だけ異なる音声に係る出力タイミングの間隔の第1の比率を取得する第1の比率取得手段、
前記複数の音声の再生中に順次表示され、各音声の内容を示す文字情報の仮表示タイミングを取得する仮表示タイミング取得手段、
文字情報ごとに、表示順が前記第1の順番又は前記第1の順番に対応する第3の順番だけ異なる文字情報に係る仮表示タイミングの間隔に対する、表示順が前記第2の順番又は前記第2の順番に対応する第4の順番だけ異なる文字情報に係る仮表示タイミングの間隔の第2の比率を取得する第2の比率取得手段、
各音声の前記第1の比率と各文字情報の前記第2の比率とに基づいて、各音声と各文字情報との対応関係を特定する特定手段、
前記対応関係に基づいて、各文字情報の最終的な表示タイミングを決定する表示タイミング決定手段、
としてコンピュータを機能させるためのプログラム。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017564153A JP6295381B1 (ja) | 2017-08-31 | 2017-08-31 | 表示タイミング決定装置、表示タイミング決定方法、及びプログラム |
EP17901343.8A EP3678376A4 (en) | 2017-08-31 | 2017-08-31 | DEVICE FOR DETERMINING THE DISPLAY TIME, METHOD FOR DETERMINING THE DISPLAY TIME AND PROGRAM |
PCT/JP2017/031368 WO2019043871A1 (ja) | 2017-08-31 | 2017-08-31 | 表示タイミング決定装置、表示タイミング決定方法、及びプログラム |
US16/091,107 US10348938B2 (en) | 2017-08-31 | 2017-08-31 | Display timing determination device, display timing determination method, and program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2017/031368 WO2019043871A1 (ja) | 2017-08-31 | 2017-08-31 | 表示タイミング決定装置、表示タイミング決定方法、及びプログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019043871A1 true WO2019043871A1 (ja) | 2019-03-07 |
Family
ID=61629020
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2017/031368 WO2019043871A1 (ja) | 2017-08-31 | 2017-08-31 | 表示タイミング決定装置、表示タイミング決定方法、及びプログラム |
Country Status (4)
Country | Link |
---|---|
US (1) | US10348938B2 (ja) |
EP (1) | EP3678376A4 (ja) |
JP (1) | JP6295381B1 (ja) |
WO (1) | WO2019043871A1 (ja) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2021009608A (ja) * | 2019-07-02 | 2021-01-28 | キヤノン株式会社 | 画像処理装置、画像処理方法、及びプログラム |
JP2021009607A (ja) * | 2019-07-02 | 2021-01-28 | キヤノン株式会社 | 画像処理装置、画像処理方法、及びプログラム |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005045503A (ja) * | 2003-07-28 | 2005-02-17 | Toshiba Corp | 字幕信号処理装置、字幕信号処理方法及び字幕信号処理プログラム |
JP2008172421A (ja) | 2007-01-10 | 2008-07-24 | Sony Corp | 記録装置および方法、再生装置および方法、並びにプログラム |
JP2010157816A (ja) * | 2008-12-26 | 2010-07-15 | Toshiba Corp | 字幕情報作成装置、字幕情報作成方法及びプログラム |
WO2013038636A1 (ja) * | 2011-09-14 | 2013-03-21 | シャープ株式会社 | 表示装置及び録画再生装置 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6442518B1 (en) * | 1999-07-14 | 2002-08-27 | Compaq Information Technologies Group, L.P. | Method for refining time alignments of closed captions |
KR100771624B1 (ko) * | 2006-03-06 | 2007-10-30 | 엘지전자 주식회사 | 텔레비전 수신기의 언어 설정 장치 및 방법 |
US20100332225A1 (en) * | 2009-06-29 | 2010-12-30 | Nexidia Inc. | Transcript alignment |
US8843368B2 (en) * | 2009-08-17 | 2014-09-23 | At&T Intellectual Property I, L.P. | Systems, computer-implemented methods, and tangible computer-readable storage media for transcription alignment |
US8281231B2 (en) * | 2009-09-11 | 2012-10-02 | Digitalsmiths, Inc. | Timeline alignment for closed-caption text using speech recognition transcripts |
US8947596B2 (en) * | 2013-06-27 | 2015-02-03 | Intel Corporation | Alignment of closed captions |
-
2017
- 2017-08-31 JP JP2017564153A patent/JP6295381B1/ja active Active
- 2017-08-31 EP EP17901343.8A patent/EP3678376A4/en active Pending
- 2017-08-31 US US16/091,107 patent/US10348938B2/en active Active
- 2017-08-31 WO PCT/JP2017/031368 patent/WO2019043871A1/ja active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005045503A (ja) * | 2003-07-28 | 2005-02-17 | Toshiba Corp | 字幕信号処理装置、字幕信号処理方法及び字幕信号処理プログラム |
JP2008172421A (ja) | 2007-01-10 | 2008-07-24 | Sony Corp | 記録装置および方法、再生装置および方法、並びにプログラム |
JP2010157816A (ja) * | 2008-12-26 | 2010-07-15 | Toshiba Corp | 字幕情報作成装置、字幕情報作成方法及びプログラム |
WO2013038636A1 (ja) * | 2011-09-14 | 2013-03-21 | シャープ株式会社 | 表示装置及び録画再生装置 |
Non-Patent Citations (2)
Title |
---|
See also references of EP3678376A4 |
YAMASAKI ,HIRONOBU ET AL: "Synchronizing Closed Caption Stream with Speech Stream in Video Data", IPSJ SIG NOTES, vol. 2000, no. 19, 18 February 2000 (2000-02-18), pages 67 - 72, XP055613547 * |
Also Published As
Publication number | Publication date |
---|---|
US20190132491A1 (en) | 2019-05-02 |
EP3678376A4 (en) | 2021-04-14 |
US10348938B2 (en) | 2019-07-09 |
JP6295381B1 (ja) | 2018-03-14 |
EP3678376A1 (en) | 2020-07-08 |
JPWO2019043871A1 (ja) | 2019-11-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109754783B (zh) | 用于确定音频语句的边界的方法和装置 | |
US10037313B2 (en) | Automatic smoothed captioning of non-speech sounds from audio | |
US9159338B2 (en) | Systems and methods of rendering a textual animation | |
CN108780643A (zh) | 自动配音方法和装置 | |
US10014029B2 (en) | Video processing apparatus and method | |
JP2008533580A (ja) | オーディオ及び/又はビジュアルデータの要約 | |
CN109922268B (zh) | 视频的拍摄方法、装置、设备及存储介质 | |
WO2017062961A1 (en) | Methods and systems for interactive multimedia creation | |
US9569168B2 (en) | Automatic rate control based on user identities | |
CN104980790A (zh) | 语音字幕的生成和装置、播放方法和装置 | |
KR20050086942A (ko) | 오디오 신호의 증대 방법 및 시스템 | |
KR20090026942A (ko) | 메타데이터를 자동적으로 생성/갱신하는 멀티미디어 데이터기록 방법 및 장치 | |
US9749550B2 (en) | Apparatus and method for tuning an audiovisual system to viewer attention level | |
JP6641045B1 (ja) | コンテンツ生成システム、及びコンテンツ生成方法 | |
KR101389730B1 (ko) | 동영상 파일의 주제별 분할 위치 생성 방법 | |
JP6295381B1 (ja) | 表示タイミング決定装置、表示タイミング決定方法、及びプログラム | |
US20210390937A1 (en) | System And Method Generating Synchronized Reactive Video Stream From Auditory Input | |
JP2008047998A (ja) | 動画再生装置及び動画再生方法 | |
US20150051911A1 (en) | Method for dividing letter sequences into pronunciation units, method for representing tones of letter sequences using same, and storage medium storing video data representing the tones of letter sequences | |
JP2009237285A (ja) | 人物名付与装置および方法 | |
CN110324702B (zh) | 视频播放过程中的信息推送方法和装置 | |
JP2011254342A (ja) | 映像編集方法,映像編集装置および映像編集プログラム | |
CN112995530A (zh) | 视频的生成方法、装置及设备 | |
CN114697689A (zh) | 数据处理方法、装置、电子设备和存储介质 | |
CN114143587A (zh) | 一种用于在目标音乐视频中乐谱展示的方法与设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2017564153 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2017901343 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2017901343 Country of ref document: EP Effective date: 20180927 |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17901343 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2017901343 Country of ref document: EP Effective date: 20200331 |