WO2006111041A1

WO2006111041A1 - Subtitle editing method and the device thereof

Info

Publication number: WO2006111041A1
Application number: PCT/CN2005/000535
Authority: WO
Inventors: Rong Yi
Original assignee: Rong Yi
Priority date: 2005-04-19
Filing date: 2005-04-19
Publication date: 2006-10-26

Abstract

A subtitle editing method is disclosed. The method comprises the steps of, storing the audio file and the character set in the memory bank; dividing the character set into several load segments wherein each of the load segments includes one or several character regions; converting the audio file into the characteristic values consist of two dimensional variables of time and frequency, then outputting and displaying the characteristic values in the form of graphics; demarcating the starting time and the terminal time of each character region according to the displayed graphics; storing the demarcated character regions. The beneficial technical effect of the invention is that, the audio file data is converted into the characteristic values consist of two dimensional variables of time and frequency, then outputted and displayed in the form of graphics, therefore the visual information obtained by the editor is greatly enriched, so that sometimes the starting position and the terminal position of the displaying character regions can be directly observed from the graphics. According to the invention, the working strength and complexity can be decreased, and the precision of the time demarcation can be improved, which make the subtitle editing become an easy and pleasant work.

Description

Subtitle editing production method and device

[Technical Field]

The invention relates to a method and a device for making a subtitle editing.

【Background technique】

The prevalence of popular music, as well as the popularity of multimedia playback software and equipment, make karaoke an easy choice for people to enjoy. When you enjoy, learn to sing, and sing a song, if the song can have subtitles that appear with the rhythm of the music, it will undoubtedly make the entertainment process easier and more perfect. The subtitle singer music files are often more popular than ordinary simple music files. welcome.

In the early karaoke subtitle production process, special hardware was required to synchronize subtitles and songs through cumbersome operations. After the popularity of the computer platform, a variety of lyrics subtitle editing system appeared, which realized the Karaoke special effects subtitles and progressive display subtitles. The general operation method of this type of editing system is to load a music file, input the text of the song without the time label, and then play the music file, and determine the start of each sentence or each word through various operations. End time label. Since the existing lyrics subtitle editing system is not efficient in use, subtitles for making a song generally have to repeatedly play multiple song files, which is difficult for ordinary users to operate.

In the entire subtitle production process, determining the time stamp of the lyrics is the core and the most time and effort. Earlier editing systems did not convert audio signals into visual graphics. The producers listened to the music while listening to the melody, using the keyboard, clicking the mouse, etc. to indicate the time. This editing method is very subjective, and the producer The music literacy and even the agility degree have a great relationship, which causes the error of the final time calibration to be very large. Usually, the lyrics of the whole sentence can only be calibrated. Now the more advanced editing system has converted the audio signal into a waveform. Figure 1 is a screenshot of the waveform output window of the editing system interface. This operating system allows the producer to observe the "position pointer" while listening to music (Figure 1 The position of the red line in the waveform on the waveform map to determine the time range of each word. This kind of system has made great progress in the editing mode of time division based on hearing alone. It can give the operator certain auxiliary information to reduce the difficulty of time determination and improve the accuracy of calibration. With this system, it is already possible to W words (Chinese characters) or words (Western words) are time-calibrated. However, as can be seen from Figure 1, the information contained in the waveform is very scarce, and the operator can't get intuitive information from it. It still relies on listening to music to set the time. It requires a high concentration of mind, a large workload, and easy fatigue. And the work efficiency is not high. It is very difficult for an operator to distinguish words between musical instruments and vocal music; and, for alphabetic characters such as English and French, it is very difficult to increase the accuracy of time division to a single letter or phoneme.

[Summary of the Invention]

The object of the present invention is to provide a subtitle editing method and device for greatly reducing the difficulty of character area time calibration and easily achieving single words (for Chinese characters), syllables (letters or phonemes) (for Western characters).

The technical solution for achieving the above object is: A method for producing a subtitle editing, comprising the following steps:

1) storing the sound file and the character set in the memory;

2) dividing the character set into a plurality of load segments, each of which includes one or more character regions;

3) The start and end time of the character area is calibrated according to the sound file; this step is implemented by a process including the following steps:

3-1) Convert the sound file into feature values with time and frequency as two-dimensional variables, and display the output in the form of graphics; 3-2) Start and end time of the character area according to the graphic output and the sound file played synchronously Calibration.

4) storing the time-characterized character area;

Preferably, the step 3-1) includes the following steps,

3-1-1) Dividing the sound file into multiple frames in time series, respectively calculating the spectrum of each frame, and obtaining the feature value with the time and frequency as two-dimensional variables;

3-1-2) Create a two-dimensional plane with time and frequency as the coordinate axes, and display and output the corresponding feature values in the form of gradients; or, display the corresponding feature values in three dimensions. Output.

Preferably, in step 3-1-2), the feature values are logarithmically normalized to a set maximum value, and then the graphic output is performed in the form of a gradient or a height.

Preferably, when the time-characterized character region is stored in the step 4), the data structure is used, including at least the start and end positions of the character string corresponding to the character region in the character set, and the character The corresponding broadcast Let go and stop.

Preferably, the start time of the next character area is determined based on the stop time of the previous character area at the start time of the character area.

Preferably, in the calibration of the start and end time of the character area, the start and end time difference of the other calibrated character area, that is, the play time thereof, is copied as the play time of the character area to be calibrated.

Preferably, in the step 3), the graphic is segmentally displayed or continuously scrolled in a display window having an operation interface in a time axis thereof, and the time axis displayed in the display window is configurable. Time span; when the graphic is segmented display, an indication mark moving along the time axis indicates a corresponding position of the currently synchronized sound file; when the graphic is continuous scrolling, marked with a fixed position indicator The corresponding position of the currently synchronized sound file.

Further preferably, the moving speed of the indicator mark or the continuous scrolling speed of the graphic, and the playing speed of the synchronizedly played sound file are lower than the original playing speed of the sound file.

For the purpose of the present invention, a subtitle editing and creating apparatus is further provided, including: a data storage device for storing a sound file and a character set as materials;

a data processing device, configured to convert the sound file into feature values whose time and frequency are two-dimensional variables, and divide the character set into a plurality of load segments, each of the load segments including one or more character regions;

a graphic display device, configured to display and output the converted sound file in a graphic form;

An instruction receiving device, configured to receive an editing instruction issued by a user, and convert the instruction signal into an instruction signal that is identifiable by the instruction executing device;

The command execution device is configured to change a range of characters included in the character area according to the command signal, and perform calibration of the start and end time of each character area.

With the above technical solution, combined with the embodiments to be described in detail below, the beneficial technical effects of the present invention are as follows: 1) Converting the sound file data into feature values with time and frequency as two-dimensional variables, and displaying the output in the form of graphics. It greatly enriches the visual information available to the producer. In many cases, it can intuitively observe the starting and ending position of the character area from the graphic, greatly reducing the difficulty and intensity of the work of the producer, and improving the accuracy of the time calibration. , making subtitle editing Be a lighthearted thing. 2) Display two-dimensional sound graphics in the form of gradients, the interface is refreshing, more in line with the editor's usage habits, and the operation of the display program is relatively simple; using three-dimensional graphics display can provide more stereoscopic visual information, and the information expression is more abundant. And complete. 3) The eigenvalues are normalized so that the output graph has peaks of the same gradient or height at various points in time, and the operator is more likely to obtain valuable change information. 4) Most of the time (for example, there is no interlude part in the song), the character areas are consecutively arranged on the time axis, so the start time of the latter character area can be saved according to the stop time of the previous character area. The operator's editing steps. 5) Since the same melody often uses different lyrics in the song, and each character unit in the lyric text has the same singing time (for example, some songs with multiple lyrics, each lyric repeats the same tempo at the same tempo Melody), copying the playback time of a calibrated character area, as the playback time of the to-be-calibrated area with the same melody can greatly save the operator's editing time, improve editing efficiency, and in some cases, even without playing the entire song It is possible to complete the time calibration of all the lyrics, which is the advantage that all lyric editing systems do not currently have. 6) By adjusting the span of the time axis of the display window, the length of the sound file displayed on the single screen can be changed, and the operability of the editing system can be increased. 7) Play the sound and the corresponding graphic file at a low speed, giving the operator sufficient recognition and editing time, which can improve the editing precision and reduce the playback rate.

The present invention will be further described in detail below by way of embodiments with reference to the accompanying drawings.

[Description of the Drawings]

FIG. 1 is a screenshot of an existing subtitle editing system operation interface waveform output window.

2 is a block diagram showing a circuit configuration of a caption editing and producing apparatus provided by the present invention.

FIG. 3 is a flowchart of a method for creating a caption editing provided by the present invention.

Figure 4 is a two-dimensional grayscale spectrum of a lyric.

Figure 5 is a screenshot of a display window in which the spectrum is displayed in a green gradient.

Figure 6 is a screenshot of the operation interface for time calibration of the character area.

Figure 7 is a spectrum diagram displayed in three dimensions.

【detailed description】

Embodiment 1 A subtitle editing and manufacturing apparatus, which is combined with the circuit configuration block diagram shown in FIG. 2, includes: a storage device 1 for storing a sound and a character set as a material; a data processing device 2 for converting the sound file into feature values whose time and frequency are two-dimensional variables, and dividing the character set into a plurality of loads a segment, each of the loading segments includes one or more character regions; a graphic display device 3, configured to display and output the converted sound file in a graphical form; and an instruction receiving device 4, configured to receive an editing command issued by the user And converted into a command signal for the instruction execution device to recognize; and an instruction execution device 5 for changing the range of characters included in the character region according to the command signal, and performing calibration of the start and end time of each character region.

In the present embodiment, the data processing device 2 and the instruction executing device 5 may be realized by a microprocessor of a computer reading and executing a processing program stored on a temporary or fixed storage device. Based on this configuration, the graphic display device 3 is a device capable of providing a display output window for the processing result, such as a display, a projector, etc., and the command receiving device 4 can generally employ a device capable of transmitting an identifiable command to the microprocessor, for example, Keyboard, mouse, trackball, etc.

Embodiment 2 A method for creating a subtitle editing, combined with the flowchart shown in FIG. 3, includes the following steps:

1) Store the sound file and character set in the memory. These material files may be downloaded to the current memory device via an external storage device or via a communication network, or may be directly entered through a voice input device. The material file should also be converted to a format that can be edited by the program if needed. For example, a sound file downloaded from the network is usually in a highly compressed format and needs to be decompressed by a decoder. Generally, the sound file needs to be converted into a PCM (Pulse Code Modulation) audio data stream for subsequent data conversion processing. . Character sets are usually highly compatible, and general text files (such as txt, rtf, word, etc.) are generally available.

2) The character set is divided into a plurality of loading sections, each of the load segments comprises one or more character regions (Re _g ion). The load segment of the character set typically corresponds to a line displayed in the editing interface, typically in the form of a textual natural sentence (eg, an Enter symbol in text editing). The Region is divided by specific rules. For example, a space character is used as a separator to divide each Region (usually applied to a Western-language language, and a region is obtained in units of words), while for a Chinese-language language, it is usually divided into single words, that is, each Characters as a Region. Each Region can contain one or more characters, and the user can expand or reduce the range of characters contained in a Region by using a specific operation method (for example, inputting merge or split instructions through an input device). 3) The start and end time of the character area is calibrated according to the sound file. This is the core of the entire subtitle production process, and the most time and effort. The following process will provide a way to accomplish this step easily, intuitively and with high accuracy.

3-1) Convert the sound file into feature values whose time and frequency are two-dimensional variables, and display the output in the form of a graph. In the Chinese patent No. ZL00802335.2, a method for graphically synthesizing a sound into a "sound spectrum map" for statically comparing two "sound spectrum maps" and realizing speech according to the matching degree is disclosed. Identification. In this patent, a sound pattern (hereinafter referred to as "spectrum map") is used for dynamic output and observation, but the patterning of sound can basically adopt the method in the above patent. Specifically,

3-1-1) First obtain the PCM audio data stream of the sound file (if necessary, use third-party software decompression, etc.), the sampling rate is 44100Hz, and then the audio data is set to the unit time interval (usually 512) The sampling points, so the time interval is, the sampling point/sample rate = 512/44100 «11.61ms) is divided into a plurality of frames, and the number of sampling points per frame N is 512, and the divided data sequence is multiplied by As a window function, the hamming (or hanning) function, and then performing fast Fourier transform, the original spectral value of each frame is obtained, which is a characteristic value of time and frequency as a two-dimensional variable;

3-1-2) Normalize the eigenvalues with 255 as the maximum value, establish a two-dimensional plane with time and frequency as the coordinate axes, the horizontal axis represents time, the vertical axis represents frequency, and the points on the plane correspond to The eigenvalues are displayed in the form of a gradient, that is, the eigenvalues corresponding to the points on the plane are converted into RGB color values, for example, 256-level gradients are directly converted into green component values to display color or monochrome two-dimensional image. Figure 4 shows a spectrum of 256-level grayscale converted to the lyrics "Happy birthday to you" in the song "Happy Birthday". The obtained sound pattern is stored, and according to the instruction of the editing instruction, segment display or continuous scrolling is performed in the display window having the operation interface in the order of the time axis, and the time axis displayed in the display window is settable. time span. Figure 5 shows a screenshot of the display window with a window span of 8000ms, in which the spectrum is displayed with a 256-level green gradient. In this embodiment, the spectrogram adopts a segment display manner, and the indication mark (white vertical line in FIG. 5) moving in the time axis direction indicates the corresponding time position of the currently synchronized played sound file. The speed at which the indicator moves along the time axis, The playback speed of the W and the synchronized sound files is 0.5 times the original playback speed of the sound file.

3-2) The start and end time of the character area is calibrated according to the graphic of the display output and the sound file played synchronously. Figure 6 is a screenshot of the operation interface that is time calibrating the character area. When editing, first read the spectrum map with a span of 4000ms for a period of time. In the right half of the display window, the indicator mark slides along the time axis from the middle position of the window at the set speed. When the right end of the window is reached, the program Increase the time at the left end of the window by 4s (that is, move the spectrum displayed in the right half to the left half), re-read the spectrum of the next 4000ms, and return the indicator to the middle of the window to continue moving. Repeat until the end of the playback. . The operator listens to the music played at a slow speed while observing the positional change of the indicator mark (the white vertical line in Fig. 6). After confirming the start time or stop time of the current character area to be calibrated (the red underline character in the figure), select pause playback and stop the movement of the indicator, and then display the window through the input device (such as mouse, keyboard, etc.) The calibration starts and ends the current character area. The area marked by the yellow line in Figure 6 is the play area of the corresponding character, the white line indicates the current playback time point, and the red vertical line is the time label currently being edited. Generally, the current playback time point can be set to the start (or stop time) of the currently to-be-calibrated character area by inputting a confirmation signal. During the editing operation, the operator can also perform operations such as adjusting the range of characters covered by the character area, for example, expanding the range of characters to a phrase or narrowing down to a single phoneme or letter. For the character area that has been calibrated, the start and end time labels can still be changed.

Since most of the time (for example, there is no part of the song in the song), the character areas are consecutively arranged on the time axis, the start time of the latter character area can be determined based on the stop time of the previous character area, for example , Set in the program, if not set, the stop time of the previous character area is used as the start time of the next character area to save the operator's editing steps. In addition, since the same melody often uses different lyric texts (or the same lyrics text) in the song, and each character unit in the lyric text has the same singing time, the playing time of a calibrated character area is copied as having the same melody. The playing time of the character area to be calibrated can greatly save the operator's editing time and improve the editing efficiency. For example, in the song "Happy Birthday" edited as shown in Fig. 6, the first two sentences "Happy birthday to you" have the same rhythm, and each character region in the first sentence "Happy birthday to you" is calibrated (respectively After the time zone of "Ha", "ppy", "" bir, "th", "day", "to", "you"), just calibrate the second In the sentence "Happy birthday to you", the start time label of the first character area "Ha", and then copy the play time of each character area of the preceding sentence to the corresponding character area of the latter sentence, you can complete all the later sentences. The time calibration of the character area is very time-saving and labor-saving.

4) Store the time-characterized character area. In the storage, such a data structure is adopted, which includes at least the start and end positions of the character string corresponding to the character area in the character set, and the play start and end time corresponding to the character string. The stored files can be read by the corresponding subtitle display plugs for simultaneous playback and display of subtitles on various playback software.

Embodiment 3: Another method for creating a subtitle editing process is basically the same as that of the second embodiment, except that the feature value is displayed and outputted in a high-level form in step 3-1-2). Figure 7 shows a three-dimensional map of the song "Happy Birthday" with the lyrics "Happy birthday to you" converted to a height of 256 levels. Figure 7 shows the elevation of each point in different colors. . It can be seen that the use of three-dimensional graphics display can provide more stereoscopic visual information, and the information expression is more abundant and complete.

For the purpose of understanding the present invention, a basic sound patterning method is introduced in the above embodiment. In actual operation, various enhancements and optimizations of the original sound file or sound spectrum may be targeted according to needs. , or the obtained graphics are visually modified and edited to obtain an effect more suitable for the corresponding requirements, and the specific processing manners based on the present invention are not deviated from the scope of the present invention.

The caption editing method and device provided by the invention can be used not only for editing song subtitles but also for pure speech subtitles, such as movie subtitles, TV subtitles, etc., and as an auxiliary tool for learning foreign languages. Since this method greatly simplifies the difficulty of subtitle production, and improves the accuracy of time label editing, it will make "subtitle DIY" a new form of entertainment for ordinary non-professional users.

Claims

Rights request

1. A method for making subtitle editing, comprising the following steps:

1) storing the sound file and character set in the memory;

3) calibrate the start and end time of the character area according to the sound file;

4) storing the time-characterized character area;

The method is characterized in that: the step 3) comprises the following steps,

2. The caption editing method according to claim 1, wherein: when the time-characterized character region is stored in the step 4), the data structure is used, and at least the character corresponding to the character region is included. The string is at the beginning and end of character set 1, and the start and end time of the string.

3. The caption editing method according to claim 1, wherein: the start time of the character region is determined, and the start time of the latter character region is determined based on the stop time of the previous character region.

4. The subtitle editing method according to claim 2, wherein: the start time of the start and end time of the character area is determined based on the stop time of the previous character area to determine the start time of the next character area.

The method for creating a caption editing according to claim 1, wherein: in the calibration of the start and end time of the character region, the start and end time difference of the other calibrated character region, that is, the play time, is used as the character to be calibrated. Play time.

The method for creating a caption editing according to claim 2, wherein: in the calibration of the start and end time of the character region, the start and end time difference of the other calibrated character region, that is, the play time thereof, is copied as the character to be calibrated. Play time.

7. The subtitle editing method according to claim 3, wherein: the start time of the character area is marked, and the start and end time difference of the other calibrated character area, that is, the play time, is used as the to-be-calibrated character area. of play time.

8. The subtitle editing method according to claim 4, wherein: the start time of the character region is marked, and the start and end time difference of the other calibrated character region, that is, the play time, is used as the to-be-calibrated character region. Play time.

The method for creating a caption editing according to any one of claims 1 to 8, wherein the step 3-1) comprises the following steps:

3-1 -1 ) dividing the sound file into multiple frames in time series, respectively calculating the spectrum of each frame, and obtaining the feature value with the time and frequency as two-dimensional variables;

3-1-2) Create a two-dimensional plane with time and frequency as the coordinate axes, and display and output the corresponding feature values in the form of gradients; or, display the corresponding feature values in a high-level form. Output.

The method for creating a caption editing according to claim 9, wherein: in step 3-1-2), the feature value is logarithmized and normalized by a set maximum value, and then the gradient or height is used. The form is graphically output.

The method for creating a caption editing according to any one of claims 1 to 8, wherein in the step 3), the graphic is segmented in a time window in a display window having an operation interface. Display or continuous scrolling, the time axis displayed in the display window has a configurable time span; when the graphic is segmented display, an indicator moving along the time axis indicates the sound file currently being played synchronously Corresponding position; when the graphic is continuous scrolling, the corresponding position of the currently synchronized sound file is indicated by an indication of a fixed position.

The caption editing method according to claim 9, wherein in the step 3), the graphic is segmentally displayed or continuously scrolled in a display window having an operation interface in a time axis thereof. The time axis displayed in the display window has a settable time span; when the graphic is segmented display, an indication mark moving along the time axis indicates the corresponding position of the currently synchronized played sound file; When the graphic is continuous scrolling, the corresponding position of the currently synchronized sound file is indicated by a fixed position indicator.

The method for creating a caption editing according to claim 10, wherein: in the step 3), the figure is Forming a segment display or continuous scrolling in a display window having an operation interface in a time axis thereof, the time axis displayed in the display window has a settable time span; when the graphic is segmented display, The corresponding position of the currently synchronized played sound file is indicated by an indication mark moving along the time axis; when the graphic is continuous scrolling, the corresponding position of the currently synchronized played sound file is indicated by a fixed position indication mark.

The method for creating a subtitle editing according to claim 11, wherein: the moving speed of the indication mark or the continuous scrolling speed of the graphic, and the playing speed of the synchronously played sound file are lower than the sound file. Original playback speed.

The method for creating a subtitle editing according to claim 12, wherein: the moving speed of the indication mark or the continuous scrolling speed of the graphic, and the playing speed of the synchronously played sound file are lower than the sound file. The original playback speed.

16. The subtitle editing method according to claim 13, wherein: the moving speed of the indication mark or the continuous scrolling speed of the graphic, and the playing speed of the synchronizedly played sound file are lower than the sound file. Original playback speed.

17. A subtitle editing production device, comprising:

a data storage device for storing a sound file and a character set as a material;