TW201832222A - Method and apparatus for automatically generating dubbing characters, and electronic device - Google Patents

Method and apparatus for automatically generating dubbing characters, and electronic device Download PDF

Info

Publication number
TW201832222A
TW201832222A TW106126945A TW106126945A TW201832222A TW 201832222 A TW201832222 A TW 201832222A TW 106126945 A TW106126945 A TW 106126945A TW 106126945 A TW106126945 A TW 106126945A TW 201832222 A TW201832222 A TW 201832222A
Authority
TW
Taiwan
Prior art keywords
text
basic semantic
semantic unit
information
basic
Prior art date
Application number
TW106126945A
Other languages
Chinese (zh)
Other versions
TWI749045B (en
Inventor
陽鶴翔
Original Assignee
香港商阿里巴巴集團服務有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 香港商阿里巴巴集團服務有限公司 filed Critical 香港商阿里巴巴集團服務有限公司
Publication of TW201832222A publication Critical patent/TW201832222A/en
Application granted granted Critical
Publication of TWI749045B publication Critical patent/TWI749045B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel

Abstract

A method and apparatus for automatically generating dubbing characters, and an electronic device. The method for generating dubbing characters comprises: identifying audio information to acquire starting and ending time information about each identified audio basic semantic unit; acquiring text information corresponding to the audio information, and identifying the text information, so as to acquire a text basic semantic unit; recording the starting and ending time information about each of the audio basic semantic units in the corresponding text basic semantic unit; and processing the text basic semantic unit into which the starting and ending time information is recorded, so as to generate dubbing characters corresponding to the audio information. By means of the method, a dynamic lyrics file can be produced without using a manual method, thereby improving the production efficiency, reducing the production cost, and simplifying the production procedure.

Description

自動生成配音文字的方法、裝置以及電子設備  Method, device and electronic device for automatically generating voice-over text  

本申請案涉及電腦技術領域,具體涉及一種自動生成配音文字的方法;本申請案同時涉及一種自動生成配音文字的裝置以及一種電子設備。 The present application relates to the field of computer technology, and in particular, to a method for automatically generating a voice-over text; the present application also relates to a device for automatically generating a voice-over text and an electronic device.

隨著音頻處理技術的發展,用戶對試聽體驗有了更高的要求,不僅要求音頻播放應用能夠播放音頻檔,還希望音頻播放應用能夠同步顯示與音頻檔相應的歌詞檔。音頻播放同步顯示歌詞功能使得人們在聽到優美旋律的同時能夠看到該音頻檔的歌詞,該功能已經成為了音頻播放應用以及播放機的必備功能之一。 With the development of audio processing technology, users have higher requirements for the audition experience, not only requiring the audio playback application to play the audio file, but also the audio playback application to simultaneously display the lyric file corresponding to the audio file. The audio playback synchronous display lyrics function allows people to see the lyrics of the audio file while listening to the beautiful melody. This function has become one of the essential functions of the audio playback application and the player.

為了滿足用戶的需求,目前用於音頻播放同步顯示的歌詞主要採用人工方式來進行的,人工邊聽音頻邊給歌詞標註時間,為音頻檔資料庫中的每個音頻檔生成相應的歌詞檔,並將所生成的歌詞檔導入到音頻播放應用中,從而在播放音頻檔時,同步顯示相應地歌詞檔。 In order to meet the needs of users, the lyrics currently used for synchronous display of audio playback are mainly performed by manual means. Manually, while listening to audio, time is given to the lyrics, and corresponding lyric files are generated for each audio file in the audio file database. And the generated lyrics file is imported into the audio playing application, so that when the audio file is played, the corresponding lyric file is synchronously displayed.

由此可見,在現有的用於音頻播放同步顯示的歌詞的製作方案下,採用人工方式生成歌詞檔製作過程比較繁瑣,不僅效率低且成本高。隨著音頻曲庫規模的不斷擴 大,人工方式所存在的弊端顯得日益嚴重。 It can be seen that under the existing production scheme for the lyrics for synchronous display of audio playback, the process of manually generating the lyrics file is cumbersome, and is not only inefficient and costly. As the scale of audio music libraries continues to expand, the drawbacks of manual methods are becoming more and more serious.

本申請案提供一種自動生成配音文字的方法,以解決現有技術中的上述問題。本申請案同時涉及一種自動生成配音文字的裝置以及一種電子設備。 The present application provides a method of automatically generating voice-over characters to solve the above problems in the prior art. The application also relates to a device for automatically generating voice-over characters and an electronic device.

本申請案實施例提供了一種自動生成配音文字的方法,所述自動生成配音文字的方法,包括:對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。 The embodiment of the present application provides a method for automatically generating a voice-over text, and the method for automatically generating a voice-over text includes: identifying audio information, and acquiring start and end time information of each of the recognized basic audio units of the audio; acquiring and The text information corresponding to the audio information, and identifying the text information, thereby obtaining a basic semantic unit of the text; recording the start and end time information of each of the audio basic semantic units into the corresponding basic semantic unit of the text; The text basic semantic unit of the start and end time information is processed to generate a voice-over text corresponding to the audio information.

可選的,所述對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字,包括:針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位;根據已獲取的所述文本基本語意單位中記錄的起止時間資訊,確定所述單句的起止時間資訊;將確定了起止時間資訊的所述單句進行整合,形成對 應所述音頻資訊,且具有每一單句的起止時間資訊的配音文字。 Optionally, the processing, the basic semantic unit of the text that records the start and end time information is processed, and generating the voice-over text corresponding to the audio information, comprising: acquiring, for each single sentence in the text information, the composition The basic semantic unit of the single sentence; determining the start and end time information of the single sentence according to the acquired start and end time information recorded in the basic semantic unit of the text; and integrating the single sentence that determines the start and end time information to form the corresponding Audio information, and voice-over text with start and end time information for each sentence.

可選的,所述針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位時,若所述文本基本語意單位中記錄了至少兩組起止時間資訊,則按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組。 Optionally, when the text basic semantic unit constituting the single sentence is obtained for each single sentence in the text information, if at least two sets of start and end time information are recorded in the basic semantic unit of the text, the start and end time information is used. The number of groups, respectively, forms the basic semantic unit group of texts that make up the single sentence.

可選的,在所述按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組的步驟之後,包括:根據預定的計算方法,對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩選,確定組成所述單句的文本基本語意單位組。 Optionally, after the step of forming the text basic semantic unit group of the single sentence according to the number of groups of the start and end time information, the method includes: according to a predetermined calculation method, each of the text basic semantic unit groups All the start and end time information of the basic semantic units of each text are filtered to determine the basic semantic unit group of the texts constituting the single sentence.

可選的,所述預定的計算方法,包括:計算各個所述文本基本語意單位組內,每一文本基本語意單位中的起始時間與所述文本基本語意單位的上一個文本基本語意單位的終止時間之間的時間間距,獲取各個所述文本基本語意單位組中所述起始時間與所述終止時間的時間間距的和,將所述時間間距的和作為所述文本基本語意單位組的誤差值。 Optionally, the predetermined calculation method includes: calculating a start time in a basic semantic unit of each text in a basic semantic unit group of each text, and a basic semantic unit of the text in the basic semantic unit of the text. And a time interval between the end times, obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and using the sum of the time intervals as the text basic semantic unit group difference.

可選的,所述對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩選,確定組成所述單句的文本基本語意單位組,包括:對各個所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組。 Optionally, the user selects, for each start and end time information of each text basic semantic unit in each text basic unit group, and determines a text basic semantic unit group that constitutes the single sentence, including: The basic semantic unit group filters the text basic semantic unit group whose error value is lower than the preset threshold.

可選的,在所述保留誤差值低於預設的閾值的文本基本語意單位組的步驟之後,包括:計算保留的所述文本基本語意單位組內,每一文本基本語意單位中的起始時間大於所述文本基本語意單位的上一個文本基本語意單位的終止時間的次數,獲取該次數最大的文本基本語意單位組。 Optionally, after the step of the text basic semantic unit group in which the retention error value is lower than a preset threshold, the method includes: calculating a start of each text basic semantic unit in the text basic semantic unit group The time is greater than the number of times of termination of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.

可選的,所述識別所述文本資訊獲取文本基本語意單位,包括:從所述文本資訊中,按照每句內的每個字的順序進行識別獲取所述文本資訊中的文本基本語意單位。 Optionally, the identifying the text information to obtain the basic semantic unit of the text comprises: obtaining, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.

可選的,在將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中時,若所述音頻基本語意單位的起止時間資訊為空值,則使與所述音頻基本語意單位相應的所述文本基本語意單位的取值為空值。 Optionally, when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is null, then The value of the basic semantic unit of the text corresponding to the audio basic semantic unit is a null value.

可選的,在所述確定組成所述單句的文本基本語意單位組的步驟之後,包括:按照預定的推算方式,對取值為空值的所述文本基本語意單位推算起止時間資訊。 Optionally, after the step of determining the text basic semantic unit group constituting the single sentence, the method includes: estimating start and end time information for the text basic semantic unit whose value is a null value according to a predetermined estimation manner.

可選的,所述預定的推算方式,包括:計算所述文本基本語意單位組中的文本基本語意單位的平均時間資訊;將取值為空值的所述文本基本語意單位,上一個文本基本語意單位中的終止時間,放入取值為空值的所述文本 基本語意單位的起始時間中;將所述終止時間加上所述平均時間資訊後,放入取值為空值的所述文本基本語意單位的終止時間中。 Optionally, the predetermined calculating manner includes: calculating an average time information of a basic semantic unit of the text in the basic semantic unit group of the text; and using the basic semantic unit of the text as a null value, the previous text is basically The termination time in the semantic unit is placed in the start time of the basic semantic unit of the text whose value is null; the end time is added to the average time information, and the value is null. The text is in the end of the basic semantic unit.

相應的,本申請案實施例還提供了一種自動生成配音文字的裝置,所述自動生成配音文字的裝置,包括:音頻識別單元,用於對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;文本識別單元,用於獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;時間寫入單元,用於將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;配音文字生成單元,用於對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。 Correspondingly, the embodiment of the present application further provides a device for automatically generating a voice-over text, the device for automatically generating a voice-over text, comprising: an audio recognition unit, configured to identify the audio information, and obtain the basic audio semantics of the recognized audio. a start and end time information of the unit; a text recognition unit, configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text; and a time writing unit for basicizing each of the audios The start and end time information of the semantic unit is recorded in the corresponding basic semantic unit of the text; the voice-over text generating unit is configured to process the basic semantic unit of the text in which the start and end time information is recorded, and generate the corresponding audio information. Dubbed text.

可選的,所述配音文字生成單元,包括:文本語意獲取子單元,用於針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位;時間資訊確定子單元,用於根據已獲取的所述文本基本語意單位中記錄的起止時間資訊確定所述單句的起止時間資訊;配音文字生成子單元,用於將確定了起止時間資訊的所述單句進行整合,形成對應所述音頻資訊,且具有每一單句的起止時間資訊的配音文字。 Optionally, the voice-over text generating unit includes: a text semantic acquiring sub-unit, configured to acquire, for each single sentence in the text information, a text basic semantic unit constituting the single sentence; and a time information determining sub-unit, configured to: Determining start and end time information of the single sentence according to the obtained start and end time information recorded in the basic semantic unit of the text; the voice-over text generating sub-unit is configured to integrate the single sentence that determines the start and end time information to form the corresponding Audio information, and voice-over text with start and end time information for each sentence.

可選的,所述時文本語意獲取子單元,具體用於針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位時,若所述文本基本語意單位中記錄了至少兩組起止時間資訊,則按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組。 Optionally, the time text semantic acquisition sub-unit is specifically configured to: when each of the single sentences in the text information is obtained, obtain the basic semantic unit of the text that constitutes the single sentence, if at least two of the text basic semantic units are recorded When the group start and stop time information is used, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.

可選的,所述的自動生成配音文字的裝置,還包括:文本語意篩選子單元,用於在所述按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組之後,根據預定的計算方法,對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩選,確定組成所述單句的文本基本語意單位組。 Optionally, the device for automatically generating the voice-over text further includes: a text semantic screening sub-unit, configured to form, after the group of the basic semantic units of the text that constitutes the single sentence, respectively, according to the number of groups according to the start and end time information, According to a predetermined calculation method, all start and end time information of each text basic semantic unit in each text basic unit group is screened to determine a text basic semantic unit group constituting the single sentence.

可選的,所述文本語意篩選子單元,包括:誤差計算子單元,用於計算各個所述文本基本語意單位組內,每一文本基本語意單位中的起始時間與所述文本基本語意單位的上一個文本基本語意單位的終止時間之間的時間間距,獲取各個所述文本基本語意單位組中所述起始時間與所述終止時間的時間間距的和,將所述時間間距的和作為所述文本基本語意單位組的誤差值。 Optionally, the text semantic screening subunit includes: an error calculation subunit, configured to calculate a start time in a basic semantic unit of each text and a basic semantic unit of the text in each of the basic semantic unit groups of the text. Obtaining a time interval between the end time of the basic textual unit of the previous text, obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, and using the sum of the time intervals as The text is basically a semantic value of the unit group.

可選的,所述文本語意篩選子單元,還包括:過濾子單元,用於對各個所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組。 Optionally, the text semantic screening subunit further includes: a filtering subunit, configured to filter each of the text basic semantic unit groups, and retain a text basic semantic unit group whose error value is lower than a preset threshold.

可選的,所述文本語意篩選子單元,還包括:時間次數計算子單元,用於在所述保留誤差值低於預 設的閾值的文本基本語意單位組的之後,計算保留的所述文本基本語意單位組內,每一文本基本語意單位中的起始時間大於所述文本基本語意單位的上一個文本基本語意單位的終止時間的次數,獲取該次數最大的文本基本語意單位組。 Optionally, the text semantic screening subunit further includes: a time number calculation subunit, configured to calculate the retained text after the text basic semantic unit group whose retention error value is lower than a preset threshold In the basic semantic unit group, the start time in the basic semantic unit of each text is greater than the number of times of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.

可選的,所述文本識別單元,具體用於從所述文本資訊中,按照每句內的每個字的順序進行識別獲取所述文本資訊中的文本基本語意單位。 Optionally, the text recognition unit is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.

可選的,所述時間寫入單元,具體用於在將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中時,若所述音頻基本語意單位的起止時間資訊為空值,則使與所述音頻基本語意單位相應的所述文本基本語意單位的取值為空值。 Optionally, the time writing unit is specifically configured to: when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, if the audio basic semantic unit starts and ends The time information is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.

可選的,所述的自動生成配音文字的裝置,還包括:時間推算單元,用於在所述確定組成所述單句的文本基本語意單位組之後,按照預定的推算方式,對取值為空值的所述文本基本語意單位推算起止時間資訊。 Optionally, the device for automatically generating the voice-over text further includes: a time estimating unit, configured to: after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner, the value is null The text basic semantic unit of the value estimates the start and end time information.

可選的,所述時間推算單元,包括:平均時間計算子單元,用於計算所述文本基本語意單位組中的文本基本語意單位的平均時間資訊;起始時間寫入子單元,用於將取值為空值的所述文本基本語意單位,上一個文本基本語意單位中的終止時間,放入取值為空值的所述文本基本語意單位的起始時間中;終止時間寫入子單元,用於將所述終止時間加上所述 平均時間資訊後,放入取值為空值的所述文本基本語意單位的終止時間中。 Optionally, the time estimating unit includes: an average time calculating subunit, configured to calculate an average time information of a basic semantic unit of the text in the basic semantic unit group of the text; the start time is written into the subunit, and is used to The basic semantic unit of the text whose value is null, the end time in the basic semantic unit of the previous text, the start time of the basic semantic unit of the text whose value is null; the end time is written to the subunit And adding the average time information to the termination time, and placing the end time of the text basic semantic unit whose value is a null value.

此外,本申請案實施例還提供了一種電子設備,包括:顯示器;處理器;記憶體,用於儲存配音文字生成程式,所述程式在被所述處理器讀取執行時,執行如下操作:對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。 In addition, an embodiment of the present application further provides an electronic device, including: a display; a processor; and a memory for storing a voice-over text generating program, where the program performs the following operations when read and executed by the processor: Identifying the audio information, obtaining the start and end time information of the identified basic audio unit of each audio; acquiring the text information corresponding to the audio information, and identifying the text information, thereby obtaining the basic semantic unit of the text; The start and end time information of the basic semantic unit is recorded in the corresponding basic semantic unit of the text; and the basic semantic unit of the text in which the start and end time information is recorded is processed to generate a voice-over text corresponding to the audio information.

與現有技術相比,本申請案具有以下優點:本申請案提供的一種自動生成配音文字的方法、裝置以及電子設備,通過對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。所述技術方案通過對音頻資訊進行語音辨識,獲取音頻資訊中每個音頻基本語 意單位起止時間資訊,通過識別所述音頻資訊對應的文本資訊,確定文本資訊內每個單句內文本基本語意單位的數量與字形,使所述音頻資訊中識別出的音頻基本語意單位與所述文本資訊中識別出的文本基本語意單位相對應,在確立對應關係後,根據所述音頻資訊中每個音頻基本語意單位起止時間資訊確定文本資訊中對應單句的時間資訊,使文本內的每條單句帶有時間資訊,使動態歌詞檔不再採用人工的方式進行製作,提高了製作的效率降低了製作成本,簡化了製作的流程。 Compared with the prior art, the present application has the following advantages: the method, the device and the electronic device for automatically generating the voice-over text provided by the present application obtain the start and end time of the recognized basic audio semantic units of each audio by identifying the audio information. Obtaining text information corresponding to the audio information, and identifying the text information, thereby acquiring a basic semantic unit of the text; recording the start and end time information of each of the audio basic semantic units to the corresponding basic semantic unit of the text And processing the basic textual unit of the text in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information. The technical solution obtains the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, and identifies the basic semantic unit of each text in the single sentence in the text information by identifying the text information corresponding to the audio information. The quantity and the glyph, the basic semantic unit of the audio identified in the audio information corresponds to the basic semantic unit of the text identified in the text information, and after establishing the correspondence, according to the basic semantics of each audio in the audio information The start and end time information of the unit determines the time information of the corresponding single sentence in the text information, so that each single sentence in the text carries time information, so that the dynamic lyric file is no longer artificially produced, which improves the production efficiency, reduces the production cost, and simplifies. The process of production.

301‧‧‧音頻識別單元 301‧‧‧Audio identification unit

303‧‧‧文本識別單元 303‧‧‧Text recognition unit

305‧‧‧時間寫入單元 305‧‧‧Time writing unit

307‧‧‧配音文字生成單元 307‧‧‧Dubbing text generation unit

401‧‧‧顯示器 401‧‧‧ display

403‧‧‧處理器 403‧‧‧ processor

405‧‧‧記憶體 405‧‧‧ memory

為了更清楚地說明本申請案實施例或現有技術中的技術方案,下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹,顯而易見地,下面描述中的附圖僅僅是本申請案中記載的一些實施例,對於本領域普通技術人員來講,還可以根據這些附圖獲得其他的附圖。 In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are merely Some of the embodiments described in this application can be obtained by those of ordinary skill in the art from the drawings.

圖1示出了根據本申請案的實施例提供的自動生成配音文字的方法的流程圖;圖2示出了根據本申請案的實施例提供的對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字的流程圖;圖3示出了根據本申請案的實施例提供的自動生成配音文字的裝置的示意圖;圖4示出了根據本申請案的實施例提供的電子設備的 示意圖。 1 shows a flow chart of a method of automatically generating dubbed characters provided in accordance with an embodiment of the present application; FIG. 2 illustrates the text of the start and end time information recorded in accordance with an embodiment of the present application. The basic semantic unit processes to generate a flowchart of the voice-over text corresponding to the audio information; FIG. 3 shows a schematic diagram of an apparatus for automatically generating voice-speech characters according to an embodiment of the present application; FIG. 4 shows the application according to the present application. A schematic diagram of an electronic device provided by an embodiment of the present invention.

在下面的描述中闡述了很多具體細節以便於充分理解本申請案。但是本申請案能夠以很多不同於在此描述的其它方式來實施,本領域技術人員可以在不違背本申請案內涵的情況下做類似推廣,因此本申請案不受下面公開的具體實施的限制。 Numerous specific details are set forth in the description which follows to provide a thorough understanding of the application. However, the present application can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without departing from the scope of the present application. Therefore, the present application is not limited by the specific implementation disclosed below. .

為了能夠更清楚地理解本申請案的上述目的、特徵和優點,下面結合附圖和具體實施方式對本申請案進行進一步的詳細描述。需要說明的是,在不衝突的情況下,本申請案的實施例及實施例中的特徵可以相互組合。 The above objects, features and advantages of the present application will be more clearly understood from the following description of the appended claims. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.

本申請案的實施例提供了一種自動生成配音文字的方法,本申請案的實施例同時提供了一種自動生成配音文字的裝置以及一種電子設備。在下面的實施例中逐一進行詳細說明。 The embodiment of the present application provides a method for automatically generating a voice-over text, and an embodiment of the present application simultaneously provides an apparatus for automatically generating a voice-over text and an electronic device. Detailed description will be made one by one in the following embodiments.

目前用於音頻播放同步顯示的歌詞主要採用人工方式來進行的,人工邊聽音頻邊給歌詞標註時間,為音頻檔資料庫中的每個音頻檔生成相應的歌詞檔,並將所生成的歌詞檔導入到音頻播放應用中,從而在播放音頻檔時,同步顯示相應地歌詞檔。由此可見,在現有的用於音頻播放同步顯示的歌詞的製作方案下,採用人工方式生成歌詞檔製作過程比較繁瑣,不僅效率低且成本高。隨著音頻曲庫規模的不斷擴大,人工方式所存在的弊端顯得日益嚴重。針 對這一問題,本申請案的技術方案通過對音頻資訊進行語音辨識,獲取音頻資訊中每個音頻基本語意單位起止時間資訊,通過識別所述音頻資訊對應的文本資訊,確定文本資訊內每個單句內文本基本語意單位的數量與字形,使所述音頻資訊中識別出的音頻基本語意單位與所述文本資訊中識別出的文本基本語意單位相對應,在確立對應關係後,根據所述音頻資訊中每個音頻基本語意單位起止時間資訊確定文本資訊中對應單句的時間資訊,使文本內的歌詞帶有時間資訊,從而實現了自動製作動態歌詞檔的功能。 At present, the lyrics used for synchronous display of audio playback are mainly performed by manual method. Manually, the lyrics are marked with time while listening to audio, and corresponding lyric files are generated for each audio file in the audio file database, and the generated lyrics are generated. The file is imported into the audio playback application, so that when the audio file is played, the corresponding lyric file is displayed synchronously. It can be seen that under the existing production scheme for the lyrics for synchronous display of audio playback, the process of manually generating the lyrics file is cumbersome, and is not only inefficient and costly. As the scale of audio music libraries continues to expand, the drawbacks of manual methods are becoming more and more serious. In response to this problem, the technical solution of the present application obtains the start and end time information of each audio basic semantic unit in the audio information by performing voice recognition on the audio information, and identifies each text information by identifying the text information corresponding to the audio information. The number and glyph of the basic semantic units in the single sentence, so that the basic semantic unit of the audio recognized in the audio information corresponds to the basic semantic unit of the text recognized in the text information, and after establishing the correspondence, according to the audio The basic starting and ending time information of each audio in the information determines the time information of the corresponding single sentence in the text information, so that the lyrics in the text have time information, thereby realizing the function of automatically generating dynamic lyric files.

在詳細描述本實施例的具體步驟之前,先對本技術方案涉及的動態歌詞作簡要說明。 Before describing the specific steps of the embodiment in detail, the dynamic lyrics involved in the technical solution are briefly described.

動態歌詞是通過編輯器把歌詞按歌曲歌詞出現的時間編輯到一起,然後在播放歌曲時同步依次將歌詞顯示出來。常用的動態歌詞檔包括:lrc、qrc等。 The dynamic lyrics are edited by the editor to sort the lyrics according to the time when the song lyrics appear, and then the lyrics are displayed in synchronization when the song is played. Commonly used dynamic lyric files include: lrc, qrc, etc.

lrc是英文lyric(歌詞)的縮寫,被用做動態歌詞檔的副檔名。以lrc為副檔名的歌詞檔可以在各類數碼播放機中同步顯示。lrc歌詞是一種包含著“*:*:*”(其中,“*”是指萬用字元,用來代替一個或多個真正的字元。在實際的歌詞檔中“*”是指歌詞的時間(即時間內容),例如:“01:01:00”是指1分1秒;“:”用來分割分、秒、毫秒的時間資訊)形式的“標籤(tag)”的、基於純文字的歌詞專用格式。這種歌詞檔能以文書處理軟體查看、編輯(用記事本按照上述格式寫好後,將副檔名改為lrc即可做出 “檔案名.LRC”的歌詞文件)。Lrc動態歌詞檔的標準格式為[分鐘:秒:毫秒]歌詞。 Lrc is an abbreviation of English lyric (lyrics), which is used as the extension of the dynamic lyrics file. The lyrics file with lrc as the extension file can be displayed synchronously in various digital players. The lrc lyrics are ones that contain "*:*:*" (where "*" refers to a universal character that is used to replace one or more real characters. In the actual lyrics file, "*" refers to the lyrics. Time (ie time content), for example: "01:01:00" means 1 minute and 1 second; ":" is used to divide the time information of minutes, seconds, and milliseconds) based on the "tag" A plain text lyrics-specific format. This lyric file can be viewed and edited by the word processing software (after writing in the above format with Notepad, the lyric file of "File Name.LRC" can be created by changing the file name to lrc). The standard format of the Lrc dynamic lyrics file is [minutes: seconds: milliseconds] lyrics.

lrc歌詞文本中含有兩類標籤:一是標識標籤,其格式為“[標識名:值]”主要包含以下預定義的標籤:[ar:歌手名]、[ti:歌曲名]、[al:專輯名]、[by:編輯者(指lrc歌詞的製作人)]。 The lrc lyrics text contains two types of tags: one is the identification tag, and its format is "[tag name: value]". It mainly contains the following predefined tags: [ar: singer name], [ti: song name], [al: Album name], [by: editor (referring to the producer of lrc lyrics)].

二是時間標籤,形式為“[mm:ss]”或“[mm:ss.ff]”,時間標籤需位於某行歌詞中的句首部分,一行歌詞可以包含多個時間標籤(比如歌詞中的疊句部分)。當歌曲播放到達某一時間點時,就會尋找對應的時間標籤並顯示標籤後面的歌詞文本,這樣就完成了“歌詞同步”的功能。 The second is the time tag, the form is "[mm:ss]" or "[mm:ss.ff]", the time tag needs to be in the first part of the sentence in a line of lyrics, and a line of lyrics can contain multiple time tags (such as in the lyrics) The overlapping sentence part). When the song reaches a certain point in time, it will search for the corresponding time label and display the lyrics text behind the label, thus completing the function of “synchronization of lyrics”.

lrc動態歌詞檔在使用時要求歌曲和lrc動態歌詞檔的檔案名相同(即除了副檔名.mp3、.wma、.lrc等不同之外,點前面的文字、文字格式要一模一樣)並且放在同一目錄下(即同一資料夾中),用帶顯示歌詞功能的播放機播放歌曲時歌詞就可以同步顯示。 The lrc dynamic lyric file requires the same file name of the song and lrc dynamic lyrics file when using it (ie, except for the auxiliary file name.mp3, .wma, .lrc, etc., the text and text format in front of the dot must be exactly the same) and placed In the same directory (that is, in the same folder), the lyrics can be displayed simultaneously when playing a song with a player with the lyrics display function.

本申請案的實施例提供了一種生成配音文字的方法,所述生成配音文字的方法實施例如下:請參考圖1,其示出了根據本申請案的實施例提供的自動生成配音文字的方法的流程圖。 An embodiment of the present application provides a method for generating a voice-over text, and the method for generating a voice-over text is implemented as follows: Please refer to FIG. 1 , which illustrates a method for automatically generating a voice-over text according to an embodiment of the present application. Flow chart.

所述自動生成配音文字的方法包括: 步驟S101,對音頻資訊進行識別,獲取識別出的各個 音頻基本語意單位的起止時間資訊。 The method for automatically generating the voice-over text includes: Step S101: Identifying the audio information, and acquiring the start and end time information of the identified basic audio semantic units of each audio.

在本實施例中,所述對音頻資訊進行識別,主要是將所述音頻資訊的語音信號轉換為可識別的文本資訊,例如:以文本資訊的形式獲取將所述音頻資訊的語音信號轉換為可以識別的音頻基本語意單位。所述音頻基本語意單位包括:中文文字、中文詞語、拼音、數位、英文文字和/或英文詞語等。具體的,語音辨識過程可採用統計模式識別技術等語音辨識方法。 In this embodiment, the audio information is mainly obtained by converting the voice signal of the audio information into identifiable text information, for example, converting the voice signal of the audio information into a form of text information. A basic unit of audio that can be recognized. The audio basic semantic units include: Chinese characters, Chinese words, pinyin, digits, English characters, and/or English words. Specifically, the speech recognition process may adopt a speech recognition method such as a statistical pattern recognition technology.

在具體實施時,可以通過CMU-Sphinx語音辨識系統對所述音頻資訊進行語音辨識。CMU-Sphinx是大詞彙量語音辨識系統,採用連續隱含瑪律可夫模型CHMM建模。支援多種模式操作,高精度模式扁平解碼器以及快速搜索模式樹解碼器。 In a specific implementation, the audio information can be voice-recognized by a CMU-Sphinx speech recognition system. CMU-Sphinx is a large vocabulary speech recognition system that uses continuous implicit Markov model CHMM modeling. Supports multiple modes of operation, high-precision mode flat decoder and fast search mode tree decoder.

需要說明的是,所述文本資訊中,包含從所述音頻資訊中識別出的音頻基本語意單位以及所述音頻基本語意單位在所述音頻資訊中起止時間資訊。可以理解的,所述音頻資訊可以是mp3或其他音樂格式的歌曲檔,mp3檔是具有一定時長直接記錄了真實聲音的音頻檔,所以在對mp3檔進行識別,將識別出的音頻基本語意單位採用文本資訊的形式進行輸出時會記錄識別出的該音頻基本語意單位在所述音頻資訊中播放時起止時間資訊。 It should be noted that the text information includes an audio basic semantic unit recognized from the audio information and a start and end time information of the audio basic semantic unit in the audio information. It can be understood that the audio information may be a song file of mp3 or other music format, and the mp3 file is an audio file that directly records the real sound for a certain period of time, so in the recognition of the mp3 file, the basic semantic meaning of the audio will be recognized. When the unit outputs in the form of text information, the identified start and end time information of the audio basic semantic unit in the audio information is recorded.

在本實施例中,在對所述音頻資訊進行識別後輸出的所述文本資訊中採用如下格式記錄識別出的音頻基本語意單位以及所述音頻基本語意單位的時間資訊:<word, TIMECLASS>。其中,word是指識別出的音頻基本語意單位,TIMECLASS是指時間標註,該時間標註採用起始時間以及終止時間{startTime,endTime}的形式記錄該音頻基本語意單位在在所述音頻資訊中播放時出現時的時間資訊,即:是相對於所述音頻資訊在開始播放0時刻時的偏移量,單位為毫秒。 In this embodiment, in the text information outputted after the audio information is recognized, the recognized audio basic semantic unit and the time information of the audio basic semantic unit are recorded in the following format: <word, TIMECLASS>. Wherein, word refers to the identified basic unit of audio, and TIMECLASS refers to time stamping, which records the basic semantic unit of the audio in the form of the start time and the end time {startTime, endTime}. The time information when the time is present, that is, the offset in the time when the audio information starts to play 0, in milliseconds.

下面通過一個具體的例子說明所述生成配音文字的方法,例如:所述音頻資訊為mp3檔,該mp3檔在播放時的時常為10秒,在該mp3檔播放到1秒時出現歌詞:“我想了又想”,則通過識別所述音頻資訊獲取的文本資訊中記錄的識別出的音頻基本語意單位以及所述音頻基本語意單位的時間資訊為:<word:“我”,{startTime:1000,endTime:1100}>;<word:“想”,{startTime:1200,endTime:1300}>;<word:“了”,{startTime:1400,endTime:1500}>;<word:“又”,{startTime:1600,endTime:1700}>;<word:“想”,{startTime:1800,endTime:1900}>。 The following describes a method for generating a voice-over text by a specific example. For example, the audio information is an mp3 file, and the mp3 file is often 10 seconds during playback, and the lyrics appear when the mp3 file is played until 1 second: I think and think again, the recognized audio basic semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are: <word: "I", {startTime: 1000, endTime: 1100}>; <word: "think", {startTime: 1200, endTime: 1300}>; <word: "out", {startTime: 1400, endTime: 1500}>; <word: "again" , {startTime:1600,endTime:1700}>;<word: "think", {startTime:1800,endTime:1900}>.

需要說明的是,若所述音頻資訊為中文的音頻資訊,則在對所述音頻資訊進行識別後輸出的所述文本資訊中記錄的識別出的音頻基本語意單位為單個中文漢字;同樣的道理,若所述音頻資訊為英文的音頻資訊,則在對所述音頻資訊進行識別後輸出的所述文本資訊中記錄的識別出的音頻基本語意單位為單個英文單詞。 It should be noted that, if the audio information is Chinese audio information, the basic unit of the recognized audio recorded in the text information outputted after the audio information is identified is a single Chinese character; the same reason If the audio information is audio information in English, the recognized basic audio semantic unit recorded in the text information outputted after the audio information is recognized is a single English word.

可以理解的,所述音頻基本語意單位的起止時間資訊 是以毫秒為單位進行記錄的,而歌詞:“我想了又想”是在該mp3檔播放到1秒時出現,則音頻基本語意單位“我”在該mp3檔播放到1秒至1.1秒時出現,所以記錄的音頻基本語意單位“我”的時間資訊為{startTime:1000,endTime:1100}。 It can be understood that the start and end time information of the audio basic semantic unit is recorded in milliseconds, and the lyrics: "I think and think" is when the mp3 file is played until 1 second, then the audio basic semantic unit "I" appears when the mp3 file is played for 1 second to 1.1 seconds, so the time information of the recorded audio basic unit "I" is {startTime: 1000, endTime: 1100}.

步驟S103,獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位。 Step S103: Acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a text basic semantic unit.

在本實施例中,所述獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位,可以採用如下方式實現:通過網際網路搜索所述音頻資訊對應的文本資訊,在獲取所述文本資訊後對所述文本資訊中的每個基本語意單位進行識別,對識別出的每個基本語意單位形成時間資訊為空值的文本基本語意單位,獲取所述文本基本語意單位。 In this embodiment, the obtaining the text information corresponding to the audio information and identifying the text information to obtain the basic semantic unit of the text may be implemented by searching the corresponding information of the audio information through the Internet. Text information, after obtaining the text information, identifying each basic semantic unit in the text information, forming a basic semantic unit of the text whose time information is null for each of the identified basic semantic units, and acquiring the text Basic semantic unit.

需要說明的是,所述基本語意單位是所述文本資訊內的單字資訊,包括:中文文字、中文詞語、拼音、數位、英文文字和/或英文詞語等。 It should be noted that the basic semantic unit is single word information in the text information, including: Chinese characters, Chinese words, pinyin, digits, English characters, and/or English words.

沿用上述具體的例子進行說明:所述音頻資訊為mp3檔,通過網際網路路搜索該mp3檔對應的歌詞文本,所述歌詞文本的具體內容為:“我想了又想”,在獲取該mp3檔對應的歌詞文本後,對所述文本資訊中的每個基本語意單位進行識別,對識別出的每個基本語意單位形成時間資訊為空值的文本基本語意單位:<word:“我”,timeList{ }>; <word:“想”,timeList{ }>;<word:“了”,timeList{ }>;<word:“又”,timeList{ }>;<word:“想”,timeList{ }>。 The specific example is described above: the audio information is an mp3 file, and the lyric text corresponding to the mp3 file is searched through an internet path, and the specific content of the lyric text is: “I think and think”, in obtaining the After the lyrics text corresponding to the mp3 file, each basic semantic unit in the text information is identified, and each basic semantic unit identified forms a text basic semantic unit whose time information is null: <word: "I" , timeList{ }>; <word: "think", timeList{ }>; <word: "out", timeList{ }>; <word: "again", timeList{ }>; <word: "think", timeList { }>.

步驟S105,將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中。 Step S105, recording start and end time information of each of the audio basic semantic units into the corresponding text basic semantic unit.

在本實施例中,所述將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中,可以採用如下方式實現:將在對所述音頻資訊進行識別後識別出的各個所述音頻基本語意單位與從所述音頻資訊對應的文本資訊中對每個基本語意單位進行識別形成的文本基本語意單位進行匹配,將所述音頻基本語意單位的起止時間資訊放入到與該音頻基本語意單位相應的文本基本語意單位內。 In this embodiment, the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, and may be implemented by: identifying the audio information after identifying the audio information. Each of the audio basic semantic units is matched with a text basic semantic unit formed by identifying each basic semantic unit in the text information corresponding to the audio information, and the start and end time information of the audio basic semantic unit is put into To the basic semantic unit of the text corresponding to the basic semantic unit of the audio.

例如:通過識別所述音頻資訊獲取的文本資訊中記錄的識別出的音頻基本語意單位以及所述音頻基本語意單位的時間資訊為:<word:“我”,{startTime:1000,endTime:1100}>;<word:“想”,{startTime:1200,endTime:1300}>;對所述文本資訊中的每個基本語意單位進行識別,對識別出的每個基本語意單位形成時間資訊為空值的文本基本語意單位為:<word:“我”,timeList{ }>;<word:“想”,timeList{ }>; 進行識別形成的文本基本語意單位進行匹配 For example, the recognized audio basic semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are: <word: "I", {startTime: 1000, endTime: 1100} >;<word: "think", {startTime:1200,endTime:1300}>; identify each basic semantic unit in the text information, and form a time value for each identified basic semantic unit. The basic semantic units of the text are: <word: "I", timeList{ }>; <word: "think", timeList{ }>; to match the text formed by the basic semantic units to match

由於所述音頻資訊進行識別後識別出的音頻基本語意單位“我”和“想”與對所述歌詞文本中歌詞的文本基本語意單位進行識別後形成的文本基本語意單位“我”和“想”的字形相同,則將音頻基本語意單位“我”和“想”的起止時間資訊放入到文本基本語意單位“我”和“想”中:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300}>。 The text basic semantic units "I" and "Imagine" formed by the audio basic semantic units "I" and "Think" recognized by the audio information after recognition and the text basic semantic units of the lyrics in the lyric text are formed. The same glyphs are used to put the start and end time information of the audio basic semantic units "I" and "Think" into the text basic semantic units "I" and "Think": <word: "I", timeList{startTime: 1000 , endTime: 1100}>; <word: "think", timeList{startTime: 1200, endTime: 1300}>.

需要說明的是,由於音頻資訊中相同的音頻基本語意單位出現的次數可能不唯一,例如:在一首歌曲中,某個相同的字可以多次出現,所以在執行步驟S105將各個所述音頻基本語意單位的起止時間資訊記錄到相應的所述文本基本語意單位中時,當具有相同的音頻基本語意單位時,可以採用如下方式實現:將從所述音頻資訊中獲取的音頻基本語意單位的起止時間資訊放入每一個與所述音頻基本語意單位相同的文本基本語意單位內。 It should be noted that, since the number of occurrences of the same audio basic semantic unit in the audio information may not be unique, for example, in a song, an identical word may appear multiple times, so each of the audios is performed in step S105. When the start and end time information of the basic semantic unit is recorded in the corresponding basic semantic unit of the text, when having the same basic audio semantic unit, the following can be implemented: the basic semantic unit of the audio obtained from the audio information The start and end time information is placed in each of the text basic semantic units that are identical to the basic semantic unit of the audio.

沿用上述具體的例子進行說明:通過識別所述音頻資訊獲取的文本資訊中記錄的識別出的音頻基本語意單位以及所述音頻基本語意單位的時間資訊為:<word:“我”,{startTime:1000,endTime:1100}>;<word:“想”,{startTime:1200,endTime:1300}>;<word:“了”,{startTime:1400,endTime:1500}>;<word:“又”,{startTime:1600,endTime:1700}>; <word:“想”,{startTime:1800,endTime:1900}>。 The specific example is described above: the recognized audio basic semantic unit recorded in the text information obtained by identifying the audio information and the time information of the audio basic semantic unit are: <word: "I", {startTime: 1000, endTime: 1100}>; <word: "think", {startTime: 1200, endTime: 1300}>; <word: "out", {startTime: 1400, endTime: 1500}>; <word: "again" , {startTime:1600,endTime:1700}>; <word: "think", {startTime:1800,endTime:1900}>.

在獲取所述文本資訊後對所述文本資訊中的每個基本語意單位進行識別,對識別出的每個基本語意單位形成時間資訊為空值的文本基本語意單位為:<word:“我”,timeList{ }>;<word:“想”,timeList{ }>;<word:“了”,timeList{ }>;<word:“又”,timeList{ }>;<word:“想”,timeList{ }>。 After obtaining the text information, each basic semantic unit in the text information is identified, and the basic semantic unit of the text that forms the time information for each of the identified basic semantic units is: <word: "I" , timeList{ }>; <word: "think", timeList{ }>; <word: "out", timeList{ }>; <word: "again", timeList{ }>; <word: "think", timeList { }>.

由於所述音頻資訊進行識別後識別出的音頻基本語意單位“我”、“想”、“了”、“又”和“想”與對所述歌詞文本中歌詞的文本基本語意單位進行提取後形成的文本基本語意單位“我”、“想”、“了”、“又”和“想”的時間集中字形相同,則將上述音頻基本語意單位的起止時間資訊放入到相應的文本基本語意單位中:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>;<word:“了”,timeList{startTime:1400,endTime:1500}>;<word:“又”,timeList{startTime:1600,endTime:1700}>;<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>。 The audio basic semantic units "I", "Think", "Yes", "Yes" and "Think" are identified after the audio information is recognized, and the basic semantic units of the lyrics in the lyric text are extracted. The formed text basic semantic units "I", "Think", "Yes", "Yes" and "Think" have the same time-concentrated glyphs, and then put the start and end time information of the above basic audio semantic units into the corresponding text basic semantics. In the unit: <word: "I", timeList{startTime: 1000, endTime: 1100}>; <word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}> ;<word:"," timeList{startTime:1400,endTime:1500}>;<word:"again", timeList{startTime:1600,endTime:1700}>;<word:"think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>.

可以理解的,在上述例子中,由於在所述音頻資訊以及所述文本中“想”字出現了兩次,所以將從所述音頻資 訊中獲取的“想”的起止時間資訊分別放入與“想”字對應的文本基本語意單位“想”中。 It can be understood that, in the above example, since the word "think" appears twice in the audio information and the text, the start and end time information of the "thinking" obtained from the audio information is separately placed and The basic meaning of the text corresponding to the word "think" is "thinking".

步驟S107,對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。 Step S107, processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information.

在本實施例中,所述對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字,可以採用如下方式實現:根據所述文本資訊中的具體單句確定組成該單句的文本基本語意單位,並根據組成該單句的所述文本基本語意單位中的起止時間資訊確定該單句的起止時間資訊,整理所有的單句的起止時間資訊,生成對應所述音頻資訊並確定了所有單句的起止時間資訊的配音文字。 In this embodiment, the processing the basic semantic unit of the text that records the start and end time information, and generating the voice-over text corresponding to the audio information, may be implemented as follows: according to the specific information in the text information The single sentence determines the basic semantic unit of the text constituting the single sentence, and determines the start and end time information of the single sentence according to the start and end time information in the basic semantic unit of the text constituting the single sentence, and organizes the start and end time information of all the single sentences to generate the corresponding audio. Information and the dubbing text of the start and end time information of all single sentences.

需要說明的是,在所述文本資訊中確定單句時,可以通過單句與單句之間的分行符號區分所述文本中的每一單句。 It should be noted that when a single sentence is determined in the text information, each single sentence in the text may be distinguished by a line symbol between a single sentence and a single sentence.

所述對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字,具體包括步驟S107-1至S107-3,下面結合附圖2作進一步說明。 And processing the text basic semantic unit that records the start and end time information to generate a voice-over text corresponding to the audio information, specifically including steps S107-1 to S107-3, which are further described below with reference to FIG. 2 .

請參考圖2,其示出了根據本申請案的實施例提供的對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字的流程圖。 Please refer to FIG. 2 , which illustrates a flow chart of processing the text basic semantic unit in which the start and end time information is recorded, and generating a voice-over text corresponding to the audio information, according to an embodiment of the present application.

所述對記錄了所述起止時間資訊的所述文本基本語意 單位進行處理,生成對應所述音頻資訊的配音文字,包括: 步驟S107-1,針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位。 And processing the text basic semantic unit that records the start and end time information to generate the voice-over text corresponding to the audio information, including: Step S107-1, acquiring a composition for each single sentence in the text information The basic semantic unit of the text of a single sentence.

在本實施例中,所述針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位,可以採用如下方式實現:根據分行符號進行區分所述文本資訊中的每一單句,並針對具體的某一單句獲取組成所述單句的文本基本語意單位。 In this embodiment, the obtaining a basic semantic unit of the text constituting the single sentence for each single sentence in the text information may be implemented by: distinguishing each single sentence in the text information according to the branch symbol, And the basic semantic unit of the text constituting the single sentence is obtained for a specific single sentence.

例如:所述文本資訊中的具體單句為:“我想”和“你了”,則組成該單句的文本基本語意單位為“我”和“想”以及“你”和“了”,且文本基本語意單位“我”和“想”為:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300}>;文本基本語意單位“你”和“了”為:<word:“你”,timeList{startTime:1400,endTime:1500}>;<word:“了”,timeList{startTime:1600,endTime:1700}>;。 For example, the specific single sentence in the text information is: "I want" and "you", the basic semantic units of the text that make up the single sentence are "I" and "Think" and "You" and "Yes", and the text The basic semantic units "I" and "Think" are: <word: "I", timeList{startTime: 1000, endTime: 1100}>; <word: "think", timeList{startTime: 1200, endTime: 1300}>; The text basic semantic units "you" and "out" are: <word: "you", timeList{startTime: 1400, endTime: 1500}>; <word: "out", timeList{startTime: 1600, endTime: 1700}> ;

步驟S107-2,根據已獲取的所述文本基本語意單位中記錄的起止時間資訊,確定所述單句的起止時間資訊。 Step S107-2, determining start and end time information of the single sentence according to the start and end time information recorded in the basic text semantic unit of the text.

在本實施例中,所述根據已獲取的所述文本基本語意單位中記錄的起止時間資訊,確定所述單句的起止時間資訊,可以採用如下方式實現:以組成所述單句的文本基本語意單位中起始時間最早的時間資訊作為所述單句的起始 時間,以組成所述單句的文本基本語意單位的時間集中終止時間最晚的時間資訊作為所述單句的終止時間,並將所述單句的起始時間以及終止時間作為所述單句的起止時間資訊。 In this embodiment, determining, according to the start and end time information recorded in the basic semantic unit of the text, determining the start and end time information of the single sentence may be implemented by: forming a basic semantic unit of the single sentence The earliest time information of the start time is used as the start time of the single sentence, and the latest time information of the time concentration termination time of the text basic semantic unit constituting the single sentence is used as the end time of the single sentence, and the single sentence is The start time and the end time are used as the start and end time information of the single sentence.

例如:根據上述兩個文本基本語意單位的時間資訊確定的單句“我想”的時間資訊為:timeList{startTime:1000,endTime:1300},根據上述兩個文本基本語意單位的時間資訊確定的單句“你了”的時間資訊為:timeList{startTime:1400,endTime:1700}。 For example, the time information of the single sentence "I want" determined according to the time information of the basic semantic units of the above two texts is: timeList{startTime: 1000, endTime: 1300}, a single sentence determined according to the time information of the basic semantic units of the two texts above. The time information for "you" is: timeList{startTime:1400,endTime:1700}.

步驟S107-3,將確定了起止時間資訊的所述單句進行整合,形成對應所述音頻資訊,且具有每一單句的起止時間資訊的配音文字。 Step S107-3, integrating the single sentences that determine the start and end time information to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.

例如:在確定所述文本中所有的單句“我想”和“你了”的時間資訊之後,輸出帶有上述兩句的時間資訊的文本(即:動態歌詞lrc):[00:01:00]我想 For example, after determining the time information of all the single sentences "I want" and "you" in the text, output the text with the time information of the above two sentences (ie: dynamic lyrics lrc): [00:01:00 ]I think

[00:01:40]你了。 [00:01:40]You are.

可以理解的,在播放所述音頻資訊時,在達到每一所述單句的顯示時間時,顯示配音文字中相應的單句。 It can be understood that, when the audio information is played, when the display time of each of the single sentences is reached, a corresponding single sentence in the voice-over text is displayed.

在本實施例中,由於音頻資訊中相同的音頻基本語意單位出現的次數可能不唯一,例如:在一首歌曲中,某個相同的字可以多次出現,所以在執行步驟S107-1針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意 單位時,當具有相同的基本語意單位時,可以採用如下方式實現:若所述文本基本語意單位中記錄了至少兩組起止時間資訊,則按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組。 In this embodiment, the number of occurrences of the same audio basic semantic unit in the audio information may not be unique. For example, in a song, a certain word may appear multiple times, so step S107-1 is performed for When each single sentence in the text information obtains the basic semantic unit of the text constituting the single sentence, when having the same basic semantic unit, it may be implemented as follows: if at least two groups of starting and ending time are recorded in the basic semantic unit of the text For information, the basic semantic unit group of the texts constituting the single sentence is formed according to the number of groups of start and end time information.

沿用上述具體的例子進行說明:所述文本中的具體單句為:“我想了又想”,則組成該單句的文本基本語意單位“我”、“想”、“了”、“又”和“想”為:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>;<word:“了”,timeList{startTime:1400,endTime:1500}>;<word:“又”,timeList{startTime:1600,endTime:1700}>;<word:“想”,timeList{startTime:1200,endTime:1300},{startTime:1800,endTime:1900}>;由於組成單句“我想了又想”的兩個文本基本語意單位“想”中各具有兩組時間資訊,則按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組包括如下四組:第一組為:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300}>;<word:“了”,timeList{startTime:1400,endTime:1500}>;<word:“又”,timeList{startTime:1600,endTime:1700}>;<word:“想”,timeList{startTime:1200,endTime:1300}>;第二組為: <word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300}>;<word:“了”,timeList{startTime:1400,endTime:1500}>;<word:“又”,timeList{startTime:1600,endTime:1700}>;<word:“想”,timeList{startTime:1800,endTime:1900}>;第三組為:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1800,endTime:1900}>;<word:“了”,timeList{startTime:1400,endTime:1500}>;<word:“又”,timeList{startTime:1600,endTime:1700}>;<word:“想”,timeList{startTime:1200,endTime:1300}>;第四組為:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1800,endTime:1900}>;<word:“了”,timeList{startTime:1400,endTime:1500}>;<word:“又”,timeList{startTime:1600,endTime:1700}>;<word:“想”,timeList{startTime:1800,endTime:1900}>。 The specific example above is used to illustrate: the specific single sentence in the text is: "I think and think", the basic semantic units of the text that make up the single sentence are "I", "Think", "Y", "Yes" and "Think" is: <word: "I", timeList{startTime: 1000, endTime: 1100}>; <word: "think", timeList{startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900 }>;<word: "了", timeList{startTime:1400,endTime:1500}>;<word:"again", timeList{startTime:1600,endTime:1700}>;<word:"think", timeList{ startTime: 1200, endTime: 1300}, {startTime: 1800, endTime: 1900}>; because the two texts of the basic sentence meaning "I want to think" in the single sentence "I want to think" have two sets of time information, then according to The number of groups of start and end time information, respectively forming the text basic semantic unit group constituting the single sentence includes the following four groups: the first group is: <word: "I", timeList{startTime: 1000, endTime: 1100}>; <word : "Think", timeList{startTime:1200,endTime:1300}>;<word:" ", timeList{startTime: 1400, endTime: 1500}>; <word: "again", timeList{startTime: 1600, endTime: 1700}>; <word: "think", timeList{startTime: 1200, endTime: 1300} >; The second group is: <word: "I", timeList{startTime:1000, endTime:1100}>; <word: "think", timeList{startTime:1200,endTime:1300}>;<word:" ", timeList{startTime:1400,endTime:1500}>;<word:"again", timeList{startTime:1600,endTime:1700}>;<word:"think", timeList{startTime:1800,endTime:1900} >; The third group is: <word: "I", timeList{startTime: 1000, endTime: 1100}>; <word: "think", timeList{startTime: 1800, endTime: 1900}>; <word: " ", timeList{startTime: 1400, endTime: 1500}>; <word: "again", timeList{startTime: 1600, endTime: 1700}>; <word: "think", timeList{startTime: 1200, endTime: 1300} >; The fourth group is: <word: "I", timeList{startTime: 1000, endTime: 1100}>; <word: "think", timeList{startTime: 1800, endTime: 19 00}>;<word: "了", timeList{startTime:1400,endTime:1500}>;<word:"again", timeList{startTime:1600,endTime:1700}>;<word:"think", timeList {startTime:1800,endTime:1900}>.

由於真實的所述單句的文本基本語意單位中應該只具有一種時間資訊,所以需要過濾掉時間資訊不合理的文本基本語意單位組,所以在執行完按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組的步驟之後,還包括如下步驟:根據預定的計算方法,對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩 選,確定組成所述單句的文本基本語意單位組。 Since the textual basic semantic unit of the single sentence sentence should have only one kind of time information, it is necessary to filter out the text basic semantic unit group whose time information is unreasonable, so the composition group is formed after the execution of the information according to the start and end time information. After the step of the single-sentence text basic semantic unit group, the method further includes the following steps: screening, according to a predetermined calculation method, all the start and end time information of each text basic semantic unit in each text basic semantic unit group, determining the composition The text of the single sentence is a basic semantic unit group.

在本實施例中,所述預定的計算方法,採用如下方式進行計算:計算各個所述文本基本語意單位組內,每一文本基本語意單位中的起始時間與所述文本基本語意單位的上一個文本基本語意單位的終止時間之間的時間間距,獲取各個所述文本基本語意單位組中所述起始時間與所述終止時間的時間間距的和,將所述時間間距的和作為所述文本基本語意單位組的誤差值。 In this embodiment, the predetermined calculation method performs calculation by calculating a start time in a basic semantic unit of each text and a basic semantic unit of the text in each of the text basic semantic unit groups. a time interval between end times of a textual semantic unit, obtaining a sum of a time interval of the start time and the end time in each of the text basic semantic unit groups, and using the sum of the time intervals as the The basic value of the text is the error value of the unit group.

需要說明的是,所述時間間距是指:每一文本基本語意單位中的起始時間與所述文本基本語意單位的上一個文本基本語意單位的終止時間之間的時間間距,由於在形成的組成所述單句的文本基本語意單位組時,所述文本基本語意單位的起始時間可能會小於上一個文本基本語意單位的終止時間,為了防止在計算誤差值時出現的負數時間間距影響誤差值的計算,需要獲取所述時間間距的正值。 It should be noted that the time interval refers to a time interval between the start time in the basic semantic unit of each text and the end time of the last text basic semantic unit of the text basic semantic unit. When the text basic semantic unit group of the single sentence is composed, the start time of the basic semantic unit of the text may be smaller than the end time of the basic semantic unit of the previous text, in order to prevent the negative time interval occurring when calculating the error value from affecting the error value The calculation needs to obtain a positive value of the time interval.

獲取所述時間間距的正值的方法包括:取絕對值、取平方等,下面以採用取平方的方式獲取所述時間間距的正值來進行說明。可以理解的,由於要獲取每一文本基本語意單位中的起始時間與上一個文本基本語意單位的終止時間之間的時間間距,所以通過差平方的計算方式獲取所述時間間距的正值。 The method for obtaining the positive value of the time interval includes: taking an absolute value, taking a square, etc., and the following is a description of obtaining a positive value of the time interval by taking a square. It can be understood that since the time interval between the start time in the basic semantic unit of each text and the end time of the basic semantic unit of the previous text is to be obtained, the positive value of the time interval is obtained by the calculation of the difference square.

具體的,所述預定的計算方法的數學演算法為:誤差值=(startTime2-endTime1)2+(startTime3-endTime2)2...+(startTime n-endTime n-1)2 Specifically, the mathematical algorithm of the predetermined calculation method is: error value=(startTime2-endTime1) 2 +(startTime3-endTime2) 2 ...+(startTime n-endTime n-1) 2

下面分別對上述4組時間集進行計算進行詳細說明。(為了方便計算進行舉例說明,在計算時以秒為單位進行計算) The calculation of the above four sets of time sets will be described in detail below. (For the convenience of calculation, the calculation is performed in seconds in the calculation)

第一組:(1.2-1.1)2+(1.4-1.3)2+(1.6-1.5)2+(1.2-1.7)2=0.28 The first group: (1.2-1.1) 2 + (1.4-1.3) 2 + (1.6-1.5) 2 + (1.2-1.7) 2 =0.28

第二組:(1.2-1.1)2+(1.4-1.3)2+(1.6-1.5)2+(1.8-1.7)2=0.04 The second group: (1.2-1.1) 2 + (1.4-1.3) 2 + (1.6-1.5) 2 + (1.8-1.7) 2 = 0.04

第三組:(1.8-1.1)2+(1.4-1.9)2+(1.6-1.5)2+(1.2-1.7)2=1 The third group: (1.8-1.1) 2 + (1.4-1.9) 2 + (1.6-1.5) 2 + (1.2-1.7) 2 =1

第四組:(1.8-1.1)2+(1.4-1.9)2+(1.6-1.5)2+(1.8-1.7)2=0.76 Group 4: (1.8-1.1) 2 + (1.4-1.9) 2 + (1.6-1.5) 2 + (1.8-1.7) 2 = 0.76

在本實施例中,所述預設的閾值可以是通過本領域的技術人員根據經驗進行配置的較為合理的數值,或者所述預設的閾值為數值最小的誤差值,在計算完誤差值之後,對各個所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組。 In this embodiment, the preset threshold may be a reasonable value configured by a person skilled in the art according to experience, or the preset threshold is the smallest error value, after the error value is calculated. And filtering each of the text basic semantic unit groups, and retaining a text basic semantic unit group whose error value is lower than a preset threshold.

當所述預設的閾值為數值最小的誤差值時,所述對各個所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組,可以採用如下方式實現:保留誤差值最小的組成所述單句的文本基本語意單位組,將其他的組成所述單句的文本基本語意單位組過濾掉。 When the preset threshold value is the smallest error value, the basic semantic unit group of each of the texts is filtered, and the text basic semantic unit group whose error value is lower than a preset threshold value may be used, which may be implemented as follows The text basic semantic unit group that constitutes the single sentence of the single sentence is retained, and the other text basic semantic unit groups constituting the single sentence are filtered out.

需要說明的是,在對組成所述單句的文本基本語意單位組進行過濾時,可能會出現具有相同誤差值的組成所述單句的文本基本語意單位組,這時在根據誤差值進行過濾後還是無法獲取單一的只具有一種時間資訊的文本基本語意單位組,為了解決上述問題,本申請案的實施例提供了一種較佳的實施方式,在較佳方式下,在執行所述對各個 所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組的步驟之後,還需要計算保留的所述文本基本語意單位組內,每一文本基本語意單位中的起始時間大於所述文本基本語意單位的上一個文本基本語意單位的終止時間的次數,獲取該次數最大的文本基本語意單位組。 It should be noted that, when filtering the basic semantic unit group of the texts constituting the single sentence, a text basic semantic unit group constituting the single sentence having the same error value may appear, and then it is impossible to filter according to the error value. Obtaining a single textual semantic unit group having only one type of time information, in order to solve the above problem, an embodiment of the present application provides a preferred embodiment. In a preferred manner, the pair of the texts is executed. The basic semantic unit group performs filtering, and after the step of retaining the text basic semantic unit group whose error value is lower than the preset threshold value, it is also required to calculate the start of the basic semantic unit in each text of the text. The time is greater than the number of times of termination of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.

下面以一個具體的實例進行說明。 The following is a specific example.

若組成組成所述單句的文本基本語意單位組還包括第五組:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300}>;<word:“了”,timeList{startTime:1400,endTime:1500}>;<word:“又”,timeList{startTime:1600,endTime:1700}>;<word:“想”,timeList{startTime:1600,endTime:1700}>;則第五組的誤差值為:(1.2-1.1)2+(1.4-1.3)2+(1.6-1.5)2+(1.6-1.7)2=0.04 If the text consists of the single sentence, the basic semantic unit group also includes the fifth group: <word: "I", timeList{startTime: 1000, endTime: 1100}>;<word:"think", timeList{startTime: 1200, endTime:1300}>;<word: "Yes", timeList{startTime:1400,endTime:1500}>;<word:"again", timeList{startTime:1600,endTime:1700}>;<word:"Think" , timeList{startTime:1600,endTime:1700}>; then the error value of the fifth group is: (1.2-1.1) 2 +(1.4-1.3) 2 +(1.6-1.5) 2 +(1.6-1.7) 2 = 0.04

經過對誤差值進行過濾後,保留誤差值最小的組成所述單句的文本基本語意單位組為第二組以及第五組,則還需對第二組和第五組的按照單句中文本基本語意單位的時間順序進行合理性判斷,即:判斷保留的組成所述單句的每一文本基本語意單位中的起始時間大於所述單句中的上一個文本基本語意單位的終止時間的次數。 After filtering the error value, the basic semantic unit group of the text that constitutes the single sentence with the smallest remaining error value is the second group and the fifth group, and the basic semantics of the second sentence and the fifth group according to the single sentence Chinese text are also needed. The chronological order of the units is judged by rationality, that is, the number of times in which the starting time in the basic semantic unit of each text of the single sentence is greater than the ending time of the basic semantic unit of the previous text in the single sentence.

例如:第二組“想”字的起始時間大於“想”字上一個文本基本語意單位“我”的終止時間;“了”字的起始 時間大於“了”字上一個文本基本語意單位“想”的終止時間;“又”字的起始時間大於“又”字上一個文本基本語意單位“了”的終止時間;“想”字的起始時間大於“想”字上一個文本基本語意單位“又”的終止時間,則第二組的合理次數為4次;同樣的道理,第五組的的合理次數為3次,則獲取合理次數為4次的組成所述單句的文本基本語意單位的時間集組。 For example, the start time of the second group of "thinking" words is greater than the ending time of the basic meaning unit "me" of a text on the word "think"; the start time of the word "了" is greater than the basic semantic unit of a text on the word "了" The end time of "thinking"; the starting time of the word "again" is greater than the ending time of the basic meaning unit of a text on the word "again"; the starting time of the word "thinking" is greater than the text of the word "thinking" For the termination time of the semantic unit "again", the reasonable number of the second group is 4 times; the same reason, the reasonable number of the fifth group is 3 times, then the text of the single sentence is obtained as a reasonable number of times. The time set of semantic units.

作為一個較佳實施方式,本申請案實施例提供的自動生成配音文字的方法中,在執行步驟S103獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊獲取文本基本語意單位時,是從所述文本資訊中,按照每句內的每個字的順序進行識別獲取所述文本資訊中的文本基本語意單位。 As a preferred embodiment, in the method for automatically generating the voice-over text provided by the embodiment of the present application, when the text information corresponding to the audio information is acquired in step S103, and the text information is obtained, the basic semantic unit of the text is obtained. From the text information, the basic semantic unit of the text in the text information is obtained by identifying each word in each sentence.

作為一個較佳實施方式,本申請案實施例提供的自動生成配音文字的方法中,由於語音辨識存在識別率,即:不一定能使所述音頻資訊精確無誤的被識別出來,所以在步驟S101中對音頻資訊進行識別時,可能會有未被識別出的音頻基本語意單位,而在執行步驟S103,獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊獲取文本基本語意單位時,由於文本資訊內的資訊是電腦可以識別的字串,則能夠將所述文本資訊內的每個基本語意單位進行識別並形成文本基本語意單位,所以在執行步驟S105將各個所述音頻基本語意單位的起止時間資訊記錄到相應的所述文本基本語意單位中時,若所述音頻基本語意單位的起 止時間資訊為空值,則使與所述音頻基本語意單位相應的所述文本基本語意單位的取值為空值。 As a preferred embodiment, in the method for automatically generating a dubbed character provided by the embodiment of the present application, since the recognition rate exists in the speech recognition, that is, the audio information may not be accurately recognized, so in step S101 When the audio information is identified, there may be an unrecognized audio basic semantic unit, and in step S103, the text information corresponding to the audio information is acquired, and the text information is recognized to obtain the basic semantic unit of the text. Since the information in the text information is a string recognizable by the computer, each basic semantic unit in the text information can be identified and formed into a basic semantic unit of the text, so the basic semantics of each of the audios is performed in step S105. When the start and end time information of the unit is recorded in the corresponding basic semantic unit of the text, if the start and end time information of the audio basic semantic unit is a null value, the basic semantic unit of the text corresponding to the audio basic semantic unit is made. The value is null.

可以理解的,若所述音頻資訊在識別過程中,具有未識別出的音頻基本語意單位,即:所述音頻基本語意單位為空,且該音頻基本語意單位中的起止時間資訊的取值也為空值,則在執行步驟S105將各個所述音頻基本語意單位的起止時間資訊記錄到相應的所述文本基本語意單位中時,形成的文本基本語意單位的數目會大於語音辨識出的音頻基本語意單位的數目,則使未匹配上的所述文本基本語意單位中的起止時間資訊的取值為空值。 It can be understood that if the audio information is in the process of identification, it has an unrecognized audio basic semantic unit, that is, the audio basic semantic unit is empty, and the value of the start and end time information in the audio basic semantic unit is also If the value is null, when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, the number of basic text units of the formed text is greater than the basic audio of the voice recognition. The number of semantic units is such that the start and end time information in the basic semantic unit of the text on the unmatched value is a null value.

例如:通過識別所述音頻資訊識別出的音頻基本語意單位以及所述音頻基本語意單位的時間資訊為:<word:“我”,{startTime:1000,endTime:1100}>;<word:“想”,{startTime:1200,endTime:1300}>;<word:“又”,{startTime:1600,endTime:1700}>;對所述歌詞文本中歌詞的每個文本基本語意單位形成時間資訊為空值的文本基本語意單位為:<word:“我”,timeList{ }>;<word:“想”,timeList{ }>;<word:“了”,timeList{ }>;<word:“又”,timeList{ }>;由於所述音頻資訊進行識別後只識別出了“我”、“想”和“又”,而對所述歌詞文本中歌詞的文本基本語意單位進行識別後形成的文本基本語意單位為:“我”、 “想”、“了”、“又”,則將上述音頻基本語意單位的時間資訊放入到相應的文本基本語意單位中:<word:“我”,timeList{startTime:1000,endTime:1100}>;<word:“想”,timeList{startTime:1200,endTime:1300}>;<word:“了”,timeList{ }>;<word:“又”,timeList{startTime:1600,endTime:1700}>。 For example, the basic information unit of the audio identified by the identification of the audio information and the time information of the basic semantic unit of the audio are: <word: "I", {startTime: 1000, endTime: 1100}>; <word: "Think ", {startTime: 1200, endTime: 1300}>; <word: "again", {startTime: 1600, endTime: 1700}>; for each text of the lyrics text, the basic semantic unit formation time information is empty The basic semantic unit of the text of the value is: <word: "I", timeList{ }>; <word: "think", timeList{ }>; <word: "了", timeList{ }>; <word: "again" , timeList{ }>; since the audio information is recognized, only "I", "think" and "again" are recognized, and the text formed by recognizing the basic semantic unit of the lyrics in the lyric text is basically The semantic units are: "I", "think", "ha", "again", then put the time information of the above basic audio semantic units into the corresponding text basic semantic units: <word: "I", timeList{ startTime: 1000, endTime: 1100}>; <word: "think" , timeList{startTime: 1200, endTime: 1300}>; <word: "out", timeList{ }>; <word: "again", timeList{startTime: 1600, endTime: 1700}>.

作為一個較佳實施方式,本申請案實施例提供的自動生成配音文字的方法中,在執行步驟S107-1針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位時,若具有取值為空值的文本基本語意單位時,在所述確定組成所述單句的文本基本語意單位組的步驟之後,為了使每一文本基本語意單位都具有起止時間資訊,按照預定的推算方式,對取值為空值的所述文本基本語意單位推算起止時間資訊。 As a preferred embodiment, in the method for automatically generating a dubbed character provided by the embodiment of the present application, when step S107-1 is performed, for each single sentence in the text information, the basic semantic unit of the text constituting the single sentence is obtained. If there is a text basic semantic unit whose value is a null value, after the step of determining the text basic semantic unit group constituting the single sentence, in order to make each text basic semantic unit have start and end time information, according to a predetermined calculation In the manner, the start and end time information is estimated for the basic semantic unit of the text whose value is null.

所述預定的推算方式,包括:計算所述文本基本語意單位組中的文本基本語意單位的平均時間資訊;將取值為空值的所述文本基本語意單位的上一個基本語意單位中的終止時間,放入取值為空值的所述文本基本語意單位的起始時間中;將所述終止時間加上所述平均時間資訊後,放入取值為空值的所述文本基本語意單位的終止時間中。 The predetermined calculation manner includes: calculating an average time information of a basic semantic unit of the text in the basic semantic unit group of the text; and terminating in a last basic semantic unit of the text basic semantic unit of the text having a null value Time, put in the start time of the basic semantic unit of the text whose value is null; add the end time to the average time information, and put the basic semantic unit of the text with a null value The termination time.

在本實施例中,所述計算所述文本基本語意單位組中的文本基本語意單位的平均時間資訊,可以採用如下方式 實現:將組成所述單句的每一文本基本語意單位中的終止時間減去起始時間,獲得每一文本基本語意單位在音頻資訊中的播放時間,並根據該單句中文本基本語意單位的播放時間的和除以該單句中文本基本語意單位的數量計算組成所述單句的文本基本語意單位的平均時間資訊。 In this embodiment, the calculating the average time information of the basic semantic units of the text in the basic semantic unit group of the text may be implemented by: reducing the termination time in the basic semantic unit of each text constituting the single sentence. Going to the start time, obtaining the playing time of the basic semantic unit of each text in the audio information, and calculating the single sentence according to the sum of the playing time of the basic semantic unit of the text in the single sentence divided by the number of basic semantic units of the text in the single sentence The average time information of the basic semantic units of the text.

可以理解的,由於所述文本基本語意單位是按照文本資訊的單句中每個基本語意單位的順序形成的,所以能通過取值為空值的文本基本語意單位的上一文本基本語意單位的時間資訊中的終止時間進行時間估算,將取值為空值的文本基本語意單位的上一個文本基本語意單位中的終止時間,放入取值為空值的文本基本語意單位的起始時間中,即:將與取值為空值的文本基本語意鄰近的文本基本語意的終止時間作為取值為空值的文本基本語意的起始時間。 It can be understood that, since the basic semantic unit of the text is formed in the order of each basic semantic unit in the single sentence of the text information, the time of the basic text unit of the previous text of the basic semantic unit of the text of the null value can be passed. The end time of the information is estimated by time, and the end time in the basic semantic unit of the text of the basic semantic unit of the text of the null value is placed in the start time of the basic semantic unit of the text whose value is null. That is, the end time of the basic semantic meaning of the text adjacent to the basic semantic meaning of the text with the null value is taken as the starting time of the basic semantic meaning of the text whose value is null.

在確定取值為空值的文本基本語意的起始時間後,根據該單句中每個文本基本語意單位在音頻資訊中的平均播放時間確定取值為空值的文本基本語意單位的終止時間,即:將取值為空值的文本基本語意單位已確定的起始時間加上所述平均時間資訊後,放入取值為空值的文本基本語意的終止時間中。 After determining the start time of the basic semantics of the text whose value is null, determining the end time of the basic semantic unit of the text whose value is null according to the average playing time of the basic semantic unit of each text in the single sentence, That is, the start time of the text basic semantic unit whose value is null is added to the average time information, and then the end time of the text basic meaning of the null value is put.

需要說明的是,由於執行步驟S103獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊獲取文本基本語意單位時,是從所述文本資訊中,按照每句內的每個字的順序進行識別獲取所述文本資訊中的文本基本語意單位 的,則對取值為空值的文本基本語意單位推算起止時間資訊還可以採用另一種方式實現:直接以取值為空的文本基本語意單位的上一文本基本語意單位的時間資訊中的終止時間以及以取值為空的文本基本語意單位的下一文本基本語意單位的時間資訊中的開始時間,分別作為該取值為空的文本基本語意單位的時間資訊中的開始時間以及終止時間。 It should be noted that, when the step S103 is performed to obtain the text information corresponding to the audio information, and the text information is recognized, the basic semantic unit of the text is obtained, from the text information, according to each word in each sentence. If the basic semantic unit of the text in the text information is sequentially identified, the start and end time information of the basic semantic unit of the text having the null value may be implemented in another manner: directly adopting the basic semantic meaning of the text with null value The end time in the time information of the basic textual unit of the last text of the unit and the start time in the time information of the next text basic semantic unit of the basic semantic unit of the text of the empty text, respectively, as the text with the null value The start time and the end time in the time information of the basic semantic unit.

可以理解的,由於所述文本基本語意單位是按照文本單句中每個文本基本語意單位的順序形成的,所以取值為空值的文本基本語意單位的基本語意單位是出現在與其鄰近的前後文本基本語意單位之間的,所以能通過上一文本基本語意單位的時間資訊中的結束時間以及下一文本基本語意單位的時間資訊中的開始時間對取值為空值的文本基本語意單位進行時間估算。 It can be understood that since the basic semantic units of the text are formed in the order of the basic semantic units of each text in the text sentence, the basic semantic unit of the basic semantic unit of the text whose value is null is the context before and after the text. Basic semantics between units, so you can use the end time in the time information of the basic semantic unit of the previous text and the start time in the time information of the basic semantic unit of the next text to time the basic semantic unit of the text with a null value. Estimate.

在上述的實施例中,提供了一種自動生成配音文字的方法,與上述自動生成配音文字的方法相對應的,本申請案還提供了一種自動生成配音文字的裝置。由於裝置的實施例基本相似於方法的實施例,所以描述得比較簡單,相關之處參見方法實施例的部分說明即可。下述描述的裝置實施例僅僅是示意性的。所述自動生成配音文字的裝置實施例如下:請參考圖3,其示出了根據本申請案的實施例提供的自動生成配音文字的裝置的示意圖。 In the above embodiment, a method for automatically generating a dubbed character is provided. Corresponding to the above method for automatically generating a dubbed character, the present application also provides an apparatus for automatically generating a dubbed character. Since the embodiment of the device is substantially similar to the embodiment of the method, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative. The device for automatically generating the voice-over text is as follows: Please refer to FIG. 3, which shows a schematic diagram of an apparatus for automatically generating voice-over characters according to an embodiment of the present application.

所述自動生成配音文字的裝置,包括:音頻識別單元 301、文本識別單元303、時間寫入單元305以及配音文字生成單元307;所述音頻識別單元301,用於對音頻信息進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;所述文本識別單元303,用於獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;所述時間寫入單元305,用於將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;所述配音文字生成單元307,用於對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。 The device for automatically generating a voice-over character includes: an audio recognition unit 301, a text recognition unit 303, a time writing unit 305, and a voice-over character generating unit 307. The audio recognition unit 301 is configured to identify and acquire the audio information. The start and end time information of each audio basic semantic unit; the text recognition unit 303 is configured to acquire text information corresponding to the audio information, and identify the text information, thereby acquiring a basic semantic unit of the text; The input unit 305 is configured to record the start and end time information of each of the audio basic semantic units into the corresponding text basic semantic unit; the voice-over text generating unit 307 is configured to record the start and end time information The text basic semantic unit performs processing to generate a voice-over text corresponding to the audio information.

可選的,所述時間記錄單元,包括:文本語意獲取子單元、時間資訊確定子單元以及配音文字生成子單元;所述文本語意獲取子單元,用於針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位;所述時間資訊確定子單元,用於根據已獲取的所述文本基本語意單位中記錄的起止時間資訊確定所述單句的起止時間資訊;所述配音文字生成子單元,用於將確定了起止時間資訊的所述單句進行整合,形成對應所述音頻資訊,且具有每一單句的起止時間資訊的配音文字。 Optionally, the time recording unit includes: a text semantic acquiring subunit, a time information determining subunit, and a dubbing text generating subunit; and the text semantic acquiring subunit is configured to use, for each single sentence in the text information, Obtaining a textual semantic unit constituting the single sentence; the time information determining subunit, configured to determine start and end time information of the single sentence according to the acquired start and end time information recorded in the basic semantic unit of the text; And generating a subunit for integrating the single sentence that determines the start and end time information to form a voiceover text corresponding to the audio information and having start and end time information of each single sentence.

可選的,所述時文本語意獲取子單元,具體用於針對 所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位時,若所述文本基本語意單位中記錄了至少兩組起止時間資訊,則按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組。 Optionally, the time text semantic acquisition sub-unit is specifically configured to: when each of the single sentences in the text information is obtained, obtain the basic semantic unit of the text that constitutes the single sentence, if at least two of the text basic semantic units are recorded When the group start and stop time information is used, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.

可選的,所述的自動生成配音文字的裝置,還包括:文本語意篩選子單元;所述文本語意篩選子單元,用於在所述按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組之後,根據預定的計算方法,對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩選,確定組成所述單句的文本基本語意單位組。 Optionally, the device for automatically generating the voice-over text further includes: a text semantic screening sub-unit; the text semantic screening sub-unit, configured to form the single sentence in the group according to the start and end time information After the text basic semantic unit group, according to a predetermined calculation method, all start and end time information of each text basic semantic unit in each text basic unit group is screened, and the text basic semantic unit group constituting the single sentence is determined. .

可選的,所述時間集組篩選子單元,包括:誤差計算子單元;所述誤差計算子單元,用於計算各個所述文本基本語意單位組內,每一文本基本語意單位中的起始時間與所述文本基本語意單位的上一個文本基本語意單位的終止時間之間的時間間距,獲取各個所述文本基本語意單位組中所述起始時間與所述終止時間的時間間距的和,將所述時間間距的和作為所述文本基本語意單位組的誤差值。 Optionally, the time set grouping subunit includes: an error calculation subunit; the error calculation subunit is configured to calculate a start in each text basic semantic unit group, each text basic semantic unit Obtaining a time interval between a time and an end time of the last text basic semantic unit of the basic semantic unit of the text, obtaining a sum of a time interval of the start time and the end time in each of the text basic semantic unit groups, The sum of the time intervals is taken as the error value of the text basic semantic unit group.

可選的,所述時間集組篩選子單元,還包括:過濾子單元;所述過濾子單元,用於對各個所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組。 Optionally, the time set group filtering subunit further includes: a filtering subunit, wherein the filtering subunit is configured to filter each of the text basic semantic unit groups, and the remaining error value is lower than a preset threshold. Text basic semantic unit group.

可選的,所述時間集組篩選子單元,還包括:時間次數計算子單元;所述時間次數計算子單元,用於在所述保留誤差值低於預設的閾值的文本基本語意單位組的之後,計算保留的所述文本基本語意單位組內,每一文本基本語意單位中的起始時間大於所述文本基本語意單位的上一個文本基本語意單位的終止時間的次數,獲取該次數最大的文本基本語意單位組。 Optionally, the time set grouping subunit further includes: a time number calculation subunit; the time number calculation subunit, configured to: the text basic semantic unit group in which the retention error value is lower than a preset threshold After the calculation, the number of times in the basic semantic unit group of the text, the starting time in the basic semantic unit of each text is greater than the ending time of the last text basic semantic unit of the basic semantic unit of the text, and the maximum number of times is obtained. The text of the basic semantic unit group.

可選的,所述文本識別單元303,具體用於從所述文本資訊中,按照每句內的每個字的順序進行識別獲取所述文本資訊中的文本基本語意單位。 Optionally, the text recognition unit 303 is specifically configured to: obtain, from the text information, the basic semantic unit of the text in the text information according to the order of each word in each sentence.

可選的,所述時間寫入單元305,具體用於在將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中時,若所述音頻基本語意單位的起止時間資訊為空值,則使與所述音頻基本語意單位相應的所述文本基本語意單位的取值為空值。 Optionally, the time writing unit 305 is specifically configured to: when the start and end time information of each of the audio basic semantic units is recorded into the corresponding basic semantic unit of the text, if the audio basic semantic unit The start and end time information is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.

可選的,所述的自動生成配音文字的裝置,還包括:時間推算單元,用於在所述確定組成所述單句的文本基本語意單位組之後,按照預定的推算方式,對取值為空值的所述文本基本語意單位推算起止時間資訊 Optionally, the device for automatically generating the voice-over text further includes: a time estimating unit, configured to: after determining the text basic semantic unit group constituting the single sentence, according to a predetermined calculation manner, the value is null The text's basic semantic unit of the value estimates the start and end time information

可選的,所述時間推算單元,包括:平均時間計算子單元,用於計算所述文本基本語意單位組中的文本基本語意單位的平均時間資訊;起始時間寫入子單元,用於用於將取值為空值的所述 文本基本語意單位,上一個文本基本語意單位中的終止時間,放入取值為空值的所述文本基本語意單位的起始時間中;終止時間寫入子單元,用於將所述終止時間加上所述平均時間資訊後,放入取值為空值的所述文本基本語意單位的終止時間中。 Optionally, the time estimating unit includes: an average time calculating subunit, configured to calculate an average time information of a basic semantic unit of the text in the basic semantic unit group of the text; the starting time is written into the subunit, and is used for In the basic semantic unit of the text whose value is null, the end time in the basic semantic unit of the previous text is placed in the start time of the basic semantic unit of the text whose value is null; the end time is written. And a subunit, configured to add the expiration time to the expiration time of the text basic semantic unit of the null value after adding the average time information.

在上述的實施例中,提供了一種自動生成配音文字的方法以及一種自動生成配音文字的裝置,此外,本申請案還提供了一種電子設備;所述電子設備實施例如下:請參考圖4,其示出了根據本申請案的實施例提供的電子設備的示意圖。 In the above embodiment, a method for automatically generating a voice-over text and a device for automatically generating a voice-over text are provided. In addition, the present application further provides an electronic device; the electronic device is implemented as follows: Please refer to FIG. It shows a schematic diagram of an electronic device provided in accordance with an embodiment of the present application.

所述電子設備,包括:顯示器401;處理器403;記憶體405;所述記憶體405,用於儲存配音文字生成程式,所述程式在被所述處理器讀取執行時,執行如下操作:對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。 The electronic device includes: a display 401; a processor 403; a memory 405; the memory 405 is configured to store a voice-over text generating program, and when the program is read and executed by the processor, the program performs the following operations: Identifying the audio information, obtaining the start and end time information of the identified basic audio unit of each audio; acquiring the text information corresponding to the audio information, and identifying the text information, thereby obtaining the basic semantic unit of the text; The start and end time information of the basic semantic unit is recorded in the corresponding basic semantic unit of the text; and the basic semantic unit of the text in which the start and end time information is recorded is processed to generate a voice-over text corresponding to the audio information.

在一個典型的配置中,計算設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。 In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

記憶體可能包括電腦可讀媒體中的非永久性記憶體,隨機存取記憶體(RAM)和/或非易失性記憶體等形式,如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。 The memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer readable media, such as read only memory (ROM) or flash memory ( Flash RAM). Memory is an example of a computer readable medium.

1、電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括,但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可程式設計唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶,磁帶磁磁片儲存或其他磁性存放裝置或任何其他非傳輸媒體,可用於儲存可以被計算設備存取的資訊。按照本文中的界定,電腦可讀媒體不包括非暫存電腦可讀媒體(transitory media),如調製的資料信號和載波。 1. Computer readable media including both permanent and non-permanent, removable and non-removable media can be stored by any method or technique. Information can be computer readable instructions, data structures, modules of programs, or other materials. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM). Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM only, digitally versatile A compact disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non-transportable medium can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.

2、本領域技術人員應明白,本申請案的實施例可提供為方法、系統或電腦程式產品。因此,本申請案可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且,本申請案可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電 腦程式產品的形式。 2. Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment combining the software and hardware. Moreover, the present application can employ a computer program product implemented on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) including computer usable code. form.

本申請案雖然以較佳實施例公開如上,但其並不是用來限定本申請案,任何本領域技術人員在不脫離本申請案的精神和範圍內,都可以做出可能的變動和修改,因此本申請案的保護範圍應當以本申請案申請專利範圍所界定的範圍為準。 The present application is disclosed in the above preferred embodiments, but it is not intended to limit the scope of the application, and any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. Therefore, the scope of protection of this application should be based on the scope defined by the scope of patent application of this application.

Claims (23)

一種自動生成配音文字的方法,其特徵在於,包括:對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。  A method for automatically generating a voice-over text, comprising: identifying audio information, acquiring start and end time information of each of the recognized basic audio semantic units; acquiring text information corresponding to the audio information, and identifying the text Information, thereby obtaining a basic semantic unit of the text; recording the start and end time information of each of the audio basic semantic units into the corresponding basic semantic unit of the text; performing the basic semantic unit of the text in which the start and end time information is recorded Processing, generating a voice-over text corresponding to the audio information.   根據申請專利範圍第1項所述的自動生成配音文字的方法,其中,所述對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字,包括:針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位;根據已獲取的所述文本基本語意單位中記錄的起止時間資訊,確定所述單句的起止時間資訊;將確定了起止時間資訊的所述單句進行整合,形成對應所述音頻資訊,且具有每一單句的起止時間資訊的配音文字。  The method for automatically generating a dubbed character according to claim 1, wherein the text basic semantic unit in which the start and end time information is recorded is processed to generate an audio dubbing text corresponding to the audio information, including Obtaining a basic semantic unit of the text constituting the single sentence for each single sentence in the text information; determining start and end time information of the single sentence according to the obtained start and end time information recorded in the basic semantic unit of the text; The single sentence of the start and end time information is integrated to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.   根據申請專利範圍第2項所述的自動生成配音文字的方法,其中,所述針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位時,若所述文本基本語意單位中記錄了至少兩組起止時間資訊,則按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組。  The method for automatically generating a voice-over character according to the second aspect of the patent application, wherein, for each single sentence in the text information, when the text basic semantic unit constituting the single sentence is obtained, if the text basic semantic unit At least two sets of start and end time information are recorded, and the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.   根據申請專利範圍第3項所述的自動生成配音文字的方法,其中,在所述按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組的步驟之後,包括:根據預定的計算方法,對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩選,確定組成所述單句的文本基本語意單位組。  The method for automatically generating a dubbed character according to claim 3, wherein, after the step of forming the text basic semantic unit group of the single sentence in the group according to the start and end time information, the method includes: according to the predetermined The calculation method selects, for each start and end time information of the basic semantic units of each text in each of the text basic semantic unit groups, and determines the basic semantic unit group of the texts constituting the single sentence.   根據申請專利範圍第4項所述的自動生成配音文字的方法,其中,所述預定的計算方法,包括:計算各個所述文本基本語意單位組內,每一文本基本語意單位中的起始時間與所述文本基本語意單位的上一個文本基本語意單位的終止時間之間的時間間距,獲取各個所述文本基本語意單位組中所述起始時間與所述終止時間的時間間距的和,將所述時間間距的和作為所述文本基本語意單位組的誤差值。  The method for automatically generating a dubbed character according to claim 4, wherein the predetermined calculation method comprises: calculating a start time in a basic semantic unit of each text in a basic semantic unit group of each of the texts Obtaining a time interval between a start time and a time interval of the start time of the last text basic semantic unit of the text basic semantic unit, and obtaining a sum of the time intervals of the start time and the end time in each of the text basic semantic unit groups, The sum of the time intervals is the error value of the basic semantic unit group of the text.   根據申請專利範圍第5項所述的自動生成配音文字的 方法,其中,所述對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩選,確定組成所述單句的文本基本語意單位組,包括:對各個所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組。  The method for automatically generating a dubbed character according to claim 5, wherein the starting and ending time information of each text basic semantic unit in each text basic semantic unit group is selected to determine the composition The basic sentence unit group of the single sentence text includes: filtering the basic semantic unit group of each of the texts, and retaining the text basic semantic unit group whose error value is lower than a preset threshold.   根據申請專利範圍第6項所述的自動生成配音文字的方法,其中,在所述保留誤差值低於預設的閾值的文本基本語意單位組的步驟之後,包括:計算保留的所述文本基本語意單位組內,每一文本基本語意單位中的起始時間大於所述文本基本語意單位的上一個文本基本語意單位的終止時間的次數,獲取該次數最大的文本基本語意單位組。  The method for automatically generating a dubbed character according to claim 6, wherein after the step of retaining the textual semantic unit group whose retention error value is lower than a preset threshold, the method comprises: calculating the retained text basic In the semantic unit group, the start time in the basic semantic unit of each text is greater than the number of times of the last text basic semantic unit of the basic semantic unit of the text, and the text basic semantic unit group having the largest number of times is obtained.   根據申請專利範圍第1-7項之任意一項所述的自動生成配音文字的方法,其中,所述識別所述文本資訊獲取文本基本語意單位,包括:從所述文本資訊中,按照每句內的每個字的順序進行識別獲取所述文本資訊中的文本基本語意單位。  The method for automatically generating a dubbed character according to any one of claims 1 to 7, wherein the recognizing the text information to obtain a text basic semantic unit comprises: from the text information, according to each sentence The order of each word within is identified to obtain the basic semantic unit of the text in the text information.   根據申請專利範圍第8項所述的自動生成配音文字的方法,其中,在將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中時,若所述音頻基本語意單位的起止時間資訊為空值,則使與所述音 頻基本語意單位相應的所述文本基本語意單位的取值為空值。  The method for automatically generating a dubbed character according to claim 8, wherein when the start and end time information of each of the audio basic semantic units is recorded in the corresponding basic semantic unit of the text, if the audio The start and end time information of the basic semantic unit is a null value, so that the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.   根據申請專利範圍第9項所述的自動生成配音文字的方法,其中,在所述確定組成所述單句的文本基本語意單位組的步驟之後,包括:按照預定的推算方式,對取值為空值的所述文本基本語意單位推算起止時間資訊。  The method for automatically generating a dubbed character according to claim 9, wherein after the step of determining a text basic semantic unit group constituting the single sentence, the method comprises: according to a predetermined calculation manner, the value is null The text basic semantic unit of the value estimates the start and end time information.   根據申請專利範圍第10項所述的自動生成配音文字的方法,其中,所述預定的推算方式,包括:計算所述文本基本語意單位組中的文本基本語意單位的平均時間資訊;將取值為空值的所述文本基本語意單位,上一個文本基本語意單位中的終止時間,放入取值為空值的所述文本基本語意單位的起始時間中;將所述終止時間加上所述平均時間資訊後,放入取值為空值的所述文本基本語意單位的終止時間中。  The method for automatically generating a dubbed character according to claim 10, wherein the predetermined derivation method comprises: calculating an average time information of a basic semantic unit of the text in the basic semantic unit group of the text; The textual semantic unit of the null value, the termination time in the basic semantic unit of the previous text, the start time of the basic semantic unit of the text whose value is null; the termination time is added After the average time information is described, the end time of the basic semantic unit of the text whose value is null is placed.   一種自動生成配音文字的裝置,其特徵在於,包括:音頻識別單元,用於對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;文本識別單元,用於獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單 位;時間寫入單元,用於將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;配音文字生成單元,用於對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。  An apparatus for automatically generating a voice-over character, comprising: an audio recognition unit, configured to identify audio information, obtain start and end time information of each of the recognized basic audio semantic units; and a text recognition unit, configured to acquire The text information corresponding to the audio information, and the text information is recognized, thereby obtaining a basic semantic unit of the text; the time writing unit is configured to record the start and end time information of each of the audio basic semantic units to the corresponding basic semantic meaning of the text. In the unit, the voice-over text generating unit is configured to process the text basic semantic unit in which the start and end time information is recorded, and generate a voice-over text corresponding to the audio information.   根據申請專利範圍第12項所述的自動生成配音文字的裝置,其中,所述配音文字生成單元,包括:文本語意獲取子單元,用於針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位;時間資訊確定子單元,用於根據已獲取的所述文本基本語意單位中記錄的起止時間資訊確定所述單句的起止時間資訊;配音文字生成子單元,用於將確定了起止時間資訊的所述單句進行整合,形成對應所述音頻資訊,且具有每一單句的起止時間資訊的配音文字。  The apparatus for automatically generating a dubbed character according to claim 12, wherein the dubbing text generating unit includes: a text semantic acquiring subunit, configured to acquire the composition for each single sentence in the text information. a textual basic semantic unit of a single sentence; a time information determining subunit, configured to determine start and end time information of the single sentence according to the acquired start and end time information recorded in the basic semantic unit of the text; the voiceover text generating subunit is configured to determine The single sentence of the start and end time information is integrated to form a voice-over text corresponding to the audio information and having start and end time information of each single sentence.   根據申請專利範圍第13項所述的自動生成配音文字的裝置,其中,所述時文本語意獲取子單元,具體用於針對所述文本資訊中每一單句,獲取組成所述單句的文本基本語意單位時,若所述文本基本語意單位中記錄了至少兩組起止時間資訊,則按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組。  The apparatus for automatically generating a voice-over character according to claim 13 , wherein the time text semantic acquisition sub-unit is specifically configured to acquire a basic semantic meaning of the text constituting the single sentence for each single sentence in the text information. In the unit, if at least two sets of start and end time information are recorded in the basic semantic unit of the text, the text basic semantic unit group constituting the single sentence is respectively formed according to the number of groups of the start and end time information.   根據申請專利範圍第14項所述的自動生成配音文字的裝置,其中,還包括:文本語意篩選子單元,用於在所述按照起止時間資訊的組數,分別形成組成所述單句的文本基本語意單位組之後,根據預定的計算方法,對每一所述文本基本語意單位組中,各個文本基本語意單位的所有起止時間資訊進行篩選,確定組成所述單句的文本基本語意單位組。  The apparatus for automatically generating a dubbed character according to claim 14 , further comprising: a text semantic screening subunit, configured to respectively form a text constituting the single sentence in the group according to the start and end time information After the semantic unit group, according to a predetermined calculation method, all the start and end time information of each text basic semantic unit in each text basic unit group is screened, and the text basic semantic unit group constituting the single sentence is determined.   根據申請專利範圍第15項所述的自動生成配音文字的裝置,其中,所述文本語意篩選子單元,包括:誤差計算子單元,用於計算各個所述文本基本語意單位組內,每一文本基本語意單位中的起始時間與所述文本基本語意單位的上一個文本基本語意單位的終止時間之間的時間間距,獲取各個所述文本基本語意單位組中所述起始時間與所述終止時間的時間間距的和,將所述時間間距的和作為所述文本基本語意單位組的誤差值。  The apparatus for automatically generating a voice-over character according to claim 15, wherein the text semantic screening sub-unit includes: an error calculation sub-unit, configured to calculate each text in the basic semantic unit group, each text Obtaining a time interval between a start time in the basic semantic unit and an end time of the last text basic semantic unit of the text basic semantic unit, obtaining the start time and the termination in each of the text basic semantic unit groups The sum of the time intervals of time, the sum of the time intervals is taken as the error value of the basic semantic unit group of the text.   根據申請專利範圍第15項所述的自動生成配音文字的裝置,其中,所述文本語意篩選子單元,還包括:過濾子單元,用於對各個所述文本基本語意單位組進行過濾,保留誤差值低於預設的閾值的文本基本語意單位組。  The apparatus for automatically generating a voice-over character according to the fifteenth aspect of the patent application, wherein the text semantic screening sub-unit further includes: a filtering sub-unit, configured to filter each of the text basic semantic unit groups, and retain an error A textual semantic unit group whose value is lower than the preset threshold.   根據申請專利範圍第17項所述的自動生成配音文字的裝置,其中,所述文本語意篩選子單元,還包括:時間次數計算子單元,用於在所述保留誤差值低於預設的閾值的文本基本語意單位組的之後,計算保留的所述文本基本語意單位組內,每一文本基本語意單位中的起始時間大於所述文本基本語意單位的上一個文本基本語意單位的終止時間的次數,獲取該次數最大的文本基本語意單位組。  The apparatus for automatically generating a voice-over character according to claim 17, wherein the text semantic screening sub-unit further includes: a time number calculation sub-unit, configured to: when the retention error value is lower than a preset threshold After the basic textual unit group of the text is calculated, the starting time of the basic semantic unit of each text is calculated, and the starting time of the basic semantic unit of each text is greater than the ending time of the last text basic semantic unit of the basic semantic unit of the text. The number of times to get the text basic semantic unit group with the largest number of times.   根據申請專利範圍第12-18項之任意一項所述的自動生成配音文字的裝置,其中,所述文本識別單元,具體用於從所述文本資訊中,按照每句內的每個字的順序進行識別獲取所述文本資訊中的文本基本語意單位。  The apparatus for automatically generating a dubbed character according to any one of claims 12-18, wherein the text recognition unit is specifically configured to: according to each text in each sentence from the text information Identifying sequentially obtains the basic semantic units of the text in the text information.   根據申請專利範圍第19項所述的自動生成配音文字的裝置,其中,所述時間寫入單元,具體用於在將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中時,若所述音頻基本語意單位的起止時間資訊為空值,則使與所述音頻基本語意單位相應的所述文本基本語意單位的取值為空值。  The apparatus for automatically generating a dubbed character according to claim 19, wherein the time writing unit is specifically configured to record start and end time information of each of the audio basic semantic units to the corresponding text. In the basic semantic unit, if the start and end time information of the audio basic semantic unit is a null value, the value of the text basic semantic unit corresponding to the audio basic semantic unit is null.   根據申請專利範圍第20項所述的自動生成配音文字的裝置,其中,還包括:時間推算單元,用於在所述確定組成所述單句的文本 基本語意單位組之後,按照預定的推算方式,對取值為空值的所述文本基本語意單位推算起止時間資訊。  The apparatus for automatically generating a voice-over character according to claim 20, further comprising: a time estimating unit, configured to perform a predetermined calculation manner after determining the text basic semantic unit group constituting the single sentence, The start and end time information is estimated for the basic semantic unit of the text whose value is null.   根據申請專利範圍第21項所述的自動生成配音文字的裝置,其中,所述時間推算單元,包括:平均時間計算子單元,用於計算所述文本基本語意單位組中的文本基本語意單位的平均時間資訊;起始時間寫入子單元,用於將取值為空值的所述文本基本語意單位,上一個文本基本語意單位中的終止時間,放入取值為空值的所述文本基本語意單位的起始時間中;終止時間寫入子單元,用於將所述終止時間加上所述平均時間資訊後,放入取值為空值的所述文本基本語意單位的終止時間中。  The apparatus for automatically generating a dubbed character according to claim 21, wherein the time estimating unit comprises: an average time calculating subunit, configured to calculate a basic semantic unit of the text in the basic semantic unit group of the text. Average time information; the start time is written into the subunit, and is used to put the text basic semantic unit whose value is a null value, the end time in the basic semantic unit of the previous text, and put the text with a null value The start time of the basic semantic unit; the end time is written into the subunit, and after adding the end time to the average time information, the end time of the basic semantic unit of the text whose value is null is placed. .   一種電子設備,其特徵在於,所述電子設備包括:顯示器;處理器;記憶體,用於儲存配音文字生成程式,所述程式在被所述處理器讀取執行時,執行如下操作:對音頻資訊進行識別,獲取識別出的各個音頻基本語意單位的起止時間資訊;獲取與所述音頻資訊對應的文本資訊,並識別所述文本資訊,從而獲取文本基本語意單位;將各個所述音頻基本語意單位的起止時間資訊,記錄到相應的所述文本基本語意單位中;對記錄了所述起止時間資訊的所述文本基本語意單位進行處理,生成對應所述音頻資訊的配音文字。  An electronic device, comprising: a display; a processor; a memory for storing a voice-over text generating program, wherein when the program is read and executed by the processor, the following operation is performed: audio The information is identified, the identified start and end time information of each audio basic semantic unit is obtained; the text information corresponding to the audio information is acquired, and the text information is recognized, thereby obtaining a basic semantic unit of the text; The start and end time information of the unit is recorded in the corresponding basic semantic unit of the text; and the basic semantic unit of the text in which the start and end time information is recorded is processed to generate a voice-over text corresponding to the audio information.  
TW106126945A 2016-12-22 2017-08-09 Method, device and electronic equipment for automatically generating dubbing text TWI749045B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201611196447.6 2016-12-22
??201611196447.6 2016-12-22
CN201611196447.6A CN108228658B (en) 2016-12-22 2016-12-22 Method and device for automatically generating dubbing characters and electronic equipment

Publications (2)

Publication Number Publication Date
TW201832222A true TW201832222A (en) 2018-09-01
TWI749045B TWI749045B (en) 2021-12-11

Family

ID=62624697

Family Applications (1)

Application Number Title Priority Date Filing Date
TW106126945A TWI749045B (en) 2016-12-22 2017-08-09 Method, device and electronic equipment for automatically generating dubbing text

Country Status (3)

Country Link
CN (1) CN108228658B (en)
TW (1) TWI749045B (en)
WO (1) WO2018113535A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858492A (en) * 2018-08-23 2020-03-03 阿里巴巴集团控股有限公司 Audio editing method, device, equipment and system and data processing method
CN110728116B (en) * 2019-10-23 2023-12-26 深圳点猫科技有限公司 Method and device for generating video file dubbing manuscript

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6345252B1 (en) * 1999-04-09 2002-02-05 International Business Machines Corporation Methods and apparatus for retrieving audio information using content and speaker information
JP4026543B2 (en) * 2003-05-26 2007-12-26 日産自動車株式会社 Vehicle information providing method and vehicle information providing device
CN100501738C (en) * 2006-10-24 2009-06-17 北京搜狗科技发展有限公司 Searching method, system and apparatus for playing media file
CN101616264B (en) * 2008-06-27 2011-03-30 中国科学院自动化研究所 Method and system for cataloging news video
CN101615417B (en) * 2009-07-24 2011-01-26 北京海尔集成电路设计有限公司 Synchronous Chinese lyrics display method which is accurate to words
GB2502944A (en) * 2012-03-30 2013-12-18 Jpal Ltd Segmentation and transcription of speech
CN104599693B (en) * 2015-01-29 2018-07-13 语联网(武汉)信息技术有限公司 The production method of lines sychronization captions
CN204559707U (en) * 2015-04-23 2015-08-12 南京信息工程大学 There is the prompter device of speech identifying function
CN105788589B (en) * 2016-05-04 2021-07-06 腾讯科技(深圳)有限公司 Audio data processing method and device

Also Published As

Publication number Publication date
CN108228658B (en) 2022-06-03
WO2018113535A1 (en) 2018-06-28
TWI749045B (en) 2021-12-11
CN108228658A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
WO2020024690A1 (en) Speech labeling method and apparatus, and device
US9153233B2 (en) Voice-controlled selection of media files utilizing phonetic data
US8666727B2 (en) Voice-controlled data system
KR101292698B1 (en) Method and apparatus for attaching metadata
WO2017157142A1 (en) Song melody information processing method, server and storage medium
US20180286459A1 (en) Audio processing
WO2018059342A1 (en) Method and device for processing dual-source audio data
EP3373299B1 (en) Audio data processing method and device
US20110046955A1 (en) Speech processing apparatus, speech processing method and program
WO2011146366A1 (en) Methods and systems for performing synchronization of audio with corresponding textual transcriptions and determining confidence values of the synchronization
CN103123644A (en) Voice data retrieval system and program product therefor
CN109213977A (en) The generation system of court&#39;s trial notes
TWI749045B (en) Method, device and electronic equipment for automatically generating dubbing text
US11016945B1 (en) Identifying and utilizing synchronized content
JP5465926B2 (en) Speech recognition dictionary creation device and speech recognition dictionary creation method
CN112750421B (en) Singing voice synthesis method and device and readable storage medium
CN106782601A (en) A kind of multimedia data processing method and its device
JP4697432B2 (en) Music playback apparatus, music playback method, and music playback program
US20100222905A1 (en) Electronic apparatus with an interactive audio file recording function and method thereof
CN115329125A (en) Song skewer burning splicing method and device
CN109165283A (en) Resource recommendation method, device, equipment and storage medium
EP1826686B1 (en) Voice-controlled multimedia retrieval system
CN114999464A (en) Voice data processing method and device
CN110895575A (en) Audio processing method and device
JP2009204872A (en) Creation system of dictionary for speech recognition