CN104038804B

CN104038804B - Captioning synchronization apparatus and method based on speech recognition

Info

Publication number: CN104038804B
Application number: CN201310069142.9A
Authority: CN
Inventors: 徐�明; 范炜; 谭皓
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2013-03-05
Filing date: 2013-03-05
Publication date: 2017-09-29
Anticipated expiration: 2033-03-05
Also published as: CN104038804A

Abstract

There is provided a kind of captioning synchronization apparatus and method based on speech recognition, the captioning synchronization device includes：Sound identification module, the voice from audio stream extraction foreground sounds, and the voice of extraction is sampled and recognized, so as to generate and corresponding text information；Dynamic sampling adjusting module, carries out the evaluation of semantics recognition degree to the text information of generation, and according to the result of evaluation come control voice identification module sampling rate adjusting to obtain the text information with high semantics recognition degree；Captions semanteme contrast module, semantic matches are carried out by the word of the text information with high semantics recognition degree and the additional multilingual subtitle for playing video；Captioning synchronization module, if captions semanteme contrast module finds sentence corresponding with the text information of the voice of identification in subtitle file, the temporal information of subtitle file is adjusted according to the temporal information of voice；Subtitle Demonstration module, captions are shown according to the temporal information of the subtitle file after adjustment.

Description

Captioning synchronization apparatus and method based on speech recognition

Technical field

The present invention relates to speech recognition and captioning synchronization technical field.More particularly, it is related to one kind and utilizes speech recognition The apparatus and method of automatic synchronization captions corresponding with video when TV programme are played.

Background technology

At present, the support in digital television signal stream for subtitle language number is limited, it is impossible to while meeting different crowd Demand.Especially in place as hotel hotel, the people for having many different language countries moves in, and these crowds are in viewing The need for just having its special when DTV captions.Therefore, when playing digital TV video frequency in the presence of additional many of display The demand of state's language subtitle.Simultaneously as the information that may be broken for commercialsy with emergency notice etc in TV programme, additional Multinational Subtitle Demonstration needs advertisement category information turn function, synchronous with audio frequency and video holding all the time.

The content of the invention

The present invention proposes in TV programme simultaneous display when there is commercial breaks by using speech recognition technology and added The scheme of captions.By additional language subtitle, sampled using dynamic voice, effective audio-frequency information is rationally obtained, to subtitling The text Presentation Time Stamp that is matched and adjusted so that subtitling text can be to entering in the presence of the phenomenon such as intercutting in digital television program The effective adjustment of row, keeps the simultaneous display of subtitling.

According to an aspect of the present invention there is provided a kind of captioning synchronization device based on speech recognition, including：Speech recognition Module, the voice from audio stream extraction foreground sounds corresponding with playing video, and the voice of extraction is sampled and known Not, so as to generate text information corresponding with the voice recognized；Dynamic sampling adjusting module, the text generated to sound identification module Word information carries out the evaluation of semantics recognition degree, and according to the result of evaluation come control voice identification module sampling rate adjusting with Obtain the text information with high semantics recognition degree；Captions semanteme contrast module, by the text information with high semantics recognition degree The word of additional multilingual subtitle with playing video carries out semantic matches；Captioning synchronization module, if captions semanteme contrast Module finds sentence corresponding with the text information of the voice of identification in subtitle file, then is adjusted according to the temporal information of voice The temporal information of whole subtitle file；Subtitle Demonstration module, the temporal information of the subtitle file after being adjusted according to captioning synchronization module To show captions.

According to an aspect of the present invention, the captioning synchronization device also includes：Speech selection module, according to the selection of user To determine the language for the captions that will be shown.

According to an aspect of the present invention, when in the text information that dynamic sampling adjusting module determines sound identification module generation Phonetic word number when preset range [m, n] is interior, dynamic sampling adjusting module, which determines that text information has, high semantic to be known Do not spend, wherein m, n are natural numbers.

According to an aspect of the present invention, if dynamic sampling adjusting module determines the text information of sound identification module generation In phonetic word quantity be less than minimum number m, then dynamic sampling adjusting module control voice identification module improve sampling frequency Rate is sampled to voice；If dynamic sampling adjusting module determines the voice in the text information of sound identification module generation The quantity of word is higher than maximum quantity n, then dynamic sampling adjusting module control voice identification module reduction sample frequency is come to language Sound is sampled.

According to an aspect of the present invention, dynamic sampling adjusting module is considered in the text information of sound identification module generation The semantic meaning of phonetic word evaluates the semantics recognition degree of text information.

According to an aspect of the present invention, the semantic contrast module of captions uses fuzzy algorithmic approach to playing video using fuzzy algorithmic approach The word of additional multilingual subtitle enter line character scoring so that find out the sentence of highest scoring in subtitle file as with text The sentence of word information matches.

According to an aspect of the present invention, if captions semantic matches module does not find the language with identification in subtitle file The corresponding sentence of text information of sound, then notify dynamic sampling adjusting module to improve the sample frequency of sound identification module.

According to another aspect of the present invention there is provided a kind of captioning synchronization method based on speech recognition, including：(a) from Audio stream corresponding with playing video extracts the voice in foreground sounds, and the voice of extraction is sampled and recognized, so that Generate text information corresponding with the voice recognized；(b) evaluation of semantics recognition degree, and root are carried out to the text information of generation Carry out control voice identification module sampling rate adjusting according to the result of evaluation to obtain the text information with high semantics recognition degree； (c) word of the text information with high semantics recognition degree and the additional multilingual subtitle for playing video is carried out semantic Match somebody with somebody, to find sentence corresponding with the text information of the voice of identification in subtitle file；(d) according to the temporal information of voice come Adjust the temporal information of subtitle file；(e) captions are shown according to the temporal information of the subtitle file after adjustment.

According to another aspect of the present invention, the captioning synchronization method also includes：Being determined according to the selection of user will The language of the captions of display.

According to another aspect of the present invention, in step (b), when it is determined that step (a) generation text information in voice list The number of word determines that text information has high semantics recognition degree, wherein m, n is natural number when preset range [m, n] is interior.

According to another aspect of the present invention, in step (b), if it is determined that the voice in the text information of step (a) generation The quantity of word is less than minimum number m, then return to step (a) and improves sample frequency to sample to voice；If it is determined that The quantity of phonetic word in the text information of step (a) generation is higher than maximum quantity n, then return to step (a) reduction sampling frequency Rate is sampled to voice.

According to another aspect of the present invention, in step (b), it is considered to the phonetic word in the text information of step (a) generation Semantic meaning evaluate the semantics recognition degree of text information.

According to another aspect of the present invention, in step (c), using fuzzy algorithmic approach using fuzzy algorithmic approach to playing the attached of video Plus the word of multilingual subtitle enters line character scoring, so that the sentence for finding out highest scoring in subtitle file is believed as with word Cease the sentence of matching.

According to another aspect of the present invention, if do not found in step (c) in subtitle file and the voice of identification The sample frequency of the corresponding sentence of text information, then return to step (a) raising speech recognition.

Brief description of the drawings

By the description carried out below in conjunction with the accompanying drawings, above and other purpose of the invention and feature will become more clear Chu, wherein：

Fig. 1 is the block diagram for showing the captioning synchronization device according to embodiments of the present invention based on speech recognition；

Fig. 2 is the flow chart for showing the captioning synchronization method according to embodiments of the present invention based on speech recognition.

Embodiment

The description of progress referring to the drawings is provided below to contribute to comprehensive understanding such as claim and its equivalent to be limited Exemplary embodiment of the invention.The description includes various detailed details to help to understand, and these describe will be by Think exemplary only.Therefore, one of ordinary skill in the art will recognize do not departing from scope and spirit of the present invention In the case of can make various changes described here and modification.In addition, in order to clear and succinct, can omit to known function and The description of construction.

The term and word used in following described and claimed is not limited to bibliographical meaning, but only by inventor Use to understand and as one man understand the present invention.Therefore, it should be appreciated by the person skilled in the art that being provided below To the description of the exemplary embodiment of the present invention merely for the purpose shown, rather than for limitation such as by claim and its The purpose of the present invention that jljl limits.

Fig. 1 is the block diagram for showing the captioning synchronization device 100 according to embodiments of the present invention based on speech recognition.

As shown in figure 1, the captioning synchronization device 100 based on speech recognition according to embodiments of the present invention includes speech selection The semantic contrast module 140 of module 110, sound identification module 120, dynamic sampling adjusting module 130, captions, captioning synchronization module 150 and Subtitle Demonstration module 160.Captioning synchronization device 100 according to embodiments of the present invention can be integrated into Digital Broadcasting Receiver dress Put or video play device among.

Speech selection module 110 can determine the subtitle language that will be shown according to the selection of user.For example, when user is logical Cross remote control equal controller and send signal to captioning synchronization device 100, so as to select the subtitle language for wanting to use.

Sound identification module 120 is from the corresponding audio of the video flowings of the TV programme with playing or other broadcasting contents Stream extracts the voice in foreground sounds, and the voice of extraction is sampled and recognized, so as to generate corresponding with the voice recognized Text information.By extracting prospect master voice, the background sound in the video of broadcasting can be removed, for example, movie or television The sound such as automobile, background music in program, can so improve the degree of accuracy of speech recognition.Can using it is any in the prior art Prospect master voice extracting method and speech recognition engine realize sound identification module 120.

The text information that dynamic sampling adjusting module 130 is generated to sound identification module 120 carries out semantic identification degree and commented Valency, and according to the result of evaluation determine the need for adjust sound identification module 120 sample frequency.It is real according to the one of the present invention Example is applied, dynamic sampling adjusting module 130 can determine that the number of the phonetic word in the text information that sound identification module 120 is generated Whether in preset range [m, n].If it is determined that the quantity of the phonetic word in text information is less than minimum number m or is more than Maximum quantity n, then the determination of dynamic sampling adjusting module 130 semantics recognition degree is relatively low, it is necessary to sampling rate adjusting.Work as dynamic sampling Adjusting module 130 determines that the quantity of the phonetic word in the text information that sound identification module 120 is generated is less than minimum number m When, dynamic sampling adjusting module 130 determines to need to improve sample frequency, so that the adopting with raising of control voice identification module 120 Sample frequency is sampled to voice.When dynamic sampling adjusting module 130 determines the text information that sound identification module 120 is generated In phonetic word quantity be higher than maximum quantity n when, dynamic sampling adjusting module 130 determine can reduce sample frequency, from And control voice identification module 120 is sampled according to the sample frequency after reduction to voice.That is, as the people in audio When thing speaks word speed quickly, the sentence number of characters obtained within the unit interval will increase, and this causes the error rate that captions are matched Increase, now, it may be determined that the semantics recognition degree of present video is low.Conversely, when in audio personage speak word speed it is very slow when, in unit The sentence number of characters obtained in time will be reduced, and equally can also increase the error rate of captions matching, now, equally be can determine that and worked as Preceding audio semantics recognition degree is low.Therefore, only control sample frequency and obtain the number of characters of fair amount and just can determine that semanteme Resolution is high.

In addition, embodiments in accordance with the present invention, when evaluation semantics recognition is spent, dynamic sampling adjusting module 130 can also be examined Consider the semantic meaning of the phonetic word in the text information that sound identification module 120 is generated, determine whether to need adjustment to adopt Sample frequency.For example, when the phonetic word in the text information that sound identification module 120 is generated includes multiple low semantic words When (for example, the onomatopoeia of such as continuous multiple " "), dynamic sampling adjusting module 130 can determine that sound identification module 120 is given birth to Into text information semantics recognition degree it is relatively low, and control voice identification module 120 improve sample frequency.

Next, obtaining the text information of higher semantics recognition degree in the assessment by dynamic sampling adjusting module 130 Afterwards, the semantic contrast module 140 of captions carries out the word of text information and the additional multilingual subtitle for playing video semantic Matching.Here, the semantic contrast module 140 of captions can use fuzzy algorithmic approach, and word is carried out to the word of additional multilingual subtitle Symbol scoring, so as to find out the sentence of highest scoring in subtitle file.That is, captions semanteme contrast module 140 is literary by captions Scoring is defined as sentence corresponding with the text information recognized higher than the scoring highest sentence in the sentence of predetermined value in part.

It will be exemplified below by the way of fuzzy algorithmic approach scores sentence.Certainly, those skilled in the art can adopt Otherwise search the sentence with the semantic matches of the sentence in subtitle file.

Two character strings ACAATCC and AGCATGC are provided, then modifies, delete and adds when being matched to both It can just be matched completely Deng operation.In order to be more convenient the calculating of the degree of approximation, editing distance is adjusted to degree of approximation score, even With then obtaining 2 points, modification, deletion, addition then obtain -1 point., can be by following in order to obtain degree of approximation score during matching completely Recurrence formula obtains a score matrix, and its degree of approximation score is S (n, n) value in n rank matrixes S, and n is matching string Length adds 1.V represents Value (i.e. score value), and D represents Difference Value (i.e. difference), and S represents String and (treated With character string), T represents Template i.e. template, and i, j represent the row and column of matrix respectively, and value is since 0).

Initial value can be directly obtained:

V (0,0)=0；

V (0, j)=V (0, j-1)+D (_, T [j])；Insertion j times

V (i, 0)=V (i-1,0)+D (S [i], _)；Delete i times

Other values can be obtained by following stepping type：

Formula more than, with exemplified by calculating V (1,2),

Known i=1, j=2

Then：

V (0,1)=- 1, V (0,2)=- 2, V (1,1)=2；

D (S [1], T [2])=- 1 (i.e. A is compared with AG),

D (S [1], _)=- 1 (i.e. A with _ is compared),

D (_, T [2])=- 1 (i.e. _ compared with G)；

V (1,2)=V (0,1)+D (S [1], T [2])=- 2,

V (1,2)=V (0,2)+D (S [1], _)=- 3,

V (1,2)=V (0,1)+D (_, T [2])=1；

It can finally obtain：

(max) V (1,2)=1

The corresponding optimal similarity score for being scored at the character string of 7 points, i.e., two of most short editing distance may finally be drawn For 7.

It the above is only an example of the method for scoring character string of having enumerated, any of side can also be used Method is evaluated come the similitude between the sentence in the text information and subtitle file to identification.

If the scoring of all sentences is below predetermined value, captions semanteme contrast module 140 is determined in subtitle file not In the presence of sentence corresponding with the text information of identification.Embodiments in accordance with the present invention, when the semantic contrast module 140 of captions does not exist When sentence corresponding with the text information recognized is found in subtitle file, captions semanteme contrast module 140 is adjusted to dynamic sampling Module 130 sends the order for improving sample frequency, so that dynamic sampling adjusting module 130 can control sound identification module 120 Continue that voice is identified according to the sample frequency of raising.Then, sound identification module 120, dynamic sampling adjustment mould are repeated The aforesaid operations of the semantic contrast module 140 of block 130, captions, until find with the semantic similarity of the sentence in subtitle file compared with Untill high voice.

If captions semanteme contrast module 140 finds the sentence in subtitle file corresponding with the voice sampled, captions Synchronization module 150 adjusts the temporal information of subtitle file according to the temporal information of voice.That is, captioning synchronization module It is inclined between the temporal information of 150 sentences found according to the semantic contrast module 140 of temporal information and captions of the voice of sampling Shifting value adjusts the temporal information of Subtitle Demonstration.

Finally, Subtitle Demonstration module 160 shows word according to the temporal information of the captions after the adjustment of captioning synchronization module 150 Curtain.

It should be understood that modules described above can be further combined into less module, or according to its execution Operate and be divided into more modules.

Captioning synchronization side according to embodiments of the present invention based on speech recognition is described below with reference to Fig. 2 flow chart Method.

First, in step S210, the voice from audio stream corresponding with video flowing extraction foreground sounds, and to extraction Voice is sampled and recognized, so as to generate text information corresponding with the voice recognized.Here, word can be selected by user The language form of information.

Next, in step S220, semantic identification degree evaluation is carried out to the text information of generation.Next, in step S230 determines the need for adjusting the sample frequency of speech recognition according to the result of evaluation.Embodiments in accordance with the present invention, can lead to Whether the number for the phonetic word crossed in the text information for determining the generation of sound identification module 120 is come in preset range [m, n] Decide whether the sample frequency of adjustment speech recognition.In addition, it is also possible to consider the semanteme meaning of the phonetic word in text information Justice determines the need for sampling rate adjusting.If it is determined that need sampling rate adjusting, then can according to semanteme in step S235 The evaluation result of resolution carrys out sampling rate adjusting, then returnes to step S210 to carry out semantic identification degree evaluation again. If it is determined that not needing sampling rate adjusting, then proceed to step S240.

After obtaining the text information of higher semantics recognition degree by step S230 assessment, in step S240 by word The word of additional multilingual subtitle of the information with playing video carries out semantic matches.

Next, determining whether to find the word letter with identification in the word of additional multilingual subtitle in step S250 Cease the sentence of matching.

If determining have found the sentence matched with text information in step S250, believe in step S260 according to word The temporal information of corresponding voice is ceased to adjust the display time of captions.Otherwise, if not finding what is matched with text information Sentence, then improve sample frequency in step S255, is then back to step S210 and extracts voice and sampled and recognized.

The operation S210-S255 of the above is repeated, until the word that the voice with extracting is found in subtitle file is believed Untill ceasing corresponding sentence.

Finally, in S270, captions are shown according to the display time of the captions after adjustment.

The present invention proposes the solution of the simultaneous display of captions using speech recognition technology.By using dynamic voice Sampling, rationally obtains effective audio-frequency information, and subtitling text is matched and display temporal information is adjusted, can be in DTV Exist in program and the phenomenon such as intercut effective adjustment is carried out to the word of subtitling, keep the simultaneous display of subtitling.

The method according to the invention may be recorded in including performing by the programmed instruction of computer implemented various operations In computer-readable medium.Medium can also only include programmed instruction or including be combined with programmed instruction data file, Data structure etc..The example of computer-readable medium includes magnetizing mediums (such as hard disk, floppy disk and tape)；Optical medium is (for example CD-ROM and DVD)；Magnet-optical medium (for example, CD)；And especially it is formulated for the hardware unit of storage and execute program instructions (for example, read-only storage (ROM), random access memory (RAM), flash memory etc.).Medium can also include transmitting regulation journey The transmission medium of the carrier wave of the signal of sequence instruction, data structure etc. (such as optical line or metal wire, waveguide).Programmed instruction Example includes the text of the machine code and high-level code performed comprising usable interpreter by computer for example produced by compiler Part.

Although being particularly shown and describing the present invention, the skill of this area with reference to the exemplary embodiment of the present invention Art personnel to it should be understood that in the case where not departing from the spirit and scope of the present invention being defined by the claims, can enter Various changes in row form and details.

Claims

1. a kind of captioning synchronization device based on speech recognition, including：

Sound identification module, the voice from audio stream extraction foreground sounds corresponding with playing video, and to the voice of extraction Sampled and recognized, so as to generate text information corresponding with the voice recognized；

Dynamic sampling adjusting module, whether the number of the phonetic word in text information by determining sound identification module generation Carry out to carry out the text information that sound identification module is generated the evaluation of semantics recognition degree within a predetermined range, and according to evaluation As a result carry out control voice identification module sampling rate adjusting to obtain the text information with high semantics recognition degree；

Captions semanteme contrast module, by the text information with high semantics recognition degree and the additional multilingual subtitle for playing video Word carry out semantic matches；

Captioning synchronization module, if captions semanteme contrast module finds the text information pair with the voice of identification in subtitle file The sentence answered, then adjust the temporal information of subtitle file according to the temporal information of voice；

Subtitle Demonstration module, captions are shown according to the temporal information of the subtitle file after the adjustment of captioning synchronization module.

2. captioning synchronization device as claimed in claim 1, in addition to：

Speech selection module, the language of captions that will be shown is determined according to the selection of user.

3. captioning synchronization device as claimed in claim 1, wherein, when dynamic sampling adjusting module determines that sound identification module is given birth to Into text information in phonetic word number when preset range [m, n] is interior, dynamic sampling adjusting module determine word believe Breath has high semantics recognition degree, and wherein m, n is natural number.

4. captioning synchronization device as claimed in claim 3, wherein：

If dynamic sampling adjusting module determines that the quantity of the phonetic word in the text information of sound identification module generation is less than Minimum number m, then dynamic sampling adjusting module control voice identification module improve sample frequency and voice sampled；

If dynamic sampling adjusting module determines that the quantity of the phonetic word in the text information of sound identification module generation is higher than Maximum quantity n, then dynamic sampling adjusting module control voice identification module reduction sample frequency voice is sampled.

5. the captioning synchronization device as described in claim 3 or 4, wherein, dynamic sampling adjusting module considers sound identification module The semantic meaning of phonetic word in the text information of generation evaluates the semantics recognition degree of text information.

6. captioning synchronization device as claimed in claim 1, wherein, captions semanteme contrast module is regarded using fuzzy algorithmic approach to broadcasting The word of the additional multilingual subtitle of frequency enters line character scoring, thus find out the sentence of highest scoring in subtitle file as with The sentence of text information matching.

7. captioning synchronization device as claimed in claim 1, wherein, if captions semanteme contrast module is not in subtitle file Sentence corresponding with the text information of the voice of identification is found, then notifies dynamic sampling adjusting module to improve sound identification module Sample frequency.

8. a kind of captioning synchronization method based on speech recognition, including：

(a) from playing the voice during the corresponding audio stream of video extracts foreground sounds, and the voice of extraction is carried out sampling and Identification, so as to generate text information corresponding with the voice recognized；

(b) whether the number of the phonetic word in the text information by determining generation carrys out the word to generation within a predetermined range Information carries out the evaluation of semantics recognition degree, and according to the result of evaluation come control voice identification module sampling rate adjusting to obtain There must be the text information of high semantics recognition degree；

(c) word of the text information with high semantics recognition degree and the additional multilingual subtitle for playing video is carried out semantic Matching, to find sentence corresponding with the text information of the voice of identification in subtitle file；

(d) temporal information of subtitle file is adjusted according to the temporal information of voice；

(e) captions are shown according to the temporal information of the subtitle file after adjustment.

9. captioning synchronization method as claimed in claim 8, in addition to：

The language of captions that will be shown is determined according to the selection of user.

10. captioning synchronization method as claimed in claim 8, wherein, in step (b), when it is determined that the word letter of step (a) generation The number of phonetic word in breath determines that text information has high semantics recognition degree, wherein m, n when preset range [m, n] is interior It is natural number.

11. captioning synchronization method as claimed in claim 8, wherein, in step (b),

If it is determined that the quantity of the phonetic word in the text information of step (a) generation is less than minimum number m, then return to step (a) and improve sample frequency to sample to voice；

If it is determined that the quantity of the phonetic word in the text information of step (a) generation is higher than maximum quantity n, then return to step (a) sample frequency is reduced to sample to voice.

12. the captioning synchronization method as described in claim 10 or 11, wherein, in step (b), it is considered to the text of step (a) generation The semantic meaning of phonetic word in word information evaluates the semantics recognition degree of text information.

13. captioning synchronization method as claimed in claim 8, wherein, in step (c), using fuzzy algorithmic approach to playing video The word of additional multilingual subtitle enters line character scoring, thus find out the sentence of highest scoring in subtitle file as with word The sentence of information matches.

14. captioning synchronization method as claimed in claim 8, wherein, if step (c) do not found in subtitle file with The sample frequency of the corresponding sentence of text information of the voice of identification, then return to step (a) raising speech recognition.