CN110381389A

CN110381389A - A kind of method for generating captions and device based on artificial intelligence

Info

Publication number: CN110381389A
Application number: CN201910740413.6A
Authority: CN
Inventors: 张宇露; 陈联武; 陈祺; 蔡建伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2019-10-25
Anticipated expiration: 2038-11-14
Also published as: CN110381388B; CN109379641B; CN110381388A; CN110418208A; CN110418208B; CN110381389B; CN109379641A

Abstract

The embodiment of the present application discloses a kind of method for generating captions and device based on artificial intelligence, refer at least to the voice processing technology and natural language processing technique in artificial intelligence, for multiple sound bites, the corresponding text of multiple sound bites is obtained by speech recognition and determines the time span of silence clip.According to the sequence of audio stream time shaft, successively determine whether the duration of silence clip is greater than preset duration since target voice segment, group of text to be processed is added in text corresponding to sound bite between the target silence clip and target voice segment greater than preset duration, using the separator in group of text to be processed as the foundation for determining captioned test.Textual portions between separator belong to complete words, reasonable semanteme can be embodied, and it can determine whether silence clip is that expression between sentence pauses according to preset duration, there is a possibility that imperfect sentence to reduce captioned test, can help to watch that the user of audio-video understands audio-video frequency content.

Description

A kind of method for generating captions and device based on artificial intelligence

For the application to application No. is 201811355311.4, the applying date is on November 14th, 2018, entitled " a kind of The Chinese patent application of method for generating captions and device " proposes divisional application.

Technical field

This application involves field of audio processing, more particularly to a kind of method for generating captions and dress based on artificial intelligence It sets.

Background technique

User can be shown by audio-video and be shown on picture when watching some audio-videos such as network direct broadcasting, film Subtitle understand audio-video frequency content.

In traditional audio-video subtitle generating mode, audio stream is mainly handled according to silence clip, to generate word Curtain.Silence clip can be the segment for not having voice in the audio stream of audio-video, by audio stream cutting be more according to silence clip A sound bite, wherein can be by the subtitle of corresponding this sound bite of text generation of voice in any one sound bite.

However, since traditional approach is only according to this single audio signal characteristic of silence clip come cutting audio stream, It is difficult to differentiate between the expression that the expression in personage's expression in a word pauses between sentence to pause, to often be syncopated as improperly Sound bite, so that the subtitle generated with this will include incomplete sentence, it is difficult to help user to understand audio-video frequency content, even User can be also misled, bad experience is caused.

Summary of the invention

In order to solve the above-mentioned technical problem, true by separator this application provides a kind of method for generating captions and device Occurring a possibility that imperfect sentence in the captioned test made substantially reduces, using the captioned test as when corresponding audio stream Between the subtitle in axis section when being shown, can help to watch that the user of audio-video understands audio-video frequency content, improve user's body It tests.

The embodiment of the present application discloses following technical solution:

In a first aspect, the embodiment of the present application provides a kind of method for generating captions, the method includes

It obtains from the same audio stream and according to multiple sound bites of silence clip cutting；

Speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite, institute Stating includes the separator added according to text semantic in the corresponding text of multiple sound bites；

When the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, text to be processed is determined This group, the group of text to be processed include at least the corresponding text of the target voice segment；

Captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed；

Using the captioned test as the subtitle in corresponding audio stream time shaft section.

Second aspect, the embodiment of the present application provide a kind of caption generation device, and described device includes acquiring unit, identification list Member, the first determination unit, the second determination unit and generation unit:

The acquiring unit, for obtaining from the same audio stream and according to multiple voice sheets of silence clip cutting Section；

The recognition unit obtains the multiple sound bite for carrying out speech recognition to the multiple sound bite Corresponding text includes the separator added according to text semantic in the corresponding text of the multiple sound bite；

First determination unit, for true in the text according to corresponding to target voice segment in the multiple sound bite When determining subtitle, determine that group of text to be processed, the group of text to be processed include at least the text of the target voice segment；

Second determination unit, for according to the separator in the group of text to be processed from the group of text to be processed Middle determining captioned test；

The generation unit, for using the captioned test as the subtitle in corresponding audio stream time shaft section.

The third aspect, the embodiment of the present application provide a kind of equipment generated for subtitle, the equipment include processor with And memory:

Said program code is transferred to the processor for storing program code by the memory；

The processor is used for raw according to the described in any item subtitles of instruction execution first aspect in said program code At method.

Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium Matter is for storing program code, and said program code is for executing method for generating captions described in any one of first aspect.

For from the same audio stream and according to the multiple of silence clip cutting it can be seen from above-mentioned technical proposal During sound bite generates subtitle, speech recognition is carried out to multiple sound bites, obtains the multiple sound bite difference Corresponding text includes the separator added according to text semantic in the corresponding text of multiple sound bites.According to it When text corresponding to middle target voice segment determines subtitle, the group of text to be processed for generating subtitle, the text to be processed are determined The corresponding text of target voice segment has been included at least in this group.After determining group of text to be processed, can according to this to The separator in group of text is handled as the foundation for determining captioned test from group of text to be processed, due in group of text to be processed Separator be identify sound bite in text when based on semanteme added by, the textual portions between separator belong to completely Sentence can embody reasonable semanteme, therefore it is big a possibility that imperfect sentence occur in the captioned test determined by separator It is big to reduce, when which is shown as the subtitle in corresponding audio stream time shaft section, it can help to watch sound The user of video understands audio-video frequency content, improves user experience.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Fig. 1 is the application scenarios schematic diagram of method for generating captions provided by the embodiments of the present application；

Relation schematic diagram of the Fig. 2 between audio stream provided by the embodiments of the present application, silence clip and sound bite；

Fig. 3 is a kind of flow chart of method for generating captions provided by the embodiments of the present application；

Fig. 4 is a kind of method flow diagram for determining group of text to be processed provided by the embodiments of the present application；

Fig. 5 is a kind of word that corresponding audio stream time shaft section is generated according to captioned test provided by the embodiments of the present application Curtain method flow diagram；

Fig. 6 is the exemplary diagram in audio stream time shaft section corresponding to determining captioned test provided by the embodiments of the present application；

Fig. 7 is a kind of flow chart of method for generating captions provided by the embodiments of the present application；

Fig. 8 is the structure flow chart that a kind of subtitle provided by the embodiments of the present application generates；

Fig. 9 a is a kind of structure chart of caption generation device provided by the embodiments of the present application；

Fig. 9 b is a kind of structure chart of caption generation device provided by the embodiments of the present application；

Fig. 9 c is a kind of structure chart of caption generation device provided by the embodiments of the present application；

Fig. 9 d is a kind of structure chart of caption generation device provided by the embodiments of the present application；

Fig. 9 e is a kind of structure chart of caption generation device provided by the embodiments of the present application；

Figure 10 is a kind of equipment structure chart generated for subtitle provided by the embodiments of the present application；

Figure 11 is a kind of equipment structure chart generated for subtitle provided by the embodiments of the present application.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.

In traditional method for generating captions, audio stream is mainly handled according to silence clip, to generate subtitle.It is mute Segment can embody pause of the user in expression between sentence to a certain extent, but be different user and have different expression Habit, some users may have expression in a word and pause.Such as " in this sunny date, two children Play hide-and-seek playing " in the words, wherein space indicates that the expression in " in this sunny date " pauses, due to user's Communicative habits or user needed when expressing a word think deeply etc. so that in the words " fine at this " and " in the bright date " it Between occur expression pause.

If carrying out cutting by silence clip, the audio stream where " fine at this " may be cut into one Audio stream where " in the bright date, two children are playing hide-and-seek " is cut into a sound bite by a sound bite, The corresponding subtitle of one sound bite, in this way, will using " fine at this " as a subtitle, " in the bright date, two Child is playing hide-and-seek " it is used as another subtitle, the subtitle of generation will include incomplete sentence.When showing subtitle, use The subtitle that family is seen first is " fine at this ", then, just sees that " in the bright date, two children catch fan in object for appreciation to subtitle Hiding ", therefore, the understanding of user may be will affect, cause bad experience.

For this purpose, the embodiment of the present application provides a kind of method for generating captions, this method according to silence clip by audio stream It is cut on the basis of multiple sound bites, using a kind of new method for generating captions, this method is using separator as determination The foundation of captioned test, as separator be identify sound bite in text when based on semanteme added by, between separator Textual portions belong to complete sentence, can embody reasonable semanteme, therefore, word is determined from group of text to be processed by separator Curtain text, can substantially reduce and occur a possibility that imperfect sentence in captioned test, the subtitle shown can help to watch The user of audio-video understands audio-video frequency content, improves user experience.

Method for generating captions provided by the embodiment of the present application is realized based on artificial intelligence, artificial intelligence (Artificial Intelligence, AI) is the machine simulation controlled using digital computer or digital computer, extended With the intelligence of extension people, perception environment obtains knowledge and theory, method, technology and application using Knowledge Acquirement optimum System.In other words, artificial intelligence is a complex art of computer science, it attempts to understand the essence of intelligence, and produces A kind of new intelligence machine that can be made a response in such a way that human intelligence is similar out.Artificial intelligence namely studies various intelligence The design principle and implementation method of machine make machine have the function of perception, reasoning and decision.

Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage, The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.

In the embodiment of the present application, the artificial intelligence software's technology related generally to includes above-mentioned voice processing technology and nature The directions such as language processing techniques.

Such as the speech recognition technology in voice technology (Speech Technology) can be related to, including voice Signal Pretreatment (Speech signal preprocessing), voice signal frequency-domain analysis (Speech signal Frequency analyzing), speech recognition (Speech signal feature extraction), voice The training of signal characteristic matching/identification (Speech signal feature matching/recognition), voice (Speech training) etc..

Such as the text that can be related in natural language processing (Nature Language processing, NLP) is located in advance (Text preprocessing) and machine translation (Machine Translation) etc. are managed, including word, sentence cutting (word/sentence segementation), part-of-speech tagging (word tagging), statement classification (word/sentence Classification), translation word selection (word selection), sentence generate (sentence generation), part of speech change Change (word-activity), editor's output (Editting and outputting) etc..

It is understood that method for generating captions provided by the embodiments of the present application is compared with traditional method for generating captions, It reduces and occurs a possibility that imperfect sentence in captioned test, not needing the later period manually proofreads, and therefore, the embodiment of the present application mentions The method for generating captions of confession can be applied in net cast, Video chat, game etc. need in real-time scene, certainly, this Shen It method for generating captions please also can be applied in non-live scene provided by embodiment, for example, can be for the sound recorded Video file generates subtitle.

Method for generating captions provided by the embodiments of the present application can be applied to have the audio-video of subtitle generative capacity to handle In equipment, which can be terminal device, be also possible to server.

The audio-video processing equipment, which can have, implements automatic speech recognition technology (ASR) and Application on Voiceprint Recognition in voice technology The ability of technology etc..Audio-video processing equipment can be listened, can be seen, can felt, is the developing direction of the following human-computer interaction, wherein language Sound becomes following one of the man-machine interaction mode being most expected.

In the embodiment of the present application, audio-video processing equipment, can be to the voice of acquisition by implementing above-mentioned voice technology Segment carries out speech recognition, to obtain the functions such as the corresponding text of sound bite.

The audio-video processing equipment can also have implementation natural language processing (Nature Language Processing, NLP) ability, NLP is an important directions in computer science and artificial intelligence field.It grinds Study carefully the various theory and methods for being able to achieve and carrying out efficient communication between people and computer with natural language.Natural language processing is one Door melts linguistics, computer science, mathematics in the science of one.Therefore, the research in this field will be related to natural language, i.e. people Language used in everyday, so it with it is philological research have close contact.Natural language processing technique generally includes The technologies such as text-processing, semantic understanding, machine translation, robot question and answer, knowledge mapping.

In the embodiment of the present application, audio-video processing equipment may be implemented by implementing above-mentioned NLP technology by aforementioned true Fixed text determines the process of captioned test, and carries out the function such as translating to captioned test.

Wherein, if audio-video processing equipment is terminal device, terminal device can be intelligent terminal, computer, individual Digital assistants (Personal Digital Assistant, abbreviation PDA), tablet computer etc..

If the audio-video processing equipment is server, server can be separate server, or cluster service Device.When the server by utilizing method for generating captions obtains captioned test, using the captioned test as the corresponding audio stream time The subtitle in axis section is shown on the corresponding terminal device of user, to realize real-time display word during net cast Curtain.

The technical solution of the application in order to facilitate understanding, below with reference to practical application scene to provided by the embodiments of the present application Method for generating captions is introduced.

Referring to Fig. 1, Fig. 1 is the application scenarios schematic diagram of method for generating captions provided by the embodiments of the present application.The applied field Scape is introduced so that the method for generating captions is applied to server (audio-video processing equipment is server) as an example.The application scenarios In include server 101, server 101 is available from the same audio stream and according to multiple languages of silence clip cutting Tablet section, for example, sound bite 1, sound bite 2, sound bite 3 etc. in Fig. 1, these sound bites come from same Audio stream and generation time sequencing according to sound bite obtains.

Wherein, audio stream includes the voice that personage is issued in object to be processed, and object to be processed can be based on straight The audio-video for broadcasting scene generation is also possible to the audio-video document determined, such as the audio-video document recorded, downloaded, to It include audio stream in process object；The voice that personage is issued can refer to that live streaming person speaks in live scene, be also possible to broadcast The audio file including voice put, such as recording, broadcasting song etc..

Sound bite can refer to the part in audio stream including voice messaging；And silence clip can refer in audio stream There is no the part of voice messaging, the expression pause in a word that user occurs in expression or the expression between sentence can be embodied It pauses.

Relationship between audio stream, silence clip and sound bite can be as shown in Fig. 2, from figure 2 it can be seen that be directed to 0-t1 moment corresponding audio stream on time shaft, can be with according to the silence clip got during obtaining the audio stream The audio stream is cut into multiple sound bites, for example, sound bite 1, sound bite 2, sound bite 3 and voice sheet in Fig. 2 Section 4.

It should be noted that sound bite can be by server when obtaining audio stream according to silence clip cutting, It is also possible to server and directly acquires the sound bite segmented.

Multiple sound bites of 101 pairs of server acquisitions carry out speech recognition, and it is corresponding to obtain multiple sound bites Text includes the separator added according to text semantic in the corresponding text of multiple sound bites.

As separator be identify sound bite in text when based on semanteme added by, the text between separator Our department belongs to complete sentence, can embody reasonable semanteme, therefore occur in the captioned test determined subsequently through separator A possibility that imperfect sentence, substantially reduces.

Separator may include punctuation mark and additional character, wherein punctuation mark may include fullstop, comma, exclamation, Question mark etc.；Additional character may include space character, underscore, vertical line, oblique line etc..

It is true in the text according to corresponding to some sound bite such as target voice segment in multiple sound bites of server 101 Determine to determine group of text to be processed, the text to be processed when subtitle (Fig. 1 is using sound bite 2 as target voice segment) Group includes at least the corresponding text of target voice segment.

It should be noted that the embodiment of the present application is all based on separator for each sound bite to determine captioned test , when the text according to corresponding to target voice segment determines subtitle, since target voice segment is not necessarily in audio stream One processed sound bite, then, it is upper once captioned test is determined based on separator when, target voice segment is corresponding Part text may have been used for determining captioned test, so that text corresponding to target voice segment may not be mesh The corresponding full text of sound bite is marked, but the last time generates the corresponding remainder text of target voice segment after subtitle.

By taking " in this sunny date, two children are playing hide-and-seek " as an example, the sound bite of cutting is right respectively The text answered can be " fine at this " and " in the bright date, two children are playing hide-and-seek ", wherein ", " is separator, It, can " in the bright date, two children exist by text corresponding to sound bite when determining captioned test based on separator ", " Text " fine at this " corresponding to text " in the bright date " and sound bite in object for appreciation hide-and-seek " generates captioned test together, Therefore, when sound bite " in the bright date, two children are playing hide-and-seek " is used as target voice segment, target language tablet Text corresponding to section is " two children are playing hide-and-seek ", is the corresponding part text of target voice segment.

Certainly, text corresponding to target voice segment is also possible to the corresponding full text of target voice segment, the application Embodiment does not limit this.

For example, target voice segment is the sound bite handled for the first time for the audio stream, or, in upper primary foundation When separator determines captioned test, which is not applied to generate captioned test, then target language Text corresponding to tablet section is the corresponding full text of target voice segment.

It should be noted that group of text to be processed can only include the corresponding text of target voice segment, also can wrap Include the corresponding text of multiple sound bites including the corresponding text of target voice segment.When group of text to be processed includes more When the corresponding text of a sound bite, group of text to be processed be can be by the corresponding text of target voice segment and the target What the corresponding text of one or more sound bites after sound bite was spliced, specifically how to determine group of text to be processed It will be introduced subsequent.It is all based on group of text to be processed later, captioned test is determined by separator, and then it is right to generate institute Answer the subtitle in audio stream time shaft section.

In the present embodiment, captioned test can be is identified based on the languages in sound bite, but subtitle is not The languages being limited in sound bite, the languages of subtitle can be to be determined based on user demand, can be in sound bite Languages, be also possible to other languages, can also include a variety of languages.For example, captioned test is English, then, the word of displaying Curtain can be English subtitles, be also possible to Chinese subtitle, naturally it is also possible to be Sino-British subtitle etc..

Next, method for generating captions provided by the embodiments of the present application will be introduced in conjunction with attached drawing.

A kind of flow chart of method for generating captions is shown referring to Fig. 3, Fig. 3, which comprises

S301, it obtains from the same audio stream and according to multiple sound bites of silence clip cutting.

It may include much according to the obtained sound bite of silence clip cutting, these sound bites may belong to difference Audio stream, in the present embodiment, the sound bite of acquisition is the sound bite from same audio stream, and according to sound bite Generation time sequencing successively obtain.

S302, speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite This.

When obtaining the corresponding text of multiple sound bites by speech recognition, can be added to based on text semantic The corresponding text of multiple sound bites adds separator, to determine captioned test subsequently through separator.

S303, when the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, determine to Handle group of text.

For the sound bite of same audio stream, it is right to sound bite institute according to the generation time sequencing of sound bite to need The text answered is handled, current to determine corresponding to text, that is, target voice segment corresponding to sound bite based on subtitle Text, the group of text to be processed determined include at least target voice segment text.

The cutting of sound bite may be to be paused according to the expression between sentence, it is also possible to be stopped according to the expression in a word , in order to reduce due to the expression in a word pauses and a possibility that cause group of text to be processed to include incomplete sentence, Present embodiments provide a kind of method for determining group of text to be processed.

Referring to fig. 4, this method comprises:

S401, the time span for determining silence clip between the multiple sound bite.

It is that the expression between sentence stops that the time span of silence clip can embody the silence clip to a certain extent Or a word in expression pause.Under normal circumstances, the expression in a word pauses the time of silence clip generated Length is smaller, and the expression between sentence pause silence clip generated time span it is bigger, therefore, according to determining The time span of silence clip could be aware which sound bite is possible to get up to be formed wait locate with target voice fragment assembly Manage group of text.

The method of determination of the time span of silence clip may is that when obtaining sound bite, for current speech segment, Record the ending time stamp T of sound bite_{sil_begin}T is stabbed at the beginning of next sound bite_{sil_end}, successively calculate and work as The time span T of silence clip after preceding sound bite_sil, i.e. T_sil=T_{sil_end}-T_{sil_begin}。

S402, according to the sequence of audio stream time shaft, successively determine silence clip since the target voice segment Whether time span is greater than preset duration.

Wherein, preset duration is according under normal conditions, and the duration that expression of the user in expression between sentence pauses determines , it can determine that silence clip may be that the expression pause between sentence or the expression in a word stop according to preset duration ?.

It referring to fig. 2, is successively sound bite 1, silence clip A, sound bite 2, quiet according to the sequence of audio stream time shaft Tablet section B, sound bite 3, silence clip C, sound bite 4, wherein the time span of silence clip A is T_sil-1, silence clip The time span of B is T_sil-2, silence clip C time span be T_sil-3If sound bite 1 is target voice segment, need Successively determine whether the time span of silence clip is greater than preset duration since silence clip A, if being not more than, it is believed that should Silence clip may be that the expression in a word pauses, then when continuing to determine whether the time span of silence clip B is greater than default It is long, and so on, until determining that the time span of some silence clip is greater than preset duration, at this point it is possible to think this Silence clip may be that the expression between sentence pauses, i.e., text corresponding to former and later two sound bites of the silence clip may It is in two different sentences.

S403, if it is determined that target silence clip time span be greater than preset duration, the target mute plate will be in The group of text to be processed is added in text corresponding to sound bite between section and the target voice segment.

During successively determining whether the time span of silence clip is greater than preset duration, before target silence clip The time span of silence clip is respectively less than preset duration, then the sound bite pair between target silence clip and target voice segment The text answered is likely to be at the same sentence, therefore the corresponding text of these sound bites can be spliced.Once it is determined that going out certain The time span of a silence clip (such as target silence clip) is greater than preset duration, then can stop executing determining silence clip Time span the step of whether being greater than preset duration, in order to reduce since the expression pause in a word leads to text to be processed A possibility that group includes incomplete sentence can will be in the sound bite between target silence clip and target voice segment Corresponding text carries out being spliced to form group of text to be processed.

Referring to fig. 2, if successively determining T_sil-1Less than preset duration, T_sil-2Less than preset duration, and T_sil-3Greater than default Duration, it is possible to think that silence clip A and silence clip B may be possible for the expression pause in a word, silence clip C Expression between sentence pauses, can be using silence clip C as target silence clip, sound bite 1, sound bite 2 and voice Text corresponding to segment 3 can be spliced into group of text to be processed.

This method is by the time span of silence clip, successively to determine whether is silence clip after target voice segment The expression embodied in a word pauses, so that the sound bite cut out by the expression pause in a word is corresponding Text, which is stitched together, constitutes group of text to be processed, reduces since the expression pause in a word causes the group of text to be processed to include A possibility that incomplete sentence.

S304, captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed.

As the separator in group of text to be processed be identify sound bite in text when based on semanteme added by, point Textual portions between symbol belong to complete sentence, can embody reasonable semanteme, therefore literary by the subtitle that separator is determined Occurring a possibility that imperfect sentence in this substantially reduces.

For example, the corresponding text of sound bite 1 is " fine at this ", the corresponding text of sound bite 2 is " the bright date In, two children are playing hide-and-seek ", when by sound bite 1 as target voice segment, determined by S303 to be processed Group of text is " in this sunny date, two children are playing hide-and-seek ", wherein ", " is separator, then, according to this Separator in group of text to be processed can determine " in this sunny date " as captioned test；When continuing with, language It is used for last treatment process in the corresponding text of tablet section 2 " in the bright date " and generates captioned test, but sound bite 2 It there remains text " two children are playing hide-and-seek " in corresponding text, then when according to sound bite 2 (target voice segment) When corresponding text determines subtitle, text corresponding to target voice segment is 2 corresponding part text " two small friends of sound bite Friend is playing hide-and-seek ", rather than " in the bright date, two children are playing hide-and-seek ", at this moment, for target voice segment, institute is right Text " two children are playing hide-and-seek " is answered to continue to execute S303-S305.

Compared with the traditional way, text corresponding to sound bite 1 " fine at this " in traditional approach, 2 institute of sound bite Corresponding text " in the bright date, two children are playing hide-and-seek " respectively corresponds a captioned test, the two subtitles text It originally include incomplete sentence, and by the method for the embodiment of the present application, in the captioned test determined, it is ensured that " In this sunny date " and " two children play play hide-and-seek " be complete sentence, to reduce in captioned test A possibility that existing imperfect sentence.

S305, using the captioned test as the subtitle in corresponding audio stream time shaft section.

When the captioned test is shown as the subtitle in corresponding audio stream time shaft section, it can help to watch sound The user of video understands audio-video frequency content, improves user experience.

Above-described embodiment describes method for generating captions, during generating subtitle, needs according to separator to be processed Captioned test is determined in group of text, since the separator in group of text to be processed and group of text to be processed there may be different feelings Condition, for example, may also need to consider to show subtitle length when determining captioned test and which separator to determine subtitle according to Text be it is appropriate, in varied situations determine captioned test mode can be different.In the present embodiment, do not sympathizing with Under condition captioned test can be determined with reference to following formula:

Wherein, L_textIt can indicate the captioned test length determined, L_silIt can indicate that the text of group of text to be processed is long Degree；L_segIt can indicate preset quantity, be determined according to display subtitle length；L_puncCan in group of text to be processed from the Text size before one character to the group of text to be processed in preset quantity character between the last one separator, or be the One character is to the text size between a last separator；L_maxIt can be maximum quantity, as show subtitle extreme length Corresponding character quantity.

Based on above-mentioned formula, appropriate captioned test can be determined in varied situations.Next, will be to different situations Under, determine that the mode of captioned test is introduced one by one from group of text to be processed.

The first it may is that, the text size of group of text to be processed is less than preset quantity, i.e. L_sil< L_seg, at this point, Formula L can be used_text=L_silDetermine captioned test.

Specifically, under normal circumstances, showing font size, display screen size, user's body of the subtitle length by subtitle Testing etc. influences, and the subtitle of display needs to have a reasonable length, that is, shows subtitle length.It is aobvious to show that subtitle length can be used Show in subtitle character quantity to indicate.In this way, after obtaining group of text to be processed, it can be determined that the word of the group of text to be processed Whether symbol quantity is greater than preset quantity, that is, judges L_silWhether L is greater than_seg, the preset quantity is true according to display subtitle length Fixed, the preset quantity shows character quantity in subtitle when meeting display subtitle length；If not, it is believed that text to be processed This group meets display subtitle length requirement, the group of text to be processed directly can be determined as the captioned test, i.e. L_text =L_sil。

Second situation can be, and the text size of group of text to be processed is greater than preset quantity and there are separators, i.e. L_sil > L_seg&L_punc> 0, at this point it is possible to formula L_text=L_puncDetermine captioned test.

If judging, the character quantity of group of text to be processed is greater than preset quantity, i.e. L_sil> L_seg, it is believed that it is to be processed The character quantity of group of text is excessive, needs to intercept group of text to be processed, to be met from group of text to be processed Show the captioned test of subtitle length requirement, however, it is determined that there are separators, i.e. L in group of text to be processed out_punc> 0, then can be with It executes S304, and then determines captioned test, i.e. L_text=L_punc。

It should be noted that determining the mode of captioned test in the corresponding embodiment of Fig. 3 (in S304) according to separator Simple introduction has been carried out, next, will introduce how according to separator from group of text to be processed determine captioned test, It is how to determine L_punc。

It should be noted that determine that captioned test includes two kinds of methods of determination from group of text to be processed according to separator, Wherein, the first method of determination may is that by the group of text to be processed from first character to a last separator it Between part be determined as captioned test, i.e.,For in group of text to be processed from first character to a last separator Text size.

For example, group of text to be processed is that " on a clear day, two children are playing hide-and-seek, they play to open very much The heart.But ", according to the first method of determination, first character is " " in group of text to be processed, the last one separator is ".", then, " " and "." between part can be used as captioned test, i.e., captioned test is " on a clear day, two Child is playing hide-and-seek, they get a big kick.".

However, in some cases, in order to further ensure the word determined from group of text to be processed according to separator Curtain text meets display subtitle length requirement, while determining captioned test from group of text to be processed according to separator, also Can will display subtitle length take into account, i.e. second of method of determination may is that by the group of text to be processed from first Part before a character to the group of text to be processed in preset quantity character between the last one separator is determined as subtitle text This, the preset quantity is determined according to display subtitle length, i.e.,For in group of text to be processed from first character to Text size before processing group of text in preset quantity character between the last one separator.

For example, group of text to be processed is that " on a clear day, two children are playing hide-and-seek, they play to open very much The heart.But ", preset quantity 25, according to second of method of determination, in group of text to be processed first character be " ", the 25th A character is "ON", the last one separator is second in 25 characters before first character " " to group of text to be processed ", ", then, the part between " " and second ", " can be used as captioned test, i.e. captioned test is " on the sunny date In, two children are playing hide-and-seek, ".As it can be seen that the captioned test that second of method of determination is determined includes 19 characters, symbol Display subtitle length requirement is closed, user experience is more preferable.

The third it may is that, the text size of group of text to be processed is greater than preset quantity and separator is not present, i.e., L_sil> L_seg&L_punc=0, at this point it is possible to formula L_text=min (L_sil,L_max) determine captioned test.

It should be noted that determining subtitle in S304 from group of text to be processed according to the separator in group of text to be processed It includes separator that the premise of text, which is in group of text to be processed, however, in some cases, may not wrap in group of text to be processed Separator is included, such as group of text to be processed can be for " that home address for wearing the child of red clothes is Beijing sea No. 2 building of the institute of shallow lake area Zhong Guan-cun South Street 5 Unit 3 Room 301 ".Next, whether the character quantity to group of text to be processed is big In preset quantity, and when not including separator in group of text to be processed, the mode of captioned test is determined from group of text to be processed It is introduced.

Display subtitle length more reasonable subtitle length when being display subtitle, subtitle length is also by display subtitle longest The limitation of length therefore, can also be according to display subtitle extreme length in addition to determining captioned test using display subtitle length To determine captioned test.The character quantity of group of text to be processed is greater than preset quantity, can only illustrate the character of group of text to be processed Quantity has exceeded shows subtitle length under normal circumstances, it is not intended that cannot receive, that is, is not offered as text to be processed Group centainly cannot function as captioned test, as long as the character quantity of group of text to be processed is right without departing from display subtitle extreme length institute The character quantity answered.

Specifically, whether being greater than preset quantity, and group of text to be processed in the character quantity for determining group of text to be processed In when not including separator, can also further judge whether the character quantity of the group of text to be processed is greater than maximum quantity, Judge L_silWhether L is greater than_max, the maximum quantity L_maxFor character quantity corresponding to display subtitle extreme length；If so, Then illustrate that the character quantity of group of text to be processed has had exceeded the receptible display subtitle extreme length of institute, needs to be processed Cut a part of character in group of text as captioned test, such as can be by the character of maximum quantity before the group of text to be processed It is determined as captioned test；If it is not, then illustrating the character quantity of group of text to be processed in the receptible display subtitle extreme length of institute It is interior, group of text to be processed directly can be determined as captioned test, determine that text is long from group of text to be processed to realize Smaller text is spent as captioned test, i.e. L_text=min (L_sil,L_max)。

For example, group of text to be processed is that " that home address for wearing the child of red clothes is Haidian District, Beijing City No. 2 building of the institute of Zhong Guan-cun South Street 5 Unit 3 Room 301 ", maximum quantity 30, at this point, the character quantity of group of text to be processed is 43, then, the character quantity 43 of group of text to be processed is greater than maximum quantity 30, then can be by before group of text to be processed 30 character It is determined as captioned test, i.e. captioned test is that " that home address for wearing the child of red clothes is Haidian District, Beijing City Zhong Guan-cun South Street ".

For another example, group of text to be processed can be for " home address of that child is that Zhongguancun South St., Haidian District, Beijing City is big The institute of street 5 ", maximum quantity 30, at this point, the character quantity of group of text to be processed is 26, then, the character of group of text to be processed Quantity 26 is less than maximum quantity 30, then group of text to be processed can be determined as to captioned test, i.e. captioned test is " that small friend The home address of friend is No. 5, Zhongguancun Road(South), Haidian District, Beijing City institute ".

The purpose for determining captioned test is to generate subtitle for corresponding audio stream, next, will be to how according to subtitle The subtitle in audio stream time shaft section corresponding to text generation is introduced.

It should be noted that traditional determine captioned test according in method for generating captions, only relying on silence clip cutting, And then the subtitle in corresponding audio stream time shaft section is generated according to captioned test, therefore it may only be necessary to cutting time-sharing recording voice sheet The time migration of section.And in the embodiment of the present application, due to determining subtitle from group of text to be processed according to separator It when text, can may also be repartitioned according to separator, the time migration for only relying on sound bite, which is difficult to ensure, to be determined Captioned test correspond to the accuracy at moment on a timeline.Therefore, institute is generated according to captioned test the present embodiment provides a kind of The subtitle method in corresponding audio stream time shaft section, referring to Fig. 5, this method comprises:

S501, opposite start time of the first character in corresponding sound bite in the captioned test is determined.

S502, the sound bite according to corresponding to the opposite start time and the first character are in audio stream time shaft On time migration, at the beginning of determining audio stream time shaft section corresponding to the captioned test.

S503, opposite finish time of the last character in corresponding sound bite in the captioned test is determined.

S504, the sound bite according to corresponding to the opposite finish time and the last character are in the audio stream time The finish time in audio stream time shaft section corresponding to the captioned test is determined in time migration on axis.

In this way, at the beginning of the audio stream time shaft section according to corresponding to captioned test and finish time, so that it may root The subtitle in corresponding audio stream time shaft section is generated according to captioned test.

It is understood that can be determined when repartitioning captioned test according to separator by speech recognition engine Opposite start time and opposite finish time of each character in captioned test.Wherein, the opposite start time of each character Expression format with opposite finish time can be as follows:

For example, can determine that it is tied with respect to start time start for 500ms and relatively for Word_1 in captioned test Beam moment end is 750ms etc..

Referring to Fig. 6, however, it is determined that shown between A, B in the captioned test such as figure gone out, wherein the position where A is subtitle text Position corresponding to first character in this, the position where B are position corresponding to last character, word in captioned test At the time of at the beginning of audio stream time shaft section corresponding to curtain text for corresponding to location A, audio corresponding to captioned test At the time of the finish time for flowing time shaft section is corresponding to B location.

From fig. 6 it can be seen that opposite start time of the first character in corresponding sound bite be t1, first Sound bite corresponding to character is sound bite 2, and the time migration on audio stream time shaft is t2, in this way, according to phase To the time migration t2 of start time t1 and sound bite 2 on audio stream time shaft, can determine corresponding to captioned test It is t1+t2 at the beginning of audio stream time shaft section；At the end of opposite in corresponding sound bite of last character Carving is t3, and sound bite corresponding to last character is sound bite 3, and the time migration on audio stream time shaft is T4, in this way, the time migration t4 according to opposite finish time t3 and sound bite 3 on audio stream time shaft, can determine The finish time in audio stream time shaft section corresponding to captioned test is t3+t4.

This method is on the basis of the time migration based on sound bite, it is also necessary in conjunction with character in corresponding sound bite Relative instant, to guarantee that the captioned test determined corresponds to the accuracy at moment on a timeline.

It is understood that the languages of sound bite are not that user is used in everyday in audio-video in many cases, Languages, at this point, the captioned test as subtitle should utilize use in order to help the user for watching audio-video to understand audio-video frequency content Family languages used in everyday indicate.Therefore, in the present embodiment, languages can also be shown to determination in S304 according to subtitle Captioned test is translated, the captioned test after being translated, and using the captioned test after translation as when corresponding audio stream Between axis section subtitle.

Wherein, subtitle shows that languages may include Chinese, bilingual Chinese-English, English etc., and subtitle shows that languages can be user According to self-demand setting, for example, it is Chinese that the languages of sound bite, which are English, user, in audio-video, then, subtitle exhibition Show that languages can be Chinese, in this way, can be that the captioned test of English translates into the captioned test that languages are Chinese, language by languages Kind is Chinese captioned test as the captioned test after translation, to be Chinese captioned test as corresponding sound using languages Frequency flows the subtitle in time shaft section, understands audio-video frequency content convenient for user.

Next, method for generating captions provided by the embodiments of the present application will be introduced in conjunction with concrete scene, the scene For the net cast scene for speaker, it is assumed that speaker is given a lecture using English, then, it is straight in order to help to watch video The spectators broadcast understand the speech content of speaker, and the speech for speaker is needed to generate subtitle in real time, at this point, for the ease of seeing Crowd's study and understanding, the subtitle of generation can be bilingual Chinese-English subtitle.In this scenario, referring to Fig. 7, method for generating captions packet It includes:

S701, it obtains from the same audio stream and according to multiple sound bites of silence clip cutting.S702, determination The time span of silence clip between the multiple sound bite.

S703, speech recognition is carried out to multiple sound bites, obtains the corresponding text of multiple sound bites.

S704, when the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, according to sound Frequency flows the sequence of time shaft, and it is default successively to determine whether the time span of silence clip is greater than since the target voice segment Duration, if so, S705 is executed, if it is not, then executing S704.

S705, text corresponding to the sound bite between the target silence clip and the target voice segment will be in The group of text to be processed is added.

S706, judge whether the character quantity of the group of text to be processed is greater than preset quantity, if so, S707 is executed, If it is not, then executing S711.

S707, determine in the group of text to be processed whether include separator, if so, S708 is executed, if it is not, then executing S709。

S708, captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed.

S709, judge whether the character quantity of the group of text to be processed is greater than maximum quantity, if so, S710 is executed, If it is not, then executing S711.

S710, the character of the maximum quantity before the group of text to be processed is determined as the captioned test.

S711, the group of text to be processed is determined as the captioned test.

S712, the captioned test is translated by machine translation.

S713, the captioned test using the captioned test and after translating are as the word in corresponding audio stream time shaft section Curtain.

In the application scenarios, the Structure and Process that subtitle generates be may refer to shown in Fig. 8.Wherein, it is cut based on silence clip Get sound bite 1 ... sound bite 4 etc. correspond to Fig. 7 in S701；It is combined based on silence clip/semanteme again to language The corresponding text of tablet section is divided, to obtain captioned test, corresponds to S702-S711 in Fig. 7；Pass through machine translation pair Captioned test is translated, the captioned test after being translated, for example, carrying out machine translation for captioned test 1 obtains subtitle Text 1 ' etc. corresponds to S712 in Fig. 7；According to audio stream time shaft, audio stream time shaft is obtained with after machine translation Captioned test merge with time shaft, generate corresponding subtitle, correspond to Fig. 7 in S713.After obtaining subtitle, word can be pushed Curtain is played in real time.

Based on a kind of method for generating captions that previous embodiment provides, the present embodiment provides a kind of caption generation device 900, Referring to Fig. 9 a, described device 900 includes acquiring unit 901, recognition unit 902, the first determination unit 903, the second determination unit 904 and generation unit 905:

The acquiring unit 901, for obtaining from the same audio stream and according to multiple voices of silence clip cutting Segment；

The recognition unit 902 obtains the multiple voice sheet for carrying out speech recognition to the multiple sound bite The corresponding text of section includes the separation added according to text semantic in the corresponding text of the multiple sound bite Symbol；

First determination unit 903, for the text in processing target voice segment into the multiple sound bite When, determine that group of text to be processed, the group of text to be processed include at least the text of the target voice segment；

Second determination unit 904, for according to the separator in the group of text to be processed from the text to be processed Captioned test is determined in this group；

The generation unit 905, for using the captioned test as the subtitle in corresponding audio stream time shaft section.

In one implementation, referring to Fig. 9 b, described device 900 further includes third determination unit 906:

The third determination unit 906, for determining the time span of silence clip between the multiple sound bite；

First determination unit 903, specifically for the sequence according to audio stream time shaft, from the target voice segment Start successively to determine whether the time span of silence clip is greater than preset duration；

If it is determined that the time span of target silence clip is greater than preset duration, the target silence clip and institute will be in It states text corresponding to the sound bite between target voice segment and the group of text to be processed is added.

In one implementation, referring to Fig. 9 c, described device 900 further includes that the first judging unit 907 and the 4th determines Unit 908:

First judging unit 907, for judging whether the character quantity of the group of text to be processed is greater than present count Amount, the preset quantity are determined according to display subtitle length；

If first judging unit 907 judges that the character quantity of the group of text to be processed is greater than preset quantity, triggering Second determination unit 904 executes the separator according in the group of text to be processed from the group of text to be processed The step of determining captioned test；

4th determination unit 908, if judging the word of the group of text to be processed for first judging unit 907 It accords with quantity and is not more than preset quantity, the group of text to be processed is determined as the captioned test.

In one implementation, second determination unit 904, is specifically used for:

It will be determined as subtitle from first character to the part a last separator in the group of text to be processed Text；Alternatively,

By in the group of text to be processed before from first character to the group of text to be processed in preset quantity character most Part between the latter separator is determined as captioned test, and the preset quantity is determined according to display subtitle length.

In one implementation, if first judging unit 907 judges the character quantity of the group of text to be processed It does not include separator greater than the preset quantity, and in the group of text to be processed, referring to Fig. 9 d, described device 900 further includes Second judgment unit 909 and the 5th determination unit 910:

The second judgment unit 909, for judging whether the character quantity of the group of text to be processed is greater than maximum number Amount, the maximum quantity are character quantity corresponding to display subtitle extreme length；

5th determination unit 910, if judging the word of the group of text to be processed for the second judgment unit 909 It accords with quantity and is greater than maximum quantity, the character of the maximum quantity before the group of text to be processed is determined as the captioned test；

If the second judgment unit 909 judges the character quantity of the group of text to be processed no more than maximum quantity, touching The step of group of text to be processed is determined as the captioned test by execution of sending out the 4th determination unit 908 described.

In one implementation, referring to Fig. 9 e, described device 900 further includes that the 6th determination unit the 911, the 7th determines list First 912, the 8th determination unit 913 and the 9th determination unit 914:

6th determination unit 911, for determining that first character is in corresponding sound bite in the captioned test In opposite start time；

7th determination unit 912, for the language according to corresponding to the opposite start time and the first character Opening for audio stream time shaft section corresponding to the captioned test is determined in time migration of the tablet section on audio stream time shaft Begin the moment；

8th determination unit 913, for determining that last character is in corresponding voice sheet in the captioned test Opposite finish time in section；

9th determination unit 914, for according to corresponding to the opposite finish time and the last character Audio stream time shaft section corresponding to the captioned test is determined in time migration of the sound bite on audio stream time shaft Finish time.

The embodiment of the present application also provides a kind of equipment generated for subtitle, generate with reference to the accompanying drawing to for subtitle Equipment be introduced.Shown in Figure 10, the embodiment of the present application provides a kind of equipment 1000 generated for subtitle, should Equipment 1000 can be server, can generate bigger difference because configuration or performance are different, may include one or one Above central processing unit (Central Processing Units, abbreviation CPU) 1022 is (for example, one or more are handled Device) and memory 1032, the storage medium 1030 (such as one of one or more storage application programs 1042 or data 1044 A or more than one mass memory unit).Wherein, memory 1032 and storage medium 1030 can be of short duration storage or persistently deposit Storage.The program for being stored in storage medium 1030 may include one or more modules (diagram does not mark), and each module can To include to the series of instructions operation in server.Further, central processing unit 1022 can be set to be situated between with storage Matter 1030 communicates, and the series of instructions operation in storage medium 1030 is executed in the equipment 1000 generated for subtitle.

The equipment 1000 generated for subtitle can also include one or more power supplys 1026, one or more Wired or wireless network interface 1050, one or more input/output interfaces 1058, and/or, one or more behaviour Make system 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

The step as performed by server can be based on the server architecture shown in Fig. 10 in above-described embodiment.

Wherein, CPU 1022 is for executing following steps:

Shown in Figure 11, the embodiment of the present application provides a kind of equipment 1100 generated for subtitle, the equipment 1100 can also be terminal device, the terminal device can be include mobile phone, tablet computer, personal digital assistant (Personal Digital Assistant, abbreviation PDA), point-of-sale terminal (Point of Sales, abbreviation POS), vehicle-mounted computer etc. it is any eventually End equipment, by taking terminal device is mobile phone as an example:

Figure 11 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided by the embodiments of the present application.Ginseng Examine Figure 11, mobile phone include: radio frequency (Radio Frequency, abbreviation RF) circuit 1110, memory 1120, input unit 1130, Display unit 1140, sensor 1150, voicefrequency circuit 1160, Wireless Fidelity (wireless fidelity, abbreviation WiFi) module 1170, the components such as processor 1180 and power supply 1190.It will be understood by those skilled in the art that mobile phone knot shown in Figure 11 Structure does not constitute the restriction to mobile phone, may include perhaps combining certain components or not than illustrating more or fewer components Same component layout.

It is specifically introduced below with reference to each component parts of the Figure 11 to mobile phone:

RF circuit 1110 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, handled to processor 1180；In addition, the data for designing uplink are sent to base station.In general, RF circuit 1110 include but is not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (Low Noise Amplifier, abbreviation LNA), duplexer etc..In addition, RF circuit 1110 can also by wireless communication with network and other equipment Communication.Any communication standard or agreement, including but not limited to global system for mobile communications can be used in above-mentioned wireless communication (Global System of Mobile communication, abbreviation GSM), general packet radio service (General Packet Radio Service, abbreviation GPRS), CDMA (Code Division Multiple Access, referred to as CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviation WCDMA), long term evolution (Long Term Evolution, abbreviation LTE), Email, short message service (Short Messaging Service, letter Claim SMS) etc..

Memory 1120 can be used for storing software program and module, and processor 1180 is stored in memory by operation 1120 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1120 can be led It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function Application program (such as sound-playing function, image player function etc.) etc.；Storage data area, which can be stored, uses institute according to mobile phone Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1120 may include high random access storage Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid State memory device.

Input unit 1130 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with And the related key signals input of function control.Specifically, input unit 1130 may include touch panel 1131 and other inputs Equipment 1132.Touch panel 1131, also referred to as touch screen collect touch operation (such as the user of user on it or nearby Use the behaviour of any suitable object or attachment such as finger, stylus on touch panel 1131 or near touch panel 1131 Make), and corresponding attachment device is driven according to preset formula.Optionally, touch panel 1131 may include touch detection Two parts of device and touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation band The signal come, transmits a signal to touch controller；Touch controller receives touch information from touch detecting apparatus, and by it It is converted into contact coordinate, then gives processor 1180, and order that processor 1180 is sent can be received and executed.In addition, Touch panel 1131 can be realized using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves.In addition to touch surface Plate 1131, input unit 1130 can also include other input equipments 1132.Specifically, other input equipments 1132 may include But in being not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. It is one or more.

Display unit 1140 can be used for showing information input by user or be supplied to user information and mobile phone it is each Kind menu.Display unit 1140 may include display panel 1141, optionally, can use liquid crystal display (Liquid Crystal Display, abbreviation LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode, referred to as ) etc. OLED forms configure display panel 1141.Further, touch panel 1131 can cover display panel 1141, work as touch-control After panel 1131 detects touch operation on it or nearby, processor 1180 is sent to determine the type of touch event, It is followed by subsequent processing device 1180 and provides corresponding visual output on display panel 1141 according to the type of touch event.Although in Figure 11 In, touch panel 1131 and display panel 1141 are the input and input function for realizing mobile phone as two independent components, But in some embodiments it is possible to touch panel 1131 is integrated with display panel 1141 and realizes outputting and inputting for mobile phone Function.

Mobile phone may also include at least one sensor 1150, such as optical sensor, motion sensor and other sensors. Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light Light and shade adjust the brightness of display panel 1141, proximity sensor can close display panel when mobile phone is moved in one's ear 1141 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect in all directions (generally three axis) and add The size of speed can detect that size and the direction of gravity when static, can be used to identify application (such as the horizontal/vertical screen of mobile phone posture Switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.；Also as mobile phone The other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.

Voicefrequency circuit 1160, loudspeaker 1161, microphone 1162 can provide the audio interface between user and mobile phone.Audio Electric signal after the audio data received conversion can be transferred to loudspeaker 1161, be converted by loudspeaker 1161 by circuit 1160 For voice signal output；On the other hand, the voice signal of collection is converted to electric signal by microphone 1162, by voicefrequency circuit 1160 Audio data is converted to after reception, then by after the processing of audio data output processor 1180, through RF circuit 1110 to be sent to ratio Such as another mobile phone, or audio data is exported to memory 1120 to be further processed.

WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 1170 Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 11 is shown WiFi module 1170, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely Become in the range of the essence of invention and omits.

Processor 1180 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, By running or execute the software program and/or module that are stored in memory 1120, and calls and be stored in memory 1120 Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 1180 may include one or more processing units；Preferably, processor 1180 can integrate application processor and modulation /demodulation processing Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1180.

Mobile phone further includes the power supply 1190 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply Management system and processor 1180 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system The functions such as reason.

Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.

In the present embodiment, processor 1180 included by the terminal device is also with the following functions:

The embodiment of the present application also provides a kind of computer readable storage medium, for storing program code, the program code For executing any one embodiment in a kind of method for generating captions described in foregoing individual embodiments.

The description of the present application and term " first " in above-mentioned attached drawing, " second ", " third ", " the 4th " etc. are (if deposited ) it is to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that use in this way Data are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be in addition to illustrating herein Or the sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that Cover it is non-exclusive include, for example, containing the process, method, system, product or equipment of a series of steps or units need not limit In step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, produce The other step or units of product or equipment inherently.

It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two More than a."and/or" indicates may exist three kinds of relationships, for example, " A and/or B " for describing the incidence relation of affiliated partner It can indicate: only exist A, only exist B and exist simultaneously tri- kinds of situations of A and B, wherein A, B can be odd number or plural number.Word Symbol "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or"." at least one of following (a) " or its similar expression, refers to Any combination in these, any combination including individual event (a) or complex item (a).At least one of for example, in a, b or c (a) can indicate: a, b, c, " a and b ", " a and c ", " b and c ", or " a and b and c ", and wherein a, b, c can be individually, can also To be multiple.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, letter Claim ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc. is various to deposit Store up the medium of program code.

The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of method for generating captions based on artificial intelligence, which is characterized in that the described method includes:

Speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite；

Determine the time span of silence clip between the multiple sound bite；

According to the sequence of audio stream time shaft, mute plate is successively determined target voice segment since the multiple sound bite Whether the time span of section is greater than preset duration；

If it is determined that the time span of target silence clip is greater than preset duration, it will be in the target silence clip and the mesh It marks text corresponding to the sound bite between sound bite and group of text to be processed is added；The group of text to be processed includes at least institute State the corresponding text of target voice segment；

Captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed；It is described to be processed Separator in group of text is added according to text semantic；

2. the method according to claim 1, wherein the method also includes:

Judge whether the character quantity of the group of text to be processed is greater than preset quantity, the preset quantity is according to display subtitle What length determined；

If so, executing the separator according in the group of text to be processed determines subtitle from the group of text to be processed The step of text；

If it is not, the group of text to be processed is determined as the captioned test.

3. the method according to claim 1, wherein the separator according in the group of text to be processed from Captioned test is determined in the group of text to be processed, comprising:

It will be determined as captioned test from first character to the part a last separator in the group of text to be processed； Alternatively,

By last in preset quantity character before from first character to the group of text to be processed in the group of text to be processed Part between a separator is determined as captioned test, and the preset quantity is determined according to display subtitle length.

4. according to the method described in claim 2, it is characterized in that, the character quantity of the group of text to be processed is greater than if judging The preset quantity, and do not include separator in the group of text to be processed, the method also includes:

Judge whether the character quantity of the group of text to be processed is greater than maximum quantity, the maximum quantity is display subtitle longest Character quantity corresponding to length；

If so, the character of the maximum quantity before the group of text to be processed is determined as the captioned test；

5. the method according to claim 1, wherein in the separator according in the group of text to be processed After determining captioned test in the group of text to be processed, the method also includes:

Determine opposite start time of the first character in corresponding sound bite in the captioned test；

According to time of the sound bite on audio stream time shaft corresponding to the opposite start time and the first character Offset, at the beginning of determining audio stream time shaft section corresponding to the captioned test；

Determine opposite finish time of the last character in corresponding sound bite in the captioned test；

According to sound bite corresponding to the opposite finish time and the last character on audio stream time shaft when Between deviate, determine the finish time in audio stream time shaft section corresponding to the captioned test.

6. the method according to claim 1, wherein the method also includes:

Show that languages translate the captioned test according to subtitle, the captioned test after being translated；

It is described using the captioned test as the subtitle in corresponding audio stream time shaft section, comprising:

Using the captioned test after the translation as the subtitle in corresponding audio stream time shaft section.

7. a kind of caption generation device based on artificial intelligence, which is characterized in that described device includes acquiring unit, identification list Member, the first determination unit, the second determination unit, third determination unit and generation unit:

The acquiring unit, for obtaining from the same audio stream and according to multiple sound bites of silence clip cutting；

The recognition unit obtains the multiple sound bite difference for carrying out speech recognition to the multiple sound bite Corresponding text；

The third determination unit, for determining the time span of silence clip between the multiple sound bite；

First determination unit, for the sequence according to audio stream time shaft, since the target voice segment successively really Whether the time span for determining silence clip is greater than preset duration, however, it is determined that when going out the time span of target silence clip greater than presetting It is long, will be added in text corresponding to the sound bite between the target silence clip and the target voice segment it is described to Handle group of text；The group of text to be processed includes at least the text of the target voice segment；

Second determination unit, for true from the group of text to be processed according to the separator in the group of text to be processed Determine captioned test；Separator in the group of text to be processed is added according to text semantic；

8. device according to claim 7, which is characterized in that described device further includes that the first judging unit and the 4th determine Unit:

First judging unit, it is described for judging whether the character quantity of the group of text to be processed is greater than preset quantity Preset quantity is determined according to display subtitle length；

If first judging unit judges that the character quantity of the group of text to be processed is greater than preset quantity, triggering described second Determination unit executes the separator according in the group of text to be processed and determines subtitle text from the group of text to be processed This step of；

4th determination unit, if judging that the character quantity of the group of text to be processed is little for first judging unit In preset quantity, the group of text to be processed is determined as the captioned test.

9. device according to claim 7, which is characterized in that second determination unit is specifically used for:

10. device according to claim 8, which is characterized in that if first judging unit judges the text to be processed The character quantity of this group is greater than the preset quantity, and does not include separator in the group of text to be processed, and described device is also wrapped Include second judgment unit and the 5th determination unit:

The second judgment unit, it is described for judging whether the character quantity of the group of text to be processed is greater than maximum quantity Maximum quantity is character quantity corresponding to display subtitle extreme length；

5th determination unit, if judging that the character quantity of the group of text to be processed is greater than for the second judgment unit The character of the maximum quantity before the group of text to be processed is determined as the captioned test by maximum quantity；

If the second judgment unit judges the character quantity of the group of text to be processed no more than maximum quantity, described the is triggered The step of group of text to be processed is determined as the captioned test by the execution of four determination units.

11. device according to claim 7, which is characterized in that described device further includes the 6th determination unit, the 7th determination Unit, the 8th determination unit and the 9th determination unit:

6th determination unit, for determining, first character is opposite in corresponding sound bite in the captioned test Start time；

7th determination unit exists for the sound bite according to corresponding to the opposite start time and the first character Time migration on audio stream time shaft, at the beginning of determining audio stream time shaft section corresponding to the captioned test；

8th determination unit, for determining phase of the last character in corresponding sound bite in the captioned test To finish time；

9th determination unit, for the sound bite according to corresponding to the opposite finish time and the last character Time migration on audio stream time shaft, at the end of determining audio stream time shaft section corresponding to the captioned test It carves.

12. a kind of equipment that the subtitle for based on artificial intelligence generates, which is characterized in that the equipment include processor and Memory:

The processor is used to be based on artificial intelligence according to the instruction execution claim any one of 1-6 in said program code The method for generating captions of energy.

13. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing program generation Code, said program code require the method for generating captions based on artificial intelligence described in any one of 1-6 for perform claim.