CN110381389A - A kind of method for generating captions and device based on artificial intelligence - Google Patents
A kind of method for generating captions and device based on artificial intelligence Download PDFInfo
- Publication number
- CN110381389A CN110381389A CN201910740413.6A CN201910740413A CN110381389A CN 110381389 A CN110381389 A CN 110381389A CN 201910740413 A CN201910740413 A CN 201910740413A CN 110381389 A CN110381389 A CN 110381389A
- Authority
- CN
- China
- Prior art keywords
- text
- processed
- group
- captioned test
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 85
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 18
- 238000012360 testing method Methods 0.000 claims abstract description 165
- 238000013508 migration Methods 0.000 claims description 13
- 230000005012 migration Effects 0.000 claims description 13
- 238000013519 translation Methods 0.000 claims description 11
- 230000001960 triggered effect Effects 0.000 claims 1
- 238000012545 processing Methods 0.000 abstract description 30
- 238000005516 engineering process Methods 0.000 abstract description 22
- 238000003058 natural language processing Methods 0.000 abstract description 10
- 230000006870 function Effects 0.000 description 15
- 230000006854 communication Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000003203 everyday effect Effects 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000155 melt Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440236—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8547—Content authoring involving timestamps for synchronizing content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/278—Subtitling
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computer Security & Cryptography (AREA)
- Studio Circuits (AREA)
Abstract
The embodiment of the present application discloses a kind of method for generating captions and device based on artificial intelligence, refer at least to the voice processing technology and natural language processing technique in artificial intelligence, for multiple sound bites, the corresponding text of multiple sound bites is obtained by speech recognition and determines the time span of silence clip.According to the sequence of audio stream time shaft, successively determine whether the duration of silence clip is greater than preset duration since target voice segment, group of text to be processed is added in text corresponding to sound bite between the target silence clip and target voice segment greater than preset duration, using the separator in group of text to be processed as the foundation for determining captioned test.Textual portions between separator belong to complete words, reasonable semanteme can be embodied, and it can determine whether silence clip is that expression between sentence pauses according to preset duration, there is a possibility that imperfect sentence to reduce captioned test, can help to watch that the user of audio-video understands audio-video frequency content.
Description
For the application to application No. is 201811355311.4, the applying date is on November 14th, 2018, entitled " a kind of
The Chinese patent application of method for generating captions and device " proposes divisional application.
Technical field
This application involves field of audio processing, more particularly to a kind of method for generating captions and dress based on artificial intelligence
It sets.
Background technique
User can be shown by audio-video and be shown on picture when watching some audio-videos such as network direct broadcasting, film
Subtitle understand audio-video frequency content.
In traditional audio-video subtitle generating mode, audio stream is mainly handled according to silence clip, to generate word
Curtain.Silence clip can be the segment for not having voice in the audio stream of audio-video, by audio stream cutting be more according to silence clip
A sound bite, wherein can be by the subtitle of corresponding this sound bite of text generation of voice in any one sound bite.
However, since traditional approach is only according to this single audio signal characteristic of silence clip come cutting audio stream,
It is difficult to differentiate between the expression that the expression in personage's expression in a word pauses between sentence to pause, to often be syncopated as improperly
Sound bite, so that the subtitle generated with this will include incomplete sentence, it is difficult to help user to understand audio-video frequency content, even
User can be also misled, bad experience is caused.
Summary of the invention
In order to solve the above-mentioned technical problem, true by separator this application provides a kind of method for generating captions and device
Occurring a possibility that imperfect sentence in the captioned test made substantially reduces, using the captioned test as when corresponding audio stream
Between the subtitle in axis section when being shown, can help to watch that the user of audio-video understands audio-video frequency content, improve user's body
It tests.
The embodiment of the present application discloses following technical solution:
In a first aspect, the embodiment of the present application provides a kind of method for generating captions, the method includes
It obtains from the same audio stream and according to multiple sound bites of silence clip cutting;
Speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite, institute
Stating includes the separator added according to text semantic in the corresponding text of multiple sound bites;
When the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, text to be processed is determined
This group, the group of text to be processed include at least the corresponding text of the target voice segment;
Captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed;
Using the captioned test as the subtitle in corresponding audio stream time shaft section.
Second aspect, the embodiment of the present application provide a kind of caption generation device, and described device includes acquiring unit, identification list
Member, the first determination unit, the second determination unit and generation unit:
The acquiring unit, for obtaining from the same audio stream and according to multiple voice sheets of silence clip cutting
Section;
The recognition unit obtains the multiple sound bite for carrying out speech recognition to the multiple sound bite
Corresponding text includes the separator added according to text semantic in the corresponding text of the multiple sound bite;
First determination unit, for true in the text according to corresponding to target voice segment in the multiple sound bite
When determining subtitle, determine that group of text to be processed, the group of text to be processed include at least the text of the target voice segment;
Second determination unit, for according to the separator in the group of text to be processed from the group of text to be processed
Middle determining captioned test;
The generation unit, for using the captioned test as the subtitle in corresponding audio stream time shaft section.
The third aspect, the embodiment of the present application provide a kind of equipment generated for subtitle, the equipment include processor with
And memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor is used for raw according to the described in any item subtitles of instruction execution first aspect in said program code
At method.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium
Matter is for storing program code, and said program code is for executing method for generating captions described in any one of first aspect.
For from the same audio stream and according to the multiple of silence clip cutting it can be seen from above-mentioned technical proposal
During sound bite generates subtitle, speech recognition is carried out to multiple sound bites, obtains the multiple sound bite difference
Corresponding text includes the separator added according to text semantic in the corresponding text of multiple sound bites.According to it
When text corresponding to middle target voice segment determines subtitle, the group of text to be processed for generating subtitle, the text to be processed are determined
The corresponding text of target voice segment has been included at least in this group.After determining group of text to be processed, can according to this to
The separator in group of text is handled as the foundation for determining captioned test from group of text to be processed, due in group of text to be processed
Separator be identify sound bite in text when based on semanteme added by, the textual portions between separator belong to completely
Sentence can embody reasonable semanteme, therefore it is big a possibility that imperfect sentence occur in the captioned test determined by separator
It is big to reduce, when which is shown as the subtitle in corresponding audio stream time shaft section, it can help to watch sound
The user of video understands audio-video frequency content, improves user experience.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of application without any creative labor, may be used also for those of ordinary skill in the art
To obtain other drawings based on these drawings.
Fig. 1 is the application scenarios schematic diagram of method for generating captions provided by the embodiments of the present application;
Relation schematic diagram of the Fig. 2 between audio stream provided by the embodiments of the present application, silence clip and sound bite;
Fig. 3 is a kind of flow chart of method for generating captions provided by the embodiments of the present application;
Fig. 4 is a kind of method flow diagram for determining group of text to be processed provided by the embodiments of the present application;
Fig. 5 is a kind of word that corresponding audio stream time shaft section is generated according to captioned test provided by the embodiments of the present application
Curtain method flow diagram;
Fig. 6 is the exemplary diagram in audio stream time shaft section corresponding to determining captioned test provided by the embodiments of the present application;
Fig. 7 is a kind of flow chart of method for generating captions provided by the embodiments of the present application;
Fig. 8 is the structure flow chart that a kind of subtitle provided by the embodiments of the present application generates;
Fig. 9 a is a kind of structure chart of caption generation device provided by the embodiments of the present application;
Fig. 9 b is a kind of structure chart of caption generation device provided by the embodiments of the present application;
Fig. 9 c is a kind of structure chart of caption generation device provided by the embodiments of the present application;
Fig. 9 d is a kind of structure chart of caption generation device provided by the embodiments of the present application;
Fig. 9 e is a kind of structure chart of caption generation device provided by the embodiments of the present application;
Figure 10 is a kind of equipment structure chart generated for subtitle provided by the embodiments of the present application;
Figure 11 is a kind of equipment structure chart generated for subtitle provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this
Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
In traditional method for generating captions, audio stream is mainly handled according to silence clip, to generate subtitle.It is mute
Segment can embody pause of the user in expression between sentence to a certain extent, but be different user and have different expression
Habit, some users may have expression in a word and pause.Such as " in this sunny date, two children
Play hide-and-seek playing " in the words, wherein space indicates that the expression in " in this sunny date " pauses, due to user's
Communicative habits or user needed when expressing a word think deeply etc. so that in the words " fine at this " and " in the bright date " it
Between occur expression pause.
If carrying out cutting by silence clip, the audio stream where " fine at this " may be cut into one
Audio stream where " in the bright date, two children are playing hide-and-seek " is cut into a sound bite by a sound bite,
The corresponding subtitle of one sound bite, in this way, will using " fine at this " as a subtitle, " in the bright date, two
Child is playing hide-and-seek " it is used as another subtitle, the subtitle of generation will include incomplete sentence.When showing subtitle, use
The subtitle that family is seen first is " fine at this ", then, just sees that " in the bright date, two children catch fan in object for appreciation to subtitle
Hiding ", therefore, the understanding of user may be will affect, cause bad experience.
For this purpose, the embodiment of the present application provides a kind of method for generating captions, this method according to silence clip by audio stream
It is cut on the basis of multiple sound bites, using a kind of new method for generating captions, this method is using separator as determination
The foundation of captioned test, as separator be identify sound bite in text when based on semanteme added by, between separator
Textual portions belong to complete sentence, can embody reasonable semanteme, therefore, word is determined from group of text to be processed by separator
Curtain text, can substantially reduce and occur a possibility that imperfect sentence in captioned test, the subtitle shown can help to watch
The user of audio-video understands audio-video frequency content, improves user experience.
Method for generating captions provided by the embodiment of the present application is realized based on artificial intelligence, artificial intelligence
(Artificial Intelligence, AI) is the machine simulation controlled using digital computer or digital computer, extended
With the intelligence of extension people, perception environment obtains knowledge and theory, method, technology and application using Knowledge Acquirement optimum
System.In other words, artificial intelligence is a complex art of computer science, it attempts to understand the essence of intelligence, and produces
A kind of new intelligence machine that can be made a response in such a way that human intelligence is similar out.Artificial intelligence namely studies various intelligence
The design principle and implementation method of machine make machine have the function of perception, reasoning and decision.
Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer
The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage,
The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer
Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.
In the embodiment of the present application, the artificial intelligence software's technology related generally to includes above-mentioned voice processing technology and nature
The directions such as language processing techniques.
Such as the speech recognition technology in voice technology (Speech Technology) can be related to, including voice
Signal Pretreatment (Speech signal preprocessing), voice signal frequency-domain analysis (Speech signal
Frequency analyzing), speech recognition (Speech signal feature extraction), voice
The training of signal characteristic matching/identification (Speech signal feature matching/recognition), voice
(Speech training) etc..
Such as the text that can be related in natural language processing (Nature Language processing, NLP) is located in advance
(Text preprocessing) and machine translation (Machine Translation) etc. are managed, including word, sentence cutting
(word/sentence segementation), part-of-speech tagging (word tagging), statement classification (word/sentence
Classification), translation word selection (word selection), sentence generate (sentence generation), part of speech change
Change (word-activity), editor's output (Editting and outputting) etc..
It is understood that method for generating captions provided by the embodiments of the present application is compared with traditional method for generating captions,
It reduces and occurs a possibility that imperfect sentence in captioned test, not needing the later period manually proofreads, and therefore, the embodiment of the present application mentions
The method for generating captions of confession can be applied in net cast, Video chat, game etc. need in real-time scene, certainly, this Shen
It method for generating captions please also can be applied in non-live scene provided by embodiment, for example, can be for the sound recorded
Video file generates subtitle.
Method for generating captions provided by the embodiments of the present application can be applied to have the audio-video of subtitle generative capacity to handle
In equipment, which can be terminal device, be also possible to server.
The audio-video processing equipment, which can have, implements automatic speech recognition technology (ASR) and Application on Voiceprint Recognition in voice technology
The ability of technology etc..Audio-video processing equipment can be listened, can be seen, can felt, is the developing direction of the following human-computer interaction, wherein language
Sound becomes following one of the man-machine interaction mode being most expected.
In the embodiment of the present application, audio-video processing equipment, can be to the voice of acquisition by implementing above-mentioned voice technology
Segment carries out speech recognition, to obtain the functions such as the corresponding text of sound bite.
The audio-video processing equipment can also have implementation natural language processing (Nature Language
Processing, NLP) ability, NLP is an important directions in computer science and artificial intelligence field.It grinds
Study carefully the various theory and methods for being able to achieve and carrying out efficient communication between people and computer with natural language.Natural language processing is one
Door melts linguistics, computer science, mathematics in the science of one.Therefore, the research in this field will be related to natural language, i.e. people
Language used in everyday, so it with it is philological research have close contact.Natural language processing technique generally includes
The technologies such as text-processing, semantic understanding, machine translation, robot question and answer, knowledge mapping.
In the embodiment of the present application, audio-video processing equipment may be implemented by implementing above-mentioned NLP technology by aforementioned true
Fixed text determines the process of captioned test, and carries out the function such as translating to captioned test.
Wherein, if audio-video processing equipment is terminal device, terminal device can be intelligent terminal, computer, individual
Digital assistants (Personal Digital Assistant, abbreviation PDA), tablet computer etc..
If the audio-video processing equipment is server, server can be separate server, or cluster service
Device.When the server by utilizing method for generating captions obtains captioned test, using the captioned test as the corresponding audio stream time
The subtitle in axis section is shown on the corresponding terminal device of user, to realize real-time display word during net cast
Curtain.
The technical solution of the application in order to facilitate understanding, below with reference to practical application scene to provided by the embodiments of the present application
Method for generating captions is introduced.
Referring to Fig. 1, Fig. 1 is the application scenarios schematic diagram of method for generating captions provided by the embodiments of the present application.The applied field
Scape is introduced so that the method for generating captions is applied to server (audio-video processing equipment is server) as an example.The application scenarios
In include server 101, server 101 is available from the same audio stream and according to multiple languages of silence clip cutting
Tablet section, for example, sound bite 1, sound bite 2, sound bite 3 etc. in Fig. 1, these sound bites come from same
Audio stream and generation time sequencing according to sound bite obtains.
Wherein, audio stream includes the voice that personage is issued in object to be processed, and object to be processed can be based on straight
The audio-video for broadcasting scene generation is also possible to the audio-video document determined, such as the audio-video document recorded, downloaded, to
It include audio stream in process object;The voice that personage is issued can refer to that live streaming person speaks in live scene, be also possible to broadcast
The audio file including voice put, such as recording, broadcasting song etc..
Sound bite can refer to the part in audio stream including voice messaging;And silence clip can refer in audio stream
There is no the part of voice messaging, the expression pause in a word that user occurs in expression or the expression between sentence can be embodied
It pauses.
Relationship between audio stream, silence clip and sound bite can be as shown in Fig. 2, from figure 2 it can be seen that be directed to
0-t1 moment corresponding audio stream on time shaft, can be with according to the silence clip got during obtaining the audio stream
The audio stream is cut into multiple sound bites, for example, sound bite 1, sound bite 2, sound bite 3 and voice sheet in Fig. 2
Section 4.
It should be noted that sound bite can be by server when obtaining audio stream according to silence clip cutting,
It is also possible to server and directly acquires the sound bite segmented.
Multiple sound bites of 101 pairs of server acquisitions carry out speech recognition, and it is corresponding to obtain multiple sound bites
Text includes the separator added according to text semantic in the corresponding text of multiple sound bites.
As separator be identify sound bite in text when based on semanteme added by, the text between separator
Our department belongs to complete sentence, can embody reasonable semanteme, therefore occur in the captioned test determined subsequently through separator
A possibility that imperfect sentence, substantially reduces.
Separator may include punctuation mark and additional character, wherein punctuation mark may include fullstop, comma, exclamation,
Question mark etc.;Additional character may include space character, underscore, vertical line, oblique line etc..
It is true in the text according to corresponding to some sound bite such as target voice segment in multiple sound bites of server 101
Determine to determine group of text to be processed, the text to be processed when subtitle (Fig. 1 is using sound bite 2 as target voice segment)
Group includes at least the corresponding text of target voice segment.
It should be noted that the embodiment of the present application is all based on separator for each sound bite to determine captioned test
, when the text according to corresponding to target voice segment determines subtitle, since target voice segment is not necessarily in audio stream
One processed sound bite, then, it is upper once captioned test is determined based on separator when, target voice segment is corresponding
Part text may have been used for determining captioned test, so that text corresponding to target voice segment may not be mesh
The corresponding full text of sound bite is marked, but the last time generates the corresponding remainder text of target voice segment after subtitle.
By taking " in this sunny date, two children are playing hide-and-seek " as an example, the sound bite of cutting is right respectively
The text answered can be " fine at this " and " in the bright date, two children are playing hide-and-seek ", wherein ", " is separator,
It, can " in the bright date, two children exist by text corresponding to sound bite when determining captioned test based on separator ", "
Text " fine at this " corresponding to text " in the bright date " and sound bite in object for appreciation hide-and-seek " generates captioned test together,
Therefore, when sound bite " in the bright date, two children are playing hide-and-seek " is used as target voice segment, target language tablet
Text corresponding to section is " two children are playing hide-and-seek ", is the corresponding part text of target voice segment.
Certainly, text corresponding to target voice segment is also possible to the corresponding full text of target voice segment, the application
Embodiment does not limit this.
For example, target voice segment is the sound bite handled for the first time for the audio stream, or, in upper primary foundation
When separator determines captioned test, which is not applied to generate captioned test, then target language
Text corresponding to tablet section is the corresponding full text of target voice segment.
It should be noted that group of text to be processed can only include the corresponding text of target voice segment, also can wrap
Include the corresponding text of multiple sound bites including the corresponding text of target voice segment.When group of text to be processed includes more
When the corresponding text of a sound bite, group of text to be processed be can be by the corresponding text of target voice segment and the target
What the corresponding text of one or more sound bites after sound bite was spliced, specifically how to determine group of text to be processed
It will be introduced subsequent.It is all based on group of text to be processed later, captioned test is determined by separator, and then it is right to generate institute
Answer the subtitle in audio stream time shaft section.
In the present embodiment, captioned test can be is identified based on the languages in sound bite, but subtitle is not
The languages being limited in sound bite, the languages of subtitle can be to be determined based on user demand, can be in sound bite
Languages, be also possible to other languages, can also include a variety of languages.For example, captioned test is English, then, the word of displaying
Curtain can be English subtitles, be also possible to Chinese subtitle, naturally it is also possible to be Sino-British subtitle etc..
Next, method for generating captions provided by the embodiments of the present application will be introduced in conjunction with attached drawing.
A kind of flow chart of method for generating captions is shown referring to Fig. 3, Fig. 3, which comprises
S301, it obtains from the same audio stream and according to multiple sound bites of silence clip cutting.
It may include much according to the obtained sound bite of silence clip cutting, these sound bites may belong to difference
Audio stream, in the present embodiment, the sound bite of acquisition is the sound bite from same audio stream, and according to sound bite
Generation time sequencing successively obtain.
S302, speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite
This.
When obtaining the corresponding text of multiple sound bites by speech recognition, can be added to based on text semantic
The corresponding text of multiple sound bites adds separator, to determine captioned test subsequently through separator.
S303, when the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, determine to
Handle group of text.
For the sound bite of same audio stream, it is right to sound bite institute according to the generation time sequencing of sound bite to need
The text answered is handled, current to determine corresponding to text, that is, target voice segment corresponding to sound bite based on subtitle
Text, the group of text to be processed determined include at least target voice segment text.
The cutting of sound bite may be to be paused according to the expression between sentence, it is also possible to be stopped according to the expression in a word
, in order to reduce due to the expression in a word pauses and a possibility that cause group of text to be processed to include incomplete sentence,
Present embodiments provide a kind of method for determining group of text to be processed.
Referring to fig. 4, this method comprises:
S401, the time span for determining silence clip between the multiple sound bite.
It is that the expression between sentence stops that the time span of silence clip can embody the silence clip to a certain extent
Or a word in expression pause.Under normal circumstances, the expression in a word pauses the time of silence clip generated
Length is smaller, and the expression between sentence pause silence clip generated time span it is bigger, therefore, according to determining
The time span of silence clip could be aware which sound bite is possible to get up to be formed wait locate with target voice fragment assembly
Manage group of text.
The method of determination of the time span of silence clip may is that when obtaining sound bite, for current speech segment,
Record the ending time stamp T of sound bitesil_beginT is stabbed at the beginning of next sound bitesil_end, successively calculate and work as
The time span T of silence clip after preceding sound bitesil, i.e. Tsil=Tsil_end-Tsil_begin。
S402, according to the sequence of audio stream time shaft, successively determine silence clip since the target voice segment
Whether time span is greater than preset duration.
Wherein, preset duration is according under normal conditions, and the duration that expression of the user in expression between sentence pauses determines
, it can determine that silence clip may be that the expression pause between sentence or the expression in a word stop according to preset duration
?.
It referring to fig. 2, is successively sound bite 1, silence clip A, sound bite 2, quiet according to the sequence of audio stream time shaft
Tablet section B, sound bite 3, silence clip C, sound bite 4, wherein the time span of silence clip A is Tsil-1, silence clip
The time span of B is Tsil-2, silence clip C time span be Tsil-3If sound bite 1 is target voice segment, need
Successively determine whether the time span of silence clip is greater than preset duration since silence clip A, if being not more than, it is believed that should
Silence clip may be that the expression in a word pauses, then when continuing to determine whether the time span of silence clip B is greater than default
It is long, and so on, until determining that the time span of some silence clip is greater than preset duration, at this point it is possible to think this
Silence clip may be that the expression between sentence pauses, i.e., text corresponding to former and later two sound bites of the silence clip may
It is in two different sentences.
S403, if it is determined that target silence clip time span be greater than preset duration, the target mute plate will be in
The group of text to be processed is added in text corresponding to sound bite between section and the target voice segment.
During successively determining whether the time span of silence clip is greater than preset duration, before target silence clip
The time span of silence clip is respectively less than preset duration, then the sound bite pair between target silence clip and target voice segment
The text answered is likely to be at the same sentence, therefore the corresponding text of these sound bites can be spliced.Once it is determined that going out certain
The time span of a silence clip (such as target silence clip) is greater than preset duration, then can stop executing determining silence clip
Time span the step of whether being greater than preset duration, in order to reduce since the expression pause in a word leads to text to be processed
A possibility that group includes incomplete sentence can will be in the sound bite between target silence clip and target voice segment
Corresponding text carries out being spliced to form group of text to be processed.
Referring to fig. 2, if successively determining Tsil-1Less than preset duration, Tsil-2Less than preset duration, and Tsil-3Greater than default
Duration, it is possible to think that silence clip A and silence clip B may be possible for the expression pause in a word, silence clip C
Expression between sentence pauses, can be using silence clip C as target silence clip, sound bite 1, sound bite 2 and voice
Text corresponding to segment 3 can be spliced into group of text to be processed.
This method is by the time span of silence clip, successively to determine whether is silence clip after target voice segment
The expression embodied in a word pauses, so that the sound bite cut out by the expression pause in a word is corresponding
Text, which is stitched together, constitutes group of text to be processed, reduces since the expression pause in a word causes the group of text to be processed to include
A possibility that incomplete sentence.
S304, captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed.
As the separator in group of text to be processed be identify sound bite in text when based on semanteme added by, point
Textual portions between symbol belong to complete sentence, can embody reasonable semanteme, therefore literary by the subtitle that separator is determined
Occurring a possibility that imperfect sentence in this substantially reduces.
For example, the corresponding text of sound bite 1 is " fine at this ", the corresponding text of sound bite 2 is " the bright date
In, two children are playing hide-and-seek ", when by sound bite 1 as target voice segment, determined by S303 to be processed
Group of text is " in this sunny date, two children are playing hide-and-seek ", wherein ", " is separator, then, according to this
Separator in group of text to be processed can determine " in this sunny date " as captioned test;When continuing with, language
It is used for last treatment process in the corresponding text of tablet section 2 " in the bright date " and generates captioned test, but sound bite 2
It there remains text " two children are playing hide-and-seek " in corresponding text, then when according to sound bite 2 (target voice segment)
When corresponding text determines subtitle, text corresponding to target voice segment is 2 corresponding part text " two small friends of sound bite
Friend is playing hide-and-seek ", rather than " in the bright date, two children are playing hide-and-seek ", at this moment, for target voice segment, institute is right
Text " two children are playing hide-and-seek " is answered to continue to execute S303-S305.
Compared with the traditional way, text corresponding to sound bite 1 " fine at this " in traditional approach, 2 institute of sound bite
Corresponding text " in the bright date, two children are playing hide-and-seek " respectively corresponds a captioned test, the two subtitles text
It originally include incomplete sentence, and by the method for the embodiment of the present application, in the captioned test determined, it is ensured that "
In this sunny date " and " two children play play hide-and-seek " be complete sentence, to reduce in captioned test
A possibility that existing imperfect sentence.
S305, using the captioned test as the subtitle in corresponding audio stream time shaft section.
When the captioned test is shown as the subtitle in corresponding audio stream time shaft section, it can help to watch sound
The user of video understands audio-video frequency content, improves user experience.
For from the same audio stream and according to the multiple of silence clip cutting it can be seen from above-mentioned technical proposal
During sound bite generates subtitle, speech recognition is carried out to multiple sound bites, obtains the multiple sound bite difference
Corresponding text includes the separator added according to text semantic in the corresponding text of multiple sound bites.According to it
When text corresponding to middle target voice segment determines subtitle, the group of text to be processed for generating subtitle, the text to be processed are determined
The corresponding text of target voice segment has been included at least in this group.After determining group of text to be processed, can according to this to
The separator in group of text is handled as the foundation for determining captioned test from group of text to be processed, due in group of text to be processed
Separator be identify sound bite in text when based on semanteme added by, the textual portions between separator belong to completely
Sentence can embody reasonable semanteme, therefore it is big a possibility that imperfect sentence occur in the captioned test determined by separator
It is big to reduce, when which is shown as the subtitle in corresponding audio stream time shaft section, it can help to watch sound
The user of video understands audio-video frequency content, improves user experience.
Above-described embodiment describes method for generating captions, during generating subtitle, needs according to separator to be processed
Captioned test is determined in group of text, since the separator in group of text to be processed and group of text to be processed there may be different feelings
Condition, for example, may also need to consider to show subtitle length when determining captioned test and which separator to determine subtitle according to
Text be it is appropriate, in varied situations determine captioned test mode can be different.In the present embodiment, do not sympathizing with
Under condition captioned test can be determined with reference to following formula:
Wherein, LtextIt can indicate the captioned test length determined, LsilIt can indicate that the text of group of text to be processed is long
Degree;LsegIt can indicate preset quantity, be determined according to display subtitle length;LpuncCan in group of text to be processed from the
Text size before one character to the group of text to be processed in preset quantity character between the last one separator, or be the
One character is to the text size between a last separator;LmaxIt can be maximum quantity, as show subtitle extreme length
Corresponding character quantity.
Based on above-mentioned formula, appropriate captioned test can be determined in varied situations.Next, will be to different situations
Under, determine that the mode of captioned test is introduced one by one from group of text to be processed.
The first it may is that, the text size of group of text to be processed is less than preset quantity, i.e. Lsil< Lseg, at this point,
Formula L can be usedtext=LsilDetermine captioned test.
Specifically, under normal circumstances, showing font size, display screen size, user's body of the subtitle length by subtitle
Testing etc. influences, and the subtitle of display needs to have a reasonable length, that is, shows subtitle length.It is aobvious to show that subtitle length can be used
Show in subtitle character quantity to indicate.In this way, after obtaining group of text to be processed, it can be determined that the word of the group of text to be processed
Whether symbol quantity is greater than preset quantity, that is, judges LsilWhether L is greater thanseg, the preset quantity is true according to display subtitle length
Fixed, the preset quantity shows character quantity in subtitle when meeting display subtitle length;If not, it is believed that text to be processed
This group meets display subtitle length requirement, the group of text to be processed directly can be determined as the captioned test, i.e. Ltext
=Lsil。
Second situation can be, and the text size of group of text to be processed is greater than preset quantity and there are separators, i.e. Lsil
> Lseg&Lpunc> 0, at this point it is possible to formula Ltext=LpuncDetermine captioned test.
If judging, the character quantity of group of text to be processed is greater than preset quantity, i.e. Lsil> Lseg, it is believed that it is to be processed
The character quantity of group of text is excessive, needs to intercept group of text to be processed, to be met from group of text to be processed
Show the captioned test of subtitle length requirement, however, it is determined that there are separators, i.e. L in group of text to be processed outpunc> 0, then can be with
It executes S304, and then determines captioned test, i.e. Ltext=Lpunc。
It should be noted that determining the mode of captioned test in the corresponding embodiment of Fig. 3 (in S304) according to separator
Simple introduction has been carried out, next, will introduce how according to separator from group of text to be processed determine captioned test,
It is how to determine Lpunc。
It should be noted that determine that captioned test includes two kinds of methods of determination from group of text to be processed according to separator,
Wherein, the first method of determination may is that by the group of text to be processed from first character to a last separator it
Between part be determined as captioned test, i.e.,For in group of text to be processed from first character to a last separator
Text size.
For example, group of text to be processed is that " on a clear day, two children are playing hide-and-seek, they play to open very much
The heart.But ", according to the first method of determination, first character is " " in group of text to be processed, the last one separator is
".", then, " " and "." between part can be used as captioned test, i.e., captioned test is " on a clear day, two
Child is playing hide-and-seek, they get a big kick.".
However, in some cases, in order to further ensure the word determined from group of text to be processed according to separator
Curtain text meets display subtitle length requirement, while determining captioned test from group of text to be processed according to separator, also
Can will display subtitle length take into account, i.e. second of method of determination may is that by the group of text to be processed from first
Part before a character to the group of text to be processed in preset quantity character between the last one separator is determined as subtitle text
This, the preset quantity is determined according to display subtitle length, i.e.,For in group of text to be processed from first character to
Text size before processing group of text in preset quantity character between the last one separator.
For example, group of text to be processed is that " on a clear day, two children are playing hide-and-seek, they play to open very much
The heart.But ", preset quantity 25, according to second of method of determination, in group of text to be processed first character be " ", the 25th
A character is "ON", the last one separator is second in 25 characters before first character " " to group of text to be processed
", ", then, the part between " " and second ", " can be used as captioned test, i.e. captioned test is " on the sunny date
In, two children are playing hide-and-seek, ".As it can be seen that the captioned test that second of method of determination is determined includes 19 characters, symbol
Display subtitle length requirement is closed, user experience is more preferable.
The third it may is that, the text size of group of text to be processed is greater than preset quantity and separator is not present, i.e.,
Lsil> Lseg&Lpunc=0, at this point it is possible to formula Ltext=min (Lsil,Lmax) determine captioned test.
It should be noted that determining subtitle in S304 from group of text to be processed according to the separator in group of text to be processed
It includes separator that the premise of text, which is in group of text to be processed, however, in some cases, may not wrap in group of text to be processed
Separator is included, such as group of text to be processed can be for " that home address for wearing the child of red clothes is Beijing sea
No. 2 building of the institute of shallow lake area Zhong Guan-cun South Street 5 Unit 3 Room 301 ".Next, whether the character quantity to group of text to be processed is big
In preset quantity, and when not including separator in group of text to be processed, the mode of captioned test is determined from group of text to be processed
It is introduced.
Display subtitle length more reasonable subtitle length when being display subtitle, subtitle length is also by display subtitle longest
The limitation of length therefore, can also be according to display subtitle extreme length in addition to determining captioned test using display subtitle length
To determine captioned test.The character quantity of group of text to be processed is greater than preset quantity, can only illustrate the character of group of text to be processed
Quantity has exceeded shows subtitle length under normal circumstances, it is not intended that cannot receive, that is, is not offered as text to be processed
Group centainly cannot function as captioned test, as long as the character quantity of group of text to be processed is right without departing from display subtitle extreme length institute
The character quantity answered.
Specifically, whether being greater than preset quantity, and group of text to be processed in the character quantity for determining group of text to be processed
In when not including separator, can also further judge whether the character quantity of the group of text to be processed is greater than maximum quantity,
Judge LsilWhether L is greater thanmax, the maximum quantity LmaxFor character quantity corresponding to display subtitle extreme length;If so,
Then illustrate that the character quantity of group of text to be processed has had exceeded the receptible display subtitle extreme length of institute, needs to be processed
Cut a part of character in group of text as captioned test, such as can be by the character of maximum quantity before the group of text to be processed
It is determined as captioned test;If it is not, then illustrating the character quantity of group of text to be processed in the receptible display subtitle extreme length of institute
It is interior, group of text to be processed directly can be determined as captioned test, determine that text is long from group of text to be processed to realize
Smaller text is spent as captioned test, i.e. Ltext=min (Lsil,Lmax)。
For example, group of text to be processed is that " that home address for wearing the child of red clothes is Haidian District, Beijing City
No. 2 building of the institute of Zhong Guan-cun South Street 5 Unit 3 Room 301 ", maximum quantity 30, at this point, the character quantity of group of text to be processed is
43, then, the character quantity 43 of group of text to be processed is greater than maximum quantity 30, then can be by before group of text to be processed 30 character
It is determined as captioned test, i.e. captioned test is that " that home address for wearing the child of red clothes is Haidian District, Beijing City
Zhong Guan-cun South Street ".
For another example, group of text to be processed can be for " home address of that child is that Zhongguancun South St., Haidian District, Beijing City is big
The institute of street 5 ", maximum quantity 30, at this point, the character quantity of group of text to be processed is 26, then, the character of group of text to be processed
Quantity 26 is less than maximum quantity 30, then group of text to be processed can be determined as to captioned test, i.e. captioned test is " that small friend
The home address of friend is No. 5, Zhongguancun Road(South), Haidian District, Beijing City institute ".
The purpose for determining captioned test is to generate subtitle for corresponding audio stream, next, will be to how according to subtitle
The subtitle in audio stream time shaft section corresponding to text generation is introduced.
It should be noted that traditional determine captioned test according in method for generating captions, only relying on silence clip cutting,
And then the subtitle in corresponding audio stream time shaft section is generated according to captioned test, therefore it may only be necessary to cutting time-sharing recording voice sheet
The time migration of section.And in the embodiment of the present application, due to determining subtitle from group of text to be processed according to separator
It when text, can may also be repartitioned according to separator, the time migration for only relying on sound bite, which is difficult to ensure, to be determined
Captioned test correspond to the accuracy at moment on a timeline.Therefore, institute is generated according to captioned test the present embodiment provides a kind of
The subtitle method in corresponding audio stream time shaft section, referring to Fig. 5, this method comprises:
S501, opposite start time of the first character in corresponding sound bite in the captioned test is determined.
S502, the sound bite according to corresponding to the opposite start time and the first character are in audio stream time shaft
On time migration, at the beginning of determining audio stream time shaft section corresponding to the captioned test.
S503, opposite finish time of the last character in corresponding sound bite in the captioned test is determined.
S504, the sound bite according to corresponding to the opposite finish time and the last character are in the audio stream time
The finish time in audio stream time shaft section corresponding to the captioned test is determined in time migration on axis.
In this way, at the beginning of the audio stream time shaft section according to corresponding to captioned test and finish time, so that it may root
The subtitle in corresponding audio stream time shaft section is generated according to captioned test.
It is understood that can be determined when repartitioning captioned test according to separator by speech recognition engine
Opposite start time and opposite finish time of each character in captioned test.Wherein, the opposite start time of each character
Expression format with opposite finish time can be as follows:
For example, can determine that it is tied with respect to start time start for 500ms and relatively for Word_1 in captioned test
Beam moment end is 750ms etc..
Referring to Fig. 6, however, it is determined that shown between A, B in the captioned test such as figure gone out, wherein the position where A is subtitle text
Position corresponding to first character in this, the position where B are position corresponding to last character, word in captioned test
At the time of at the beginning of audio stream time shaft section corresponding to curtain text for corresponding to location A, audio corresponding to captioned test
At the time of the finish time for flowing time shaft section is corresponding to B location.
From fig. 6 it can be seen that opposite start time of the first character in corresponding sound bite be t1, first
Sound bite corresponding to character is sound bite 2, and the time migration on audio stream time shaft is t2, in this way, according to phase
To the time migration t2 of start time t1 and sound bite 2 on audio stream time shaft, can determine corresponding to captioned test
It is t1+t2 at the beginning of audio stream time shaft section;At the end of opposite in corresponding sound bite of last character
Carving is t3, and sound bite corresponding to last character is sound bite 3, and the time migration on audio stream time shaft is
T4, in this way, the time migration t4 according to opposite finish time t3 and sound bite 3 on audio stream time shaft, can determine
The finish time in audio stream time shaft section corresponding to captioned test is t3+t4.
This method is on the basis of the time migration based on sound bite, it is also necessary in conjunction with character in corresponding sound bite
Relative instant, to guarantee that the captioned test determined corresponds to the accuracy at moment on a timeline.
It is understood that the languages of sound bite are not that user is used in everyday in audio-video in many cases,
Languages, at this point, the captioned test as subtitle should utilize use in order to help the user for watching audio-video to understand audio-video frequency content
Family languages used in everyday indicate.Therefore, in the present embodiment, languages can also be shown to determination in S304 according to subtitle
Captioned test is translated, the captioned test after being translated, and using the captioned test after translation as when corresponding audio stream
Between axis section subtitle.
Wherein, subtitle shows that languages may include Chinese, bilingual Chinese-English, English etc., and subtitle shows that languages can be user
According to self-demand setting, for example, it is Chinese that the languages of sound bite, which are English, user, in audio-video, then, subtitle exhibition
Show that languages can be Chinese, in this way, can be that the captioned test of English translates into the captioned test that languages are Chinese, language by languages
Kind is Chinese captioned test as the captioned test after translation, to be Chinese captioned test as corresponding sound using languages
Frequency flows the subtitle in time shaft section, understands audio-video frequency content convenient for user.
Next, method for generating captions provided by the embodiments of the present application will be introduced in conjunction with concrete scene, the scene
For the net cast scene for speaker, it is assumed that speaker is given a lecture using English, then, it is straight in order to help to watch video
The spectators broadcast understand the speech content of speaker, and the speech for speaker is needed to generate subtitle in real time, at this point, for the ease of seeing
Crowd's study and understanding, the subtitle of generation can be bilingual Chinese-English subtitle.In this scenario, referring to Fig. 7, method for generating captions packet
It includes:
S701, it obtains from the same audio stream and according to multiple sound bites of silence clip cutting.S702, determination
The time span of silence clip between the multiple sound bite.
S703, speech recognition is carried out to multiple sound bites, obtains the corresponding text of multiple sound bites.
S704, when the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, according to sound
Frequency flows the sequence of time shaft, and it is default successively to determine whether the time span of silence clip is greater than since the target voice segment
Duration, if so, S705 is executed, if it is not, then executing S704.
S705, text corresponding to the sound bite between the target silence clip and the target voice segment will be in
The group of text to be processed is added.
S706, judge whether the character quantity of the group of text to be processed is greater than preset quantity, if so, S707 is executed,
If it is not, then executing S711.
S707, determine in the group of text to be processed whether include separator, if so, S708 is executed, if it is not, then executing
S709。
S708, captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed.
S709, judge whether the character quantity of the group of text to be processed is greater than maximum quantity, if so, S710 is executed,
If it is not, then executing S711.
S710, the character of the maximum quantity before the group of text to be processed is determined as the captioned test.
S711, the group of text to be processed is determined as the captioned test.
S712, the captioned test is translated by machine translation.
S713, the captioned test using the captioned test and after translating are as the word in corresponding audio stream time shaft section
Curtain.
In the application scenarios, the Structure and Process that subtitle generates be may refer to shown in Fig. 8.Wherein, it is cut based on silence clip
Get sound bite 1 ... sound bite 4 etc. correspond to Fig. 7 in S701;It is combined based on silence clip/semanteme again to language
The corresponding text of tablet section is divided, to obtain captioned test, corresponds to S702-S711 in Fig. 7;Pass through machine translation pair
Captioned test is translated, the captioned test after being translated, for example, carrying out machine translation for captioned test 1 obtains subtitle
Text 1 ' etc. corresponds to S712 in Fig. 7;According to audio stream time shaft, audio stream time shaft is obtained with after machine translation
Captioned test merge with time shaft, generate corresponding subtitle, correspond to Fig. 7 in S713.After obtaining subtitle, word can be pushed
Curtain is played in real time.
For from the same audio stream and according to the multiple of silence clip cutting it can be seen from above-mentioned technical proposal
During sound bite generates subtitle, speech recognition is carried out to multiple sound bites, obtains the multiple sound bite difference
Corresponding text includes the separator added according to text semantic in the corresponding text of multiple sound bites.According to it
When text corresponding to middle target voice segment determines subtitle, the group of text to be processed for generating subtitle, the text to be processed are determined
The corresponding text of target voice segment has been included at least in this group.After determining group of text to be processed, can according to this to
The separator in group of text is handled as the foundation for determining captioned test from group of text to be processed, due in group of text to be processed
Separator be identify sound bite in text when based on semanteme added by, the textual portions between separator belong to completely
Sentence can embody reasonable semanteme, therefore it is big a possibility that imperfect sentence occur in the captioned test determined by separator
It is big to reduce, when which is shown as the subtitle in corresponding audio stream time shaft section, it can help to watch sound
The user of video understands audio-video frequency content, improves user experience.
Based on a kind of method for generating captions that previous embodiment provides, the present embodiment provides a kind of caption generation device 900,
Referring to Fig. 9 a, described device 900 includes acquiring unit 901, recognition unit 902, the first determination unit 903, the second determination unit
904 and generation unit 905:
The acquiring unit 901, for obtaining from the same audio stream and according to multiple voices of silence clip cutting
Segment;
The recognition unit 902 obtains the multiple voice sheet for carrying out speech recognition to the multiple sound bite
The corresponding text of section includes the separation added according to text semantic in the corresponding text of the multiple sound bite
Symbol;
First determination unit 903, for the text in processing target voice segment into the multiple sound bite
When, determine that group of text to be processed, the group of text to be processed include at least the text of the target voice segment;
Second determination unit 904, for according to the separator in the group of text to be processed from the text to be processed
Captioned test is determined in this group;
The generation unit 905, for using the captioned test as the subtitle in corresponding audio stream time shaft section.
In one implementation, referring to Fig. 9 b, described device 900 further includes third determination unit 906:
The third determination unit 906, for determining the time span of silence clip between the multiple sound bite;
First determination unit 903, specifically for the sequence according to audio stream time shaft, from the target voice segment
Start successively to determine whether the time span of silence clip is greater than preset duration;
If it is determined that the time span of target silence clip is greater than preset duration, the target silence clip and institute will be in
It states text corresponding to the sound bite between target voice segment and the group of text to be processed is added.
In one implementation, referring to Fig. 9 c, described device 900 further includes that the first judging unit 907 and the 4th determines
Unit 908:
First judging unit 907, for judging whether the character quantity of the group of text to be processed is greater than present count
Amount, the preset quantity are determined according to display subtitle length;
If first judging unit 907 judges that the character quantity of the group of text to be processed is greater than preset quantity, triggering
Second determination unit 904 executes the separator according in the group of text to be processed from the group of text to be processed
The step of determining captioned test;
4th determination unit 908, if judging the word of the group of text to be processed for first judging unit 907
It accords with quantity and is not more than preset quantity, the group of text to be processed is determined as the captioned test.
In one implementation, second determination unit 904, is specifically used for:
It will be determined as subtitle from first character to the part a last separator in the group of text to be processed
Text;Alternatively,
By in the group of text to be processed before from first character to the group of text to be processed in preset quantity character most
Part between the latter separator is determined as captioned test, and the preset quantity is determined according to display subtitle length.
In one implementation, if first judging unit 907 judges the character quantity of the group of text to be processed
It does not include separator greater than the preset quantity, and in the group of text to be processed, referring to Fig. 9 d, described device 900 further includes
Second judgment unit 909 and the 5th determination unit 910:
The second judgment unit 909, for judging whether the character quantity of the group of text to be processed is greater than maximum number
Amount, the maximum quantity are character quantity corresponding to display subtitle extreme length;
5th determination unit 910, if judging the word of the group of text to be processed for the second judgment unit 909
It accords with quantity and is greater than maximum quantity, the character of the maximum quantity before the group of text to be processed is determined as the captioned test;
If the second judgment unit 909 judges the character quantity of the group of text to be processed no more than maximum quantity, touching
The step of group of text to be processed is determined as the captioned test by execution of sending out the 4th determination unit 908 described.
In one implementation, referring to Fig. 9 e, described device 900 further includes that the 6th determination unit the 911, the 7th determines list
First 912, the 8th determination unit 913 and the 9th determination unit 914:
6th determination unit 911, for determining that first character is in corresponding sound bite in the captioned test
In opposite start time;
7th determination unit 912, for the language according to corresponding to the opposite start time and the first character
Opening for audio stream time shaft section corresponding to the captioned test is determined in time migration of the tablet section on audio stream time shaft
Begin the moment;
8th determination unit 913, for determining that last character is in corresponding voice sheet in the captioned test
Opposite finish time in section;
9th determination unit 914, for according to corresponding to the opposite finish time and the last character
Audio stream time shaft section corresponding to the captioned test is determined in time migration of the sound bite on audio stream time shaft
Finish time.
For from the same audio stream and according to the multiple of silence clip cutting it can be seen from above-mentioned technical proposal
During sound bite generates subtitle, speech recognition is carried out to multiple sound bites, obtains the multiple sound bite difference
Corresponding text includes the separator added according to text semantic in the corresponding text of multiple sound bites.According to it
When text corresponding to middle target voice segment determines subtitle, the group of text to be processed for generating subtitle, the text to be processed are determined
The corresponding text of target voice segment has been included at least in this group.After determining group of text to be processed, can according to this to
The separator in group of text is handled as the foundation for determining captioned test from group of text to be processed, due in group of text to be processed
Separator be identify sound bite in text when based on semanteme added by, the textual portions between separator belong to completely
Sentence can embody reasonable semanteme, therefore it is big a possibility that imperfect sentence occur in the captioned test determined by separator
It is big to reduce, when which is shown as the subtitle in corresponding audio stream time shaft section, it can help to watch sound
The user of video understands audio-video frequency content, improves user experience.
The embodiment of the present application also provides a kind of equipment generated for subtitle, generate with reference to the accompanying drawing to for subtitle
Equipment be introduced.Shown in Figure 10, the embodiment of the present application provides a kind of equipment 1000 generated for subtitle, should
Equipment 1000 can be server, can generate bigger difference because configuration or performance are different, may include one or one
Above central processing unit (Central Processing Units, abbreviation CPU) 1022 is (for example, one or more are handled
Device) and memory 1032, the storage medium 1030 (such as one of one or more storage application programs 1042 or data 1044
A or more than one mass memory unit).Wherein, memory 1032 and storage medium 1030 can be of short duration storage or persistently deposit
Storage.The program for being stored in storage medium 1030 may include one or more modules (diagram does not mark), and each module can
To include to the series of instructions operation in server.Further, central processing unit 1022 can be set to be situated between with storage
Matter 1030 communicates, and the series of instructions operation in storage medium 1030 is executed in the equipment 1000 generated for subtitle.
The equipment 1000 generated for subtitle can also include one or more power supplys 1026, one or more
Wired or wireless network interface 1050, one or more input/output interfaces 1058, and/or, one or more behaviour
Make system 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The step as performed by server can be based on the server architecture shown in Fig. 10 in above-described embodiment.
Wherein, CPU 1022 is for executing following steps:
It obtains from the same audio stream and according to multiple sound bites of silence clip cutting;
Speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite, institute
Stating includes the separator added according to text semantic in the corresponding text of multiple sound bites;
When the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, text to be processed is determined
This group, the group of text to be processed include at least the corresponding text of the target voice segment;
Captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed;
Using the captioned test as the subtitle in corresponding audio stream time shaft section.
Shown in Figure 11, the embodiment of the present application provides a kind of equipment 1100 generated for subtitle, the equipment
1100 can also be terminal device, the terminal device can be include mobile phone, tablet computer, personal digital assistant (Personal
Digital Assistant, abbreviation PDA), point-of-sale terminal (Point of Sales, abbreviation POS), vehicle-mounted computer etc. it is any eventually
End equipment, by taking terminal device is mobile phone as an example:
Figure 11 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided by the embodiments of the present application.Ginseng
Examine Figure 11, mobile phone include: radio frequency (Radio Frequency, abbreviation RF) circuit 1110, memory 1120, input unit 1130,
Display unit 1140, sensor 1150, voicefrequency circuit 1160, Wireless Fidelity (wireless fidelity, abbreviation WiFi) module
1170, the components such as processor 1180 and power supply 1190.It will be understood by those skilled in the art that mobile phone knot shown in Figure 11
Structure does not constitute the restriction to mobile phone, may include perhaps combining certain components or not than illustrating more or fewer components
Same component layout.
It is specifically introduced below with reference to each component parts of the Figure 11 to mobile phone:
RF circuit 1110 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station
After downlink information receives, handled to processor 1180;In addition, the data for designing uplink are sent to base station.In general, RF circuit
1110 include but is not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (Low Noise
Amplifier, abbreviation LNA), duplexer etc..In addition, RF circuit 1110 can also by wireless communication with network and other equipment
Communication.Any communication standard or agreement, including but not limited to global system for mobile communications can be used in above-mentioned wireless communication
(Global System of Mobile communication, abbreviation GSM), general packet radio service (General
Packet Radio Service, abbreviation GPRS), CDMA (Code Division Multiple Access, referred to as
CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviation WCDMA), long term evolution
(Long Term Evolution, abbreviation LTE), Email, short message service (Short Messaging Service, letter
Claim SMS) etc..
Memory 1120 can be used for storing software program and module, and processor 1180 is stored in memory by operation
1120 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1120 can be led
It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function
Application program (such as sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses institute according to mobile phone
Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1120 may include high random access storage
Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid
State memory device.
Input unit 1130 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with
And the related key signals input of function control.Specifically, input unit 1130 may include touch panel 1131 and other inputs
Equipment 1132.Touch panel 1131, also referred to as touch screen collect touch operation (such as the user of user on it or nearby
Use the behaviour of any suitable object or attachment such as finger, stylus on touch panel 1131 or near touch panel 1131
Make), and corresponding attachment device is driven according to preset formula.Optionally, touch panel 1131 may include touch detection
Two parts of device and touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation band
The signal come, transmits a signal to touch controller;Touch controller receives touch information from touch detecting apparatus, and by it
It is converted into contact coordinate, then gives processor 1180, and order that processor 1180 is sent can be received and executed.In addition,
Touch panel 1131 can be realized using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves.In addition to touch surface
Plate 1131, input unit 1130 can also include other input equipments 1132.Specifically, other input equipments 1132 may include
But in being not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc.
It is one or more.
Display unit 1140 can be used for showing information input by user or be supplied to user information and mobile phone it is each
Kind menu.Display unit 1140 may include display panel 1141, optionally, can use liquid crystal display (Liquid
Crystal Display, abbreviation LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode, referred to as
) etc. OLED forms configure display panel 1141.Further, touch panel 1131 can cover display panel 1141, work as touch-control
After panel 1131 detects touch operation on it or nearby, processor 1180 is sent to determine the type of touch event,
It is followed by subsequent processing device 1180 and provides corresponding visual output on display panel 1141 according to the type of touch event.Although in Figure 11
In, touch panel 1131 and display panel 1141 are the input and input function for realizing mobile phone as two independent components,
But in some embodiments it is possible to touch panel 1131 is integrated with display panel 1141 and realizes outputting and inputting for mobile phone
Function.
Mobile phone may also include at least one sensor 1150, such as optical sensor, motion sensor and other sensors.
Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light
Light and shade adjust the brightness of display panel 1141, proximity sensor can close display panel when mobile phone is moved in one's ear
1141 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect in all directions (generally three axis) and add
The size of speed can detect that size and the direction of gravity when static, can be used to identify application (such as the horizontal/vertical screen of mobile phone posture
Switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.;Also as mobile phone
The other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.
Voicefrequency circuit 1160, loudspeaker 1161, microphone 1162 can provide the audio interface between user and mobile phone.Audio
Electric signal after the audio data received conversion can be transferred to loudspeaker 1161, be converted by loudspeaker 1161 by circuit 1160
For voice signal output;On the other hand, the voice signal of collection is converted to electric signal by microphone 1162, by voicefrequency circuit 1160
Audio data is converted to after reception, then by after the processing of audio data output processor 1180, through RF circuit 1110 to be sent to ratio
Such as another mobile phone, or audio data is exported to memory 1120 to be further processed.
WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 1170
Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 11 is shown
WiFi module 1170, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely
Become in the range of the essence of invention and omits.
Processor 1180 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone,
By running or execute the software program and/or module that are stored in memory 1120, and calls and be stored in memory 1120
Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor
1180 may include one or more processing units;Preferably, processor 1180 can integrate application processor and modulation /demodulation processing
Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located
Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1180.
Mobile phone further includes the power supply 1190 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply
Management system and processor 1180 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system
The functions such as reason.
Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.
In the present embodiment, processor 1180 included by the terminal device is also with the following functions:
It obtains from the same audio stream and according to multiple sound bites of silence clip cutting;
Speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite, institute
Stating includes the separator added according to text semantic in the corresponding text of multiple sound bites;
When the text according to corresponding to target voice segment in the multiple sound bite determines subtitle, text to be processed is determined
This group, the group of text to be processed include at least the corresponding text of the target voice segment;
Captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed;
Using the captioned test as the subtitle in corresponding audio stream time shaft section.
The embodiment of the present application also provides a kind of computer readable storage medium, for storing program code, the program code
For executing any one embodiment in a kind of method for generating captions described in foregoing individual embodiments.
The description of the present application and term " first " in above-mentioned attached drawing, " second ", " third ", " the 4th " etc. are (if deposited
) it is to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that use in this way
Data are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be in addition to illustrating herein
Or the sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that
Cover it is non-exclusive include, for example, containing the process, method, system, product or equipment of a series of steps or units need not limit
In step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, produce
The other step or units of product or equipment inherently.
It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two
More than a."and/or" indicates may exist three kinds of relationships, for example, " A and/or B " for describing the incidence relation of affiliated partner
It can indicate: only exist A, only exist B and exist simultaneously tri- kinds of situations of A and B, wherein A, B can be odd number or plural number.Word
Symbol "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or"." at least one of following (a) " or its similar expression, refers to
Any combination in these, any combination including individual event (a) or complex item (a).At least one of for example, in a, b or c
(a) can indicate: a, b, c, " a and b ", " a and c ", " b and c ", or " a and b and c ", and wherein a, b, c can be individually, can also
To be multiple.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application
Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, letter
Claim ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc. is various to deposit
Store up the medium of program code.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before
Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding
Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.
Claims (13)
1. a kind of method for generating captions based on artificial intelligence, which is characterized in that the described method includes:
It obtains from the same audio stream and according to multiple sound bites of silence clip cutting;
Speech recognition is carried out to the multiple sound bite, obtains the corresponding text of the multiple sound bite;
Determine the time span of silence clip between the multiple sound bite;
According to the sequence of audio stream time shaft, mute plate is successively determined target voice segment since the multiple sound bite
Whether the time span of section is greater than preset duration;
If it is determined that the time span of target silence clip is greater than preset duration, it will be in the target silence clip and the mesh
It marks text corresponding to the sound bite between sound bite and group of text to be processed is added;The group of text to be processed includes at least institute
State the corresponding text of target voice segment;
Captioned test is determined from the group of text to be processed according to the separator in the group of text to be processed;It is described to be processed
Separator in group of text is added according to text semantic;
Using the captioned test as the subtitle in corresponding audio stream time shaft section.
2. the method according to claim 1, wherein the method also includes:
Judge whether the character quantity of the group of text to be processed is greater than preset quantity, the preset quantity is according to display subtitle
What length determined;
If so, executing the separator according in the group of text to be processed determines subtitle from the group of text to be processed
The step of text;
If it is not, the group of text to be processed is determined as the captioned test.
3. the method according to claim 1, wherein the separator according in the group of text to be processed from
Captioned test is determined in the group of text to be processed, comprising:
It will be determined as captioned test from first character to the part a last separator in the group of text to be processed;
Alternatively,
By last in preset quantity character before from first character to the group of text to be processed in the group of text to be processed
Part between a separator is determined as captioned test, and the preset quantity is determined according to display subtitle length.
4. according to the method described in claim 2, it is characterized in that, the character quantity of the group of text to be processed is greater than if judging
The preset quantity, and do not include separator in the group of text to be processed, the method also includes:
Judge whether the character quantity of the group of text to be processed is greater than maximum quantity, the maximum quantity is display subtitle longest
Character quantity corresponding to length;
If so, the character of the maximum quantity before the group of text to be processed is determined as the captioned test;
If it is not, the group of text to be processed is determined as the captioned test.
5. the method according to claim 1, wherein in the separator according in the group of text to be processed
After determining captioned test in the group of text to be processed, the method also includes:
Determine opposite start time of the first character in corresponding sound bite in the captioned test;
According to time of the sound bite on audio stream time shaft corresponding to the opposite start time and the first character
Offset, at the beginning of determining audio stream time shaft section corresponding to the captioned test;
Determine opposite finish time of the last character in corresponding sound bite in the captioned test;
According to sound bite corresponding to the opposite finish time and the last character on audio stream time shaft when
Between deviate, determine the finish time in audio stream time shaft section corresponding to the captioned test.
6. the method according to claim 1, wherein the method also includes:
Show that languages translate the captioned test according to subtitle, the captioned test after being translated;
It is described using the captioned test as the subtitle in corresponding audio stream time shaft section, comprising:
Using the captioned test after the translation as the subtitle in corresponding audio stream time shaft section.
7. a kind of caption generation device based on artificial intelligence, which is characterized in that described device includes acquiring unit, identification list
Member, the first determination unit, the second determination unit, third determination unit and generation unit:
The acquiring unit, for obtaining from the same audio stream and according to multiple sound bites of silence clip cutting;
The recognition unit obtains the multiple sound bite difference for carrying out speech recognition to the multiple sound bite
Corresponding text;
The third determination unit, for determining the time span of silence clip between the multiple sound bite;
First determination unit, for the sequence according to audio stream time shaft, since the target voice segment successively really
Whether the time span for determining silence clip is greater than preset duration, however, it is determined that when going out the time span of target silence clip greater than presetting
It is long, will be added in text corresponding to the sound bite between the target silence clip and the target voice segment it is described to
Handle group of text;The group of text to be processed includes at least the text of the target voice segment;
Second determination unit, for true from the group of text to be processed according to the separator in the group of text to be processed
Determine captioned test;Separator in the group of text to be processed is added according to text semantic;
The generation unit, for using the captioned test as the subtitle in corresponding audio stream time shaft section.
8. device according to claim 7, which is characterized in that described device further includes that the first judging unit and the 4th determine
Unit:
First judging unit, it is described for judging whether the character quantity of the group of text to be processed is greater than preset quantity
Preset quantity is determined according to display subtitle length;
If first judging unit judges that the character quantity of the group of text to be processed is greater than preset quantity, triggering described second
Determination unit executes the separator according in the group of text to be processed and determines subtitle text from the group of text to be processed
This step of;
4th determination unit, if judging that the character quantity of the group of text to be processed is little for first judging unit
In preset quantity, the group of text to be processed is determined as the captioned test.
9. device according to claim 7, which is characterized in that second determination unit is specifically used for:
It will be determined as captioned test from first character to the part a last separator in the group of text to be processed;
Alternatively,
By last in preset quantity character before from first character to the group of text to be processed in the group of text to be processed
Part between a separator is determined as captioned test, and the preset quantity is determined according to display subtitle length.
10. device according to claim 8, which is characterized in that if first judging unit judges the text to be processed
The character quantity of this group is greater than the preset quantity, and does not include separator in the group of text to be processed, and described device is also wrapped
Include second judgment unit and the 5th determination unit:
The second judgment unit, it is described for judging whether the character quantity of the group of text to be processed is greater than maximum quantity
Maximum quantity is character quantity corresponding to display subtitle extreme length;
5th determination unit, if judging that the character quantity of the group of text to be processed is greater than for the second judgment unit
The character of the maximum quantity before the group of text to be processed is determined as the captioned test by maximum quantity;
If the second judgment unit judges the character quantity of the group of text to be processed no more than maximum quantity, described the is triggered
The step of group of text to be processed is determined as the captioned test by the execution of four determination units.
11. device according to claim 7, which is characterized in that described device further includes the 6th determination unit, the 7th determination
Unit, the 8th determination unit and the 9th determination unit:
6th determination unit, for determining, first character is opposite in corresponding sound bite in the captioned test
Start time;
7th determination unit exists for the sound bite according to corresponding to the opposite start time and the first character
Time migration on audio stream time shaft, at the beginning of determining audio stream time shaft section corresponding to the captioned test;
8th determination unit, for determining phase of the last character in corresponding sound bite in the captioned test
To finish time;
9th determination unit, for the sound bite according to corresponding to the opposite finish time and the last character
Time migration on audio stream time shaft, at the end of determining audio stream time shaft section corresponding to the captioned test
It carves.
12. a kind of equipment that the subtitle for based on artificial intelligence generates, which is characterized in that the equipment include processor and
Memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor is used to be based on artificial intelligence according to the instruction execution claim any one of 1-6 in said program code
The method for generating captions of energy.
13. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing program generation
Code, said program code require the method for generating captions based on artificial intelligence described in any one of 1-6 for perform claim.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910740413.6A CN110381389B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910740413.6A CN110381389B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
CN201811355311.4A CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811355311.4A Division CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110381389A true CN110381389A (en) | 2019-10-25 |
CN110381389B CN110381389B (en) | 2022-02-25 |
Family
ID=65389096
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811355311.4A Active CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
CN201910740413.6A Active CN110381389B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
CN201910741161.9A Active CN110418208B (en) | 2018-11-14 | 2018-11-14 | Subtitle determining method and device based on artificial intelligence |
CN201910740405.1A Active CN110381388B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811355311.4A Active CN109379641B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910741161.9A Active CN110418208B (en) | 2018-11-14 | 2018-11-14 | Subtitle determining method and device based on artificial intelligence |
CN201910740405.1A Active CN110381388B (en) | 2018-11-14 | 2018-11-14 | Subtitle generating method and device based on artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (4) | CN109379641B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111639233A (en) * | 2020-05-06 | 2020-09-08 | 广东小天才科技有限公司 | Learning video subtitle adding method and device, terminal equipment and storage medium |
CN111916053A (en) * | 2020-08-17 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN112188241A (en) * | 2020-10-09 | 2021-01-05 | 上海网达软件股份有限公司 | Method and system for real-time subtitle generation of live stream |
CN112686018A (en) * | 2020-12-23 | 2021-04-20 | 科大讯飞股份有限公司 | Text segmentation method, device, equipment and storage medium |
CN112750425A (en) * | 2020-01-22 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and computer readable storage medium |
CN113225618A (en) * | 2021-05-06 | 2021-08-06 | 阿里巴巴新加坡控股有限公司 | Video editing method and device |
CN113596579A (en) * | 2021-07-29 | 2021-11-02 | 北京字节跳动网络技术有限公司 | Video generation method, device, medium and electronic equipment |
CN113660432A (en) * | 2021-08-17 | 2021-11-16 | 安徽听见科技有限公司 | Translation subtitle production method and device, electronic equipment and storage medium |
WO2022037383A1 (en) * | 2020-08-17 | 2022-02-24 | 北京字节跳动网络技术有限公司 | Voice processing method and apparatus, electronic device, and computer readable medium |
CN114420125A (en) * | 2020-10-12 | 2022-04-29 | 腾讯科技(深圳)有限公司 | Audio processing method, device, electronic equipment and medium |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111797632B (en) * | 2019-04-04 | 2023-10-27 | 北京猎户星空科技有限公司 | Information processing method and device and electronic equipment |
CN112037768A (en) * | 2019-05-14 | 2020-12-04 | 北京三星通信技术研究有限公司 | Voice translation method and device, electronic equipment and computer readable storage medium |
CN110379413B (en) * | 2019-06-28 | 2022-04-19 | 联想(北京)有限公司 | Voice processing method, device, equipment and storage medium |
CN110400580B (en) * | 2019-08-30 | 2022-06-17 | 北京百度网讯科技有限公司 | Audio processing method, apparatus, device and medium |
CN110648653A (en) * | 2019-09-27 | 2020-01-03 | 安徽咪鼠科技有限公司 | Subtitle realization method, device and system based on intelligent voice mouse and storage medium |
CN110933485A (en) * | 2019-10-21 | 2020-03-27 | 天脉聚源(杭州)传媒科技有限公司 | Video subtitle generating method, system, device and storage medium |
CN110660393B (en) * | 2019-10-31 | 2021-12-03 | 广东美的制冷设备有限公司 | Voice interaction method, device, equipment and storage medium |
CN110992960A (en) * | 2019-12-18 | 2020-04-10 | Oppo广东移动通信有限公司 | Control method, control device, electronic equipment and storage medium |
CN111353038A (en) * | 2020-05-25 | 2020-06-30 | 深圳市友杰智新科技有限公司 | Data display method and device, computer equipment and storage medium |
CN111832279B (en) * | 2020-07-09 | 2023-12-05 | 抖音视界有限公司 | Text partitioning method, apparatus, device and computer readable medium |
CN111986654B (en) * | 2020-08-04 | 2024-01-19 | 云知声智能科技股份有限公司 | Method and system for reducing delay of voice recognition system |
CN112272277B (en) * | 2020-10-23 | 2023-07-18 | 岭东核电有限公司 | Voice adding method and device in nuclear power test and computer equipment |
CN113886612A (en) * | 2020-11-18 | 2022-01-04 | 北京字跳网络技术有限公司 | Multimedia browsing method, device, equipment and medium |
CN113099292A (en) * | 2021-04-21 | 2021-07-09 | 湖南快乐阳光互动娱乐传媒有限公司 | Multi-language subtitle generating method and device based on video |
CN112995736A (en) * | 2021-04-22 | 2021-06-18 | 南京亿铭科技有限公司 | Speech subtitle synthesis method, apparatus, computer device, and storage medium |
CN113938758A (en) * | 2021-12-08 | 2022-01-14 | 沈阳开放大学 | Method for quickly adding subtitles in video editor |
CN114268829B (en) * | 2021-12-22 | 2024-01-16 | 中电金信软件有限公司 | Video processing method, video processing device, electronic equipment and computer readable storage medium |
CN114554238B (en) * | 2022-02-23 | 2023-08-11 | 北京有竹居网络技术有限公司 | Live broadcast voice simultaneous transmission method, device, medium and electronic equipment |
CN115831120B (en) * | 2023-02-03 | 2023-06-16 | 北京探境科技有限公司 | Corpus data acquisition method and device, electronic equipment and readable storage medium |
CN116471436B (en) * | 2023-04-12 | 2024-05-31 | 央视国际网络有限公司 | Information processing method and device, storage medium and electronic equipment |
CN116612781B (en) * | 2023-07-20 | 2023-09-29 | 深圳市亿晟科技有限公司 | Visual processing method, device and equipment for audio data and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104143331A (en) * | 2013-05-24 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Method and system for adding punctuations |
CN105244022A (en) * | 2015-09-28 | 2016-01-13 | 科大讯飞股份有限公司 | Audio and video subtitle generation method and apparatus |
CN105845129A (en) * | 2016-03-25 | 2016-08-10 | 乐视控股(北京)有限公司 | Method and system for dividing sentences in audio and automatic caption generation method and system for video files |
CN106331893A (en) * | 2016-08-31 | 2017-01-11 | 科大讯飞股份有限公司 | Real-time subtitle display method and system |
CN107632980A (en) * | 2017-08-03 | 2018-01-26 | 北京搜狗科技发展有限公司 | Voice translation method and device, the device for voiced translation |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6697564B1 (en) * | 2000-03-03 | 2004-02-24 | Siemens Corporate Research, Inc. | Method and system for video browsing and editing by employing audio |
KR100521914B1 (en) * | 2002-04-24 | 2005-10-13 | 엘지전자 주식회사 | Method for managing a summary of playlist information |
AU2003241205B2 (en) * | 2002-06-24 | 2009-03-26 | Lg Electronics Inc. | Recording medium having data structure for managing reproduction of multiple title video data recorded thereon and recording and reproducing methods and apparatuses |
CN100547670C (en) * | 2004-03-17 | 2009-10-07 | Lg电子株式会社 | Be used to reproduce recording medium, the method and apparatus of text subtitle stream |
WO2010105245A2 (en) * | 2009-03-12 | 2010-09-16 | Exbiblio B.V. | Automatically providing content associated with captured information, such as information captured in real-time |
US20150318020A1 (en) * | 2014-05-02 | 2015-11-05 | FreshTake Media, Inc. | Interactive real-time video editor and recorder |
US9898773B2 (en) * | 2014-11-18 | 2018-02-20 | Microsoft Technology Licensing, Llc | Multilingual content based recommendation system |
CN106878805A (en) * | 2017-02-06 | 2017-06-20 | 广东小天才科技有限公司 | A kind of mixed languages subtitle file generation method and device |
CN110444197B (en) * | 2018-05-10 | 2023-01-03 | 腾讯科技(北京)有限公司 | Data processing method, device and system based on simultaneous interpretation and storage medium |
-
2018
- 2018-11-14 CN CN201811355311.4A patent/CN109379641B/en active Active
- 2018-11-14 CN CN201910740413.6A patent/CN110381389B/en active Active
- 2018-11-14 CN CN201910741161.9A patent/CN110418208B/en active Active
- 2018-11-14 CN CN201910740405.1A patent/CN110381388B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104143331A (en) * | 2013-05-24 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Method and system for adding punctuations |
CN105244022A (en) * | 2015-09-28 | 2016-01-13 | 科大讯飞股份有限公司 | Audio and video subtitle generation method and apparatus |
CN105845129A (en) * | 2016-03-25 | 2016-08-10 | 乐视控股(北京)有限公司 | Method and system for dividing sentences in audio and automatic caption generation method and system for video files |
CN106331893A (en) * | 2016-08-31 | 2017-01-11 | 科大讯飞股份有限公司 | Real-time subtitle display method and system |
CN107632980A (en) * | 2017-08-03 | 2018-01-26 | 北京搜狗科技发展有限公司 | Voice translation method and device, the device for voiced translation |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112750425A (en) * | 2020-01-22 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and computer readable storage medium |
CN112750425B (en) * | 2020-01-22 | 2023-11-03 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, computer equipment and computer readable storage medium |
CN111639233A (en) * | 2020-05-06 | 2020-09-08 | 广东小天才科技有限公司 | Learning video subtitle adding method and device, terminal equipment and storage medium |
CN111639233B (en) * | 2020-05-06 | 2024-05-17 | 广东小天才科技有限公司 | Learning video subtitle adding method, device, terminal equipment and storage medium |
WO2022037383A1 (en) * | 2020-08-17 | 2022-02-24 | 北京字节跳动网络技术有限公司 | Voice processing method and apparatus, electronic device, and computer readable medium |
CN111916053A (en) * | 2020-08-17 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN112188241A (en) * | 2020-10-09 | 2021-01-05 | 上海网达软件股份有限公司 | Method and system for real-time subtitle generation of live stream |
CN114420125A (en) * | 2020-10-12 | 2022-04-29 | 腾讯科技(深圳)有限公司 | Audio processing method, device, electronic equipment and medium |
CN112686018A (en) * | 2020-12-23 | 2021-04-20 | 科大讯飞股份有限公司 | Text segmentation method, device, equipment and storage medium |
CN113225618A (en) * | 2021-05-06 | 2021-08-06 | 阿里巴巴新加坡控股有限公司 | Video editing method and device |
CN113596579A (en) * | 2021-07-29 | 2021-11-02 | 北京字节跳动网络技术有限公司 | Video generation method, device, medium and electronic equipment |
CN113660432A (en) * | 2021-08-17 | 2021-11-16 | 安徽听见科技有限公司 | Translation subtitle production method and device, electronic equipment and storage medium |
CN113660432B (en) * | 2021-08-17 | 2024-05-28 | 安徽听见科技有限公司 | Translation subtitle making method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110381388B (en) | 2021-04-13 |
CN109379641B (en) | 2022-06-03 |
CN110381388A (en) | 2019-10-25 |
CN110418208A (en) | 2019-11-05 |
CN110418208B (en) | 2021-07-27 |
CN110381389B (en) | 2022-02-25 |
CN109379641A (en) | 2019-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110381389A (en) | A kind of method for generating captions and device based on artificial intelligence | |
JP7312853B2 (en) | AI-BASED VOICE-DRIVEN ANIMATION METHOD AND APPARATUS, DEVICE AND COMPUTER PROGRAM | |
CN110288077B (en) | Method and related device for synthesizing speaking expression based on artificial intelligence | |
CN108304846B (en) | Image recognition method, device and storage medium | |
CN110544488B (en) | Method and device for separating multi-person voice | |
CN108287739A (en) | A kind of guiding method of operating and mobile terminal | |
CN112040263A (en) | Video processing method, video playing method, video processing device, video playing device, storage medium and equipment | |
CN110570840B (en) | Intelligent device awakening method and device based on artificial intelligence | |
CN109063583A (en) | A kind of learning method and electronic equipment based on read operation | |
CN107040452B (en) | Information processing method and device and computer readable storage medium | |
CN110322760B (en) | Voice data generation method, device, terminal and storage medium | |
CN108735216A (en) | A kind of voice based on semantics recognition searches topic method and private tutor's equipment | |
CN111538456A (en) | Human-computer interaction method, device, terminal and storage medium based on virtual image | |
WO2016119165A1 (en) | Chat history display method and apparatus | |
CN109462768A (en) | A kind of caption presentation method and terminal device | |
CN110033769A (en) | A kind of typing method of speech processing, terminal and computer readable storage medium | |
CN110430475A (en) | A kind of interactive approach and relevant apparatus | |
CN110471589A (en) | Information display method and terminal device | |
CN109634550A (en) | A kind of voice operating control method and terminal device | |
CN111816168A (en) | Model training method, voice playing method, device and storage medium | |
CN111639209A (en) | Book content searching method, terminal device and storage medium | |
CN110333803A (en) | A kind of multimedia object selection method and terminal device | |
CN109325219A (en) | A kind of method, apparatus and system generating recording documents | |
CN109725798A (en) | The switching method and relevant apparatus of Autonomous role | |
CN108255389A (en) | Image edit method, mobile terminal and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |