CN106331844A - Method and device for generating subtitles of media file and electronic equipment - Google Patents

Method and device for generating subtitles of media file and electronic equipment Download PDF

Info

Publication number
CN106331844A
CN106331844A CN201610683362.4A CN201610683362A CN106331844A CN 106331844 A CN106331844 A CN 106331844A CN 201610683362 A CN201610683362 A CN 201610683362A CN 106331844 A CN106331844 A CN 106331844A
Authority
CN
China
Prior art keywords
audio
information
segmentation
frequency information
section
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610683362.4A
Other languages
Chinese (zh)
Inventor
田昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Internet Security Software Co Ltd
Original Assignee
Beijing Kingsoft Internet Security Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Internet Security Software Co Ltd filed Critical Beijing Kingsoft Internet Security Software Co Ltd
Priority to CN201610683362.4A priority Critical patent/CN106331844A/en
Publication of CN106331844A publication Critical patent/CN106331844A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method, a device and electronic equipment for generating a media file subtitle, wherein the method comprises the steps of segmenting audio information of a media file to obtain multi-segment segmented audio information; processing the multi-segment segmented audio information to obtain target audio information of which the end frame does not contain voice audio information; identifying the target audio information as corresponding words; and processing the characters to generate subtitle information of the media file. The invention obtains the target audio information of which the end frame does not contain the voice audio information by segmenting and processing the audio of the media file, avoids the situation that the voice of one character is segmented in two segments of audio, and improves the accuracy of generating the caption by voice recognition.

Description

Generation method, device and the electronic equipment of a kind of media file caption
Technical field
The present invention relates to electronic technology field, particularly relate to generation method, device and the electronics of a kind of media file caption Equipment.
Background technology
During playback of media files, be frequently encountered by following several situation: (1) due to different regions language pronunciation difference very Greatly, a lot of people do not understand mandarin;(2) lines of some media files include that the dialect of different regions, a lot of people are difficult to manage Solve these country dialects;(3) noise of some media file background sounds or playback of media files environment is relatively big, causes a lot of people Do not hear lines therein.For above-mentioned several situations, if the voice content of media content being shown in captions mode, Ke Yibang Spectators are helped to be better understood from the content of media file.But much media file does not has captions or caption timestamps the most right, no It is easy to spectators and understands the content of media file.
Prior art uses speech recognition to generate the captions of media file, and this method mainly uses Preset Time to audio frequency Carry out segmentation, identify piecemeal, but this mode according to set time random segmentation audio frequency often results in speech recognition and forbidden Really.
For this reason, it may be necessary to solve the below technical problem that improving speech recognition generates the accuracy of captions.
Summary of the invention
The present invention proposes generation method, device and the electronic equipment of a kind of media file caption, by media file In audio frequency carry out segmentation and process, obtain end frame without the target audio information of voice audio information, and then to target sound Frequently information carries out speech recognition, generate the caption information of media file, it is to avoid the voice of a word is split at two sections of sounds In Pin, improve speech recognition and generate the accuracy of captions.
In one aspect, embodiments provide the generation method of media file caption, for electronic equipment, described Method includes:
The audio-frequency information of media file is carried out segmentation, obtains multistage segmentation audio-frequency information;
Described multistage segmentation audio-frequency information is processed, obtains the end frame target audio letter without voice audio information Breath;
Described target audio information is identified as corresponding word;
Described word is processed, generates the caption information of described media file.
Wherein, the described audio-frequency information to described media file carries out segmentation, obtains multistage segmentation audio-frequency information, specifically For:
Described media file is decoded, obtains the audio-frequency information of described media file;
The disposal ability of the processor according to described electronic equipment determines split time;
According to described split time, the audio-frequency information of described media file is carried out segmentation, obtain multistage segmentation audio frequency letter Breath.
Wherein, described described multistage segmentation audio-frequency information is processed, obtain end frame without voice audio information Target audio information, particularly as follows:
From the beginning of first paragraph segmentation audio-frequency information, splice adjacent next section segmentation audio-frequency information successively, it is judged that spell every time Whether the end frame of the audio-frequency information after connecing comprises voice audio information, until judging described first paragraph segmentation audio-frequency information extremely The end frame of the audio-frequency information that the splicing of n-th section of segmentation audio-frequency information obtains does not comprises voice audio information, completes single treatment behaviour Making, the audio-frequency information that described first paragraph segmentation audio-frequency information obtains to n-th section of segmentation audio-frequency information splicing is target audio information;
Starting the process over operation from (n+1)th section of segmentation audio-frequency information, described n is the integer more than 1.
Wherein, described described multistage segmentation audio-frequency information is processed, obtain end frame without voice audio information Target audio information, particularly as follows:
Judge whether end frame comprises voice audio information successively from first paragraph segmentation audio-frequency information, until judging n-th The end frame of section segmentation audio-frequency information comprises voice audio information, splices successively from described n-th segmentation audio-frequency information, splices every time Rear judgement obtains whether the end frame of audio-frequency information comprises voice audio information, until judging described n-th section of segmentation audio frequency letter Breath does not comprise voice audio information to the end frame of the audio-frequency information that the n-th+i section segmentation audio-frequency information splicing obtains, and completes once Processing operation, the most all end frames do not comprise the segmentation audio-frequency information of voice audio information and described n-th segmentation audio-frequency information The audio-frequency information obtained to the n-th+i section segmentation audio-frequency information splicing is target audio information;
Starting the process over operation from the n-th+i+1 section segmentation audio-frequency information, described n is the integer more than 0, and described i is big Integer in O.
Wherein, described described multistage segmentation audio-frequency information is processed, obtain end frame without voice audio information Target audio information, particularly as follows:
Described multistage segmentation audio-frequency information is judged, obtains the segmentation audio frequency letter that end frame comprises voice audio information Breath and end frame do not comprise the segmentation audio-frequency information of voice audio information;
If n-th section is, to n+i section segmentation audio-frequency information, the segmentation audio-frequency information that end frame comprises voice audio information, the N+i+1 section segmentation audio-frequency information is the segmentation audio-frequency information that end frame does not comprise voice audio information, then by n-th section to n+i+1 Section segmentation audio-frequency information splices, and described end frame does not comprise the segmentation audio-frequency information of voice audio information and described n-th section The audio-frequency information obtained to the splicing of n+i+1 section segmentation audio-frequency information is target audio information;
Wherein, described n is the integer more than 0, and described i is the integer more than or equal to 0.
Wherein, when obtaining multistage target audio information, described described target audio information is identified as corresponding word, Particularly as follows:
According to the sound identification module of described electronic equipment, multistage target audio information is carried out multithreading speech recognition, Obtain every section of word corresponding to target audio information.
Wherein, described described target audio information is identified as corresponding word, particularly as follows:
Described target audio information is sent to cloud server, receives the literary composition that described cloud server speech recognition obtains Word.
Wherein, described described word is processed, generate the caption information of described media file, particularly as follows:
Obtain the timestamp information of described target audio information;
Described word is generated caption information according to described timestamp information.
Preferably, described method also includes importing in described media file by described caption information, word described in simultaneous display Word in curtain information.
In yet another aspect, embodiments provide the generating means of media file caption, be applied to electronic equipment, Including: segmentation module, processing module, sound identification module and captions generation module;
Described segmentation module, for the audio-frequency information of described media file is carried out segmentation, obtains multistage segmentation audio frequency letter Breath;
Described processing module, for processing described multistage segmentation audio-frequency information, obtains end frame without speech sound Frequently the target audio information of information;
Described sound identification module, for being identified as corresponding word by described target audio information;
Described captions generation module, for generating the caption information of described media file by described word processing.
Wherein, described segmentation module includes the first acquiring unit and segmenting unit, wherein:
Described first acquiring unit, for being decoded described media file, obtains the audio frequency letter of described media file Breath;
Described segmenting unit, the disposal ability for the processor according to described electronic equipment determines split time;According to Described split time carries out segmentation to the audio-frequency information of described media file, obtains multistage segmentation audio-frequency information.
Wherein, described processing module includes concatenation unit and judging unit, wherein:
Described concatenation unit splices successively from first paragraph segmentation audio-frequency information, and described judging unit judges after splicing every time Whether the end frame obtaining audio-frequency information comprises voice audio information, until described judging unit judges described first paragraph segmentation Audio-frequency information does not comprise voice audio information to the end frame of the audio-frequency information that n-th section of segmentation audio-frequency information splicing obtains, and completes Single treatment operates, and the audio-frequency information that described first paragraph segmentation audio-frequency information obtains to n-th section of segmentation audio-frequency information splicing is mesh Mark audio-frequency information;
Described concatenation unit and described judging unit start the process over operation, described n from (n+1)th section of segmentation audio-frequency information For the integer more than 1.
Wherein, described processing module includes concatenation unit and judging unit, wherein:
From first paragraph segmentation audio-frequency information, described judging unit judges whether end frame comprises voice audio information successively, directly Comprising voice audio information to the end frame judging n-th section of segmentation audio-frequency information, described concatenation unit is from described n-th segmentation sound Frequently information is spliced successively, and after splicing, described judging unit judges whether the end frame obtaining audio frequency comprises voice audio letter every time Breath, until described judging unit judges what described n-th section of segmentation audio-frequency information to the n-th+i section segmentation audio-frequency information splicing obtained The end frame of audio-frequency information does not comprise voice audio information, completes single treatment operation, and wherein end frame does not comprise voice audio The audio frequency letter that the segmentation audio-frequency information of information and described n-th segmentation audio-frequency information obtain to the n-th+i section segmentation audio-frequency information splicing Breath is target audio information;
Described concatenation unit and described judging unit start the process over operation from the n-th+i+1 section segmentation audio-frequency information, institute Stating n is the integer more than 0, and described i is the integer more than 0.
Wherein, described processing module includes concatenation unit and judging unit, wherein:
Described multistage segmentation audio-frequency information is judged by described judging unit, obtains end frame and comprises voice audio information Segmentation audio-frequency information and end frame do not comprise the segmentation audio-frequency information of voice audio information;
If n-th section is the segmentation audio-frequency information that end frame comprises voice audio information to n+i section segmentation audio-frequency information, n-th+ I+1 section segmentation audio-frequency information is the segmentation audio-frequency information that end frame does not comprise voice audio information, and the most described concatenation unit is by n-th Section is spliced to n+i+1 section segmentation audio-frequency information, described end frame do not comprise voice audio information segmentation audio-frequency information and Described n-th section of audio-frequency information obtained to the splicing of n+i+1 section segmentation audio-frequency information is target audio information;
Wherein, described n is the positive integer more than 0, and described i is the integer more than or equal to 0.
Wherein, described sound identification module specifically for, when obtaining multistage target audio information, to described multistage target Audio-frequency information carries out multithreading speech recognition, it is thus achieved that the word that every section of target audio information is corresponding.
Wherein, described sound identification module specifically for, described target audio information is sent to cloud server, receives The word that described cloud server speech recognition obtains.
Wherein, described captions generation module includes second acquisition unit and captions signal generating unit, wherein:
Described second acquisition unit is for obtaining the timestamp information of described target audio information;
Described word is generated caption information according to described timestamp information by described captions signal generating unit.
Preferably, described device also includes Subtitle Demonstration module, for described caption information is imported described media file In, the word in caption information described in simultaneous display.
In yet another aspect, embodiments provide a kind of terminal, including: media file caption as above Generating means.
In yet another aspect, embodiments provide a kind of electronic equipment, including: housing, processor, memorizer, Display screen, circuit board and power circuit, wherein, described circuit board is placed in the interior volume that described housing surrounds, described process Device and described memorizer are arranged on described circuit board, be embedded on described housing and connect described circuit board outside described display screen; Described power circuit, powers for each circuit or the device for electronic equipment;Described memorizer is used for storing executable program Code and data;Described processor runs by reading the executable program code of storage in described memorizer and can perform journey The program that sequence code is corresponding, for performing following steps:
The audio-frequency information of media file is carried out segmentation, obtains multistage segmentation audio-frequency information;
Described multistage segmentation audio-frequency information is processed, obtains the end frame target audio letter without voice audio information Breath;
Described target audio information is identified as corresponding word;
Described word is processed, generates the caption information of described media file.
The such scheme of the present invention at least includes following beneficial effect:
The present invention is by obtaining the audio-frequency information in media file;Audio-frequency information is carried out segmentation, obtains multistage segmentation sound Frequently information;Multistage segmentation audio-frequency information is processed, obtains the end frame target audio information without voice audio information;Will Target audio information is identified as corresponding word;Word is processed, generates the caption information of described media file.The present invention By to the audio parsing of media file and process, obtain the end frame target audio information without voice audio information, it is to avoid The voice of one word is split in two section audios, thus improves speech recognition and generate the accuracy of captions.
Accompanying drawing explanation
The specific embodiment of the present invention is described below with reference to accompanying drawings, wherein:
Fig. 1 shows the schematic diagram of the generation method of media file caption in the embodiment of the present invention one;
Fig. 2 shows the schematic diagram of the generation method of media file caption in the embodiment of the present invention two;
Fig. 3 shows the structural representation of the generating means of media file caption in the embodiment of the present invention three;
Fig. 4 shows the structural representation of the generating means of media file caption in the embodiment of the present invention four;
Fig. 5 shows the structural representation of processing module in the embodiment of the present invention four;
Fig. 6 shows the structural representation of sound identification module in the embodiment of the present invention four;
Fig. 7 shows the structural representation of electronic equipment in the embodiment of the present invention five.
Detailed description of the invention
In order to make technical scheme and advantage clearer, below in conjunction with exemplary to the present invention of accompanying drawing Embodiment is described in more detail, it is clear that described embodiment be only the present invention a part of embodiment rather than All embodiments exhaustive.And in the case of not conflicting, the embodiment in this explanation and the feature in embodiment can be mutual Combine.
Embodiments of the invention generate, for media file speech recognition in prior art, the problem that captions accuracy is low, carry For generation method, device and the electronic equipment of a kind of media file caption, by the audio parsing of media file and process, To end frame without the target audio information of voice audio information, it is to avoid the voice of a word is split at two section audios In, thus improve speech recognition and generate the accuracy of captions.
In embodiments of the invention, media file can be video file or video flowing, this video file or video The source of stream includes but not limited to: the video file preserved in (1) storage device;(2) live video stream, such as live telecast regard Frequency stream, network direct broadcasting video flowing etc..
Embodiment one
The first embodiment schematic flow sheet of the generation method of a kind of media file caption that Fig. 1 provides for the present invention.This The generation method of the media file caption that inventive embodiments one provides includes:
Step 101, audio-frequency information to media file carry out segmentation, obtain multistage segmentation audio-frequency information;
Step 102, multistage segmentation audio-frequency information is processed, obtain the end frame target sound without voice audio information Frequently information;
Step 103, target audio information is identified as corresponding word;
Step 104, word is processed, generate the caption information of media file.
In the embodiment of the present invention, by the audio parsing of media file and process, obtain end frame without voice audio The target audio information of information, it is to avoid the voice of a word is split in two section audios, thus it is raw to improve speech recognition Become the accuracy of captions.
Second embodiment schematic flow sheet of the generation method of a kind of media file caption that Fig. 2 provides for the present invention.This The generation method of the media file caption that inventive embodiments two provides includes:
Step 201, audio-frequency information to media file carry out segmentation, obtain multistage segmentation audio-frequency information;
In the present embodiment, the audio-frequency information of media file is being carried out segmentation, before obtaining multistage segmentation audio-frequency information, Can include that receiving user obtains the step of captions triggering command, be specially when terminal or electronic equipment play media file, In certain position of media file, can be, but not limited to the upper left corner or the position, the upper right corner of media file, display to the user that and obtain Taking the button of captions, user clicks on when needs obtain captions, and terminal or electronic equipment receive and obtain touching of captions Subsequent step is performed after sending instructions.
Audio parsing is to carry out according to Preset Time, can be, but not limited to carry out in such a way segmentation: according to terminal Processor disposal ability different, preset the different time, the type split time that processor disposal ability is high be 300 milliseconds with Under, the type split time that processor disposal ability is low is 300-500 millisecond.
Step 202, multistage segmentation audio-frequency information is processed, obtain the end frame target sound without voice audio information Frequently information;
Present embodiments providing the processing method of three kinds of multistage segmentation audio-frequency informations, segmentation audio frequency is first believed by first method Breath splices, it is judged that whether spliced audio-frequency information end frame comprises voice audio information;Second method is first to segmentation Audio-frequency information carries out whether end frame comprises the judgement of voice audio information, when judging that end frame comprises voice audio information Splice again during segmentation audio frequency;To all segmentation audio-frequency informations, the third method first carries out whether end frame comprises voice audio The judgement of information, obtains all end frames and comprises and do not comprise the segmentation audio-frequency information of voice audio information, comprised by end frame The segmentation audio-frequency information of voice audio information splices with segmentation audio-frequency information thereafter.
The processing method of the first the segmentation audio-frequency information in the present embodiment is specific as follows:
From the beginning of first paragraph segmentation audio-frequency information, splice adjacent next section segmentation audio-frequency information successively, it is judged that spell every time Whether the end frame of the audio-frequency information after connecing comprises voice audio information, until judging that first paragraph segmentation audio-frequency information is to n-th The end frame of the audio-frequency information that section segmentation audio-frequency information splicing obtains does not comprises voice audio information, completes single treatment operation, The audio-frequency information that first paragraph segmentation audio-frequency information obtains to n-th section of segmentation audio-frequency information splicing is target audio information;
Starting the process over operation from (n+1)th section of segmentation audio-frequency information, n is the integer more than 1.
The processing method of the second segmentation audio-frequency information in the present embodiment is specific as follows:
Judge whether end frame comprises voice audio information successively from first paragraph segmentation audio-frequency information, until judging n-th The end frame of section segmentation audio-frequency information comprises voice audio information, splices successively from the n-th segmentation audio-frequency information, sentences after splicing every time Whether the disconnected end frame obtaining audio-frequency information comprises voice audio information, until judging that n-th section of segmentation audio-frequency information is to the n-th+i The end frame of the audio-frequency information that section segmentation audio-frequency information splicing obtains does not comprises voice audio information, completes single treatment operation, The most all end frames do not comprise the segmentation audio-frequency information of voice audio information and the n-th segmentation audio-frequency information to the n-th+i section segmentation The audio-frequency information that audio-frequency information splicing obtains is target audio information;
Starting the process over operation from the n-th+i+1 section segmentation audio-frequency information, n is the integer more than 0, and i is whole more than 0 Number.
The processing method of the third segmentation audio-frequency information in the present embodiment is specific as follows:
Multistage segmentation audio-frequency information is judged, obtain segmentation audio-frequency information that end frame comprises voice audio information and End frame does not comprise the segmentation audio-frequency information of voice audio information;
If n-th section is, to n+i section segmentation audio-frequency information, the segmentation audio-frequency information that end frame comprises voice audio information, the N+i+1 section segmentation audio-frequency information is the segmentation audio-frequency information that end frame does not comprise voice audio information, then by n-th section to n+i+1 Section segmentation audio-frequency information splices, and end frame does not comprise the segmentation audio-frequency information of voice audio information and n-th section to n+i+1 section The audio-frequency information that the splicing of segmentation audio-frequency information obtains is target audio information;
Wherein, n is the integer more than 0, and i is the integer more than or equal to 0.
In the processing method of the segmentation audio-frequency information of above-mentioned three kinds of modes, for final stage segmentation audio-frequency information or Final stage splicing audio-frequency information, no matter whether its end frame comprises voice audio information, all as target audio information.
In the processing method of the segmentation audio-frequency information of above-mentioned three kinds of modes, it is judged that the end frame of the audio-frequency information obtained is The no voice audio information that comprises can be, but not limited in the following manner: with voice activity detection technology (VAD), based on short-time energy and Zero-crossing rate detection judges whether end frame comprises voice audio information.Particularly as follows: respectively short-time energy and zero-crossing rate are arranged door Limit value, calculates short-time energy and the zero-crossing rate of end frame, if short-time energy and zero-crossing rate are above threshold value, then it is assumed that end frame Comprise voice audio information.
Step 203, target audio information is identified as corresponding word;
In the present embodiment, target audio information is identified as corresponding word, can be in the following ways: according to electronics Equipment or the sound identification module of terminal, carry out multithreading speech recognition by multistage target audio information, it is thus achieved that every section of target sound Frequently the word that information is corresponding.
Certainly, the present embodiment target audio information is identified as corresponding word can also be in the following ways: (1) basis Electronic equipment or the sound identification module of terminal, carry out speech recognition to target audio information, it is thus achieved that target audio information is corresponding Word;(2) target audio information is sent to cloud server, receive the word that cloud server speech recognition obtains.
Step 204, word is processed, generate the caption information of media file.
In the present embodiment, processing word, the caption information generating media file can be: obtains target audio The timestamp information of information, generates caption information by word according to timestamp information, particularly as follows: add the word identified to literary composition In presents, wherein, text stores the identification word of each section of target audio information, then according to text Content and timestamp, the form adding captions according to a time code generates caption information, i.e. according to a time code The form adding captions writes word in caption information.
The kind of captions has multiple, and the most the more commonly used subtitling format has graphical format and text formatting two class, relatively For graphical format captions, text formatting captions have that size is little, form simple, are easy to make and the feature of amendment, text lattice Formula captions include utf, idx, sub, srt, smi, rt, txt, ssa, aq, jss, js, ass, wherein the text subtitle of srt form Most widely used, it can compatible various common media players, MPC, QQ are audio-visual etc. all can load the type automatically Captions.Therefore, in the present embodiment, caption information uses srt form, and certain the present embodiment does not limit the lattice of caption information Formula, as long as the form of caption information can support used media player.
Step 205, by caption information importing medium file, the word in synchronously displaying subtitle information.
In the present embodiment, caption information is stored in the file at media file place, when playing media file, and should Caption information can be automatically imported and simultaneous display.
Additionally, for the display effect optimizing captions, can be by sentence branch display longer in caption information.
In the embodiment of the present invention, by the audio parsing of media file and process, obtain end frame without voice audio The target audio information of information, it is to avoid the voice of a word is split in two section audios, improves speech recognition and generates word The accuracy of curtain;Multistage target audio information is carried out multithreading speech recognition, it is achieved multistage audio-frequency information identifies simultaneously, improve Speech recognition generates the efficiency of captions.
Based on same inventive concept, the embodiment of the present invention additionally provides the generating means of media file caption, due to this The principle that a little devices solve problem is similar to the generation method of media file caption, the therefore enforcement side of may refer to of these devices The enforcement of method, repeats no more in place of repetition.
As it is shown on figure 3, provide the generating means of a kind of media file caption in the embodiment of the present invention, device can wrap Include:
Segmentation module 301, for the audio-frequency information of media file is carried out segmentation, obtains multistage segmentation audio-frequency information;
Processing module 302, for processing multistage segmentation audio-frequency information, obtains end frame without voice audio information Target audio information;
Sound identification module 303, for being identified as corresponding word by target audio information;
Captions generation module 304, for generating the caption information of media file by word processing.
The generating means of the media file caption in the embodiment of the present invention, carries out segmentation to the audio-frequency information in media file And splicing, obtain end frame without the target audio information of voice audio information, it is to avoid the voice of a word is split In two section audios, thus improve speech recognition and generate the accuracy of captions.
As shown in Figure 4, additionally providing the generating means of another kind of media file caption in the embodiment of the present invention, device is permissible Including:
Segmentation module 401, for the audio-frequency information of media file is carried out segmentation, obtains multistage segmentation audio-frequency information;
In the present embodiment, the generating means of media file caption can also include receiver module, for receiving obtaining of user Take captions triggering command, be specially when playing media file in certain position of media file, can be, but not limited to media literary composition The upper left corner of part or position, the upper right corner, display to the user that the button obtaining captions, and user clicks on when needs obtain captions should Button, after receiver module receives the triggering command obtaining captions, other modules perform subsequent step.
Segmentation module 401 includes the first acquiring unit 4011 and segmenting unit 4012, wherein: the first acquiring unit 4011 is right Media file is decoded, and obtains the audio-frequency information of media file;Segmenting unit 4012 determines according to the disposal ability of processor Split time, carries out segmentation according to split time to the audio-frequency information of media file, obtains multistage segmentation audio-frequency information.
Processing module 402, for processing multistage segmentation audio-frequency information, obtains end frame without voice audio information Target audio information;
Wherein, as it is shown in figure 5, processing module 402 includes concatenation unit 4021 and judging unit 4022;
Concatenation unit 4021 splices successively from first paragraph segmentation audio-frequency information, it is judged that unit 4022 judges after splicing every time Whether the end frame obtaining audio-frequency information comprises voice audio information, until judging unit 4022 is judged from first paragraph segmentation sound Frequently the end frame of the audio-frequency information that information obtains to n-th section of segmentation audio-frequency information splicing does not comprise voice audio information, completes one Secondary process operates, and the audio-frequency information that first paragraph segmentation audio-frequency information obtains to n-th section of segmentation audio-frequency information splicing is target audio Information;
Concatenation unit 4021 and judging unit 4022 start the process over operation from (n+1)th section of segmentation audio-frequency information, and n is big In the integer of 1.
On the other hand, the present embodiment additionally provides another kind of processing module 402, including concatenation unit 4021 and judging unit 4022;
From first paragraph segmentation audio-frequency information, judging unit 4022 judges whether end frame comprises voice audio information successively, directly Voice audio information is comprised to the end frame judging n-th section of segmentation audio-frequency information;
Concatenation unit 4021 splices successively from the n-th segmentation audio-frequency information, and after splicing, judging unit 4022 judges to obtain every time Whether the end frame of audio frequency comprises voice audio information, until judging unit 4022 judges that n-th section of segmentation audio-frequency information is to n-th The end frame of the audio-frequency information that the splicing of+i section segmentation audio-frequency information obtains does not comprises voice audio information, completes single treatment behaviour Making, the most all end frames do not comprise the segmentation audio-frequency information of voice audio information and the n-th segmentation audio-frequency information divides to the n-th+i section The audio-frequency information that the splicing of section audio information obtains is target audio information;
Concatenation unit 4021 and judging unit 4022 start the process over operation from the n-th+i+1 section segmentation audio-frequency information, and n is Integer more than 0, i is the integer more than 0.
On the other hand, the present embodiment additionally provides another kind of processing module 402, including concatenation unit 4021 and judging unit 4022;
Multistage segmentation audio-frequency information is judged by judging unit 4022, obtains end frame and comprises dividing of voice audio information Section audio information and end frame do not comprise the segmentation audio-frequency information of voice audio information;
If n-th section is the segmentation audio-frequency information that end frame comprises voice audio information to n+i section segmentation audio-frequency information, n-th+ I+1 section segmentation audio-frequency information is the segmentation audio-frequency information that end frame does not comprise voice audio information, then concatenation unit 4021 is by n-th Section is spliced to n+i+1 section segmentation audio-frequency information, and end frame does not comprise the segmentation audio-frequency information of voice audio information and n-th section The audio-frequency information obtained to the splicing of n+i+1 section segmentation audio-frequency information is target audio information, and wherein, n is the integer more than 0, i For the integer more than or equal to O.
Sound identification module 403, for being identified as corresponding word by target audio information;
In the present embodiment, as shown in Figure 6, sound identification module can include multiple voice recognition unit, for multistage Target audio information carries out multithreading speech recognition, it is thus achieved that the word that every section of target audio information is corresponding.
Certainly, the sound identification module in the present embodiment can be also used for: (1) carries out voice knowledge to target audio information Not, it is thus achieved that the word that target audio information is corresponding;(2) target audio information is sent to cloud server, receive cloud service The word that device speech recognition obtains.
Captions generation module 404, for generating the caption information of media file by word processing;
In the present embodiment, captions generation module 404 can include second acquisition unit and captions signal generating unit, wherein, Two acquiring units obtain the timestamp information of target audio information, and word is generated word according to timestamp information by captions signal generating unit Curtain information.
Subtitle Demonstration 405, for by the literary composition in caption information importing medium file, in the caption information that simultaneous display imports Word.
In the present embodiment, caption information is stored in the file at media file place, when playing media file, and should Caption information can be automatically imported, Subtitle Demonstration module synchronization show.
Additionally, for the display effect optimizing captions, sentence longer in caption information can be divided by Subtitle Demonstration module Row display.
The generating means of the media file caption in the embodiment of the present invention, carries out segmentation to the audio-frequency information in media file And splicing, obtain end frame without the target audio information of voice audio information, it is to avoid the voice of a word is split In two section audios, improve speech recognition and generate the accuracy of captions;Multistage target audio information is carried out many by sound identification module Thread speech recognition, it is achieved multistage audio-frequency information identifies simultaneously, improves speech recognition and generates the efficiency of captions.
As it is shown in fig. 7, the embodiment of the present invention additionally provides a kind of electronic equipment, including: housing 501, processor 502, Memorizer 503, display screen (not shown), circuit board 504 and power circuit 505, wherein, circuit board 504 is placed in housing 501 interior volume surrounded, processor 502 and memorizer 503 are arranged on circuit board 504, are embedded on housing 501 outside display screen And connect circuit board 504;Power circuit 505, powers for each circuit or the device for electronic equipment;Memorizer 503 is used for Storage executable program code and data;Processor 502 is transported by reading the executable program code of storage in memorizer 503 The program that row is corresponding with executable program code, for performing following steps:
The audio-frequency information of media file is carried out segmentation, obtains multistage segmentation audio-frequency information;
Multistage segmentation audio-frequency information is processed, obtains the end frame target audio information without voice audio information;
Target audio information is identified as corresponding word;
Word is processed, generates the caption information of media file.
Electronic equipment in the embodiment of the present invention, carries out segmentation and splicing to the audio-frequency information in media file, obtains end End frame is without the target audio information of voice audio information, it is to avoid the voice of a word is split in two section audios, carries High speech recognition generates the accuracy of captions.
For convenience of description, each several part of apparatus above is divided into various module or unit to be respectively described with function.Certainly, The function of each module or unit can be realized in same or multiple softwares or hardware when implementing the present invention.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, device or computer program Product.Therefore, the reality in terms of the present invention can use complete hardware embodiment, complete software implementation or combine software and hardware Execute the form of example.And, the present invention can use at one or more computers wherein including computer usable program code The upper computer program product implemented of usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) The form of product.
The present invention is with reference to method, equipment (device) and the flow process of computer program according to embodiments of the present invention Figure and/or block diagram describe.It should be understood that can the most first-class by computer program instructions flowchart and/or block diagram Flow process in journey and/or square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided Instruction arrives the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device to produce A raw machine so that the instruction performed by the processor of computer or other programmable data processing device is produced for real The device of the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame now.
These computer program instructions may be alternatively stored in and computer or other programmable data processing device can be guided with spy Determine in the computer-readable memory that mode works so that the instruction being stored in this computer-readable memory produces and includes referring to Make the manufacture of device, this command device realize at one flow process of flow chart or multiple flow process and/or one square frame of block diagram or The function specified in multiple square frames.
These computer program instructions also can be loaded in computer or other programmable data processing device so that at meter Perform sequence of operations step on calculation machine or other programmable devices to produce computer implemented process, thus at computer or The instruction performed on other programmable devices provides for realizing at one flow process of flow chart or multiple flow process and/or block diagram one The step of the function specified in individual square frame or multiple square frame.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation Property concept, then can make other change and amendment to these embodiments.So, claims are intended to be construed to include excellent Select embodiment and fall into all changes and the amendment of the scope of the invention.

Claims (10)

1. a generation method for media file caption, is applied to electronic equipment, it is characterised in that including:
The audio-frequency information of media file is carried out segmentation, obtains multistage segmentation audio-frequency information;
Described multistage segmentation audio-frequency information is processed, obtains the end frame target audio information without voice audio information;
Described target audio information is identified as corresponding word;
Described word is processed, generates the caption information of described media file.
2. the method for claim 1, it is characterised in that the described audio-frequency information to described media file carries out segmentation, Obtain multistage segmentation audio-frequency information, particularly as follows:
Described media file is decoded, obtains the audio-frequency information of described media file;
The disposal ability of the processor according to described electronic equipment determines split time;
According to described split time, the audio-frequency information of described media file is carried out segmentation, obtain multistage segmentation audio-frequency information.
3. the method for claim 1, it is characterised in that described described multistage segmentation audio-frequency information is processed, To the end frame target audio information without voice audio information, particularly as follows:
From the beginning of first paragraph segmentation audio-frequency information, splice adjacent next section segmentation audio-frequency information successively, it is judged that every time after splicing The end frame of audio-frequency information whether comprise voice audio information, until judging that described first paragraph segmentation audio-frequency information is to n-th The end frame of the audio-frequency information that section segmentation audio-frequency information splicing obtains does not comprises voice audio information, completes single treatment operation, The audio-frequency information that described first paragraph segmentation audio-frequency information obtains to n-th section of segmentation audio-frequency information splicing is target audio information;
Starting the process over operation from (n+1)th section of segmentation audio-frequency information, described n is the integer more than 1.
4. the method for claim 1, it is characterised in that described described multistage segmentation audio-frequency information is processed, To the end frame target audio information without voice audio information, particularly as follows:
Judge whether end frame comprises voice audio information successively from first paragraph segmentation audio-frequency information, until judging the n-th section point The end frame of section audio information comprises voice audio information, splices successively from described n-th segmentation audio-frequency information, sentences after splicing every time Whether the disconnected end frame obtaining audio-frequency information comprises voice audio information, until judging described n-th section of segmentation audio-frequency information extremely The end frame of the audio-frequency information that the n-th+i section segmentation audio-frequency information splicing obtains does not comprises voice audio information, completes single treatment Operation, the most all end frames do not comprise the segmentation audio-frequency information of voice audio information and described n-th segmentation audio-frequency information to n-th The audio-frequency information that the splicing of+i section segmentation audio-frequency information obtains is target audio information;
Starting the process over operation from the n-th+i+1 section segmentation audio-frequency information, described n is the integer more than 0, and described i is more than O's Integer.
5. the method for claim 1, it is characterised in that described described multistage segmentation audio-frequency information is processed, To the end frame target audio information without voice audio information, particularly as follows:
Described multistage segmentation audio-frequency information is judged, obtain segmentation audio-frequency information that end frame comprises voice audio information and End frame does not comprise the segmentation audio-frequency information of voice audio information;
If n-th section is the segmentation audio-frequency information that end frame comprises voice audio information, the n-th+i+ to n+i section segmentation audio-frequency information 1 section of segmentation audio-frequency information is the segmentation audio-frequency information that end frame does not comprise voice audio information, then divide n-th section to n+i+1 section Section audio information is spliced, described end frame do not comprise voice audio information segmentation audio-frequency information and described n-th section to n+i + 1 section of segmentation audio-frequency information splices the audio-frequency information obtained and is target audio information;
Wherein, described n is the integer more than O, and described i is the integer more than or equal to 0.
6. the method for claim 1, it is characterised in that when obtaining multistage target audio information, described by described mesh Mark audio-frequency information is identified as corresponding word, particularly as follows:
According to the sound identification module of described electronic equipment, multistage target audio information is carried out multithreading speech recognition, it is thus achieved that The word that every section of target audio information is corresponding.
7. the method for claim 1, it is characterised in that described described target audio information is identified as corresponding literary composition Word, particularly as follows:
Described target audio information is sent to cloud server, receives the word that described cloud server speech recognition obtains.
8. the method as described in any one of claim 1-7, it is characterised in that described process described word, generates institute State the caption information of media file, particularly as follows:
Obtain the timestamp information of described target audio information;
Described word is generated caption information according to described timestamp information.
9. method as claimed in claim 8, it is characterised in that described method also includes described caption information is imported described matchmaker Word in body file, in caption information described in simultaneous display.
10. a generating means for media file caption, is applied to electronic equipment, it is characterised in that including: segmentation module, place Reason module, sound identification module and captions generation module;
Described segmentation module, for the audio-frequency information of described media file is carried out segmentation, obtains multistage segmentation audio-frequency information;
Described processing module, for processing described multistage segmentation audio-frequency information, obtains end frame and believes without voice audio The target audio information of breath;
Described sound identification module, for being identified as corresponding word by described target audio information;
Described captions generation module, for generating the caption information of described media file by described word processing.
CN201610683362.4A 2016-08-17 2016-08-17 Method and device for generating subtitles of media file and electronic equipment Pending CN106331844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610683362.4A CN106331844A (en) 2016-08-17 2016-08-17 Method and device for generating subtitles of media file and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610683362.4A CN106331844A (en) 2016-08-17 2016-08-17 Method and device for generating subtitles of media file and electronic equipment

Publications (1)

Publication Number Publication Date
CN106331844A true CN106331844A (en) 2017-01-11

Family

ID=57743164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610683362.4A Pending CN106331844A (en) 2016-08-17 2016-08-17 Method and device for generating subtitles of media file and electronic equipment

Country Status (1)

Country Link
CN (1) CN106331844A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019194742A1 (en) * 2018-04-04 2019-10-10 Nooggi Pte Ltd A method and system for promoting interaction during live streaming events
CN113692619A (en) * 2019-05-02 2021-11-23 谷歌有限责任公司 Automatically subtitling audible portions of content on a computing device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090256959A1 (en) * 2008-04-07 2009-10-15 Sony Corporation Information presentation apparatus and information presentation method
CN101625862A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Method for detecting voice interval in automatic caption generating system
CN102024033A (en) * 2010-12-01 2011-04-20 北京邮电大学 Method for automatically detecting audio templates and chaptering videos
US20130120654A1 (en) * 2010-04-12 2013-05-16 David A. Kuspa Method and Apparatus for Generating Video Descriptions
CN103327397A (en) * 2012-03-22 2013-09-25 联想(北京)有限公司 Subtitle synchronous display method and system of media file
CN105810208A (en) * 2014-12-30 2016-07-27 富泰华工业(深圳)有限公司 Meeting recording device and method thereof for automatically generating meeting record
CN105845129A (en) * 2016-03-25 2016-08-10 乐视控股(北京)有限公司 Method and system for dividing sentences in audio and automatic caption generation method and system for video files

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090256959A1 (en) * 2008-04-07 2009-10-15 Sony Corporation Information presentation apparatus and information presentation method
CN101625862A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Method for detecting voice interval in automatic caption generating system
US20130120654A1 (en) * 2010-04-12 2013-05-16 David A. Kuspa Method and Apparatus for Generating Video Descriptions
CN102024033A (en) * 2010-12-01 2011-04-20 北京邮电大学 Method for automatically detecting audio templates and chaptering videos
CN103327397A (en) * 2012-03-22 2013-09-25 联想(北京)有限公司 Subtitle synchronous display method and system of media file
CN105810208A (en) * 2014-12-30 2016-07-27 富泰华工业(深圳)有限公司 Meeting recording device and method thereof for automatically generating meeting record
CN105845129A (en) * 2016-03-25 2016-08-10 乐视控股(北京)有限公司 Method and system for dividing sentences in audio and automatic caption generation method and system for video files

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019194742A1 (en) * 2018-04-04 2019-10-10 Nooggi Pte Ltd A method and system for promoting interaction during live streaming events
US11277674B2 (en) 2018-04-04 2022-03-15 Nooggi Pte Ltd Method and system for promoting interaction during live streaming events
CN113692619A (en) * 2019-05-02 2021-11-23 谷歌有限责任公司 Automatically subtitling audible portions of content on a computing device

Similar Documents

Publication Publication Date Title
US9552807B2 (en) Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
KR101990023B1 (en) Method for chunk-unit separation rule and display automated key word to develop foreign language studying, and system thereof
CN109754783B (en) Method and apparatus for determining boundaries of audio sentences
US11190855B2 (en) Automatic generation of descriptive video service tracks
CN112437337B (en) Method, system and equipment for realizing live caption
CN108133632B (en) The training method and system of English Listening Comprehension
CN105635782A (en) Subtitle output method and device
CN106878805A (en) Mixed language subtitle file generation method and device
CN105898556A (en) Plug-in subtitle automatic synchronization method and device
CN109963092B (en) Subtitle processing method and device and terminal
CN111986655B (en) Audio content identification method, device, equipment and computer readable medium
CN103167360A (en) Method for achieving multilingual subtitle translation
CN114390220B (en) Animation video generation method and related device
CN112399269A (en) Video segmentation method, device, equipment and storage medium
CN109858005B (en) Method, device, equipment and storage medium for updating document based on voice recognition
CN106331844A (en) Method and device for generating subtitles of media file and electronic equipment
CN108831503B (en) Spoken language evaluation method and device
KR20220048958A (en) Method of filtering subtitles of a foreign language video and system performing the same
Hong et al. Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
CN106295592A (en) Method and device for identifying subtitles of media file and electronic equipment
CN114170856B (en) Machine-implemented hearing training method, apparatus, and readable storage medium
KR102088047B1 (en) Apparatus for studying language
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
CN108959163B (en) Subtitle display method for audio electronic book, electronic device and computer storage medium
CN111556372A (en) Method and device for adding subtitles to video and audio programs in real time

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170111