CN107124647A

CN107124647A - A kind of panoramic video automatically generates the method and device of subtitle file when recording

Info

Publication number: CN107124647A
Application number: CN201710392422.1A
Authority: CN
Inventors: 陈鑫; 李晶; 陈勇
Original assignee: Shenzhen Coocaa Network Technology Co Ltd
Current assignee: Shenzhen Coocaa Network Technology Co Ltd
Priority date: 2017-05-27
Filing date: 2017-05-27
Publication date: 2017-09-01

Abstract

The invention discloses the method and device that subtitle file is automatically generated during a kind of recording of panoramic video, wherein, method includes step：Original audio data when panoramic video is recorded is obtained in real time；Processing is carried out to the original audio data and obtains secondary voice data, audio positional data and audio time data；Model Matching is carried out to the secondary voice data, corresponding lteral data is generated；Lteral data, audio positional data and time data described in real-time reception, carry out real-time edition to the lteral data according to the audio positional data and time data, form subtitle file.The present invention is realized automatically generates subtitle file in panoramic video recording process, and it has liberated manpower, and producing efficiency is high；And under the orientation that in the present invention, the subtitle file can be according to where audio positional data be correspondingly displayed in different role in video, offered convenience to user's viewing video.

Description

A kind of panoramic video automatically generates the method and device of subtitle file when recording

Technical field

Field is recorded the present invention relates to panoramic video, more particularly to captions are automatically generated during a kind of recording of panoramic video The method and device of file.

Background technology

Prior art is during screen recorded broadcast, it usually needs voice is converted into text by the way of artificial post-processing This record, and need artificial correspondence to go to make subtitle file and carry out time location adjustment to subtitle file, especially work as record When the video broadcast is panoramic video, if there is different role speaking in the video of recorded broadcast, also need to manually to manual manufacture Subtitle file be adjusted and the role of sound can be made a distinction.Not only efficiency is low for obvious this original processing mode Under, and greatly waste of manpower, cost is higher.

Therefore, prior art has yet to be improved and developed.

The content of the invention

In view of above-mentioned the deficiencies in the prior art, it is an object of the invention to provide automatically generated during a kind of recording of panoramic video The method and device of subtitle file, it is intended to solve prior art in panoramic video recording process is carried out, it is necessary to by manually making The problem of making and adjust corresponding subtitle file.

Technical scheme is as follows：

A kind of method that panoramic video automatically generates subtitle file when recording, wherein, including step：

Original audio data when panoramic video is recorded is obtained in real time；

Processing is carried out to the original audio data and obtains secondary voice data, audio positional data and audio time data；

Model Matching is carried out to the secondary voice data, corresponding lteral data is generated；

Lteral data, audio positional data and time data described in real-time reception, according to the audio positional data and audio Time data carries out real-time edition to the lteral data, forms subtitle file.

The method that described panoramic video automatically generates subtitle file when recording, wherein, the step obtains panorama in real time Original audio data during video record is specifically included：

Obtained by six wheat annular arrays being arranged on panoramic camera and take original audio data in real time.

The method that described panoramic video automatically generates subtitle file when recording, wherein, the step is to the original sound Frequency is specifically included according to the secondary voice data of processing acquisition, audio positional data and time data is carried out：

The original audio data is carried out at noise suppressed, reverberation elimination, echo cancelltion, Wave beam forming and array gain Reason, obtains secondary voice data and audio time data；

Auditory localization processing is carried out to the original audio data and obtains audio positional data.

The method that described panoramic video automatically generates subtitle file when recording, wherein, the step is to two secondary noise Frequency generates corresponding lteral data and specifically included according to Model Matching is carried out：

Voice and semantic identification, the lteral data after generation identification are carried out to the secondary voice data by DNN algorithms.

The method that described panoramic video automatically generates subtitle file when recording, wherein, described in the step real-time reception Lteral data, audio positional data and audio time data, according to the audio positional data and time data to the text Digital data carries out real-time edition, forms subtitle file and specifically includes:

By the caption editing function of coprocessor, successively according to [audio angle-data] [time data] [lteral data] Order format is arranged, and forms subtitle file.

The method that described panoramic video automatically generates subtitle file when recording, wherein, described in the step real-time reception Lteral data, audio positional data and time data, according to the audio positional data and time data to the word number According to progress real-time edition, also include after formation subtitle file：

The subtitle file is carried in the bottom of the corresponding sound bearing of panoramic video according to audio positional data.

A kind of panoramic video automatically generates the device of subtitle file when recording, wherein, including six wheat rings being sequentially connected electrically Shape array, array source of sound processor and coprocessor：

The six wheats annular array is used to obtain original audio data when panoramic video is recorded in real time；

The array source of sound processor is used to carry out the original audio data the secondary voice data of processing acquisition, audio position Data and audio time data, while being additionally operable to carry out Model Matching to the secondary voice data, generate corresponding text Digital data；

The coprocessor is used for lteral data, audio positional data and time data described in real-time reception, according to the sound Frequency position data and time data carry out real-time edition to the lteral data, form subtitle file.

Described panoramic video automatically generates the device of subtitle file when recording, wherein, the six wheats annular array is by six Individual annular acoustic sensor composition, six acoustic sensors are electrically connected with the array source of sound processor respectively.

Described panoramic video automatically generates the device of subtitle file when recording, wherein, in the array source of sound processor Comprising auditory localization unit, the auditory localization unit is used to carry out the original audio data auditory localization processing acquisition sound Frequency position data.

Described panoramic video automatically generates the device of subtitle file when recording, wherein, it is single that coprocessor also includes loading Member, the loading unit is used to the subtitle file is carried in into the corresponding sound bearing of panoramic video according to audio positional data Bottom.

Beneficial effect：The method that subtitle file is prepared by the artificial later stage in recorded broadcast video compared to tradition, the present invention Realize and automatically generate subtitle file in panoramic video recording process, it has liberated manpower, producing efficiency is high；And in this hair In bright, under the orientation that the subtitle file can be according to where audio positional data be correspondingly displayed in different role in video, give User's viewing video offers convenience.

Brief description of the drawings

Fig. 1 automatically generates the flow of the method preferred embodiment of subtitle file when being recorded for a kind of panoramic video of the invention Figure；

Fig. 2 automatically generates the device preferred embodiment structural representation of subtitle file when being recorded for a kind of panoramic video of the invention.

Embodiment

The present invention provides the method and device that subtitle file is automatically generated when a kind of panoramic video is recorded, to make the present invention's Purpose, technical scheme and effect are clearer, clear and definite, referring to the drawings and give an actual example that the present invention is described in more detail. It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

Referring to Fig. 1, Fig. 1 automatically generates subtitle file method when being recorded for a kind of panoramic video of the invention is preferably implemented The flow chart of example, as illustrated, it includes step：

Original audio data when S10, acquisition panoramic video recording in real time；

S20, processing is carried out to the original audio data obtain secondary voice data, audio positional data and audio time number According to；

S30, Model Matching is carried out to the secondary voice data, generate corresponding lteral data；

Lteral data, audio positional data and time data described in S40, real-time reception, according to the audio positional data and Time data carries out real-time edition to the lteral data, forms subtitle file.

Specifically, prior art by artificial post-production and need to adjust phase in panoramic video recording process is carried out Subtitle file, this original processing mode inefficiency and waste of manpower are answered, cost is higher；To solve the above problems, this hair It is bright to carry out the secondary voice data of processing acquisition, audio positional data and audio time data to original audio data first, so Model Matching is carried out to the secondary voice data afterwards and obtains corresponding lteral data, finally according to the audio positional data Enter edlin to the lteral data with audio time data, form the subtitle file of certain format；The present invention realize regarding Subtitle file is automatically generated during frequency recorded broadcast, it has liberated manpower, producing efficiency is high；And in the present invention, the captions Under the orientation that file can be according to where audio positional data be correspondingly displayed in different role in video, video tape is watched to user To facilitate.

Further, the step S10 is specially：Obtained and taken in real time by six wheat annular arrays being arranged on panoramic camera Original audio data；Specifically, the six wheats annular array is made up of six annular acoustic sensors, six acoustics Sensor can realize 3 six 0 ° of speech signal collections as the scope of six pickup wave beams, each 60 ° of correspondence；Further, The six wheats annular array also has far field pickup effect, and its effective pickup distance reaches 5 meters.The present invention uses six wheat circular arrays Row can effectively collect the original audio data during panorama recorded broadcast.

Preferably, the step S20, processing carried out to the original audio data obtain secondary voice data, audio position Put data and time data is specifically included：

S21, noise suppressed, reverberation elimination, echo cancelltion, Wave beam forming and array gain are carried out to the original audio data Processing, obtains secondary voice data and audio time data；

Specifically, panoramic video is in recording process, it will usually there are the interference tones such as noise, reverberation, echo, these interference Audio can have a strong impact on recording quality；Therefore, in order to generate accurate subtitle file, it is necessary to ensure recording quality；Base of the present invention The interference tones are eliminated respectively in source of sound array processor, noise refers generally to ambient noise, such as air-conditioning noise, this Noise like does not generally have space directivity, energy nor especially greatly, will not cover normal voice, simply have impact on voice Definition and intelligibility；

Noise suppression principle is that the data signal of real-time sampling is carried out into spectrum analysis, thus can analysis background noise response Intensity and spectrum distribution, then according to model with regard to a wave filter can be designed, when someone talks, while doing signal point Analysis, according to analysis, ANC is with regard to that can analyze the frequency spectrum of talker, then according to these background noises and the frequency spectrum of talker, this Wave filter changes in real time according to the contrast of two signals, allows talker's sound spectrum to pass through, and the frequency spectrum of ambient noise is carried out Suppress, reduce its energy, such as reduce by 15 to 20 decibels, just clearly can be with the effect of sense learning through practice to noise suppression；

Same echo and reverberation are all eliminated by wave filter, such as after sound source stops sounding, sound wave is in room To pass through multiple reflections and absorption, it appears that the mixing of several sound waves is continued for some time, and this phenomenon is called reverberation.Reverberation can be tight Ghost image rings Speech processing, such as cross-correlation function or beam main lobe reduce direction finding precision；In many fields of sound collection Close, particularly when sound source and microphone are distant, the audio signal that microphone is collected often contains larger reverberation sound, this The definition and intelligibility of voice can be had a strong impact on, follow audio processing system can be also influenceed（Such as speech recognition system）Property Energy.Now, in order to improve audio quality, Reverberation Rejection and technology for eliminating must just be used；The present invention uses microphone signal point Microphone signal is resolved into one or more parts by area's instrument；The mixed of some blocks is estimated using reverberation energy estimator Ring portion of energy；Finally, speech processes are carried out using the reverberation energy estimated, to obtain the voice after dereverberation.

Echo is the extension concept of reverberation, and the difference of both is exactly that the time delay of echo is longer；In general, more than 100 The reverberation of millisecond time delay, the mankind can substantially distinguish, it appears that a sound occurs in that twice, we are just called echo simultaneously, The famous echo wall of the such as the Temple of Heaven.In fact, referred herein is the sound that interactive voice equipment is sent oneself, such as Echo sounds Case, if being Alexa when song is played, at this time microphone array actually acquires the music played and user The Alexa sound cried, it is clear that this two classes sound of speech recognition None- identified, echo cancelltion seeks to remove music letter therein Cease and only retain the voice of user；The principle of echo cancellor is with voice signal and the correlation of the multipath echo produced by it Based on, the speech model of remote signaling is set up, echo is estimated using it, and the coefficient of wave filter is constantly changed, make Obtain the echo of estimate more approaching to reality；Then, echo estimate is subtracted from the input signal of microphone, disappeared so as to reach Except the purpose of echo.

Wave beam forming is general signal processing method, and the present invention is using the microphone array for arranging certain geometry Each microphone output signal by processing（Such as weighting, time delay, summation）The method for forming space directivity.Wave beam forming Mainly suppress the sound interference beyond main lobe, voice is also included here, such as when several personal talks around Echo, Echo The sound of one of people only can be recognized；

Further, the present invention by array gain solve pickup apart from the problem of, if signal is smaller, speech recognition equally can not Ensure, the energy of voice signal can be suitably increased by ARRAY PROCESSING, is easy to pick up remote voice signal.

The secondary voice data eliminated after noise can be obtained by above-mentioned processing, and obtain audio time data.

S22, to the original audio data carry out auditory localization processing obtain audio positional data.

Specifically, sound source direction finding can be based on ENERGY METHOD, can also be based on Power estimation, and array also commonly uses TDOA skills Art；Sound source direction finding is general to realize that VAD technologies can just cover this category in fact, be also following work(in voice awakening phase The crucial research contents of reduction is consumed, substantially positioning can accomplish ± 15 degree.For example, the present invention can be used based on acoustic energy Sound localization method, the sound arrival time of each node is recorded by acoustic sensor array, sound is found out using TDOA algorithms Source coordinate；The energy value of each node sound is recorded, according to the attenuation model of the acoustic energy, sound source coordinate, from node Coordinate calculates sound attenuating coefficient；The sound attenuating coefficient tape enters sound energy attenuation model；Each some time Each node sound energy value is calculated, sound source coordinate, i.e. audio positional data is calculated.

Further, in the present invention, the step S30, Model Matching is carried out to the secondary voice data, generation is relative The lteral data answered is specifically included：

The present invention is handled in real time using two sets of algorithms for interrogating rumours sound, a set of hardware that is embedded in, and other set serves high in the clouds With speech processes, by the algorithm of this two sets of voices, the original character data after identification may finally be obtained；It is preferred that Secondary voice data is identified XFS3031CNP Chinese synthesis chips, generates corresponding text data；It is described XFS3031CNP Chinese synthesis chips possess stronger multitone word processing and Chinese surname disposal ability, support GB2312, GBK, The text of tetra- kinds of coded systems of BIG5, UNICODE, and a variety of text control marks are supported, analyze and process and calculate with intelligent text Method.

Specifically, abnormal speech detection is carried out to speech data according to voice identification result is automatic, detects voice number Abnormal speech in, then the part of correspondence abnormal speech in obtained identification text is marked, by the knowledge after mark Other text is supplied to user, so as to reach the effect of prompting user, misleading of the reduction anomalous identification text to user；Due to The detection of abnormal speech and the identification text mark of abnormal speech are automatically performed by system, therefore, processing data volume compared with When big, efficiency and the degree of accuracy can be significantly improved；In actual applications, it can use and abnormal language is carried out based on state posterior probability Sound detects that every frame data that the state posterior probability refers mainly to currently pending voice belong to each shape probability of state；Per frame The state posterior probability of speech data can be by building the DNN (Deep Neural Network, deep neural network) recognized Model is obtained.

Further, in the present invention, in the step S40, lteral data, audio positional data described in real-time reception And time data, real-time edition is carried out to the lteral data according to the audio positional data and time data, word is formed Curtain file；

Specifically, the editor of subtitle file is to realize that the coprocessor can be in real time from array sound based on coprocessor Lteral data, audio positional data and audio time data are received in source processor, in the caption editing work(of coprocessor Under energy, arranged according to the order format successively according to [audio angle-data] [time data] [lteral data], form captions text Part.

Further, after video record terminates, the subtitle file and the video of recording are stored under same catalogue, And the subtitle file is carried in the bottom of the corresponding sound bearing of panoramic video according to audio positional data, it is easy to user to watch Video.

Preferably, in panoramic video（VR）In recording process, six wheat annular arrays and source of sound ARRAY PROCESSING need to be opened simultaneously Device, carries out radio reception and word generation, then handles captions of the generation with positional information and temporal information by coprocessor File.For example, in court's trial, it is necessary to preserve video evidence, panoramic video is recorded, because the role in scene is more（Presiding judge, Counsel and convict）, and present position differs, and therefore, in recorded video, need to preserve the voice and text of each role Word information, just can now automatically generate subtitle file using the inventive method in recorded video, and when playing, meeting Show that its voice is converted to the caption information of word under the orientation of each role.

Further, when playing the panoramic video recorded, if panoramic video is changed into 2D mode playbacks, captions can be certainly It is dynamic to be carried in bottom, without using sound bearing；In playing panoramic video, using positional information come by captions generate and including Speaker's locality.

Based on the above method, the present invention also provides the device that subtitle file is automatically generated when a kind of panoramic video is recorded, such as Shown in Fig. 2, including：The six wheat annular array array source of sound processors 20 and coprocessor 30 being sequentially connected electrically, it is described Six wheat annular arrays 10 are made up of six annular acoustic sensors 10, six acoustic sensors 10 respectively with the array Source of sound processor 20 is electrically connected；

The six wheats annular array 10 is used to obtain original audio data when panoramic video is recorded in real time；

The array source of sound, which handles 20 devices, to be used to carry out the original audio data the secondary voice data of processing acquisition, audio position Data and audio time data are put, while being additionally operable to carry out Model Matching to the secondary voice data, are generated corresponding Lteral data；

The coprocessor 30 is used for lteral data, audio positional data and time data described in real-time reception, according to described Audio positional data and time data carry out real-time edition to the lteral data, form subtitle file.

Ins and outs on each processor in said apparatus and the specific instruction performed by each modular unit are above Method in be described in detail, therefore repeat no more.

In summary, a kind of panoramic video that the present invention is provided automatically generates the method and device of subtitle file when recording, First turn on six wheat annular arrays and source of sound array processor, carry out radio reception and word generation, then by coprocessor at Subtitle file of the reason generation with positional information and temporal information.Prepared compared to tradition in recorded broadcast video by the artificial later stage The method of subtitle file, the present invention is realized automatically generates subtitle file in panoramic video recording process, and it has liberated manpower, Producing efficiency is high；And in the present invention, the subtitle file can be correspondingly displayed in video not according to audio positional data With under the orientation where role, offered convenience to user's viewing video.

It should be appreciated that the application of the present invention is not limited to above-mentioned citing, for those of ordinary skills, can To be improved or converted according to the above description, wanted for example, all these modifications and variations should all belong to right appended by the present invention The protection domain asked.

Claims

1. a kind of method that panoramic video automatically generates subtitle file when recording, it is characterised in that including step：

Lteral data, audio positional data and time data described in real-time reception, according to the audio positional data and time Data carry out real-time edition to the lteral data, form subtitle file.

2. the method that panoramic video according to claim 1 automatically generates subtitle file when recording, it is characterised in that described The original audio data that step obtains when panoramic video is recorded in real time is specifically included：

3. the method that panoramic video according to claim 1 automatically generates subtitle file when recording, it is characterised in that described Step carries out the secondary voice data of processing acquisition, audio positional data and time data to the original audio data and specifically wrapped Include：

4. the method that panoramic video according to claim 1 automatically generates subtitle file when recording, it is characterised in that described Step carries out Model Matching to the secondary voice data, generates corresponding lteral data and specifically includes：

5. the method that panoramic video according to claim 1 automatically generates subtitle file when recording, it is characterised in that described Lteral data, audio positional data and time data described in step real-time reception, according to the audio positional data and time Data carry out real-time edition to the lteral data, form subtitle file and specifically include:

6. the method that panoramic video according to claim 1 automatically generates subtitle file when recording, it is characterised in that described Lteral data, audio positional data and time data described in step real-time reception, according to the audio positional data and time Data carry out also including after real-time edition, formation subtitle file to the lteral data：

7. a kind of panoramic video automatically generates the device of subtitle file when recording, it is characterised in that including be sequentially connected electrically six Wheat annular array, array source of sound processor and coprocessor：

8. panoramic video according to claim 7 automatically generates the device of subtitle file when recording, it is characterised in that described Six wheat annular arrays are made up of six annular acoustic sensors, six acoustic sensors respectively with the array source of sound Manage device electrical connection.

9. panoramic video according to claim 7 automatically generates the device of subtitle file when recording, it is characterised in that described Auditory localization unit is included in array source of sound processor, the auditory localization unit is used for the original audio data carry out sound Source localization process obtains audio positional data.

10. panoramic video according to claim 9 automatically generates the device of subtitle file when recording, it is characterised in that association Processor also includes loading unit, and the loading unit is used to the subtitle file is carried in into panorama according to audio positional data The bottom of the corresponding sound bearing of video.