CN108597497A - A kind of accurate synchronization system of subtitle language and method, information data processing terminal - Google Patents
A kind of accurate synchronization system of subtitle language and method, information data processing terminal Download PDFInfo
- Publication number
- CN108597497A CN108597497A CN201810289373.3A CN201810289373A CN108597497A CN 108597497 A CN108597497 A CN 108597497A CN 201810289373 A CN201810289373 A CN 201810289373A CN 108597497 A CN108597497 A CN 108597497A
- Authority
- CN
- China
- Prior art keywords
- language
- translation
- analysis
- voice
- original text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000012545 processing Methods 0.000 title claims abstract description 10
- 238000013519 translation Methods 0.000 claims abstract description 67
- 238000004458 analytical method Methods 0.000 claims abstract description 41
- 238000005516 engineering process Methods 0.000 claims abstract description 36
- 238000006243 chemical reaction Methods 0.000 claims abstract description 14
- 238000001914 filtration Methods 0.000 claims abstract description 6
- 230000001360 synchronised effect Effects 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 7
- 238000005520 cutting process Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 230000003068 static effect Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000003860 storage Methods 0.000 claims description 5
- 238000010219 correlation analysis Methods 0.000 claims description 4
- 230000014616 translation Effects 0.000 description 52
- 238000001514 detection method Methods 0.000 description 10
- 238000001228 spectrum Methods 0.000 description 8
- 238000009432 framing Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000005457 optimization Methods 0.000 description 5
- 206010011224 Cough Diseases 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 239000000686 essence Substances 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000001491 myopia Diseases 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- UPLPHRJJTCUQAY-WIRWPRASSA-N 2,3-thioepoxy madol Chemical compound C([C@@H]1CC2)[C@@H]3S[C@@H]3C[C@]1(C)[C@@H]1[C@@H]2[C@@H]2CC[C@](C)(O)[C@@]2(C)CC1 UPLPHRJJTCUQAY-WIRWPRASSA-N 0.000 description 1
- 241001672694 Citrus reticulata Species 0.000 description 1
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 241000252794 Sphinx Species 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 206010016256 fatigue Diseases 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to computer software technical field, discloses a kind of accurate synchronization system of subtitle language and method, information data processing terminal, machine recognition module application multiple technologies improve noiseproof feature, ambient noise is eliminated using twice of Wiener filtering technology;Rubbish voice is removed using the method for Gaussian modeling, accuracy 95.83% is identified using the sound of GMM pairs of 36 kinds of natural environments;Voice starting point is effectively detected using harmonic detecting technique, improves 100% in prior art basis compared to conventional speech recognition methods recognition speed, while recognition accuracy is effectively promoted, and 2 times or more is reached.The present invention independently comes source language analysis, the conversion of original text translation and translation generation, establishes independent analysis and independently generates system.In such a system, the characteristics of the characteristics of not considering to translate language when analyzing primitive, generation does not consider primitive when translating language yet, the difference that primitive translates language are converted by original text translation to solve.
Description
Technical field
The invention belongs to computer software technical field more particularly to a kind of accurate synchronization system of subtitle language and method,
Information data processing terminal.
Background technology
Currently, the prior art commonly used in the trade is such:Such as phonetic dialing system of numerous areas in social life
System, bank's inquiry system, the system that orders tickets by telephone, information retrieval and translation system, education activities etc., there is speech recognition technology
Using, these are established has been above 98% in isolated word or the signer-independent sign language recognition system of small vocabulary, accuracy of identification,
It is widely recognized as by people.However as the development of Internet technology, video becomes a big flow of network, and regarding in recent years
Frequency live streaming " wreaking havoc " global network, more and more people pay close attention to network direct broadcasting, watch all kinds of races, grave news, all kinds of online
The demand straight line of news conference increases.And globalization process accelerates, live streaming is trend of the times to across the language viewing network of people online,
Foreign countries' network direct broadcasting translations such as NBA ball matches, European football cup, apple products news conference are urgently to be resolved hurrily.Current large vocabulary connects
Continuous speech recognition system is still unsatisfactory for practicability, popularity demand, especially similar to the big video such as TV, film, Living report
Flow field.The main reason for causing this phenomenon is the technical bottleneck of speech recognition.Speech recognition is primarily present following several
A problem:(1) continuous speech decomposition must be that phoneme or sound mother etc. are single by the first step of phonetic segmentation, speech recognition
Position, then needs to establish a rule, for understanding semanteme.(2) voice has ambiguity, and the polyphone in Chinese is ambiguity
One side, be on the other hand in English and Chinese, speaker's somewhat different word in speech may sound phase
As.(3) context-sensitive image, English word, Chinese character by words are influenced by context, and characteristics of speech sounds is in stress, tone, sound
Amount and the rate of articulation etc. can change.(4) influence of noise, environment are leading to speech recognition just when weighing noise and serious interference
True rate declines.Currently, be still the interpretive scheme of mainstream in video flow field human translation, however human translation is not only in work
Make to have a greatly reduced quality in efficiency, meanwhile, with the rapid soaring of country's human cost, human translation is also increasingly that many profits are looked forward to
Industry is tired out.It is therefore, a that accurately subtitle generation product can solve the above demand in real time.
In conclusion problem of the existing technology is:The working efficiency of human translation is low, with high costs.
Solve the difficulty and meaning of above-mentioned technical problem:In speech recognition, first have to according to corresponding algorithm to original
The voice signal and non-speech audio of voice carry out cutting, then carry out speech recognition for certain characteristic parameters of voice signal,
The pretreatment work of speech recognition technology includes the cutting of the selection and voice to voice recognition unit.Due to different language knot
The difference of structure, the selection for voice recognition unit is distinguishing, for example the sound rhythm parent structure of Chinese and English do not have this
Kind structure.
For Mandarin speech recognition, and word, syllable, sound mother may be selected as voice recognition unit, the primitive of selection is got over
Small, the flexibility of identification is higher, but stability reduces, and vice versa.In addition, Chinese structure is complicated, there are 1312 tonal sounds
Section, 432 syllables for not considering tone, 22 initial consonants, 38 simple or compound vowel of a Chinese syllable, the huge Chinese scale of construction and its labyrinth are that voice is known
The difficult point that other technology is captured.However, the breakthrough of this technology also by for video flow field supplier's main body from top to bottom and
Main body of consumption provides unprecedented convenient service, effectively improves the economic benefit in the field.
Invention content
The subtitle language is precisely synchronous without two big key technology of non-speech recognition and caption translating, in the 21st century, with
The implementation of computer network so that the development of speech recognition technology is more in congenial company or do congenial work, and also day is new for many representations, algorithm
The moon is different so that the exploitation of speech recognition system has derived more polynary combination.Traditional speech recognition thinking is in statistics voice
It on the basis of identification, is modeled using statistical model, in recent years, many decoding strategies and various decoding functions are applied to
In decoder, convenient door is opened for emerging audio recognition method.Meanwhile caption translating technology equally grows with each passing hour, companion
With the development of big data, multilingual sample database obtains facility, and semantic analysis constantly updates upgrading, faster more accurately translation
Algorithm so that subtitle language precisely synchronizes.
In view of the problems of the existing technology, the present invention provides a kind of accurate synchronization system of subtitle language and method, letters
Cease data processing terminal.
The invention is realized in this way
Another object of the present invention is to provide a kind of computer programs for realizing the accurate synchronous method of the subtitle language.
At a kind of information data for realizing the accurate synchronous method of the subtitle language
Manage terminal.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer
When upper operation so that computer executes the accurate synchronous method of subtitle language.
Advantages of the present invention and good effect are:Machine recognition module application multiple technologies improve noiseproof feature, using two
Time Wiener filtering technology eliminate ambient noise;Rubbish voice (the tinkle of bells, laugh, cough are removed using the method for Gaussian modeling
Cough the artificial persons such as sound voice), the accuracy being identified using the sound of GMM pairs of 36 kinds of natural environments is up to 95.83%;It adopts
Voice starting point is effectively detected with harmonic detecting technique, does not need the priori of noise, takes full advantage of voice in frequency domain
With the correlation of time-domain, it is adapted to various non-stationary Complex Noises.Experiment shows that this algorithm overcomes traditional voice endpoint
Detection also can using short-sighted energy, fundamental frequency, zero-crossing rate etc. as detection feature poor disadvantage of robustness under low signal-to-noise ratio environment
Overcome sub-belt energy not high disadvantage of performance under nonstationary noise and single-frequency noise environment.In training data and Acoustic Modeling etc.
Aspect has also all fully considered the interference of natural environment noise, and using the Training strategy of many condition, can significantly improve pair
In the robustness of noise.Using the Cross-word static state search space construction method based on WFST, effectively single pass is integrated each
The knowledge sources such as acoustic model, acoustical context, pronunciation dictionary, language model are statically compiled to state network by kind knowledge source.It is logical
Sufficient bitonic merging algorithm optimization network is crossed, search cyberspace has significantly been simplified.Double tune MERGING/SORTING ALGORITHMs can be straight
It is connected on memory space to be sorted and carries out data exchange, effectively save memory overhead.It is comparable in discrimination, than
Fast 4 times of WFST Open-Source Tools packet decoding speed or more.Using language model adaptation optimisation technique, based on real network service textual data
According to language model adaptation optimization training is carried out, it is adapted to the voice recognition tasks in different business field.
The present invention obtains effective voice segments by adding window framing technology, obtains voice signal that is continuous, stablizing, reduces and know
Other error;It effectively can enhance voice signal based on previous work, distinguish that the ability of non-useful voice signal excludes
Noise jamming reduces error, can improve 30% to speech recognition accuracy;It can relatively accurately detect in high impulse noise
The starting point of voice segments in environment, can effectively solve the problem that ambient noise problem, speech recognition accuracy is made to be increased to 95%.
The present invention eliminates ambient noise using twice of Wiener filtering technology;Rubbish is removed using the method for Gaussian modeling
Rubbish voice;Voice starting point is effectively detected using harmonic detecting technique;It is searched for using the Cross-word static state based on WFST empty
Between construction method, WFST has then been incorporated into Acoustic treatment feature used in field of speech recognition, by acoustic model, acoustical context,
Pronunciation dictionary and language model etc. are combined closely by WFST, and are scanned on WFST;Merger is adjusted to calculate by adequately double
Method optimizes network;It is adaptive using acoustic model is carried out based on the distinctive training criterion for minimizing sentence error rate;Support 32
The real-time identification of road audio data stream.
The present invention is using the voice in speech recognition technology identification Internet video, more compared to conventional speech recognition methods effect
Outstanding, recognition speed improves 100% in prior art basis, while recognition accuracy is effectively promoted, and reaches 2 times
More than.The present invention independently comes source language analysis, the conversion of original text translation and translation generation, establishes independent analysis and independently generates
System.In such a system, the characteristics of the characteristics of not considering to translate language when analyzing primitive, generation does not consider primitive when translating language yet,
The difference that primitive translates language is converted by original text translation to solve.
Description of the drawings
Fig. 1 is the accurate synchronous method flow chart of subtitle language provided in an embodiment of the present invention.
Fig. 2 is the accurate synchronous system architecture schematic diagram of subtitle language provided in an embodiment of the present invention;
In figure:1, machine recognition module;2, machine translation module;3, live streaming media module.
Fig. 3 is machine recognition module flow diagram provided in an embodiment of the present invention.
Fig. 4 is machine translation module flow diagram provided in an embodiment of the present invention.
Fig. 5 is the accurate synchronous method implementation flow chart of subtitle language provided in an embodiment of the present invention.
Fig. 6 is adding window sub-frame processing schematic diagram provided in an embodiment of the present invention.
Fig. 7 is that voice provided in an embodiment of the present invention divides completion schematic diagram automatically.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
The present invention provides a kind of subtitle language essence quasi synchronous system, realizes that the real-time subtitle of network direct broadcasting generates effect,
There is provided multilingual translation on line function simultaneously.
As shown in Figure 1, the accurate synchronous method of subtitle language provided in an embodiment of the present invention includes the following steps:
S101:After newest phonetic segmentation technology cutting time shaft, machine recognition is carried out by speech analysis system,
Machine recognition technology can quickly identify voice, and have the adaptability of different accents, can be in relatively noisy noise circumstance
Under, identify accurate voice data;
S102:On the basis of machine recognition goes out voice, intelligent translation is carried out to original text, is divided into source language analysis, original text is translated
Text conversion and translation generate 3 stages;
S103:Audio and video are transmitted in a network in a streaming manner with streaming media server, the network after opposite download
For broadcasting form, live streaming media is exactly to be placed on network server after continuous audio/video information is compressed, under user side
Side viewing is carried, without waiting for entire file download to finish.
As shown in Fig. 2, the accurate synchronization system of subtitle language provided in an embodiment of the present invention includes:
Machine recognition module 1 is carried out after newest phonetic segmentation technology cutting time shaft by speech analysis system
Machine recognition.Mechanical recognition system generally includes mainly following module:Obtain feature vector, acoustic model, language model,
Decoder.If Fig. 3 shows, wherein O and W are respectively the observation feature vector of training sentence and corresponding word sequence;P (O/W) is sound
Model probability is learned, indicates the matching degree of speech acoustics feature and word sequence W, when P (W) P (W/O) reaches maximum value, word order
Arrange outputs of the W* as speech recognition.
1) multiple technologies are applied to improve noiseproof feature
Ambient noise is eliminated using twice of Wiener filtering technology.
Using the method removal rubbish voice (artificial persons such as the tinkle of bells, laugh, cough voice) of Gaussian modeling.This
It is a kind of recognition methods of the natural environment sound based on gauss hybrid models (GMM).By the extraction frame for changing phonetic feature
The mixing exponent number of number and model improves the rate of identification, extracts Mel frequency cepstral coefficients (MFCCs) to analyze voice signal;
Gauss hybrid models are established based on MFCC feature sets using expectation-maximization algorithm for each sound;Sentenced using minimal error rate
Certainly rule and the method for ballot ruling are identified.The accuracy being identified using the sound of GMM pairs of 36 kinds of natural environments can
Up to 95.83%.
Voice starting point is effectively detected using harmonic detecting technique.The harmonic components of voice signal are one of human articulation
One apparent feature of basic characteristics and voice signal and non-speech audio, harmonic detecting technique are the harmonic waves voice
Sound end detecting method of the energy as detection feature.The method does not need the priori of noise, takes full advantage of voice
In the correlation of frequency domain and time-domain, it is adapted to various non-stationary Complex Noises.Experiment shows that this algorithm overcomes tradition
Speech terminals detection is poor using short-sighted energy, fundamental frequency, zero-crossing rate etc. as detection feature robustness under low signal-to-noise ratio environment
Disadvantage can also overcome sub-belt energy not high disadvantage of performance under nonstationary noise and single-frequency noise environment.
From the existing Basic Research Results such as the Auditory Perception of people and Mechanism of Speech Production, analysis extraction has noise immunity, mirror
Other property, complementary characteristic parameter.Natural environment noise has also all been fully considered in training data and Acoustic Modeling etc.
Interference, and using the Training strategy of many condition, the robustness for noise can be significantly improved.
2) it is based on WFST static state search space construction method and improves recognition efficiency
Using the Cross-word static state search space construction method based on WFST, effectively single pass integrates various knowledge
The knowledge sources such as acoustic model, acoustical context, pronunciation dictionary, language model are statically compiled to state network by source.Decoder is
One of core of speech recognition system, in recent years, many decoding strategies and various decoding functions are applied in decoder, example
Such as the Hvite decoding tools of HTK (Hidden Markow Model Toolkit), Sphinx decoders, TODE decoders etc..
These decoders have in common that the form application in the phonic knowledges such as represented acoustics, voice, dictionary source is in a decoder
It is very stiff so that modification later operation is very cumbersome, and introduces a kind of novel knowledge source in a decoder and will become arduousness
Task.WFST (Weighted Finite-state Transducer) is a kind of more flexible decoder architecture, reason
Thought is the grammar construct and characteristic come simulation language with WFST models.Concrete operation method is to lead WFST used in speech recognition
Domain has then incorporated Acoustic treatment feature, and acoustic model, acoustical context, pronunciation dictionary and language model etc. is close by WFST
In conjunction with, and scanned on WFST.
Optimize network by sufficient bitonic merging algorithm, significantly simplifies search cyberspace.Most sequence
Method is required for opening up that new memory headroom carrys out memory sequencing intermediate steps as a result, for example common quicksort, radix row
Bucket sort etc. in sequence and parallel sorting algorithm.Double tune MERGING/SORTING ALGORITHMs can be directly in memory space to be sorted into line number
According to exchange, memory overhead is effectively saved.Double tune MERGING/SORTING ALGORITHMs are a kind of sort algorithms based on OpenCL, utilize algorithm
Middle data locality feature reduces the number that CPU is synchronized in program, is calculated using vector to promote ALU utilization rates, and to read-write
Memory access optimizes.Program, which can be run on, to be supported on OpenCL and the video card and processor of vector calculating.It is total that PCI-E is not counted
Line data transmission period, 223The performance of the integer sorting of scale can reach 0.276GB/s on HD6870 video cards.In discrimination phase
In the case of, than fast 4 times of WFST Open-Source Tools packet decoding speed or more.
3) special training speech model improves adaptability
It trains criterion progress acoustic model adaptive using based on the distinctive for minimizing sentence error rate, can be directed to specific
The real network users accent data in area carry out adaptive training optimization, to adapt to user's accent of different regions;Using language mould
Type adaptive optimization technology carries out language model adaptation optimization training based on real network service text data, is adapted to difference
The voice recognition tasks of business scope.
4) the real-time identification of 32 road audio data streams can be supported
1 times can be reached in ordinary desktop computer in real time;
Concurrency:For common server (the bis- cpu of Intel Xeon E5, per eight cores of cpu), 32 road voice numbers can be supported
According to the real-time identification (or being equivalent to 1 hour voice data that can handle 32 hours of server) of stream.Using speech recognition technology
Identify the voice in Internet video, more excellent compared to conventional speech recognition methods effect, recognition speed is in prior art basis
On improve 100%, while recognition accuracy is effectively promoted, and 2 times or more is reached.
Experimental data comparison such as table 1
Machine translation module 2 needs to use machine translation mothod, machine translation for the video for using other language
Machine-independent identification, on the basis of machine recognition goes out voice, to original text carry out intelligent translation, translation method according to
Family demand is different and different, and current translation technology can reach higher accuracy rate and one section of 30 minutes video and only take
1 minute speed.The process of entire machine translation can be divided into source language analysis, the conversion of original text translation and translation and generate 3 stages.
In specific machine translation system, according to the purpose of different schemes and requirement, original text translation can be converted stage and original text
Analysis phase is combined together, and translation generation phase is independently got up, and establishes correlation analysis and independently generates system.Such
In system, the characteristics of considering to translate language when primitive is analyzed, and the characteristics of then do not consider primitive when translating language and generating.It is a variety of studying
When language is to the translation of language a kind of, such correlation analysis is preferably used to independently generate system.
The source language analysis stage is independently got up, the original text translation conversion stage is combined with translation generation phase, is established
Independent analysis correlation generation system.In such a system, the characteristics of not considering to translate language when primitive is analyzed, and when translating language generation
The characteristics of considering primitive.In a kind of translation of the language of research to multilingual, preferably use this independent analysis is related to generate
System.
Source language analysis, the conversion of original text translation and translation generation are independently come, the independent analysis of foundation, which independently generates, is
System.In such a system, the characteristics of the characteristics of not considering to translate language when analyzing primitive, generation does not consider primitive when translating language yet, former
The difference that language translates language is converted by original text translation to solve.In translation of the multilingual to multilingual, such independence point
It is appropriate that analysis independently generates system.
Live streaming media module 3 transmits audio and video with streaming media server in a network in a streaming manner, relatively under
For netcast form after load, live streaming media is exactly to be placed on network server after continuous audio/video information is compressed
On, user watches when downloading, without waiting for entire file download to finish.Streaming media server with RTP/RTSP, MMS,
The streaming protocols such as RTMP, HTTP, HLS are watched online by video file transfer to client, for user.
The subtitle language synchronization system of the present invention is integrated with above-mentioned machine recognition, machine translation and stream media technology, uses
Newest speech recognition technology has prominent improvement, such as one section of 30 minutes English on the precision of speech recognition and speed
Text video is transferred in the system and is identified and translates, overall to take<3 minutes.
The application principle of the present invention is further described with reference to specific embodiment.
The present invention uses c/s, two kinds of framework modes of b/s to provide desktop programs and download use or browser end web applications
Program uses.System is divided into two big system composition of front end and backstage, and front end system is responsible for streaming media playing, and voice is shown, subtitle
It has been shown that, user's UI operation elements, background system are responsible for video download, video extraction, and voice cuts axis automatically, and speech analysis flows matchmaker
Body server, machine translation work, returns to front end system when by the fructufy of analysis, is presented by front end system and interacted
In user.
The invention is realized in this way first, the accurate synchronous method of subtitle language uses phonetic segmentation technology cutting
Time shaft, flow such as Fig. 5.
(1) adding window framing
To voice signal carry out adding window sub-frame processing be exactly in order to obtain it is useful to speech recognition later it is every in short-term
Parameter.Adding window is the basis of voice signal framing, i.e., is weighted using the determining window of moveable window length.Usually, often
There are a 33-100 frames in second, and one frame of setting moves length when framing, is carried out using overlapping method.It chooses frame and moves and be with the ratio of frame length
, as shown in Figure 6.
The method of adding window is exactly that window sequence is allowed to be moved from left to right along voice sample value sequence frame by frame.Common window has two
Kind, one is rectangular windows, and one is Hamming window, two kinds of windows are corresponding with respective window function.After window function is determined, just
Framing can be carried out to voice signal, to carry out operation or variation later is exactly to be carried out to each frame.Pass through adding window framing technology
Effective voice segments are obtained, voice signal that is continuous, stablizing is obtained, reduces identification error.
(2) empirical constraint is artificially added
In order to obtain more accurately voice segmentation result, some empirical constraints are artificially added, as spectral range is set in
250~3500Hz, the normalization spectrum probability density upper limit are set as 0.9.
(3) subband spectrum entropy is found out
Application based on most basic spectrum Entropy principle in voice signal distinguishes voice signal and non-speech segment to improve
Ability, eliminate certain energy and concentrate influence of some specific frequency of noise to tradition spectrum entropy method, it is proposed that improved spectrum entropy
French sound automatic segmentation algorithm.The thought that subband composes entropy is a frame to be further divided into several subbands, then find out respectively each
A sub- band spectrum entropy, so eliminates the need for the problem of each spectral line amplitude can be by influence of noise.The spectrum of different noises
Entropy difference is not fairly obvious, this allows for being easy the threshold value that setting is divided automatically.It can based on previous work, effectively
Enhance voice signal, distinguishes that the ability of non-useful voice signal excludes noise jamming, reduce error, it can be to speech recognition standard
True rate improves 30%.
(4) double threshold method end-point detection
End-point detection is a basic problem in Speech processing, and the purpose is to from the segment signal comprising voice
Determine the starting point and end point of voice.Effective end-point detection can not only be such that processing time is minimized, and can inhibit nothing
The noise jamming of sound section improves quality of speech signal.The common method of end-point detection has:Energy threshold, pitch Detection, frequency spectrum point
Analysis, cepstral analysis and LPC (Linear PredictionCoeffi-cients) prediction residual etc..Wherein it is based on energy and zero passage
The dual-threshold judgement method of rate is the most commonly used.
Different from the judgement flow of simple gate limit, there are two thresholdings for the setting of each characteristic parameter for double-threshold comparison.When
When speech signal parameter value is higher than the first lower thresholding set, illustrates there is certainly possible entrance voice section, adjudicate at this time
Continue.When voice signal has been more than the second higher thresholding pre-set on this basis, and persistently there are several frames
When duration, it is believed that voice signal enters voice section.Judgement to voice section terminating point is generally exactly its inverse process.Work as language
When the parameter value of sound signal is less than the first higher thresholding set, illustrate that voice section is possible to terminate, algorithm continues to it
It is detected, illustrates voice if the duration that parameter value has been less than the second lower thresholding set and continue for several frames again
Signal enters ambient noise section.Method can relatively accurately detect the starting of the voice segments in environment with impulse noise
Point can effectively solve the problem that ambient noise problem, speech recognition accuracy made to be increased to 95%.As shown in Figure 7.
Machine recognition is carried out by speech analysis system later;Intelligent translation is carried out to original text, is divided into source language analysis, original text
Translation is converted and translation generates 3 stages;Finally audio and video are transmitted in a network in a streaming manner with streaming media server.
The crucial innovation of the present invention is machine recognition module and machine translation module, illustrates this two generic module separately below
Outstanding feature:
The method for carrying out machine recognition by speech analysis system:
1, ambient noise is eliminated using twice of Wiener filtering technology;
2, rubbish voice is removed using the method for Gaussian modeling;
3, voice starting point is effectively detected using harmonic detecting technique;
4, using the Cross-word static state search space construction method based on WFST, WFST is used in field of speech recognition
Acoustic treatment feature has then been incorporated, acoustic model, acoustical context, pronunciation dictionary and language model etc. have closely been tied by WFST
It closes, and is scanned on WFST;
5, network is optimized by sufficient bitonic merging algorithm;
6, adaptive using acoustic model is carried out based on the distinctive training criterion for minimizing sentence error rate;
7, the real-time identification of 32 road audio data streams is supported.
The machine translation method formed is generated by source language analysis, the conversion of original text translation, translation:
i:The original text translation conversion stage is combined together with the source language analysis stage, and translation generation phase is independently risen
Come, establishes correlation analysis and independently generate system;
ii:The source language analysis stage is independently got up, the original text translation conversion stage is combined with translation generation phase, is built
Vertical independent analysis correlation generation system;
iii:Source language analysis, the conversion of original text translation and translation generation are independently come, it is independent raw to establish independent analysis
At system.
The accurate synchronization system of subtitle language provided by the invention includes:Machine recognition module is cut using phonetic segmentation technology
After timesharing countershaft, machine recognition is carried out by speech analysis system;Machine translation module goes out the basis of voice in machine recognition
On, intelligent translation is carried out to original text;Live streaming media module, with streaming media server by audio and video in a streaming manner in network
Middle transmission.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination real
It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or
Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to
Flow described in the embodiment of the present invention or function.The computer can be all-purpose computer, special purpose computer, computer network
Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one
Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one
A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)
Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center
Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one
The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie
Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state disk Solid
State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.
Claims (8)
1. a kind of accurate synchronous method of subtitle language, which is characterized in that the accurate synchronous method of subtitle language is cut using voice
Divide technology cutting time shaft, machine recognition is carried out by speech analysis system;Intelligent translation is carried out to original text, is divided into original text point
Analysis, the conversion of original text translation and translation generate 3 stages;Audio and video are passed in a network in a streaming manner with streaming media server
It send.
2. the accurate synchronous method of subtitle language as described in claim 1, which is characterized in that described to be cut using phonetic segmentation technology
Timesharing countershaft carries out machine recognition by speech analysis system and further comprises:The back of the body is eliminated using twice of Wiener filtering technology
Scape noise;Rubbish voice is removed using the method for Gaussian modeling;Voice starting point is effectively detected using harmonic detecting technique;
Using the Cross-word static state search space construction method based on WFST, WFST is then incorporated used in field of speech recognition
Acoustic treatment feature, acoustic model, acoustical context, pronunciation dictionary and language model etc. combined closely by WFST, and
It is scanned on WFST;Optimize network by sufficient bitonic merging algorithm;
It is adaptive using acoustic model is carried out based on the distinctive training criterion for minimizing sentence error rate;
Support the real-time identification of 32 road audio data streams.
3. the accurate synchronous method of subtitle language as described in claim 1, which is characterized in that the source language analysis, original text translation
The original text translation conversion stage is combined together by conversion and translation 3 stages of generation with the source language analysis stage, and translation is generated
Stage independently gets up, and establishes correlation analysis and independently generates system;The source language analysis stage is independently got up, original text translation is converted rank
Section combines with translation generation phase, establishes the related generation system of independent analysis;Source language analysis, the conversion of original text translation and translate
Text is generated and is independently come, and is established independent analysis and is independently generated system.
4. a kind of accurate synchronization system of subtitle language of the accurate synchronous method of subtitle language as described in claim 1, feature exist
In the accurate synchronization system of subtitle language includes:
Machine recognition module carries out machine recognition after phonetic segmentation technology cutting time shaft by speech analysis system;
Machine translation module carries out intelligent translation on the basis of machine recognition goes out voice to original text;
Live streaming media module transmits audio and video with streaming media server in a network in a streaming manner.
5. the accurate synchronization system of subtitle language as claimed in claim 4, which is characterized in that the machine recognition module includes:
Obtain feature vector, acoustic model, language model, decoder.
6. a kind of computer program for realizing the accurate synchronous method of subtitle language described in claims 1 to 3 any one.
7. a kind of information data processing terminal for realizing the accurate synchronous method of subtitle language described in claims 1 to 3 any one.
8. a kind of computer readable storage medium, including instruction, when run on a computer so that computer is executed as weighed
Profit requires the accurate synchronous method of subtitle language described in 1-3 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810289373.3A CN108597497B (en) | 2018-04-03 | 2018-04-03 | Subtitle voice accurate synchronization system and method and information data processing terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810289373.3A CN108597497B (en) | 2018-04-03 | 2018-04-03 | Subtitle voice accurate synchronization system and method and information data processing terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108597497A true CN108597497A (en) | 2018-09-28 |
CN108597497B CN108597497B (en) | 2020-09-08 |
Family
ID=63624291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810289373.3A Active CN108597497B (en) | 2018-04-03 | 2018-04-03 | Subtitle voice accurate synchronization system and method and information data processing terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108597497B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111369384A (en) * | 2019-12-23 | 2020-07-03 | 国网河南省电力公司郑州供电公司 | Power transformation operation and maintenance hidden danger overall process control system |
CN114079797A (en) * | 2020-08-14 | 2022-02-22 | 阿里巴巴集团控股有限公司 | Live subtitle generation method and device, server, live client and live system |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101510222A (en) * | 2009-02-20 | 2009-08-19 | 北京大学 | Multilayer index voice document searching method and system thereof |
CN103297710A (en) * | 2013-06-19 | 2013-09-11 | 江苏华音信息科技有限公司 | Audio and video recorded broadcast device capable of marking Chinese and foreign language subtitles automatically in real time for Chinese |
CN103345922A (en) * | 2013-07-05 | 2013-10-09 | 张巍 | Large-length voice full-automatic segmentation method |
CN103971686A (en) * | 2013-01-30 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and system for automatically recognizing voice |
CN104159152A (en) * | 2014-08-26 | 2014-11-19 | 中译语通科技(北京)有限公司 | Automatic timeline generating method specific to film and television videos |
US20150042771A1 (en) * | 2013-08-07 | 2015-02-12 | United Video Properties, Inc. | Methods and systems for presenting supplemental content in media assets |
WO2016097165A1 (en) * | 2014-12-19 | 2016-06-23 | Softathome | Labelled audio-video stream for synchronizing the components thereof, method and equipment for analyzing the artifacts and synchronization of such a stream |
CN106448660A (en) * | 2016-10-31 | 2017-02-22 | 闽江学院 | Natural language fuzzy boundary determining method with introduction of big data analysis |
CN106649282A (en) * | 2015-10-30 | 2017-05-10 | 阿里巴巴集团控股有限公司 | Machine translation method and device based on statistics, and electronic equipment |
CN106816151A (en) * | 2016-12-19 | 2017-06-09 | 广东小天才科技有限公司 | A kind of captions alignment methods and device |
-
2018
- 2018-04-03 CN CN201810289373.3A patent/CN108597497B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101510222A (en) * | 2009-02-20 | 2009-08-19 | 北京大学 | Multilayer index voice document searching method and system thereof |
CN103971686A (en) * | 2013-01-30 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and system for automatically recognizing voice |
CN103297710A (en) * | 2013-06-19 | 2013-09-11 | 江苏华音信息科技有限公司 | Audio and video recorded broadcast device capable of marking Chinese and foreign language subtitles automatically in real time for Chinese |
CN103345922A (en) * | 2013-07-05 | 2013-10-09 | 张巍 | Large-length voice full-automatic segmentation method |
US20150042771A1 (en) * | 2013-08-07 | 2015-02-12 | United Video Properties, Inc. | Methods and systems for presenting supplemental content in media assets |
CN104159152A (en) * | 2014-08-26 | 2014-11-19 | 中译语通科技(北京)有限公司 | Automatic timeline generating method specific to film and television videos |
WO2016097165A1 (en) * | 2014-12-19 | 2016-06-23 | Softathome | Labelled audio-video stream for synchronizing the components thereof, method and equipment for analyzing the artifacts and synchronization of such a stream |
CN106649282A (en) * | 2015-10-30 | 2017-05-10 | 阿里巴巴集团控股有限公司 | Machine translation method and device based on statistics, and electronic equipment |
CN106448660A (en) * | 2016-10-31 | 2017-02-22 | 闽江学院 | Natural language fuzzy boundary determining method with introduction of big data analysis |
CN106816151A (en) * | 2016-12-19 | 2017-06-09 | 广东小天才科技有限公司 | A kind of captions alignment methods and device |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111369384A (en) * | 2019-12-23 | 2020-07-03 | 国网河南省电力公司郑州供电公司 | Power transformation operation and maintenance hidden danger overall process control system |
CN114079797A (en) * | 2020-08-14 | 2022-02-22 | 阿里巴巴集团控股有限公司 | Live subtitle generation method and device, server, live client and live system |
Also Published As
Publication number | Publication date |
---|---|
CN108597497B (en) | 2020-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alumäe et al. | Advanced rich transcription system for Estonian speech | |
CN107945805B (en) | A kind of across language voice identification method for transformation of intelligence | |
Van Den Oord et al. | Wavenet: A generative model for raw audio | |
Oord et al. | Wavenet: A generative model for raw audio | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
Ramu Reddy et al. | Identification of Indian languages using multi-level spectral and prosodic features | |
CN112750446B (en) | Voice conversion method, device and system and storage medium | |
CN114023300A (en) | Chinese speech synthesis method based on diffusion probability model | |
CN108597497A (en) | A kind of accurate synchronization system of subtitle language and method, information data processing terminal | |
CN114550706A (en) | Smart campus voice recognition method based on deep learning | |
Dey et al. | Cross-corpora spoken language identification with domain diversification and generalization | |
Kadyan et al. | Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation | |
Dua et al. | Noise robust automatic speech recognition: review and analysis | |
Masumura et al. | Improving speech-based end-of-turn detection via cross-modal representation learning with punctuated text data | |
Andra et al. | Improved transcription and speaker identification system for concurrent speech in Bahasa Indonesia using recurrent neural network | |
TW201828281A (en) | Method and device for constructing pronunciation dictionary capable of inputting a speech acoustic feature of the target vocabulary into a speech recognition decoder | |
Liu et al. | A New Speech Encoder Based on Dynamic Framing Approach. | |
Kynych et al. | Online Speaker Diarization Using Optimized SE-ResNet Architecture | |
Zeng et al. | Low-resource accent classification in geographically-proximate settings: A forensic and sociophonetics perspective | |
Dua et al. | A review on Gujarati language based automatic speech recognition (ASR) systems | |
Bohouta | Improving wake-up-word and general speech recognition systems | |
Jing et al. | Acquisition of english corpus machine translation based on speech recognition technology | |
Zhao et al. | Multi-speaker Chinese news broadcasting system based on improved Tacotron2 | |
Ferraro et al. | Benchmarking open source and paid services for speech to text: an analysis of quality and input variety | |
Gao et al. | Chinese question speech recognition integrated with domain characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |