CN108597497A - A kind of accurate synchronization system of subtitle language and method, information data processing terminal - Google Patents

A kind of accurate synchronization system of subtitle language and method, information data processing terminal Download PDF

Info

Publication number
CN108597497A
CN108597497A CN201810289373.3A CN201810289373A CN108597497A CN 108597497 A CN108597497 A CN 108597497A CN 201810289373 A CN201810289373 A CN 201810289373A CN 108597497 A CN108597497 A CN 108597497A
Authority
CN
China
Prior art keywords
language
translation
analysis
voice
original text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810289373.3A
Other languages
Chinese (zh)
Other versions
CN108597497B (en
Inventor
孙宏亮
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Language Through Polytron Technologies Inc
Original Assignee
Chinese Translation Language Through Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Language Through Polytron Technologies Inc filed Critical Chinese Translation Language Through Polytron Technologies Inc
Priority to CN201810289373.3A priority Critical patent/CN108597497B/en
Publication of CN108597497A publication Critical patent/CN108597497A/en
Application granted granted Critical
Publication of CN108597497B publication Critical patent/CN108597497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to computer software technical field, discloses a kind of accurate synchronization system of subtitle language and method, information data processing terminal, machine recognition module application multiple technologies improve noiseproof feature, ambient noise is eliminated using twice of Wiener filtering technology;Rubbish voice is removed using the method for Gaussian modeling, accuracy 95.83% is identified using the sound of GMM pairs of 36 kinds of natural environments;Voice starting point is effectively detected using harmonic detecting technique, improves 100% in prior art basis compared to conventional speech recognition methods recognition speed, while recognition accuracy is effectively promoted, and 2 times or more is reached.The present invention independently comes source language analysis, the conversion of original text translation and translation generation, establishes independent analysis and independently generates system.In such a system, the characteristics of the characteristics of not considering to translate language when analyzing primitive, generation does not consider primitive when translating language yet, the difference that primitive translates language are converted by original text translation to solve.

Description

A kind of accurate synchronization system of subtitle language and method, information data processing terminal
Technical field
The invention belongs to computer software technical field more particularly to a kind of accurate synchronization system of subtitle language and method, Information data processing terminal.
Background technology
Currently, the prior art commonly used in the trade is such:Such as phonetic dialing system of numerous areas in social life System, bank's inquiry system, the system that orders tickets by telephone, information retrieval and translation system, education activities etc., there is speech recognition technology Using, these are established has been above 98% in isolated word or the signer-independent sign language recognition system of small vocabulary, accuracy of identification, It is widely recognized as by people.However as the development of Internet technology, video becomes a big flow of network, and regarding in recent years Frequency live streaming " wreaking havoc " global network, more and more people pay close attention to network direct broadcasting, watch all kinds of races, grave news, all kinds of online The demand straight line of news conference increases.And globalization process accelerates, live streaming is trend of the times to across the language viewing network of people online, Foreign countries' network direct broadcasting translations such as NBA ball matches, European football cup, apple products news conference are urgently to be resolved hurrily.Current large vocabulary connects Continuous speech recognition system is still unsatisfactory for practicability, popularity demand, especially similar to the big video such as TV, film, Living report Flow field.The main reason for causing this phenomenon is the technical bottleneck of speech recognition.Speech recognition is primarily present following several A problem:(1) continuous speech decomposition must be that phoneme or sound mother etc. are single by the first step of phonetic segmentation, speech recognition Position, then needs to establish a rule, for understanding semanteme.(2) voice has ambiguity, and the polyphone in Chinese is ambiguity One side, be on the other hand in English and Chinese, speaker's somewhat different word in speech may sound phase As.(3) context-sensitive image, English word, Chinese character by words are influenced by context, and characteristics of speech sounds is in stress, tone, sound Amount and the rate of articulation etc. can change.(4) influence of noise, environment are leading to speech recognition just when weighing noise and serious interference True rate declines.Currently, be still the interpretive scheme of mainstream in video flow field human translation, however human translation is not only in work Make to have a greatly reduced quality in efficiency, meanwhile, with the rapid soaring of country's human cost, human translation is also increasingly that many profits are looked forward to Industry is tired out.It is therefore, a that accurately subtitle generation product can solve the above demand in real time.
In conclusion problem of the existing technology is:The working efficiency of human translation is low, with high costs.
Solve the difficulty and meaning of above-mentioned technical problem:In speech recognition, first have to according to corresponding algorithm to original The voice signal and non-speech audio of voice carry out cutting, then carry out speech recognition for certain characteristic parameters of voice signal, The pretreatment work of speech recognition technology includes the cutting of the selection and voice to voice recognition unit.Due to different language knot The difference of structure, the selection for voice recognition unit is distinguishing, for example the sound rhythm parent structure of Chinese and English do not have this Kind structure.
For Mandarin speech recognition, and word, syllable, sound mother may be selected as voice recognition unit, the primitive of selection is got over Small, the flexibility of identification is higher, but stability reduces, and vice versa.In addition, Chinese structure is complicated, there are 1312 tonal sounds Section, 432 syllables for not considering tone, 22 initial consonants, 38 simple or compound vowel of a Chinese syllable, the huge Chinese scale of construction and its labyrinth are that voice is known The difficult point that other technology is captured.However, the breakthrough of this technology also by for video flow field supplier's main body from top to bottom and Main body of consumption provides unprecedented convenient service, effectively improves the economic benefit in the field.
Invention content
The subtitle language is precisely synchronous without two big key technology of non-speech recognition and caption translating, in the 21st century, with The implementation of computer network so that the development of speech recognition technology is more in congenial company or do congenial work, and also day is new for many representations, algorithm The moon is different so that the exploitation of speech recognition system has derived more polynary combination.Traditional speech recognition thinking is in statistics voice It on the basis of identification, is modeled using statistical model, in recent years, many decoding strategies and various decoding functions are applied to In decoder, convenient door is opened for emerging audio recognition method.Meanwhile caption translating technology equally grows with each passing hour, companion With the development of big data, multilingual sample database obtains facility, and semantic analysis constantly updates upgrading, faster more accurately translation Algorithm so that subtitle language precisely synchronizes.
In view of the problems of the existing technology, the present invention provides a kind of accurate synchronization system of subtitle language and method, letters Cease data processing terminal.
The invention is realized in this way
Another object of the present invention is to provide a kind of computer programs for realizing the accurate synchronous method of the subtitle language.
At a kind of information data for realizing the accurate synchronous method of the subtitle language Manage terminal.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation so that computer executes the accurate synchronous method of subtitle language.
Advantages of the present invention and good effect are:Machine recognition module application multiple technologies improve noiseproof feature, using two Time Wiener filtering technology eliminate ambient noise;Rubbish voice (the tinkle of bells, laugh, cough are removed using the method for Gaussian modeling Cough the artificial persons such as sound voice), the accuracy being identified using the sound of GMM pairs of 36 kinds of natural environments is up to 95.83%;It adopts Voice starting point is effectively detected with harmonic detecting technique, does not need the priori of noise, takes full advantage of voice in frequency domain With the correlation of time-domain, it is adapted to various non-stationary Complex Noises.Experiment shows that this algorithm overcomes traditional voice endpoint Detection also can using short-sighted energy, fundamental frequency, zero-crossing rate etc. as detection feature poor disadvantage of robustness under low signal-to-noise ratio environment Overcome sub-belt energy not high disadvantage of performance under nonstationary noise and single-frequency noise environment.In training data and Acoustic Modeling etc. Aspect has also all fully considered the interference of natural environment noise, and using the Training strategy of many condition, can significantly improve pair In the robustness of noise.Using the Cross-word static state search space construction method based on WFST, effectively single pass is integrated each The knowledge sources such as acoustic model, acoustical context, pronunciation dictionary, language model are statically compiled to state network by kind knowledge source.It is logical Sufficient bitonic merging algorithm optimization network is crossed, search cyberspace has significantly been simplified.Double tune MERGING/SORTING ALGORITHMs can be straight It is connected on memory space to be sorted and carries out data exchange, effectively save memory overhead.It is comparable in discrimination, than Fast 4 times of WFST Open-Source Tools packet decoding speed or more.Using language model adaptation optimisation technique, based on real network service textual data According to language model adaptation optimization training is carried out, it is adapted to the voice recognition tasks in different business field.
The present invention obtains effective voice segments by adding window framing technology, obtains voice signal that is continuous, stablizing, reduces and know Other error;It effectively can enhance voice signal based on previous work, distinguish that the ability of non-useful voice signal excludes Noise jamming reduces error, can improve 30% to speech recognition accuracy;It can relatively accurately detect in high impulse noise The starting point of voice segments in environment, can effectively solve the problem that ambient noise problem, speech recognition accuracy is made to be increased to 95%.
The present invention eliminates ambient noise using twice of Wiener filtering technology;Rubbish is removed using the method for Gaussian modeling Rubbish voice;Voice starting point is effectively detected using harmonic detecting technique;It is searched for using the Cross-word static state based on WFST empty Between construction method, WFST has then been incorporated into Acoustic treatment feature used in field of speech recognition, by acoustic model, acoustical context, Pronunciation dictionary and language model etc. are combined closely by WFST, and are scanned on WFST;Merger is adjusted to calculate by adequately double Method optimizes network;It is adaptive using acoustic model is carried out based on the distinctive training criterion for minimizing sentence error rate;Support 32 The real-time identification of road audio data stream.
The present invention is using the voice in speech recognition technology identification Internet video, more compared to conventional speech recognition methods effect Outstanding, recognition speed improves 100% in prior art basis, while recognition accuracy is effectively promoted, and reaches 2 times More than.The present invention independently comes source language analysis, the conversion of original text translation and translation generation, establishes independent analysis and independently generates System.In such a system, the characteristics of the characteristics of not considering to translate language when analyzing primitive, generation does not consider primitive when translating language yet, The difference that primitive translates language is converted by original text translation to solve.
Description of the drawings
Fig. 1 is the accurate synchronous method flow chart of subtitle language provided in an embodiment of the present invention.
Fig. 2 is the accurate synchronous system architecture schematic diagram of subtitle language provided in an embodiment of the present invention;
In figure:1, machine recognition module;2, machine translation module;3, live streaming media module.
Fig. 3 is machine recognition module flow diagram provided in an embodiment of the present invention.
Fig. 4 is machine translation module flow diagram provided in an embodiment of the present invention.
Fig. 5 is the accurate synchronous method implementation flow chart of subtitle language provided in an embodiment of the present invention.
Fig. 6 is adding window sub-frame processing schematic diagram provided in an embodiment of the present invention.
Fig. 7 is that voice provided in an embodiment of the present invention divides completion schematic diagram automatically.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
The present invention provides a kind of subtitle language essence quasi synchronous system, realizes that the real-time subtitle of network direct broadcasting generates effect, There is provided multilingual translation on line function simultaneously.
As shown in Figure 1, the accurate synchronous method of subtitle language provided in an embodiment of the present invention includes the following steps:
S101:After newest phonetic segmentation technology cutting time shaft, machine recognition is carried out by speech analysis system, Machine recognition technology can quickly identify voice, and have the adaptability of different accents, can be in relatively noisy noise circumstance Under, identify accurate voice data;
S102:On the basis of machine recognition goes out voice, intelligent translation is carried out to original text, is divided into source language analysis, original text is translated Text conversion and translation generate 3 stages;
S103:Audio and video are transmitted in a network in a streaming manner with streaming media server, the network after opposite download For broadcasting form, live streaming media is exactly to be placed on network server after continuous audio/video information is compressed, under user side Side viewing is carried, without waiting for entire file download to finish.
As shown in Fig. 2, the accurate synchronization system of subtitle language provided in an embodiment of the present invention includes:
Machine recognition module 1 is carried out after newest phonetic segmentation technology cutting time shaft by speech analysis system Machine recognition.Mechanical recognition system generally includes mainly following module:Obtain feature vector, acoustic model, language model, Decoder.If Fig. 3 shows, wherein O and W are respectively the observation feature vector of training sentence and corresponding word sequence;P (O/W) is sound Model probability is learned, indicates the matching degree of speech acoustics feature and word sequence W, when P (W) P (W/O) reaches maximum value, word order Arrange outputs of the W* as speech recognition.
1) multiple technologies are applied to improve noiseproof feature
Ambient noise is eliminated using twice of Wiener filtering technology.
Using the method removal rubbish voice (artificial persons such as the tinkle of bells, laugh, cough voice) of Gaussian modeling.This It is a kind of recognition methods of the natural environment sound based on gauss hybrid models (GMM).By the extraction frame for changing phonetic feature The mixing exponent number of number and model improves the rate of identification, extracts Mel frequency cepstral coefficients (MFCCs) to analyze voice signal; Gauss hybrid models are established based on MFCC feature sets using expectation-maximization algorithm for each sound;Sentenced using minimal error rate Certainly rule and the method for ballot ruling are identified.The accuracy being identified using the sound of GMM pairs of 36 kinds of natural environments can Up to 95.83%.
Voice starting point is effectively detected using harmonic detecting technique.The harmonic components of voice signal are one of human articulation One apparent feature of basic characteristics and voice signal and non-speech audio, harmonic detecting technique are the harmonic waves voice Sound end detecting method of the energy as detection feature.The method does not need the priori of noise, takes full advantage of voice In the correlation of frequency domain and time-domain, it is adapted to various non-stationary Complex Noises.Experiment shows that this algorithm overcomes tradition Speech terminals detection is poor using short-sighted energy, fundamental frequency, zero-crossing rate etc. as detection feature robustness under low signal-to-noise ratio environment Disadvantage can also overcome sub-belt energy not high disadvantage of performance under nonstationary noise and single-frequency noise environment.
From the existing Basic Research Results such as the Auditory Perception of people and Mechanism of Speech Production, analysis extraction has noise immunity, mirror Other property, complementary characteristic parameter.Natural environment noise has also all been fully considered in training data and Acoustic Modeling etc. Interference, and using the Training strategy of many condition, the robustness for noise can be significantly improved.
2) it is based on WFST static state search space construction method and improves recognition efficiency
Using the Cross-word static state search space construction method based on WFST, effectively single pass integrates various knowledge The knowledge sources such as acoustic model, acoustical context, pronunciation dictionary, language model are statically compiled to state network by source.Decoder is One of core of speech recognition system, in recent years, many decoding strategies and various decoding functions are applied in decoder, example Such as the Hvite decoding tools of HTK (Hidden Markow Model Toolkit), Sphinx decoders, TODE decoders etc.. These decoders have in common that the form application in the phonic knowledges such as represented acoustics, voice, dictionary source is in a decoder It is very stiff so that modification later operation is very cumbersome, and introduces a kind of novel knowledge source in a decoder and will become arduousness Task.WFST (Weighted Finite-state Transducer) is a kind of more flexible decoder architecture, reason Thought is the grammar construct and characteristic come simulation language with WFST models.Concrete operation method is to lead WFST used in speech recognition Domain has then incorporated Acoustic treatment feature, and acoustic model, acoustical context, pronunciation dictionary and language model etc. is close by WFST In conjunction with, and scanned on WFST.
Optimize network by sufficient bitonic merging algorithm, significantly simplifies search cyberspace.Most sequence Method is required for opening up that new memory headroom carrys out memory sequencing intermediate steps as a result, for example common quicksort, radix row Bucket sort etc. in sequence and parallel sorting algorithm.Double tune MERGING/SORTING ALGORITHMs can be directly in memory space to be sorted into line number According to exchange, memory overhead is effectively saved.Double tune MERGING/SORTING ALGORITHMs are a kind of sort algorithms based on OpenCL, utilize algorithm Middle data locality feature reduces the number that CPU is synchronized in program, is calculated using vector to promote ALU utilization rates, and to read-write Memory access optimizes.Program, which can be run on, to be supported on OpenCL and the video card and processor of vector calculating.It is total that PCI-E is not counted Line data transmission period, 223The performance of the integer sorting of scale can reach 0.276GB/s on HD6870 video cards.In discrimination phase In the case of, than fast 4 times of WFST Open-Source Tools packet decoding speed or more.
3) special training speech model improves adaptability
It trains criterion progress acoustic model adaptive using based on the distinctive for minimizing sentence error rate, can be directed to specific The real network users accent data in area carry out adaptive training optimization, to adapt to user's accent of different regions;Using language mould Type adaptive optimization technology carries out language model adaptation optimization training based on real network service text data, is adapted to difference The voice recognition tasks of business scope.
4) the real-time identification of 32 road audio data streams can be supported
1 times can be reached in ordinary desktop computer in real time;
Concurrency:For common server (the bis- cpu of Intel Xeon E5, per eight cores of cpu), 32 road voice numbers can be supported According to the real-time identification (or being equivalent to 1 hour voice data that can handle 32 hours of server) of stream.Using speech recognition technology Identify the voice in Internet video, more excellent compared to conventional speech recognition methods effect, recognition speed is in prior art basis On improve 100%, while recognition accuracy is effectively promoted, and 2 times or more is reached.
Experimental data comparison such as table 1
Machine translation module 2 needs to use machine translation mothod, machine translation for the video for using other language Machine-independent identification, on the basis of machine recognition goes out voice, to original text carry out intelligent translation, translation method according to Family demand is different and different, and current translation technology can reach higher accuracy rate and one section of 30 minutes video and only take 1 minute speed.The process of entire machine translation can be divided into source language analysis, the conversion of original text translation and translation and generate 3 stages. In specific machine translation system, according to the purpose of different schemes and requirement, original text translation can be converted stage and original text Analysis phase is combined together, and translation generation phase is independently got up, and establishes correlation analysis and independently generates system.Such In system, the characteristics of considering to translate language when primitive is analyzed, and the characteristics of then do not consider primitive when translating language and generating.It is a variety of studying When language is to the translation of language a kind of, such correlation analysis is preferably used to independently generate system.
The source language analysis stage is independently got up, the original text translation conversion stage is combined with translation generation phase, is established Independent analysis correlation generation system.In such a system, the characteristics of not considering to translate language when primitive is analyzed, and when translating language generation The characteristics of considering primitive.In a kind of translation of the language of research to multilingual, preferably use this independent analysis is related to generate System.
Source language analysis, the conversion of original text translation and translation generation are independently come, the independent analysis of foundation, which independently generates, is System.In such a system, the characteristics of the characteristics of not considering to translate language when analyzing primitive, generation does not consider primitive when translating language yet, former The difference that language translates language is converted by original text translation to solve.In translation of the multilingual to multilingual, such independence point It is appropriate that analysis independently generates system.
Live streaming media module 3 transmits audio and video with streaming media server in a network in a streaming manner, relatively under For netcast form after load, live streaming media is exactly to be placed on network server after continuous audio/video information is compressed On, user watches when downloading, without waiting for entire file download to finish.Streaming media server with RTP/RTSP, MMS, The streaming protocols such as RTMP, HTTP, HLS are watched online by video file transfer to client, for user.
The subtitle language synchronization system of the present invention is integrated with above-mentioned machine recognition, machine translation and stream media technology, uses Newest speech recognition technology has prominent improvement, such as one section of 30 minutes English on the precision of speech recognition and speed Text video is transferred in the system and is identified and translates, overall to take<3 minutes.
The application principle of the present invention is further described with reference to specific embodiment.
The present invention uses c/s, two kinds of framework modes of b/s to provide desktop programs and download use or browser end web applications Program uses.System is divided into two big system composition of front end and backstage, and front end system is responsible for streaming media playing, and voice is shown, subtitle It has been shown that, user's UI operation elements, background system are responsible for video download, video extraction, and voice cuts axis automatically, and speech analysis flows matchmaker Body server, machine translation work, returns to front end system when by the fructufy of analysis, is presented by front end system and interacted In user.
The invention is realized in this way first, the accurate synchronous method of subtitle language uses phonetic segmentation technology cutting Time shaft, flow such as Fig. 5.
(1) adding window framing
To voice signal carry out adding window sub-frame processing be exactly in order to obtain it is useful to speech recognition later it is every in short-term Parameter.Adding window is the basis of voice signal framing, i.e., is weighted using the determining window of moveable window length.Usually, often There are a 33-100 frames in second, and one frame of setting moves length when framing, is carried out using overlapping method.It chooses frame and moves and be with the ratio of frame length , as shown in Figure 6.
The method of adding window is exactly that window sequence is allowed to be moved from left to right along voice sample value sequence frame by frame.Common window has two Kind, one is rectangular windows, and one is Hamming window, two kinds of windows are corresponding with respective window function.After window function is determined, just Framing can be carried out to voice signal, to carry out operation or variation later is exactly to be carried out to each frame.Pass through adding window framing technology Effective voice segments are obtained, voice signal that is continuous, stablizing is obtained, reduces identification error.
(2) empirical constraint is artificially added
In order to obtain more accurately voice segmentation result, some empirical constraints are artificially added, as spectral range is set in 250~3500Hz, the normalization spectrum probability density upper limit are set as 0.9.
(3) subband spectrum entropy is found out
Application based on most basic spectrum Entropy principle in voice signal distinguishes voice signal and non-speech segment to improve Ability, eliminate certain energy and concentrate influence of some specific frequency of noise to tradition spectrum entropy method, it is proposed that improved spectrum entropy French sound automatic segmentation algorithm.The thought that subband composes entropy is a frame to be further divided into several subbands, then find out respectively each A sub- band spectrum entropy, so eliminates the need for the problem of each spectral line amplitude can be by influence of noise.The spectrum of different noises Entropy difference is not fairly obvious, this allows for being easy the threshold value that setting is divided automatically.It can based on previous work, effectively Enhance voice signal, distinguishes that the ability of non-useful voice signal excludes noise jamming, reduce error, it can be to speech recognition standard True rate improves 30%.
(4) double threshold method end-point detection
End-point detection is a basic problem in Speech processing, and the purpose is to from the segment signal comprising voice Determine the starting point and end point of voice.Effective end-point detection can not only be such that processing time is minimized, and can inhibit nothing The noise jamming of sound section improves quality of speech signal.The common method of end-point detection has:Energy threshold, pitch Detection, frequency spectrum point Analysis, cepstral analysis and LPC (Linear PredictionCoeffi-cients) prediction residual etc..Wherein it is based on energy and zero passage The dual-threshold judgement method of rate is the most commonly used.
Different from the judgement flow of simple gate limit, there are two thresholdings for the setting of each characteristic parameter for double-threshold comparison.When When speech signal parameter value is higher than the first lower thresholding set, illustrates there is certainly possible entrance voice section, adjudicate at this time Continue.When voice signal has been more than the second higher thresholding pre-set on this basis, and persistently there are several frames When duration, it is believed that voice signal enters voice section.Judgement to voice section terminating point is generally exactly its inverse process.Work as language When the parameter value of sound signal is less than the first higher thresholding set, illustrate that voice section is possible to terminate, algorithm continues to it It is detected, illustrates voice if the duration that parameter value has been less than the second lower thresholding set and continue for several frames again Signal enters ambient noise section.Method can relatively accurately detect the starting of the voice segments in environment with impulse noise Point can effectively solve the problem that ambient noise problem, speech recognition accuracy made to be increased to 95%.As shown in Figure 7.
Machine recognition is carried out by speech analysis system later;Intelligent translation is carried out to original text, is divided into source language analysis, original text Translation is converted and translation generates 3 stages;Finally audio and video are transmitted in a network in a streaming manner with streaming media server.
The crucial innovation of the present invention is machine recognition module and machine translation module, illustrates this two generic module separately below Outstanding feature:
The method for carrying out machine recognition by speech analysis system:
1, ambient noise is eliminated using twice of Wiener filtering technology;
2, rubbish voice is removed using the method for Gaussian modeling;
3, voice starting point is effectively detected using harmonic detecting technique;
4, using the Cross-word static state search space construction method based on WFST, WFST is used in field of speech recognition Acoustic treatment feature has then been incorporated, acoustic model, acoustical context, pronunciation dictionary and language model etc. have closely been tied by WFST It closes, and is scanned on WFST;
5, network is optimized by sufficient bitonic merging algorithm;
6, adaptive using acoustic model is carried out based on the distinctive training criterion for minimizing sentence error rate;
7, the real-time identification of 32 road audio data streams is supported.
The machine translation method formed is generated by source language analysis, the conversion of original text translation, translation:
i:The original text translation conversion stage is combined together with the source language analysis stage, and translation generation phase is independently risen Come, establishes correlation analysis and independently generate system;
ii:The source language analysis stage is independently got up, the original text translation conversion stage is combined with translation generation phase, is built Vertical independent analysis correlation generation system;
iii:Source language analysis, the conversion of original text translation and translation generation are independently come, it is independent raw to establish independent analysis At system.
The accurate synchronization system of subtitle language provided by the invention includes:Machine recognition module is cut using phonetic segmentation technology After timesharing countershaft, machine recognition is carried out by speech analysis system;Machine translation module goes out the basis of voice in machine recognition On, intelligent translation is carried out to original text;Live streaming media module, with streaming media server by audio and video in a streaming manner in network Middle transmission.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Flow described in the embodiment of the present invention or function.The computer can be all-purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state disk Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims (8)

1. a kind of accurate synchronous method of subtitle language, which is characterized in that the accurate synchronous method of subtitle language is cut using voice Divide technology cutting time shaft, machine recognition is carried out by speech analysis system;Intelligent translation is carried out to original text, is divided into original text point Analysis, the conversion of original text translation and translation generate 3 stages;Audio and video are passed in a network in a streaming manner with streaming media server It send.
2. the accurate synchronous method of subtitle language as described in claim 1, which is characterized in that described to be cut using phonetic segmentation technology Timesharing countershaft carries out machine recognition by speech analysis system and further comprises:The back of the body is eliminated using twice of Wiener filtering technology Scape noise;Rubbish voice is removed using the method for Gaussian modeling;Voice starting point is effectively detected using harmonic detecting technique;
Using the Cross-word static state search space construction method based on WFST, WFST is then incorporated used in field of speech recognition Acoustic treatment feature, acoustic model, acoustical context, pronunciation dictionary and language model etc. combined closely by WFST, and It is scanned on WFST;Optimize network by sufficient bitonic merging algorithm;
It is adaptive using acoustic model is carried out based on the distinctive training criterion for minimizing sentence error rate;
Support the real-time identification of 32 road audio data streams.
3. the accurate synchronous method of subtitle language as described in claim 1, which is characterized in that the source language analysis, original text translation The original text translation conversion stage is combined together by conversion and translation 3 stages of generation with the source language analysis stage, and translation is generated Stage independently gets up, and establishes correlation analysis and independently generates system;The source language analysis stage is independently got up, original text translation is converted rank Section combines with translation generation phase, establishes the related generation system of independent analysis;Source language analysis, the conversion of original text translation and translate Text is generated and is independently come, and is established independent analysis and is independently generated system.
4. a kind of accurate synchronization system of subtitle language of the accurate synchronous method of subtitle language as described in claim 1, feature exist In the accurate synchronization system of subtitle language includes:
Machine recognition module carries out machine recognition after phonetic segmentation technology cutting time shaft by speech analysis system;
Machine translation module carries out intelligent translation on the basis of machine recognition goes out voice to original text;
Live streaming media module transmits audio and video with streaming media server in a network in a streaming manner.
5. the accurate synchronization system of subtitle language as claimed in claim 4, which is characterized in that the machine recognition module includes: Obtain feature vector, acoustic model, language model, decoder.
6. a kind of computer program for realizing the accurate synchronous method of subtitle language described in claims 1 to 3 any one.
7. a kind of information data processing terminal for realizing the accurate synchronous method of subtitle language described in claims 1 to 3 any one.
8. a kind of computer readable storage medium, including instruction, when run on a computer so that computer is executed as weighed Profit requires the accurate synchronous method of subtitle language described in 1-3 any one.
CN201810289373.3A 2018-04-03 2018-04-03 Subtitle voice accurate synchronization system and method and information data processing terminal Active CN108597497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810289373.3A CN108597497B (en) 2018-04-03 2018-04-03 Subtitle voice accurate synchronization system and method and information data processing terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810289373.3A CN108597497B (en) 2018-04-03 2018-04-03 Subtitle voice accurate synchronization system and method and information data processing terminal

Publications (2)

Publication Number Publication Date
CN108597497A true CN108597497A (en) 2018-09-28
CN108597497B CN108597497B (en) 2020-09-08

Family

ID=63624291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810289373.3A Active CN108597497B (en) 2018-04-03 2018-04-03 Subtitle voice accurate synchronization system and method and information data processing terminal

Country Status (1)

Country Link
CN (1) CN108597497B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111369384A (en) * 2019-12-23 2020-07-03 国网河南省电力公司郑州供电公司 Power transformation operation and maintenance hidden danger overall process control system
CN114079797A (en) * 2020-08-14 2022-02-22 阿里巴巴集团控股有限公司 Live subtitle generation method and device, server, live client and live system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510222A (en) * 2009-02-20 2009-08-19 北京大学 Multilayer index voice document searching method and system thereof
CN103297710A (en) * 2013-06-19 2013-09-11 江苏华音信息科技有限公司 Audio and video recorded broadcast device capable of marking Chinese and foreign language subtitles automatically in real time for Chinese
CN103345922A (en) * 2013-07-05 2013-10-09 张巍 Large-length voice full-automatic segmentation method
CN103971686A (en) * 2013-01-30 2014-08-06 腾讯科技(深圳)有限公司 Method and system for automatically recognizing voice
CN104159152A (en) * 2014-08-26 2014-11-19 中译语通科技(北京)有限公司 Automatic timeline generating method specific to film and television videos
US20150042771A1 (en) * 2013-08-07 2015-02-12 United Video Properties, Inc. Methods and systems for presenting supplemental content in media assets
WO2016097165A1 (en) * 2014-12-19 2016-06-23 Softathome Labelled audio-video stream for synchronizing the components thereof, method and equipment for analyzing the artifacts and synchronization of such a stream
CN106448660A (en) * 2016-10-31 2017-02-22 闽江学院 Natural language fuzzy boundary determining method with introduction of big data analysis
CN106649282A (en) * 2015-10-30 2017-05-10 阿里巴巴集团控股有限公司 Machine translation method and device based on statistics, and electronic equipment
CN106816151A (en) * 2016-12-19 2017-06-09 广东小天才科技有限公司 A kind of captions alignment methods and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510222A (en) * 2009-02-20 2009-08-19 北京大学 Multilayer index voice document searching method and system thereof
CN103971686A (en) * 2013-01-30 2014-08-06 腾讯科技(深圳)有限公司 Method and system for automatically recognizing voice
CN103297710A (en) * 2013-06-19 2013-09-11 江苏华音信息科技有限公司 Audio and video recorded broadcast device capable of marking Chinese and foreign language subtitles automatically in real time for Chinese
CN103345922A (en) * 2013-07-05 2013-10-09 张巍 Large-length voice full-automatic segmentation method
US20150042771A1 (en) * 2013-08-07 2015-02-12 United Video Properties, Inc. Methods and systems for presenting supplemental content in media assets
CN104159152A (en) * 2014-08-26 2014-11-19 中译语通科技(北京)有限公司 Automatic timeline generating method specific to film and television videos
WO2016097165A1 (en) * 2014-12-19 2016-06-23 Softathome Labelled audio-video stream for synchronizing the components thereof, method and equipment for analyzing the artifacts and synchronization of such a stream
CN106649282A (en) * 2015-10-30 2017-05-10 阿里巴巴集团控股有限公司 Machine translation method and device based on statistics, and electronic equipment
CN106448660A (en) * 2016-10-31 2017-02-22 闽江学院 Natural language fuzzy boundary determining method with introduction of big data analysis
CN106816151A (en) * 2016-12-19 2017-06-09 广东小天才科技有限公司 A kind of captions alignment methods and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111369384A (en) * 2019-12-23 2020-07-03 国网河南省电力公司郑州供电公司 Power transformation operation and maintenance hidden danger overall process control system
CN114079797A (en) * 2020-08-14 2022-02-22 阿里巴巴集团控股有限公司 Live subtitle generation method and device, server, live client and live system

Also Published As

Publication number Publication date
CN108597497B (en) 2020-09-08

Similar Documents

Publication Publication Date Title
Alumäe et al. Advanced rich transcription system for Estonian speech
CN107945805B (en) A kind of across language voice identification method for transformation of intelligence
Van Den Oord et al. Wavenet: A generative model for raw audio
Oord et al. Wavenet: A generative model for raw audio
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
Ramu Reddy et al. Identification of Indian languages using multi-level spectral and prosodic features
CN112750446B (en) Voice conversion method, device and system and storage medium
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN108597497A (en) A kind of accurate synchronization system of subtitle language and method, information data processing terminal
CN114550706A (en) Smart campus voice recognition method based on deep learning
Dey et al. Cross-corpora spoken language identification with domain diversification and generalization
Kadyan et al. Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation
Dua et al. Noise robust automatic speech recognition: review and analysis
Masumura et al. Improving speech-based end-of-turn detection via cross-modal representation learning with punctuated text data
Andra et al. Improved transcription and speaker identification system for concurrent speech in Bahasa Indonesia using recurrent neural network
TW201828281A (en) Method and device for constructing pronunciation dictionary capable of inputting a speech acoustic feature of the target vocabulary into a speech recognition decoder
Liu et al. A New Speech Encoder Based on Dynamic Framing Approach.
Kynych et al. Online Speaker Diarization Using Optimized SE-ResNet Architecture
Zeng et al. Low-resource accent classification in geographically-proximate settings: A forensic and sociophonetics perspective
Dua et al. A review on Gujarati language based automatic speech recognition (ASR) systems
Bohouta Improving wake-up-word and general speech recognition systems
Jing et al. Acquisition of english corpus machine translation based on speech recognition technology
Zhao et al. Multi-speaker Chinese news broadcasting system based on improved Tacotron2
Ferraro et al. Benchmarking open source and paid services for speech to text: an analysis of quality and input variety
Gao et al. Chinese question speech recognition integrated with domain characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant