CN201355842Y - Large-scale user-independent and device-independent voice message system - Google Patents

Large-scale user-independent and device-independent voice message system Download PDF

Info

Publication number
CN201355842Y
CN201355842Y CNU2007900000221U CN200790000022U CN201355842Y CN 201355842 Y CN201355842 Y CN 201355842Y CN U2007900000221 U CNU2007900000221 U CN U2007900000221U CN 200790000022 U CN200790000022 U CN 200790000022U CN 201355842 Y CN201355842 Y CN 201355842Y
Authority
CN
China
Prior art keywords
message
subsystem
conversion
computer
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CNU2007900000221U
Other languages
Chinese (zh)
Inventor
丹尼尔·迈克尔·道尔顿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Ltd
Original Assignee
SpinVox Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SpinVox Ltd filed Critical SpinVox Ltd
Application granted granted Critical
Publication of CN201355842Y publication Critical patent/CN201355842Y/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Abstract

The utility model discloses a large-scale user-independent and device-independent voice message system, which transforms inorganized voice message to document to be displayed on a screen, and comprises a computer (i) which is used as a sub-system and a network (ii) which is used for connecting manual operators and for providing transcription and quality control; the system is applicable to optimizing the efficiency of the manual operators through three core sub-systems, i.e. a pre-processing front terminal (i) for determining appropriate transformation strategy, one or more transforming sources (ii) and a quality control sub-system (iii).

Description

A kind of extensive, the user independent, install independently sound message system
Background of invention
1. technical field
The utility model relate to a kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen.Preliminary in the face of amorphous sound message being converted to extensive, the independently challenge of sound message system of user of text, merit attention.At first, " on a large scale " means that this system should be upgraded to huge amount, for example 500,000+ user (typically, these are users of mobile-phone carrier), and still can allow effectively and rapidly processing time-message to receive in the 2-5 after sending minute usually to be only useful.This requires far above most automatic speech recognition ASR.The second, " user is independent ": this means does not need user's training system to discern oneself sound or tongue (being different from traditional voice dictation system) fully.The 3rd, " device is independent ": this means that service system is not subjected to receive from specific input unit the constraint of input; The system of some prior art need be from the language input of touch-tone telephone.The 4th, " inorganization ": this means that message does not have predetermined structure, unlike reaction to voice suggestion.The 5th, " sound message ": this is field very specific and extremely narrow application, and (automated speech recognition, ASR) challenge that system faced has proposed different challenges to many traditional automatic speech recognitions for it.For example, the voice mail message that is used for mobile phone generally includes pause, " oh " and " ".Traditional ASR method will verily be changed all speeches, or even skimble-skamble sound.Accurate or tediously long thinking of transcribing is the feature of most of ASR field personage's method.But in fact this is unsuitable concerning the sound message field.In the sound message field, the problem that needs to solve is not accurate, tediously long transcribing, but the recipient is obtained connotation in the most useful mode.
Have only by successfully reaching above five requirements, the just realization that might succeed.
2. the explanation of prior art
(speech-to-text, STT) trans-utilization automatic speech recognition (ASR) up to now, are mainly used in dictation and instruction task to traditional speech-to-text.It is new application to text-converted that the ASR technology is used for voice, has the special feature of multiple-task.Can be with reference to WO2004/095821 patent document (its content be to be incorporated in this with reference to the mode of quoting), it discloses the voice-mail system of a kind of Spinvox company, and this system can change into the voice mail that is used for mobile phone in the SMS text and send to mobile phone.The voice mail of management text formatting is attractive selection.Usually reading is faster than listening to message, and in case be in text formatting, voice mail just can be stored at an easy rate and retrieve as Email or SMS text.In one embodiment, the user of SpinVox service transfers to special-purpose SpinVox telephone number with their voice mail.The caller keeps the voice mail message for the user as usual.SpinVox becomes text with message from voice subsequently, is intended to obtain the style and the habitual morpheme of whole connotations and message, but needn't word for word changes.Conversion is carried out with the level of signifiance of artificial input.Text sends to the user with SMS or with Email subsequently.Like this, the user just can image tube Li Wenben and electronic mail message managing voice mail like that quickly and easily, and can use client's application software that its voice mail and other message are integrated, this voice mail is the textual form that can search for and can file now.
Yet the problem that goes up largely based on artificial re-recording system is its cost height and is difficult to scale to market scale, for example scale to 500,000+ or more user base.Therefore, concerning main mobile-phone carrier, the user base that this system is offered they is unpractical because for required fast-response time, allow human operator may listen to and whole costs of transcribing each message too expensive; Each cost of transcribing message ground that will make us hanging back is high.Therefore, basic technical barrier is exactly a kind of system based on IT of design, and it can make the manual transcription typist operate very efficiently.
No. 2004/095821 patent document of WO has been looked forward to the ASR front-end processing that combines human operator may to a certain degree: it is a hybrid system in essence; The present invention develops it, and has determined specific task, and promptly this IT system can greatly increase overall system efficiency.
Hybrid system is known in other background, but traditional phonetics transfer method will be eliminated artificial factor fully, and this also is ASR field, the especially technical staff's in STT field thinking.Therefore, we will consider some technical backgrounds of STT now.
The core technology of speech-to-text (STT) is classified exactly.Classification is intended to determine that the data that provide belong to the sort of " classification ".(Maximum likelihood estimation MLE) as many statistical means, uses basic model-toss a coin or people's speech production system of data generating procedure to maximum likelihood estimate.The parameter of estimation basic model makes model generate the probability maximization of data.Subsequently by comparing feature of obtaining by test data and the decision of each classification being made classification by the model parameter that training data obtains.Then test data is categorized into and belongs in the classification with optimum Match.How the probability that likelihood function is described observed data changes with the parameter of model.Can estimate if likelihood function and derivative thereof can be known maybe, just can find maximum likelihood by the breakover point from likelihood function.The method of estimation maximum likelihood comprises that simple gradient descends and quick Gauss-newton (Gauss-Newton) method.Yet,, can use that (it starts from preresearch estimates, converges on the local maximum of the likelihood function of observed data for Expectation-Maximization, the EM) algorithm of principle based on desired value maximization if likelihood function and derivative thereof can not obtain.
Under the situation of STT, use supervised classification, in this classification, classification is normally defined the three-tone unit by training data, it means the particular phoneme of being said in the context of phoneme before and subsequently.(the unsupervised classification method can be considered to the cluster of data, and wherein classification draws by grader.) classification among the STT not only need to determine in the voice signal each sound belong to which kind of three-tone classification, and the more important thing is need to determine most likely what three-tone sequence.This is usually by (hidden Markov model, HMM) the modelling voice are realized, the time dependent mode of this model representation phonetic feature with hidden Markov model.The parameter of HMM can utilize the Baum-Welch algorithm of EM form to determine.
The classification task that the SpinVox system is engaged in can be expressed as with the form of simplifying: " can be used for representing in the line of text of message which most possibly provides the characteristic of speech sounds that uses in the voice signal of voice mail of record and the voice mail at all? "Obviously, this is a huge amount and an extremely complicated classification difficult problem.
Automatic speech recognition (ASR) engine was developed for two more than ten years in global research laboratory.Recently, be used for continuous speech, extensively the automation that comprises dictation system and call center is used in the driving of the ASR of vocabulary, its most important example has " (Naturally Speaking) naturally talks " (Nuance product) and " I how can help you (How May I Help You) " (AT﹠amp; T produces).Significantly, successfully use voice-based system to depend on the performance of ASR, its significance level is just as system design, may just because of this do not admitted as yet based on the system of ASR by main IT and telecommunication user.
The ASR engine has three staples.1. the about every 20ms of input speech signal is carried out a feature extraction to extract the representative of voice, wherein these voice compress, and do not comprise the artificial phoneme that phase distortion and telephone receiver modify tone as far as possible.Mel frequency cepstral coefficient process is usually selected, and is well known that before identification and can carries out linear transformation to coefficient, to improve it to the resolving ability between the various speech sounds.2.ASR engine uses usually the model group based on the three-tone unit, it represents all different phonetic sound and before and conversion subsequently.The parameter of these models before using by utilizing suitable voice training example to learn by system.Training process is estimated probability and the word order of a cover constraint ASR output and the syntax rule of sentence structure of the probability of various sound appearance, all possible conversion.3.ASR engine uses pattern classifier to determine that most probable provides the text of input speech signal.The hidden Markov model grader is normally preferred, because it can irrespectively classify to sound sequence with morpheme, and has the structure that is fit to very much speech model.
ASR engine output most probable text on the coupling optimization meaning between input voice and the corresponding model.But in addition, ASR also must consider to occur the possibility of identifier export target language text.As simple example, " see you at the cinema at eight (seeing at the cinema for 8) " is more more possible than " see you at the cinema add eight (adding eight sees at the cinema) ", although the analysis to speech waveform more may detect " add " rather than " at " in the general English usage.The statistical research that language elements is occurred is called as the language modeling.Usually the two all uses to Acoustic Modeling and language modeling in ASR, significantly improves recognition performance, wherein Acoustic Modeling finger speech sound wave conformal analysis.
The simplest language model is a meta-model (unigram model), and it comprises the frequency that each word occurs in the vocabulary.Such model is set up to estimate the possibility that each word occurs by analyzing large-scale text.The n meta-model is used in more complicated modeling, the frequency that its row that comprises n key element length occurs.Usually use n=2 (binary) or n=3 (ternary).This language model in fact on calculating cost higher, but can obtain the speech habits usage more accurately than a meta-model.For example, the binary word model can be pointed out the possibility height of " degrees (degree) " heel with " centigrade (Celsius) " or " fahrenheit (Fahrenheit) ", and it is low to follow the possibility of " centipede (centipede) " or " foreigner (foreigner) ".Research to the language modeling just worldwide launches.Problem comprises the inherent quality that improves model, syntactic structure constraints is introduced model and development calculates and goes up high efficiency method and make different language of language model adaptation and accent.
Best extensive vocabulary speaker independently continuous speech ASR system claims that discrimination is higher than 95%, that is to say that mistake is less than one in 20 words.Yet the user of trust this error rate wins to(for) the technology of adopting on a large scale is still too high.In addition, if when the feature that contains noise or voice when voice packet can not be mated well with the feature that is used for the data of trainable recognizer model, the performance of ASR will sharply descend.If not extra training, special-purpose vocabulary or common saying can not be discerned well.
In order to have set up the voice system based on ASR of configuration successful, obviously need make technology and use special reliability and the durability of optimizing and need on system level, obtaining to increase.
The utility model content
Up to the present, the abundant exploration of also having no talent can convert amorphous sound message to extensive, the independently practical design requirement of mixing voice message system of user of text.The crucial application is that the voice mail that will send to mobile phone is converted to text and Email; Other application that user expectation is said message rather than get message on (any type of) keyboard also is possible, for example instant messaging transmitting-receiving, and wherein the user says as the captive reaction of the part of IM thread; Say text, wherein the user says him and wants the message that sends as text message, no matter is as posting a letter communication or as answer or other communication of sound message or text message; Say blog, wherein the user says him and wants to be presented at word on the blog, and these words just are converted to text then and add on the blog.In fact, whatsoever can make the user say message and nonessential direct input text message and make this message conversion give birth to text and be presented at needs on the screen as long as exist under the situation, perhaps therefrom obtain potential benefit, just can use independently mixing voice message system of this extensive, the user described in this explanation.
In order to address the above problem, the utility model provide a kind of extensive, the user is independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
Three core subsystems, promptly (i) determines the preliminary treatment front end of suitable switching strategy; (ii) one or more conversion sources; And (iii) quality control subsystem;
Wherein, described three core subsystems connect by network respectively.
Wherein, conversion source comprises one or more in following: one or more ASR engines; The signal processing source; Human operators.
Signal processing source optimization audio quality is used for changing by carrying out one or more following functions: remove noise, remove known defect, normalization volume/signal energy is removed quiet/hollow sectors.
Human operators is carried out quality assurance at random, and the message of test conversion also provides feedback to preliminary treatment front end and/or conversion source;
Above-mentioned sound message service system can also comprise: as the computer that the linguistic context subsystem is implemented, its language ambience information that is applicable to a part of utilizing relevant message or message improves conversion accuracy.
Wherein, language ambience information is used for limiting vocabulary or employed search of refinement ASR engine or the matching process that the ASR engine uses.
Language ambience information is used to select the combination of conversion source or conversion source.
Language ambience information comprises caller ID, recipient ID, but no matter whether caller or recipient are the classification entity of commercial affairs or other type; Caller's language; Calling is to history; The time of calling out; The date of calling out; Caller or callee's geography reference or position data; Caller or callee's pim data; The message type comprises whether this message is voice mail, spoken text, instant messaging, blog input, Email, memorandum or note; Message length; The information of utilizing online knowledge document to find; Data appear; The voice density of message; One or more in the voice quality of message.
The linguistic context subsystem comprises identifier confidence level subsystem or handles identifier confidence level subsystem, and this identifier confidence level subsystem utilizes language ambience information to determine the level of confidence that combines with the conversion of the part of described message or message automatically.
The linguistic context subsystem comprises identifier confidence level subsystem or handles identifier confidence level subsystem, and this identifier confidence level subsystem utilizes the output of one or more ASR engines to determine the level of confidence that combines with the conversion of the part of described message or message automatically.
Identifier confidence level subsystem carries out dynamic weighting heavily according to its efficient or accuracy to the output of different ASR engines.
Understanding to language ambience information is extracted by a subsystem, and feeds back to the downstream subsystem, and this downstream subsystem utilizes language ambience information to improve conversion performance; The downstream subsystem is quality monitoring and/or assurance and control subsystem.
Above-mentioned sound message service system can also comprise:
As the computer of calling out the antithetical phrase system implementation, it is applicable to utilize to call out historical information is improved conversion accuracy.
Wherein, call out and to make system become the user independently but need extra time, and significantly user's training, can make user's dependent form data of conversion performance raising history.
Calling comprises the related numeral of unique address that provides with mobile phone, landline telephone, IP address, e-mail address or network to historical and digital to related.
Calling comprises and following relevant information history: language that one or more use probably or dialect; Calling from the country that maybe will call out of country; The time zone; Call time; Call out the date; The phrase that uses; Caller's language; Intonation; Pim data.
Above-mentioned sound message system can also comprise the computer of implementing as the dynamic language model subsystem, and it is applicable to the one or more structure dynamic language models that utilize in following: the caller relies on; Calling is to relying on; The callee relies on.
Wherein, the caller is the people of any payment sound message, whether wants to send audio call with it and has nothing to do; The callee reads anyone who changes message, whether plans to receive audio call with it and has nothing to do.
Above-mentioned sound message system can also comprise the computer of carrying out as the personal configuration file feature subsystem, and it is applicable to and makes up caller's personal configuration file feature to improve conversion accuracy.
Wherein, the personal configuration file feature comprises caller's word, phrase, grammer or tone.
Above-mentioned sound message system can also comprise: as the computer of border chooser system implementation, it is applicable to by seeking the border that carrying in the message perhaps carries in dissimilar between the part of dissimilar messages handles message.
Wherein, as the one or more following parts of the Computer Analysis of border chooser system implementation: greet part; Main part; Bid farewell part.
Different conversions is applied to various piece roughly, and the strategy of application is best to this part.
The different piece of message has different quality requirements, and the quality evaluation subsystem arrives these different parts with different standard application.
Voice quality is commented device to detect carrying in the message perhaps to carry border between the part of dissimilar messages in dissimilar.
The border regional detected or deduction that voice density changes in message is come out.
The time-out place of border in message is detected or infer.
The border is appeared at default message proportional parts by deduction.
Greet the border is in whole message length by deduction 15% place.
Above-mentioned sound message system can also comprise: as the computer of preliminary treatment front terminal system implementation, it is identified for the suitable switching strategy of converting speech message.
Wherein, the preliminary treatment front end is optimized audio quality by carrying out one or more following functions for conversion: remove noise; remove known defect; normalization volume/signal energy; remove quiet/hollow sectors and the message type is classified; message is optimized routing, perhaps do not carry out above-mentioned action to change.
The preliminary treatment front end is based on the employed language of one or more definite caller in the understanding of relevant caller and/or recipient's registration, position and call history.
The preliminary treatment front end is selected the part of ASR engine conversion message or message.
Different conversion sources is used for the different piece of same message.
Different conversion sources is used for different messages.
Human operators is considered the ASR engine.
Preliminary treatment front end utilization or be connected to identifier confidence level subsystem is with the level of confidence of determining automatically to be associated with the conversion of the part of message or message, subsequently according to this level of confidence use conversion source.
Switching strategy comprises from one group of switching strategy selects switching strategy, and this group switching strategy comprises: (i) for the sufficiently high message of ASR conversion confidence level, by the automatic inspection of quality estimation subsystem, to meet quality standard; (ii) for the not high enough message of ASR conversion confidence level, with its guiding human operators in order to check; (iii), it is marked as and can not changes, and notify the user to receive the message that to change for the very low message of ASR conversion confidence level.
Above-mentioned sound message system can also comprise: as the computer that the queue management device subsystem is implemented, it is the load and the calling in intelligent management source as requested, to guarantee that changing the message time of sending reaches predetermined standard.
Wherein, the queue management device subsystem determines what should take place sound message by each of system the processing stage.
If at any automatic translate phase, it is good inadequately that confidential interval or other are measured any part that proposes message, and queue management just asks for help the correct human operators of its guiding.
The queue management device subsystem is made decision by the balance of calculating between change-over time and the quality.
Queue management device subsystem user mode machine, this state machine can determine how to handle message best by system for any given language formation.
Above-mentioned sound message system can also comprise: the computer of implementing as the dot matrix subsystem, it generates the dot matrix of possible word or expression sequence, and by show one or more from dot matrix word candidate or phrase and make the operator can select word candidate or phrase, perhaps propose that to cause conversion subsystem optional word or expression makes human operators can instruct conversion subsystem by importing one or more characters for different conversion word or expressions.
Wherein, the dot matrix subsystem receives input from handling the subsystem of calling out historical information; Perhaps receive input from conversion source; Perhaps receive input from linguistic context subsystem with message linguistic context understanding; Perhaps from the likely word of human operators input study corresponding to acoustic pattern.
Human operators need only select singly-bound to accept word or expression.
The dot matrix subsystem can provide capitalization and punctuate automatically, and proposes candidate numbers, real name speech, network address, e-mail address, actual address, positional information or other coordinate.
The dot matrix subsystem makes a distinction part and parcel in the message and unessential part automatically, and unessential part is confirmed as by the operator and belonged to the proposed classification of dot matrix subsystem in the message, and is changed by machine ASR engine separately subsequently.
Human operators can be said correct word to converting system, and converting system is transcribed automatically to it subsequently.
Above-mentioned sound message system can also comprise: as the computer that search subsystem is implemented, it relies on the message of online knowledge document analysis conversion.
Wherein, online knowledge document is the Internet, and it is visited by search engine; Perhaps online knowledge document is a search engine database.
Can make the accuracy of human operators and/or the conversion of identifier confidence level subsystem evaluates to the analysis of message of conversion.
Can make ambiguity in human operators and/or the ASR engine settlement message to the analysis of message of conversion.
Above-mentioned sound message system can also comprise: as the computer that detector subsystem is implemented, it is applicable to detect and hangs up.
Wherein, hanging up detector implements as the part of preliminary treatment front end.
Above-mentioned sound message system can also comprise the computer of implementing as detector subsystem that furnishing detects the different language of saying.
Wherein, the language detector detects the change of language part in the message.
The language detector uses from having the input of calling to the subsystem of information, and how the language before wherein said calling has been write down information in the message changes.
Above-mentioned sound message system can also comprise the computer of implementing as detector subsystem that is applicable to the estimation voice quality.
Wherein, the speech quality evaluation device is found out the whole measurement result that goes offline, estimates noise level and computing voice quality and is used the minimum message of appropriate threshold refusal quality.
Above-mentioned sound message system can also comprise and is applicable to and detects the computer of implementing as detector subsystem that is not intended to call out.
Wherein, being not intended to call detector implements as the part of preliminary treatment front end.
Above-mentioned sound message system can also comprise and is applicable to and detects and the conversion computer of implementing as detector subsystem of the message of record in advance.
Above-mentioned sound message system can also comprise and is applicable to and detects and the computer of implementing as detector subsystem of numeral that conversion is said.
Above-mentioned sound message system can also comprise and is applicable to and detects and the computer of implementing as detector subsystem of address that conversion is said.
Above-mentioned sound message system can also comprise and is applicable to and detects and the computer as detector subsystem enforcement of conversion candidates real name speech, numeral, network address, e-mail address, actual address, positional information, coordinate.
In above-mentioned sound message system, message is the voice mail that designs for mobile phone, and sound message converts text to and sends to this mobile phone.
Message is the sound message for instant messaging transmitting-receiving Service Design, and sound message converts text to and sends to instant messaging transmitting-receiving service to be presented on the screen.
Message is the sound message that designs for web blog, and sound message converts text to and sends to server and shows with the part as web blog.
The sound message that message is intended to convert to text formatting and sends as text message.
The sound message that message is intended to convert to text formatting and sends as electronic mail message.
The sound message that message is intended to convert text formatting to and sends to the message creator by Email or text as note or memorandum.
The utility model also provides a kind of mobile telephone network that is connected to above-mentioned sound message system.
All the other aspects provide in appendix III.The utility model to amorphous sound message is converted to text in order to be presented on the screen extensive, the user independent, install independently that the design field of sound message system has contribution.As previously mentioned, this field is compared with other field of using ASR in the past, and system designer has been proposed many different challenges.
That the utility model provides is extensive, the user independent, install independently that the sound message system has following advantage:
At first, " on a large scale " means that this system should be upgraded to huge amount, for example 500,000+ user (typically, these are users of mobile-phone carrier), and still can allow effectively and rapidly processing time-message to receive in the 2-5 after sending minute usually to be only useful.This requires far above most automatic speech recognition ASR.The second, " user is independent ": this means does not need user's training system to discern oneself sound or tongue (being different from traditional voice dictation system) fully.The 3rd, " device is independent ": this means that service system is not subjected to receive from specific input unit the constraint of input; The system of some prior art need be from the language input of touch-tone telephone.The 4th, " inorganization ": this means that message does not have predetermined structure, unlike reaction to voice suggestion.The 5th, " sound message ": this is field very specific and extremely narrow application, and (automated speech recognition, ASR) challenge that system faced has proposed different challenges to many traditional automatic speech recognitions for it.
Description of drawings
The utility model describes with reference to the accompanying drawings, wherein Fig. 1 and Fig. 2 for the utility model defined with amorphous sound message convert to text in order to be presented on the screen extensive, the user independent, install the independently schematic diagram of sound message system.Fig. 3 and Fig. 4 are how system shows that to human operators possible word and expression selection is to accept or improved example.
Embodiment
The SpinVox system designer faces many challenges:
Automatic speech recognition and language model
At first, very clear concerning the designer, established ASR technology self is not enough to provide the STT that is used for voice mail (and other large-scale user's independent voice message is used) reliably.ASR depends on the hypothesis of being set up by the theoretical model of voice and language, and wherein voice and language comprise, for example, comprise the language model of word prior probability and syntax rule.These hypothesis and rule many (if not all) are invalid generally concerning the voice mail voice.The factor of finding in voice mail STT uses that exceeds standard A SR technical capability comprises:
● voice quality is subjected to the network artificial factor that environmental noise, receiver and coder-decoder modify tone, comprise noise and go offline;
● the user does not know own to the speech of ASR system, and uncomfortable utilize nature, be that the incorrect language of structure sends message sometimes;
● employed language itself and accent are unrestricted or unexpected in the voice mail;
Even ● the also very fast appearance of vocabulary changes in the language in same, for example, because the generation language statistics of great current events may change.
IT foundation structure
The IT basic constructional design of doing for the validity that keeps SpinVox service and quality makes that computing capability, network and memory bandwidth and server validity are all had strict demand.Unforeseen peak value and more foreseeable cyclic variation are born in load in the SpinVox system.
The message that can not change
Can expect have the sub-fraction message to change.These messages may be the sky messages, for example " hang up (slam-downs) ", unsupported language message or the phone that is not intended to dial.
Quality evaluation
The quality evaluation in each stage of SpinVox system is challenged exactly to himself.Signal processing provides a large amount of analytical technologies that can be used for voice signal, and scope is measured (straightforward SNR measurement) to complicated technology more from direct SNR, comprises the explicit detection of common human factor.Yet, these described direct mensuration self and not obvious but need assess for example to its influence to subsequently transfer process.Similarly, the confidence level of ASR can be measured according to the output probability of optional identification supposition, still, as in the previous, the quality that influences whole text-converted is measured extremely important, and the complexity of quality control need reach this requirement.
User experience and human factor
System is subjected to successfully the influence of level to a great extent to consumer's value, and design is contained human factor with successful level.If the user receives the message or the discovery system of entanglement and can not use simply, with the degree of belief that loses very soon system.
Above-mentioned challenge solves in the SpinVox system design, and is as described below:
The simplified block diagram of the SpinVox system design among system design Fig. 1 has been represented main functional units.Core is an ASR engine 1.SpinVox has marked clearly boundary between ASR and full STT conversion.ASR1 is for generating the subsystem of " former " text, and it provides input speech signal.This is the critical component that is used for the STT conversion, but only is that the reliable STT of acquisition changes in necessary a plurality of important subsystem.Front end preprocessing subsystem 2 can be carried out the wide region classification of voice signal, and this classification can be used for determining switching strategy according to the combination of ASR engine, model group and voice processing and amplifying.Quality evaluation subsystem 3 is measured the quality of input voice and the confidence level of ASR, and Quality Control Strategy can be determined by it.Quality control subsystem 4 moves at the ASR output.Its purpose is to produce that correct, significant, idiomatic text on the meaning of one's words, with the message in the constraints of expressing text formatting.Through the time set up to background, comprise the understanding of the language model that caller ID, callee and caller are specific, compare the quality that can be used for improving in fact conversion with former ASR output.The text of conversion finally outputs to SMS text and e-mail passageway from back transfer language processing subsystem 5.
The principal character of the method that key feature SpinVox adopts has:
● significant message conversion
Text-converted is obtained message-its connotation, style and idiom-but converting speech mail word for word.
● commutating period
Guarantee to change the commutating period of message.
● reliability
System can never send the entanglement text message.The message that can not change of reminding the user can hear in a conventional manner.
● standard language
Message sends with the language of the language of standard rather than " be textization and textization "
● validity widely
System with the foundation structure operation, is different from calling transfer fully, does not need receiver or network.
● self-adapting operation
System can utilize the Quality Control Strategy that includes to optimize performance, this Quality Control Strategy by through the time understanding set up from the language model and the specific language modeling of caller of voice mail substantially drive.In addition, system can select from a large amount of possible speech-to-text switching strategies based on voice mail message feature.Continue the conversion of analyzing speech mail message data and corresponding text, so that upgrade and adjustment SpinVox STT system.
● quality monitoring
The quality of speech-to-text conversion can be monitored in each stage, and therefore carried out quality control, no matter was manually or active agency, can be guaranteed effectively.
● Language Processing
Can carry out the back transfer language and handle 5,, remove tangible redundancy and make the message key element effective, for example the greeting structure of generally using to improve the quality of the message text of changing.
● the ASR of As-Is
Can use business-like ASR engine, with the competitive advantage of the ASR technology of utilizing As-Is.Different ASR engines can be used for handling different messages, perhaps or even the different piece of unified message (determining to use which kind of engine according to determining unit 2).Human operators oneself also can be considered as the example of ASR engine, is applicable to some task, but is not suitable for other task.
● stable and safety
This service moves on highly stable and safe Unix server, goes for the requirement of various language, because different time zone is experienced the peak value that spreads all over each 24 hours period.
The message receive-transmit system that quality control SpinVox has developed a kind of detail knowledge user expectation and thought based on phone.Their zero franchise identification users' insignificant speech-to-text conversion evidence suggests that especially mistake is by machine but not human error when being caused.Therefore the quality control of converting text is particularly important.There are three kinds of optional quality policies available; Determining unit 2 is selected a kind of of the best, (i) for the sufficiently high message of ASR conversion confidence level, can pass through 3 automatic inspections of quality estimation subsystem, to meet quality standard.(ii) for the not high enough message of ASR conversion confidence level, the human agents 4 that it can be led if necessary, also will be corrected in order to check.(iii), it is marked as and can not changes, and notify the user to receive the message that to change for the very low message of ASR conversion confidence level.If the user is ready that the message that can not change can be heard by them by pressing single key.It is more welcome than generating the conversion comprise mistake that these tactful results are exactly that the SpinVox system is designed to convert failed.Therefore the user is protected the statistics of SpinVox to show to the degree of belief of system, and quite the voice mail of vast scale is all successfully changed.
The a kind of of important tool that SpinVox uses is used for improving conversion message quality is the linguistry (common phrases, greeting commonly used and the signal or the like that finishes) that uses in the voice mail message.Can develop the statistical language model that is exclusively used in the voice mail voice the cumulative data of collecting from the elapsed time, and use it for and instruct the STT transfer process.This has greatly improved the conversion accuracy of nonstandard language construction.
The obvious characteristics of SpinVox is that it provides a kind of like this service, and promptly a lot of users are unaware of their needs, if but do not have them can't handle again.This is first real-time system that speech-to-text conversion is provided.Its impact to Virtual network operator is exactly the network throughput that derives from the voice-and-data of the call continuity of raising.By adopting the method for designing of service quality first, technology second, the operation success that the SpinVox system has obtained.This system design is based on the consumer's of detail knowledge service expectation, and, the more important thing is, from the angle of technology, based on the merits and demerits of detail knowledge ASR technology.By adopting the strong point of ASR, and reject its weakness by strict quality, SpinVox become reach extensive, effective configuration that actual design that the user independently mixes inorganization sound message system requires.
SpinVox by its switch technology is aimed at very special target application-voice mail conversion verified its providing based on the success in the service of speech processes.There are indications, the application that system design aiming is determined very much compared with look be undying searching for example the improvement that increases forever of the original performance index of ASR engine be a kind of method that output is comparatively arranged.This method has opened up that the technology that can adopt SpinVox is formed and the possibility of the new application of system design technical key point.
Spinvox has developed the important technology main points that are used for voice-based application as system structure design person and expert, covers speech recognition, telecommunications application, cellular network and human factor.The chance of growing up in advanced person's message transmit-receive technology and developing is pointed to probably and thereby retrieval, management and the filing of being convenient to voice mail can be integrated in voice and text message, the same with all advantages of present welcome Email and SMS text message, comprise simplicity and self documentation of operation.This development is parallel to the convergence of voice-and-data in other communication system.
The Spinvox system from the speaker independently problem what is from the outside to the problem that the speaker relies on, this is seeing clearly greatly the work of speaking in phone.Why? because it has used such fact, i.e. the transmitting-receiving of call, message and other communication by society drive-promptly 80% voice mail is from 7-8 people.SMS has only 5-6.IM has only 2-3.Spinvox uses " calling " that history is finished multiple thing:
1. configuration file-speaker's dependent form speaker star-the caller who says something when setting up the each call of specific caller how to speak (intonation or the like);
2. set up language model-speaker's dependent form language model-caller that the caller says something to someone and say what (word, grammer, phrase or the like);
How 3. we set up the language model that A speaks to B veritably in 1 and 2.How this is than only being that A speaks accurately usually.This is atypical (being how you speak) to message transmitting-receiving type, and this to you how to B also be in a minute atypical (for example, people to his mother say the mode of message and he to wife's utterance intonation/grammer/phrase/accent/or the like on all be very different).
4.Spinvox just set up from common speaker/language independence without any user input or training and to have marched toward not independently speaker-recipient to model;
5.Spinvox have group's language (for example, how I reply and send message) of being used for each other with the ability of further refinement related words (for example dictionary), grammer/phrase or the like.
The further details of these aspects of Spinvox sound message converting system (Spinvox Voice Message ConversionSystem) provides in following appendix I.
Appendix I
Spinvox sound message converting system
(Spinvox Voice Message ConversionSystem VMCS) is absorbed in something-convert oral message to the significant text that is equal to Spinvox sound message converting system.Here, this is unique in this improved method and technology.
Notion
A kind of method of utilizing multistage automatic identification technology and artificial quality of assistance control and QAT and process sound message to be converted to text.Automatically and artificial element directly interactive live each other to produce/feedback in real time, this is that system always can be from live data study to keep coordinating and discharging the core of the quality of unanimity.It also is designed to utilize the inherent limitations of AI (ASR) and by using context boundary, manually guiding and greatly improve accuracy from the fresh language data of the Internet.
Problem
Traditional language conversion method and identification level are closely related, generate high-quality automatic speech recognition under laboratory condition, import the precision of highly being controlled and guarantee high level therein.
Problem is that speech recognition has a lot of other factors that will deal with in real world:
● speaker at random-anyone can use it
● the loud speaker of noise input-background noise and poor quality
● very poor and variable transmission quality, lossy compression is connected with bad mobile phone receiver
● the expression of wrong speech, slang or height localization on the grammer
● from the grammer of contextual context-sensitive unique between message producer and the recipient or the connotation of inferring
● in the message context change-linguistic context boundary-it does not use normal syntax rule
Quoting a part, and their all constantly change in time, so the input of actual source is not the problem of determining in following period of time, but constantly develop.
Solution
Key is correct problem definition: sound message converts the significant text that is equal to.
This does not also mean that perfection, tediously long transcribing, but the most important fragment of message is rendered as understandable form.The two precision of quality and quantity is measured the final score of the precision of estimating as the user (User Rated Accuracy), and wherein SpinVox VMCS must be divided into stable 97%.
The part that two keys are arranged:
● be used for the constant live feedback mechanism of learning system, it is by manual driven
● context of use information defines each transfer problem better
Context of use information help system estimate better something in message by the possibility said, provide message:
● type
● length
● the time in one day
● the geographical position
● caller's linguistic context-caller and recipient the two (calling) to history
● nearest incident
● or the like
Known language construction most probable appears in some specific message type-natural language, and is as described below.
Natural language
When analyzing speech message and the text message said, in the people's that the pattern of rule occurs the language, how they speak and with which kind of speak in proper order-natural language.This is according to message type or linguistic context and significant change, and it can be different when being used to order Piza.
For example, in voice mail, how people greet just can define well with 35 kinds of the most frequently used expression waies in the left and right sides-"; be me ", " hello, and I am a Denier ", " hello how? ", " you good partner ", " also? " or the like, similarly, bid farewell and can define well-" good; goodbye ", " to your health ", " to your health partner " with expression way commonly used, " thanks, goodbye, to your health " or the like.
Obviously, the different piece of the message of saying has implicit connotation, and we just can select actual said most probable classification to improve identification accuracy by using this linguistic context therefore to use this key.
These are building up in the model that statistics goes up height correlation is exactly that our natural language model is defined, and one of every kind of language comprises the dialect in any language.
The linguistic context vector
Natural language is subjected to the domination of its linguistic context usually, therefore when conversion message main body, said linguistic context can be used for estimating what has in fact been said better-for example, the phone of getting to the inquiry hot line is seen relevant the consulting and the expression of the specific names of some company, product and price possibly, on the contrary, the phone of getting to home phone number has and greets relevant close friend probably and expresses " hello ", " making a call me " or the like.
In sound message, we can the context of use vector estimate the natural language group of most probable content and its application better:
● CLI (or any group of discriminators) is very useful linguistic context vector
Zero can learn the local real name speech that most probable language/dialect and most probable use from the geographical position of number
Zero can learn whether number is known commercial number, and thereby predict message type-for example better, phone from 0870 is commercial, therefore this is a commercial message probably, and from the phone of 07 scope from private mobile phone, therefore the time in one day can determine message type more between commercial affairs, private, society or other.
Zero allows you to obtain their number better, as long as speak in message
Zero is a key, you can set up historical thus and known dictionary/grammer-for example, always say " bad apple " with the street corner tone
Zero recognition system that can set up speaker's dependent form-be we can with ASR be tuned to you as specific caller and obtain higher identification accuracy, yourself's vocabulary, grammer, Chinese idiom, dictionary and general natural language
● call out the deeper use of history-CLI (or any group of discriminators)
Zero you can be more accurately for calling out the historical training system of right message
Zero you can not consider B group (recipient) training A group's (caller) voice
Zero you can consider that B group trains that A group uses bear zone and language
Zero you can train a plurality of A and B group's relation, and system is driven to higher accuracy and speed
● the time in one day, the sky in the week
The traffic rate of zero voice mail, average message length and content type change with the time in one day in each language markets, from peak value period of utmost point picture commercial affairs message (early 8 to late 6 points) to more as period of individual (evening 7 to late 10 points), to the social period of picture highly (evening 11 to early 1 point), to very functional period (morning 2 extremely morning 6 points).This also changes because of the date in the week, thus Wednesday be the busiest one day, comprise the commercial message of highest level, and Saturday with have very different message type (overwhelming majority for private chat message) Sunday, this needs to be treated differently.
● international number
Zero analyzes country code (for example 44,33,39,52,01), and we can determine language and dialect better.
● available customer data
Zero client's name, address and possible workplace.
The linguistic context that implies between A and the B group
Further use this in step, also have a lot of other very important clues can help us to estimate the most probable content of message better, especially those relate to whom two groups are, the most probable purpose of message is that what and they call wherefrom or where beat to.
In voice mail-text and spoken text, we know that caller's number has been arranged
● allow us to estimate any number that message is inner left better
● set up the history of known word, statement, phrase or the like between two groups
● possible language (for example, what call out that+33 most probables use is French from+33, and what have that 50% chance uses is French and call out+44 from+33)
● name and spelling thereof
If you have known the history of A groupcall/message, with and issue the history of B group's message, you just can set up speaker's dependent form configuration file and make your identification person and grammer thereof obtain huge raising.
Conversion quality
When addressing this problem, the result who determines actual needs is the most basic, converts sound message (voice mail, the SMS that says, immediately receive and dispatch message or the like) to generation in text and the method how to use the tts resource that you had best very big different because how its can solve at you.
When someone left our sound message for, its purpose was exactly a message, was not formal written communications spare, so as long as the meaning of message is correctly kept, it also is sustainable then owing accurate conversion slightly.
In addition because exist asymmetric, so the message delivever can not compare the text of its said content and conversion.The recipient reads the output of changing according to callee's linguistic context, and purpose is to understand what message is, required is that remarkable conversion message is extracted, rather than the conversion of tediously long (word by word and sentence by sentence).On the contrary, in fact, unless specify, tediously long conversion is considered to low-quality message usually because its comprise a large amount of indecency undesired sound message language fragments (for example, uh, eh, repetition, word spelling or the like).
Therefore, quality herein is about extracting the important elements-intelligent conversion of message.
As the simplest situation, there are three key elements that maximum meanings is provided, be to reach the requisite key element of message quality therefore:
1. whom comes from-understand connotation thus and have very big value
The purpose of message what is-for example, phone me as early as possible, operation is postponed, the change of plan/time calls up this number to me, just greets or the like.
3. any special actual conditions, modal have:
A. name
B. number, telephone number
C. time
D. address
Out of Memory in the message is the transmission of supporting these key elements to a great extent, and helping usually provides better linguistic context for these key elements.
In message, change mass sensitivity
Also very important it will be appreciated that, we need recognize that each key component has different rules in any message in discharging message, so we can give the other quality metric that we should reach for each part in transfer process.
Message can be divided into
● greet on (head)
● message (main body)
● bid farewell (tail)
The percentage that contains the message of any main body obviously is the function of the message length of paying, so we know that news in brief breaths (for example below 7 seconds) typically only contain and greet and bid farewell.Be higher than this, the possibility of then meaningful message main body is by exponential growth.This fact also helps us to estimate the most probable switching strategy that we should use better.
Greet and bid farewell
Others how to greet you can be divided into the about 50 kinds of greetings that can recognize commonly used (for example, here, be me,, this is that X makes from Y, hello, I will look for ... or the like).Equally, message bid farewell key element can be divided into approximately similarly commonly used can recognize bid farewell (for example, thank very much, see you later, thanks, goodbye, to your health, sees you or the like).
Two problems have determined our conversion quality to require:
1. greet and bid farewell and be used for the message agreement, comprise main message usually hardly and be worth, so if it is meaningful, our franchise to low accuracy is high.
2. we can and bid farewell most greeting and be divided into the about 50 kinds of common classifications that can recognize respectively.
Therefore, greeting, greeting or bidding farewell that quality requirement in the process comprised in the message main body, normally the point of message or key factor-for example, make a call to 02079652000 and call me.
The message main body
The message main body has the higher quality requirement naturally, but it equally often can find to contain the mode of rule of the natural language that relates to linguistic context, so we also can the application class rank help us to obtain correct answer better.
Example is preferably:
", pellet, I am John "-message head (or greeting)
" receiving that can make a call to 0207965200 after this message phones me? "-message main body
" thank you so much for.To your health.Goodbye."-message tail (or bidding farewell)
In this example, main body is the voice mail language fragments that the SpinVox converting system organized has well been learnt.Then just can be with separately also distribution body output correctly of message.
That uses in this example will have
● known A and B group
● telephone number is John's CLI, perhaps can see before he makes others' phone
● message length-, therefore be likely common statement less than 10 seconds
● the time-operating time in one day-John does not stay detailed message in the operating time usually, the message that direct current is brief.
SpinVox sound message converting system
Some the very important feature that correctly shows our problem and determine voice with and how be equal to after the text associated, designed SpinVox system (see figure 2) to make full use of these advantages:
SpinVox sound message converting system
This graphical presentation can make us optimize three critical stages that we correctly convert sound message (voice mail, oral SMS, instant messaging, sound message are handled programming language of usefulness or the like) to the ability of text.
Crucial notion is exactly that system is used for any conversion source with term " agency ", no matter it is still artificial to be based on machine/computer.
Preliminary treatment
It handles two things:
1. by removing noise, remove known defect, normalization volume/signal energy, removing the converting system optimization audio quality of quiet/hollow sectors or the like for us.
2. the message type is classified,, perhaps do not carry out above-mentioned action in order to message is optimized routing to change.
The classification of message type is finished by using " detector " scope:
● language
Zero English Britain/U.S./Australia/New Zealand/South Africa/Canada for example is dialect (for example in the southeast of Britain the inside, London, Birmingham, Glasgow, New Ireland or the like) wherein then
Whether zero allow us to determine we support this language
Zero allows us to select to use which kind of transduction pathway: the QC/QA configuration file, TAT rule (SLA) loads which kind of ASR stage strategy (engine) and uses which kind of reprocessing strategy.
Method:
● the statistical language identification
Zero prior art
■ is known multiple automatic language discrimination method
Zero SpinVox solution:
■ decision is based on linguistic context: about the understanding of caller and recipient's registration, position and call history
● based on the language idendification of signal
The problem of zero prior art
The method of ■ pinpoint accuracy needs big vocabulary list language identification or is phone identification at least, therefore makes and the operating cost height
■ needs fully the reliably method (only do not have other with language and carry out mark) fast based on recording
Zero SpinVox solution:
1. be every kind of language cluster speech data (vector quantization) automatically
2. combination cluster centre
3. use the statistical model of classification order to find optimum Match for every kind of language
4. set up the relational model between the marking difference between model and the anticipate accuracy
5. the multiple version of combination 1-4 (based on the training data that changes, feature extracting method etc.) is known the accuracy that reaches required.
● noise-SNR detector
If the total amount of zero noise is higher than specific threshold value, correctly detects message signal and conversion and will become more and more difficult.More meaningfully, if signal to noise ratio drops under the certain level, you will obtain high you can not change the confidence level of message.
Zero SpinVox user is worth and is, when they received the notice that message can not change, former audio frequency was that so official post must surpass time of 87% they can directly make a phone call or send the documents that this gives that people and continuation " conversation ".
● the speech quality evaluator
If no matter zero someone's speech quality is too poor probably concerning converting system or agency's use.Perhaps, the content that should listen in person of user-for example someone to their song of singing that happy birthday
Zero SpinVox solution comprises:
1. find out go offline (voice packet of in transmission course, losing) based on the zero crossing counting
2. estimation noise level
3. the whole measurement result of computing voice quality and use the minimum message of appropriate threshold refusal quality.
● hang up (" hanging up ") detector
Zero someone message of having made a call, but do not stay significant audio content.Typically be the news in brief breath of the sound of having powerful connections.
● be not intended to call detector
Zero typically comes the replay button pressed in comfortable someone pocket, stays long noise message, does not wherein have significant audio content
● the standard message
Message under zero record in advance, very common in the U.S., from automatic dialing system or Service Notification or calling
● greet and bid farewell
If zero message only contains these, we just can utilize special-purpose ASR fragment correctly to change these messages
● message length and speech density
Zero length allows the possibility of our preresearch estimates message type-for example, brief phone just simple usually ", I am X, please return my phone ", and long on the contrary phone can comprise more complicated thing will be changed
Zero speech density allows you to adjust your the possible estimation to message length, this is the good indication-for example to type, low-density news in brief breath just simple probably "; I am X; please return my phone ", but will making this tend to you, highdensity news in brief breath needs higher levels of conversion source, because the complexity of message can be higher.
Obviously, in the path that preliminary treatment allows us (for example hang up, be not intended to call out, foreign country/unsupported language) classifies message rapidly in some cases, and to the recipient send correct notice (for example " this people calls out; but do not stay message "), save the further use of any converting system resource to preciousness.
Automatic speech recognition (ASR)
This is a dynamic process.Optimization to tts resource is used in the message level definite.
We are from the two obtains input to the pretreatment stage of message classification and linguistic context vector, and utilize these to select optimized switching strategy.This means that this stage uses best ASR counting to be used for special task.Reason is that dissimilar ASR highly is suitable for specific task (for example, a kind of very outstanding for greeting, another kind is used for telephone number, and another kind is used for French address).
This stage design becomes to use the scope of translation proxy, no matter is ASR or artificial, is being to distinguish between them on the basis how to be provided with at that time conversion logic only.Along with systematic learning, this is adjusted, and different strategies, tts resource and order can use.
This strategy is not only used on whole message level, and can be applied in the message.
Head and tail
Greeting (head), the main body that strategy is a branch message again and bid farewell (tail) part sends to the different fragments of ASR with them, and these fragments are best to this key element of message.In case they are finished, they just are reassembled into a message.
Digital path
Another kind of strategy is to tell any being clear that telephone number, currency or obviously using digital key element of saying in message again.These parts send to ASR or agency's specific fragment in order to best transition, and the remainder with the message of changing reconfigures then.
Address path
Equally, tell any address key element of in message, saying again, can send it to ASR or agency's specific fragment, and to be sent to the address coordinate validator all be real with the address fragment of guaranteeing all conversions.For example, if you detect less than street name, but clearly postcode is arranged, you just can finish most probable street name.The accuracy of finding street name is improved by handling the address once more, but the previous street name of estimating with you will the ASR classified variable redefine more restricted group and observe whether higher coupling is arranged.
Real name speech path
ASR is unreliable a significant role to the real name speech for making.In addition, unless only be absorbed in this part and more ad hoc use, unless more expensive resources is arranged in the calculating, you can estimate the real name speech better.
Reprocessing
The ASR stage comprises dictionary and the grammer of himself, but this is not enough to correctly change the tongue of many our complexity.ASR was suitable for the conversion of word level in ten minutes, had been bad at very much aspect the word order of the possibility that is used to estimate the word order and the basic syntax (n-gram and format structure technology).A problem is, mathematically say, you attempt to estimate to surpass 3 or 4 kind of possible combination when you enlarge phrase, arrangement becomes, and to make that so greatly you select the speed of ability drop of correct combination all faster than enlarging any increase that number of words obtained in turn, so this is insecure strategy at present.
A good method is wideer phrase or the sentence structure that is conceived to occur in the natural-sounding.From macroscopic scale your the just solution of misjudgment or word/part better of dealing with problems, wherein the confidence level of ASR is lower.
Yet this also has its deficiency.As previously mentioned, people's voice comprise a lot of noises, human factor, and because the substantial connection between A and the B group, it tends to bigger linguistic context background.Some things is nonsensical concerning some people, but just different concerning the people with one group of very big linguistic context, wherein can draw the meaning highly significant from look like the phrase piled up carelessly or insecure sound equipment speech.
For instance, " near the Piccadilly the subway on special Rocca moral sieve opposite is seen; then I want lean person's mocha for you " feels at a loss for a people, unless he knows the possible connotation of " subway ", he has been London and has known has one from getting very near building " special Rocca moral sieve " near " Piccadilly ", and there are the coffee of a kind of being called " mocha ", its low fat, just " lean person " near the Startbuck knowing.
Real world Corpi-linguistic context is checked
A solution is the document that is conceived to very large English word, real name speech, phrase, proverb, rule statement, may comprise these word sequences with the conversion of checking you.
Problem is, in normal speech, has these a large amount of possible combinations, and this lacks the inspection of any real world linguistic context strictly speaking.How you know that real name speech Piccadilly, special Rocca moral sieve, mocha and lean person's combination is effective, say nothing of the good conversion of your source audio frequency? it is absolute having only the real world inspection, unfortunately, according to definition, we programme the validity of having only we mankind this moment can limit something or other whether to have real world-be after all to the database that computer and they are relied on.
Intelligence with human level, you just can check the most accurately these seem not have the project of contact whether to have any possible situation in real world.Yet people still lack the understanding fully to anything, this also be why a big chunk among the Londoner for the reason of knowing whether this phrase has given them can feel under the weather about the knowledge of Piccadilly probably.
A solution is to use human knowledge corpi maximum on this celestial body.The thousands of space of a whole page and database that human editor creates can obtain on the internet.The inspection that the simple queries whether any key element of sentence or your conversion is quoted by the Internet is given your high-quality real world, check this whether be people experienced probably and write down thereby may be real. ), MSN and other main search engine can both provide the click of the abundant page, we are actually the correct confidence that extremely increases that just has to our conversion like this.
In addition, internet usage, we can find the correct spelling of word, real name speech and the place name of linguistic approximation spelling more frequently, this also is that ASR will attempt when running into new or unknown word.This extremely consuming time and manual programming that cost is high by the ASR dictionary is finished at present.
The extremely valuable benefit of other of this solution is, the Internet is a lived system, it reflects present language accurately, and language is evolution and dynamic theme, and can change with single headline, so you do not rely on the subclass of limited natural language and bring in constant renewal in your ASR dictionary, may be natural language resources nearest and maximum on this celestial body but have visit.
Example
SpinVox changes with subaudio frequency:
From British message-
Audio frequency: " The cat sat on Sky when Ronaldo scored against Cac á (cat is sitting in sky when Rhoneldo resists card card score) "
The text of conversion:
The cat sat on sky when Rownowdo/Ron Al Doh/Ronaldow/Ronahldoscored against Caka/Caca/Caker (but when Luo Nuoduo/Long Aoduo/rhonel all/cat is sitting in sky during Luo Naaoduo antagonism card noise made in coughing or vomiting/card coffee/card score)
Problem:
● " being sitting in sky " syntax error-you can not be sitting on " sky " in the dictionary linguistic context
● Luo Nuoduo/Long Aoduo/rhonel all/Luo Naaoduo is possible answer for uncommon real name speech
● the conjecture of very uncommon real name speech of card noise made in coughing or vomiting/card coffee/card
The search of in Google this difficulty phrase key element being carried out shows:
● sky is that brand-first grade is used as " on high ".Therefore, this is a real name speech that is used for object, and therefore " cat is sitting in sky " is possible grammer and is certain.
● first name from the independent spell check of all versions be likely Rhoneldo (Google " you will look for Rhoneldo? ")
● Rhoneldo is highly related with " Rhoneldo's score ", because he is a very famous sportsman, and search obtains the accurate coupling of this a large amount of phrases.
● second name is likely the card card, because the card card has maximum clicks for " antagonism card card score ".
● we by the search " football card card " further strengthened we confidence-football must in " Rhoneldo's score " context-we have obtained the Search Results of a large amount of height associations.Provide " Rhoneldo's score " and obtained a large amount of successful search, we more confident " card card " is best suited for.
● in addition, the real world character of the data directory of Google means that term was used in today, and current term is than obtaining higher rank not as good as current term, and this makes speech recognition in fact is current language and linguistic context work.
Queue management device
Queue management device is responsible for:
● determine what should take place-switching strategy at each stage sound message
● when needs are manually auxiliary, manage the decision in each automatic stage
If zero is good inadequately in any automatic translate phase confidential interval or other any part of measuring the proposition message, queue management just asks for help the correct human agents of its guiding.
● by guaranteeing that (Turn Around Time, TAT) any message of internal conversion guarantees that our service level is suitable for any consumer time-solution time in agreement for we
Zero typically, TAT average out to 3 minutes, and 95% in 10 minutes, and 98% in 15 minutes
● make decision by the balance of calculating between change-over time and the quality.This is the function that up to the present SLA allows, and especially deals with the unexpected traffic or unusual language use spike and performance.
This is by making it and all parts are interactive and be the operation core of SpinVox VMCS.
Quality control is used
Appendix II comprises this and is used for the more detailed explanation that interior dot matrix (Lattice) method is used in the SpinVox quality control.
Shown in the sketch of sound message converting system (VMCS) among Fig. 2, human agents and message are in the different phase interaction.They utilize quality control to use this process of finishing.
They also utilize the modification of this instrument to come the casual inspection message, correctly change message with the assurance system, and the problem of AI is that it can not determine that it is accurate really.
A critical step is utilized the conversion of artificial next " guidance " each message exactly.This depends on the SpinVox VMCS database, ASR and the input minority word that comprise the huge document that may mate to create the people of expection typewriting solution.Under extreme case, do not need manually to convert automatically and carry out.
Problem
ASR only is good at the coupling of word level.In order to change significant message, the phrase, sentence and the grammer that are used for the transmitting-receiving of oral message are necessary.ASR will be that the coupling of each word level generates the statistics confidence measure, also can the obtaining of phrase.Can not finish the process of meaningful and correct conversion with linguistic context or natural language rule.
What automatic system was good at is spelling and basic syntax-consistency.
What manually be good at is connotation, linguistic context, natural language, spoken grammer, deals with ambiguous input and makes its meaning clear.It is inconsistent manually to tend to spelling, grammer and speed.
The commercial affairs problem
Use artificial cost money, so anything that utilize them to do only is used for basic thereby also is the thing that economic worth is arranged.
Transfer ratio is acted on behalf of in SpinVox VMCS use, and (Agent Conversion Ratio, notion ACR)-it refers to its required cost and acts on behalf of the time of actual treatment message and the length ratio of oral message.Any reduction ASR and the thing that improves the message conversion quality all are commercial drivers, because ACR has 1% reduction will cause the growth of gross profit 1%.In fact, susceptibility even can be higher reduces because be not only the direct marketing cost, and administration overhead and service operation validity and scalability are all benefited from essential manpower still less.
Solution
Dot matrix method: use the human agents guidance system from predetermined most probable option list, to select correct phrase, phrase, sentence, message.
SpinVox VMCS database is possessed abundant message data history as huge statistical model (dictionary and grammer with relative linguistic context vector), and it can propose in the mode of two kinds of keys:
Dot matrix method
I.VMCS language model context of use (for example call out history, language, time or the like in one day, see the linguistic context vector) selects most probable conversion (proposing conversion) to be shown to the agency.
Ii. along with the playback of message, the agency selects letter to choose alternative (can be first letter of correct word), perhaps clicks the textual portions that " acceptance " is accepted to propose, and continues ensuing part.
Iii. when acting on behalf of type change, system selects new most probable conversion and makes just may obtain correct coupling more for the first time next time as feedback (study) used as input.
Iv. the whole messages that usually need the agency to get value character (for example 250) only need seldom keystroke to finish and are in real time or faster.
V. agency output is restricted to now and corrects spelling, grammer and wording, or about the rule of these control of quality and Geng Jia message connotation.
This can be in two ways dedicates the agency to:
1.ASR auxiliary the proposal changed
In this case, ASR at first is used for predicting better propose which kind of text to the agency.It utilizes actual content that exists in oral message audio frequency that possible conversion options is reduced to minimum, thereby improves accurately and act on behalf of speed.
A.ASR can be used for the preliminary conversion of proposing
B.ASR can continue to serve as the remainder that agency's input selects to propose with further refinement conversion subsequently
Prior art: utilize artificial rectification of selection of word alternative to transcribe
The problem of prior art:
● correct still consuming time
● if the user corrects known in cataloged procedure, and the ASR engine can be made better decision (speaking after a while) originally
2. all predict the text typewriting
As above 1, but not having ASR is used to select to propose that situation about changing is displayed to the agency.This is different from normative forecast text editing, because it relies on specific history (use of VMCS language model and linguistic context vector-for example call out history) and in phrase level and above work.
Prior art: predict that the most frequent word (alternative tabulation) provides the problem of the artificial input of part prior art:
● the most frequent word often be not the user want that
● prediction is only at a word
In both cases, SpinVox VMCS language model is fully by artificial training, perhaps by ASR and artificial combined training.
Under extreme case, system is trained up, and can be all the time just select for the first time correct proposal conversion, only needs artificial quality of assistance to guarantee with stochastical sampling and checks that VMCS is correctly autonomous.
Appendix II-dot matrix method
Classification is observed and hypothesis
1. provide the big vocabulary and the audio quality of variation, it seems to reach the more conversion fully automatically of a very little part that sufficiently high speech recognition accuracy is used to contrast speech.Detect this part reliably, determine that promptly not needing hand inspection is the very significant problem that studies for a long period of time, but may not be reality selection in a short time.
2. good operator has 3-4 target ACR, on average is approximately 6-8.
3. correct 90% correct speech and approximately want 1.2.(source: SpinVox operation research 2005 (SpinVox Operational Research 2005))
4.75% improvement time spends in seeks and selects wrong go up (Wald etc.).
5. word selective listing (alternative) reduces listens to the time (Burke 2006).
6. erroneous tendancy is in assembling (Burke 2006).
Double speed reset to keep intelligibility, user after short-term training, prefer it (Arons97).
8. remove time-out, fast 50% playback provides 1/3 factor (Arons 97) of real time.
9. according to people such as Bain report in 2005, normal typewriting has 6.3ACR, and it equals the accuracy editor ASR output with 70%.Dim transcribing is known as " feasible " concerning live captions.
Method
Main purpose is supported the agency by using the minimizing of language technology to act on behalf of transfer ratio (ACR) exactly.This can reach in several ways:
1. allow the agency make decision that we can not provide, i.e. whole connotations of message or single phrase to obtain mistake.Machine can be filled details.
2. prediction is provided when agency typewriting/editor speech.This may not only save time but also help to be avoided misspelling.
3. (writing a Chinese character in simplified form) capitalization is provided and adds punctuate automatically, so that the agency need not to deal with these problems.
The phone treatment step
1. the agency listens to message (for example 1/2 real time) at a high speed
2. act on behalf of button selection sort (for example " please wire back " " just wiring back " ... " totally ").
3. in some cases, speech is accepted very soon, if message is being followed the simple mode that defines for message classification, does not have very important in the fine and message of voice quality but during the part that is easy to obscure, this will take place.
4. system proposes to change and goes, editor agency, and system's utilization continues the speech of (and at once) renewal proposal based on the prediction of voice identification result.
5. as long as the speech that shows is correct, the agency accepts speech with regard to button.
An example of phone treatment step 4 is shown in Fig. 3.
In this example, act on behalf of the speech that 35 keystrokes of needs are edited 17 words and 78 characters.
● 15*<accept word〉(for example tab key)
● 14*<accept character〉(for example dextrad arrow key)
● 6* normally imports
● 1*<accept speech〉(for example enter key)
Major part in them all should be very fast, because identical key must be pressed several times.Have only 6 needs to select normal key in them.
Notice that having only 6 (35%) in 17 words is correct in the initial speech of proposing of system.
Implement
Treatment step
1. speech recognition engine (for example HTK) voice document of will making a speech converts dot matrix (being word hypothesis chart-a kind of directive acyclic chart, a large amount of possible word sequences of its expression) to.
Again to dot matrix marking with consider telephone number (to) customizing messages (for example name, before call out in frequent phrase that occurs or the like).
3. dot matrix is expanded to and can make edit phase search for (for example be the path at each node and arc calculating most probable arrival speech end, node and arc begin the path of this storage, and the character subtree is added to each node of representing commit point) very apace." family ", promptly a plurality of only its start and end time different arcs in certain limit, combine.
4. when the agency has selected particular category (step 2 in " phone treatment step "), select corresponding grammer and language model to analyze and dynamically marking again.When classification is " common ", use not strict " grammer " that limits.
5. select by with the top score path (if suitable) of the dot matrix of the classification grammer coupling of selecting.
6. the result of gained will be by very fast acceptance in this way, if:
A. classification is not " common ".
B. the score that is different from the top score in the path that limits for strictness is in the given scope.This scope can be as the parameter of weighing between dynamic control rate and the accuracy.
C. according to the grammer that is used to seek the path, speech does not comprise the key component of obscuring easily (for example time).
7. when word of accepting as the user or character, system passes through dot matrix along the path of selecting.
8. along with character and word are accepted or are up to, its color or font change.
9. when the agency got some things, the path (considering current grammer and possible other factors, for example statistical information once more) of top score was selected by system, and this path begins with the character of getting.This new route shows subsequently.
10. as agency when getting the word that can not find in the dot matrix, it is carried out spell check automatically, if suitable then rectification is provided.
11. the agency presses<accepts _ make a speech after, text is handled, add capitalization and punctuate, rectification misspelling, with digital replacement quantity word or the like.This uses durable probability analysis, and semi-automatic grammer derived from training data is adopted in this probability analysis.
Audio playback
Node in the dot matrix comprises clocking information, so system can keep the message part of tracking agent editor.What seconds the agency can pre-configured system will play.If the agency hesitates, system then resets from present node word before and makes a speech.
Refining option
Important and the unessential part of mark
I. depend on relevant classification and grammer, give prominence to the specific part of the speech text of the demonstration that is considered to crucial, and unessential especially part (for example greeting) is stamped shade.
Unessential part is used phrase classification
The each several part of message shows with phrase classification rather than word.The agency only needs to confirm classification, and the selection of single phrase is left for the ASR engine because the mistake of this part by the people for being unessential.For example, classification " (HEY) " can represent ", he, feed, hello, you are good (hi, hay, hey, hallo, hello) ", more Zao example can show as shown in Figure 4.In this version, "<acceptance _ the word〉" key that is applied to phrase will be accepted whole phrase.Get character and just become word mode again, that is, the phrase classification mark is replaced by single word.
The restriction prediction shows
Show that in fact wrong prediction may make the agency puzzled, and it may only show the part (part) that those systems determine relatively.
Selectively, confident relatively in various predictions can be in some way with color marking (being labeled as " confidence level shade "), for example uncertain (usually further from cursor) printed with extremely shallow grey, and comparatively reliably with black and thicker demonstration.
The speech segmentation
Detect the silence of long period, and be used for message is divided into paragraph."<acceptance _ paragraph〉" distributed in user interface reflection segmentation, additional key.If the word that the agency gets does not make present path extend through dot matrix, this just can determine bigger phrase by enough buttons, and synchronous again.
Make cursor remain on the screen left side
Make the middle large area region of screen show present phrase with big letter.Along with editor's continuation, mobile text (keeping cursor) at same position.A few word in display highlighting left side only.Along with word withdraws from zone line, it moves to top area (littler font, grey).More phrase is presented at the below, and is also less, be grey.
The alternative that shows phrase
Always or after the button, show that optional phrase finishes,, allow to select with arrow key just as the menu that fell in the cursor right side.This means that the agency needn't consider first character of correct word, it should help difficult word.
With the voice moving cursor
Along with the playback of message, the word of saying is highlighted automatically, and cursor moves to the beginning of word.
The zone that broadcast highlights
The agency can select zone (for example left side of mouse button and right side), and system keeps playing the paragraph between the mark, moves on up to the agency.
The dim of word or phrase transcribed
Word also highlights except being up to playing to when acting on behalf of, and the agency can say word simply and replace the current word that highlights (and all the other words of phrase).Alternative (word and expression) of system from dot matrix dynamically set up grammer, and uses ASR to select correct one.This is a technical very difficult option, because ASR need use inner the use from QC, and suitable speaker's pattern dependent model need and be selected in training running time.
Accuracy is considered
Most probable result
The result's that score after the speech recognition steps is the highest accuracy should be quite low (for example 25%).So preliminary result displayed seldom has correct.People such as Padmanabhan pointed out that the word error rate of IBM report voice mail was 28% in 2002.
When the reflection classification can identification (estimating at 20% situation), if " phrase classification " method of use, that is to say, if being used for the mistake of Exact Phrase end to end is accepted, and there is not difficult part, they can use out of Memory (the phone owner, previous phone) to be verified, the chance that obtains correct thorough result so should be quite high (for example 70%)."ball-park" estimate is to have an appointment altogether that only button is once just processed for 10% speech.
Error correcting
Observe, speech recognition errors is tended to cluster and is occurred, and for example, the par that comprises the follow-up word of mistake is 2 (TODO: see reference).This normally because:
● segmentation mistake-first incorrect word is shorter or longer than correct word, and therefore ensuing word must also be wrong
● the influence of language model
● give an oral account modeling possibly jointly
This observed result has excited such expectation, and that is exactly that mistake more than one will be typically corrected in the rectification of a word in the labeling process in speech hypothesis.
In fact general, get the factor that a character limits the competitor's of next word quantity one 1/26.Two characters are limited to 1/676 with it, and this should almost get rid of the higher incorrect word of all scores certainly.This has excited another prediction: one is in average ASR mistake and should corrects less than a keystroke.
Pass the optimal path of dot matrix
A very important factor of native system success is exactly the percentage that comprises the dot matrix of correct path, even it has low relatively score.If correct path is not in dot matrix, the path of passing dot matrix will can not be caught up with at some point by system, so it will be difficult to generate new prediction.System may need waiting agents to get two or three words, seeks the correct point that is retained in the dot matrix once more, to generate further prediction.
The size of dot matrix and the chance of therefore correctly being made a speech can be passed through parameter (quantity of mark and finishing) control, and whole in theory search volume can comprise.Yet this can generate great dot matrix, and it can't send the client in the acceptable time limit.In addition, we must deal with the unforeseeable not appearance of the word in vocabulary in advance.Several months operation back (therefore data collection has been arranged) can reach about 95% ratio.
Above-mentioned " " version, paragraph will provide and be easy to the point that restarts to predict in the speech segmentation if use.
Reprocessing on the linguistics
Perhaps " SpinVox message " sentence structure that is worthwhile for every kind of language definition to simplify.SMS does not expect to contain complete correct sentence usually, but attempts to add many punctuates (making its mistake usually), and perhaps it be worth seldom but use consistently.
Capitalization
This is simple relatively in English, but in other Languages relatively difficult (for example German).
The expection benefit
1. when changing or editing, system keeps following the tracks of acts on behalf of present residing position in the speech, so control audio is reset better.
2. for the speech of special ratios, the agency only needs to determine its classification, and ACR can be less than one (fast speed playback and remove when quiet be 1/3 in theory).
3. the higher message of a considerable amount of ASR performances only needs quick check and considerably less keystroke to correct, and the ACR that provides is about 2.
4. big multiple message still needs the editor of certain degree.The scope that these situations are extended will acquire benefit from the prediction that still must determine.
5. handle capitalization and punctuate automatically and will reduce the very little percentage of ACR, and improve consistency.
Query/problem
When 1. edit with the prediction of ASR control and become more consuming time than simple typewriting? in order to utilize prediction, the agency needs to read them.If Word prediction next seldom is correct, accepting them so simply will be faster than getting them, still, if ensuing word is wrong, so just need the extra time to check, this just has been wasted fully.On the other hand, need listen to, and may also want inspection prediction consuming time in any case act on behalf of.
Combination forecasting method
Appear to have hope and predict the backup of conduct based on the prediction of ASR with statistics.
Owing to the statistics forecast model is static (not being to call out dependent form) and does not therefore need to send the QC application to each message, it can be the more accessible dot matrix that must send each message to, therefore the certain size limit of essential maintenance, and more may lose the hypothesis that some needs.
Statistical and all be expressed as chart based on the forecast model of ASR, the task of combined prediction comprises that scrutinizing two charts respectively selects more reliable prediction then or according to some difference formula both are combined.
This method can expand to more forecast model chart, for example based on calling out right forecast model chart.
The statistics prediction
These predictions are based on the n gram language model.These model storage condition word sequence possibilities.For example, 4 meta-models possibility stores words " to (to) " are followed the possibility in the back of three speech contexts " I will remove (I amgoing) ".These models can be very big, and the effective ways of storing them need can also very fast generation prediction.
Implement
The n meta-model is typically stored in the graph structure, and wherein each node is all represented linguistic context (being transcribed into word), and each outbound link marks with word and corresponding condition possibility.
Because (or the seldom running into) word that never ran into is always arranged in some contextual back, but needs again, so model needs a kind of method deal with the word that can't predict in the given context in running time.This realizes in corresponding short context by " return and remove (backing-off) ".In our example, if do not find " to (to) " afterwards " I will remove (I am going) ", model will be sought " to (to) " in " will remove (am going) " back.If do not find there yet, will " want (going) " at the context node place, and finally seek at the context node place of sky, all words in the vocabulary present at this place.Thisly " return to remove and " implement by increasing special link for each context node, this link is pointed to node with corresponding short context, and mark has " return remove punishment ", it can be interpreted as not disperseing giving the probabilistic quantity (logarithm) of all other links of drawing from node.
" I will remove (I am going) " can for example calculate like this for the logarithm of the whole probability of " to (to) " afterwards: return and remove (back_off) (" I will remove (I am going) ")+return and remove (back_off) (" will remove (am going) ")+link probability (link_prob) (" arriving (to) " @ context node " will (going) ")
The expansion of word chart
Each user key-press begins to search for most probable word (link) with given character string and assesses the cost meeting than higher.A kind of method that makes its acceleration depends on the word chart is extended to the character chart, and wherein the outbound link on each node is by reducing the likelihood storage.Notice that the maximum number of the outbound link on each node is that the number of character adds 2 in the language, this 2 is used for back removing link and suffix links.Therefore the search by this tabulation will need about 100 characters at most relatively concerning English, consider that most probable word will at first attempt, and the cost of expection reduces about 50 relatively.
This has ignored the cost of searching for the word that does not find at present context node place.When needs, may had better accept prediction can not very fast generation, use normal return to remove be linked at go back to the context node place that removes and search for.Store back the alternative of removing link at each character nodes place and need too big memory.
Expanding to the character chart from the word chart can implement in the following manner:
1. to each (word level) context node:
● according to probability (increase) all outbound links are classified
● to each link (successively):
● for current (word) node is set pointer
● to each character:
If ● there has been mark that the link of this character is arranged, pointer has been set to the node that this link is pointed to
● otherwise: new link is added to the node of pointed, and creates the destination of new node as link.For this new node is set pointer.
● add new link for the pointer target, point to the destination of present word link.
After the expansion, what all word links (probability that comprises them) can detect arrives, and removes link except returning.Notice that this will allow to seek most probable phrase prediction always, rather than the tabulation of the less prediction of possibility.If desired, the order of execution character expansion must be stored in some way.
Prediction
With word node identifier, character identifier, current word row and character as input, " prediction " method will:
1. turn to character nodes [character _ node _ identifier] ([character_node_id])
2. seeking mark has the link of input character (using linear search in the link with the likelihood classification)
3., then follow link and, follow first link of leaving each node, up to reaching some stop condition from its destination node if found.In each conversion, will add in the final row chaining the character that finds.Turn back to the identifier and the final row of initial destination node.
Otherwise: use and to remove link and current widow from returning of word _ node (word_nodes) [word _ node _ identifier] ([word_node_id]) and predict returning to remove to seek on the node.If the user typewrites comparatively fast, then do not wish to seek in real time prediction.
Appendix III
Key concept
Following notion is protected.Each key concept A-I can make up with other key concept during enforcement.
Following content has also illustrated other enforcement feature of various subsystems and key concept.These subsystems not necessarily with other separate; For example, part that subsystem can be another subsystem.And subsystem also not necessarily must be with any alternate manner separately; The code that the code of carrying out the function of a subsystem can be used as the function of carrying out another subsystem forms the part of same software program.
Key concept A
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
Three core subsystems, promptly (i) determines the preliminary treatment front end of suitable switching strategy; (ii) one or more conversion sources; And (iii) quality control subsystem.
Further feature:
● conversion source comprises one or more in following: one or more ASR engines; Human operators.
● signal processing source optimization audio quality is used for changing by carrying out one or more following functions: remove noise, remove known defect, normalization volume/signal energy is removed quiet/hollow sectors.
● human operators is carried out quality assurance at random, and the message of test conversion also provides feedback to preliminary treatment front end and/or conversion source.
Key concept B
The linguistic context vector
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
As the computer that the linguistic context subsystem is implemented, its information that is applicable to the linguistic context of a part of utilizing relevant message or message improves conversion accuracy.
Further feature:
● language ambience information is used for limiting vocabulary or employed search of refinement ASR engine or the matching process that the ASR engine uses.
● language ambience information is used to select the combination of specific conversion source or conversion source, for example specific ASR engine.
● language ambience information comprises caller ID, recipient ID, but no matter whether caller or recipient are the classification entity of commercial affairs or other type; The language that the caller is specific; Calling is to history; The time of calling out; The date of calling out; Geography reference or other position data of caller or callee; Caller or callee's pim data (PIM data comprises address book, daily record); The message type comprises whether this message is voice mail, spoken text, instant messaging, blog input, Email, memorandum or note; Message length; The information of utilizing online knowledge document to find; Data appear; The voice density of message; One or more in the voice quality of message.
● the linguistic context subsystem comprises identifier confidence level subsystem, and this identifier confidence level subsystem utilizes language ambience information automatically to determine the level of confidence that combines with the conversion of the part of specific message or message.
● the linguistic context subsystem comprises or is connected to identifier confidence level subsystem, and this identifier confidence level subsystem utilizes the output of one or more ASR engines to determine the level of confidence that combines with the conversion of the part of specific message or message automatically.
This identifier confidence level subsystem of ■ can carry out dynamic weighting heavily to the output of different ASR engines according to its efficient likely or accuracy.
● the understanding to the message linguistic context is extracted by a subsystem, and the downstream subsystem of feeding, and this downstream subsystem utilizes language ambience information to improve conversion performance
■ downstream subsystem is quality monitoring and/or assurance and control subsystem.
Key concept C
Calling is to history
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
As the computer of calling out the antithetical phrase system implementation, it is applicable to utilize to call out historical information is improved conversion accuracy.
Other features:
● call out and can make system become the user independently but need extra time, and significantly user's training, can make user's dependent form data of conversion performance raising history.
● call out historical and digital, the related numeral of unique address that provides with mobile phone, landline telephone, IP address, e-mail address or network is provided related.
● calling comprises and following relevant information history: the language or the dialect that use probably in a kind of or many; Calling from the country that maybe will call out of country; The time zone; Call time; Call out the date; The particular phrase of using; The language that the caller is specific; Intonation; Pim data (PIM data comprises address book, daily record).
● as the computer that the dynamic language model subsystem is implemented, it is applicable to the one or more structure dynamic language models that utilize in following: the caller relies on; Calling is to relying on; The callee relies on.
The ■ caller is the people of any payment sound message, whether wants to send audio call with it and has nothing to do; The callee reads anyone who changes message, whether plans to receive audio call with it and has nothing to do.
● as the computer that the personal configuration file feature subsystem is carried out, it is applicable to and makes up caller's personal configuration file feature to improve conversion accuracy.
■ personal configuration file feature comprises caller's word, phrase, grammer or tone.
Key concept D
The classification of 3 part messages
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
As the computer of border chooser system implementation, it is applicable to by seeking the border that carrying in the message perhaps carries in dissimilar between the part of dissimilar messages handles message.
Other features:
● as the one or more following parts of the Computer Analysis of border chooser system implementation: greet part; Main part; Bid farewell part.
● different conversions is applied to various piece roughly, and the strategy of application is best to this part.
● the different piece of message has different quality requirements, and the quality evaluation subsystem arrives these different parts with different standard application.
● voice quality is commented device to detect carrying in the message perhaps to carry border between the part of dissimilar messages in dissimilar
The regional detected or deduction that voice density changes in message of ■ border is come out.
The time-out place of ■ border in message is detected or infer.
The ■ border is appeared at default message proportional parts by deduction.
● greet the border is in whole message length by deduction about 15% place.
Key concept E
The preliminary treatment front end
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
As the computer of preliminary treatment front terminal system implementation, it is identified for the suitable switching strategy of converting speech message.
Other features:
● the preliminary treatment front end is optimized audio quality by carrying out one or more following functions for conversion: remove noise; remove known defect; normalization volume/signal energy; remove quiet/hollow sectors and the message type is classified; message is optimized routing, perhaps do not carry out above-mentioned action to change.
● the preliminary treatment front end is based on the employed language of following one or more definite callers: about the understanding of caller and/or recipient's registration, position and call history.
● the preliminary treatment front end is selected the specific ASR engine conversion message or the part of message.
The conversion source that ■ is different, for example ASR engine are used for the different piece of same message.
The conversion source that ■ is different, for example ASR engine are used for different messages.
The ■ human operators is considered the ASR engine.
■ preliminary treatment front end utilization or be connected to identifier confidence level subsystem is with the level of confidence of determining automatically to be associated with the conversion of the part of specific message or message, subsequently according to specific conversion source, for example ASR engine of this level of confidence use.
● switching strategy comprises from one group of switching strategy selects switching strategy, and this group switching strategy comprises: (i) for the sufficiently high message of ASR conversion confidence level, by the automatic inspection of quality estimation subsystem, to meet quality standard; (ii), its guiding human operators in order to check, if necessary, also to be corrected for the not high enough message of ASR conversion confidence level; (iii), it is marked as and can not changes, and notify the user to receive the message that to change for the very low message of ASR conversion confidence level.
Key concept F
Queue management device
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
As the computer that the queue management device subsystem is implemented, it is the load and the calling in intelligent management source as requested, to guarantee that changing the message time of sending reaches predetermined standard.
Other features:
● the queue management device subsystem determines what should take place sound message by each of system the processing stage.
If zero at any automatic translate phase, it is good inadequately that confidential interval or other are measured any part that proposes message, and queue management just asks for help the correct human operators of its guiding.
● the queue management device subsystem is made decision by the balance of calculating between change-over time and the quality.
● queue management device subsystem user mode machine, this state machine can determine how to handle message best by system for any given language formation.
Key concept G
Dot matrix
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
Computer as the enforcement of dot matrix subsystem, it generates the dot matrix of possible word or expression sequence, and by show one or more from dot matrix word candidate or phrase and make the operator can select word candidate or phrase, perhaps propose that to cause conversion subsystem optional word or expression makes human operators can instruct conversion subsystem by importing one or more characters for different conversion word or expressions.
Other features:
● conversion subsystem receives input from handling the subsystem of calling out historical information.
● conversion subsystem receives input from conversion source.
● conversion subsystem receives input from the linguistic context subsystem with message linguistic context understanding.
● conversion subsystem is from the likely word of human operators input study corresponding to acoustic pattern.
● human operators need only select singly-bound to accept word or expression.
● conversion subsystem provides capitalization and punctuate automatically.
● conversion subsystem can be proposed candidate numbers, real name speech, network address, e-mail address, actual address, positional information or other coordinate.
● conversion subsystem will be likely in the message that part and parcel and unessential probably part make a distinction automatically.
● unessential part is confirmed as by the operator and is belonged to the proposed classification of conversion subsystem in the message, and is changed by machine ASR engine separately subsequently.
● human operators can be said correct word to converting system, and converting system is transcribed automatically to it subsequently.
Key concept H
Online document
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
As the computer that search subsystem is implemented, it relies on the message of online knowledge document analysis conversion.
Other features:
● online knowledge document is the Internet, and it is visited by search engine.
● online knowledge document is search engine database, for example Google.
● can make the accuracy of human operators and/or the conversion of identifier confidence level subsystem evaluates to the analysis of message of conversion.
● can make ambiguity in human operators and/or the ASR engine settlement message to the analysis of message of conversion.
Key concept I
Detector
A kind of extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; This system comprises: (i) as the computer of subsystem enforcement and the network that (ii) connects human operators, be used to provide and transcribe and quality control; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
As the computer that detector subsystem is implemented, it is applicable to detect and hangs up.
Other features:
● hang up detector and implement as the part of preliminary treatment front end.
Other also operable detector:
● furnishing detects the different language of saying, for example the computer as detector subsystem enforcement of English, Spanish, French or the like
● the language detector can detect the change of language part in the message.
● the language detector can use from having the input of calling to the subsystem of information, and how the language before wherein said calling has been write down information in the message changes.
● be applicable to the computer as detector subsystem enforcement of estimation voice quality
● the speech quality evaluation device is found out and is gone offline, estimates the whole measurement result of noise level and computing voice quality and uses the minimum message of appropriate threshold refusal quality.
● be applicable to and detect the computer of hanging up as detector subsystem enforcement.
● hang up detector and implement as the part of preliminary treatment front end.
● be applicable to and detect the computer that is not intended to call out as detector subsystem enforcement.
● be not intended to call detector and implement as the part of preliminary treatment front end.
● be applicable to and detect and change the computer as detector subsystem enforcement of the message of record in advance.
● be applicable to the computer that detects and change the numeral of saying as detector subsystem enforcement.
● be applicable to the computer that detects and change the address of saying as detector subsystem enforcement.
● be applicable to the computer as detector subsystem enforcement of detection and conversion candidates real name speech, numeral, network address, e-mail address, actual address, positional information, other coordinate.
The message type
● message is the voice mail that designs for mobile phone, and sound message converts text to and sends to this mobile phone.
● message is the sound message for instant messaging transmitting-receiving Service Design, and sound message converts text to and sends to instant messaging transmitting-receiving service to be presented on the screen.
● message is the sound message that designs for web blog, and sound message converts text to and sends to server and shows with the part as web blog.
● the sound message that message is intended to convert to text formatting and sends as text message.
● the sound message that message is intended to convert to text formatting and sends as electronic mail message.
● the sound message that message is intended to convert text formatting to and sends to the message creator by Email or text as note or memorandum.
Other key element of value chain
● be connected to the mobile telephone network of the described system of arbitrary aforementioned claim.
● show the mobile phone of the message of the described system of arbitrary aforementioned claim conversion.
● show the computer display of the message of the described system of arbitrary aforementioned claim conversion.
● the method for sound message is provided, comprises following steps: the user sends sound message to the message receive-transmit system described in arbitrary aforementioned claim.

Claims (25)

  1. One kind extensive, the user independent, install independently sound message system, it converts amorphous sound message to text in order to be presented on the screen; It is characterized in that this system comprises: (i), be used to provide and transcribe and quality control as the computer of subsystem enforcement and the network that (ii) connects human operators; This system is applicable to the efficient of optimizing human operators by comprising with the lower part:
    Three core subsystems, promptly (i) determines the preliminary treatment front end of suitable switching strategy; (ii) one or more conversion sources; And (iii) quality control subsystem;
    Wherein, described three core subsystems connect by network respectively.
  2. 2. system according to claim 1 is characterized in that, conversion source comprises one or more in following: one or more ASR engines; The signal processing source; Human operators.
  3. 3. system according to claim 1 is characterized in that, this system also comprises:
    As the computer that the linguistic context subsystem is implemented, its language ambience information that is applicable to a part of utilizing relevant message or message improves conversion accuracy.
  4. 4. system according to claim 3, it is characterized in that, the linguistic context subsystem comprises identifier confidence level subsystem or handles identifier confidence level subsystem, and this identifier confidence level subsystem utilizes language ambience information to determine the level of confidence that combines with the conversion of the part of described message or message automatically.
  5. 5. system according to claim 4, it is characterized in that, the linguistic context subsystem comprises identifier confidence level subsystem or handles identifier confidence level subsystem, and this identifier confidence level subsystem utilizes the output of one or more ASR engines to determine the level of confidence that combines with the conversion of the part of described message or message automatically.
  6. 6. system according to claim 3 is characterized in that this system also comprises the downstream subsystem, and this downstream subsystem is quality monitoring and/or assurance and control subsystem.
  7. 7. system according to claim 1 is characterized in that, this system also comprises:
    As the computer of calling out the antithetical phrase system implementation, it is applicable to utilize to call out historical information is improved conversion accuracy.
  8. 8. system according to claim 7 is characterized in that, this system also comprises the computer of implementing as the dynamic language model subsystem, and it is applicable to the one or more structure dynamic language models that utilize in following: the caller relies on; Calling is to relying on; The callee relies on.
  9. 9. system according to claim 7 is characterized in that, this system also comprises the computer of carrying out as the personal configuration file feature subsystem, and it is applicable to and makes up caller's personal configuration file feature to improve conversion accuracy.
  10. 10. system according to claim 1 is characterized in that, this system also comprises:
    As the computer of border chooser system implementation, it is applicable to by seeking the border that carrying in the message perhaps carries in dissimilar between the part of dissimilar messages handles message.
  11. 11. system according to claim 1 is characterized in that, this system also comprises:
    As the computer of preliminary treatment front terminal system implementation, it is identified for the suitable switching strategy of converting speech message.
  12. 12. system according to claim 1 is characterized in that, this system also comprises:
    As the computer that the queue management device subsystem is implemented, it is the load and the calling in intelligent management source as requested, to guarantee that changing the message time of sending reaches predetermined standard.
  13. 13. system according to claim 1 is characterized in that, this system also comprises:
    Computer as the enforcement of dot matrix subsystem, it generates the dot matrix of possible word or expression sequence, and by show one or more from dot matrix word candidate or phrase and make the operator can select word candidate or phrase, perhaps propose that to cause conversion subsystem optional word or expression makes human operators can instruct conversion subsystem by importing one or more characters for different conversion word or expressions.
  14. 14. system according to claim 1 is characterized in that, this system also comprises:
    As the computer that search subsystem is implemented, it relies on the message of online knowledge document analysis conversion.
  15. 15. system according to claim 14 is characterized in that, online knowledge document is the Internet, and it is visited by search engine.
  16. 16. system according to claim 14 is characterized in that, online knowledge document is a search engine database.
  17. 17. system according to claim 1 is characterized in that, this system also comprises:
    As the computer that detector subsystem is implemented, it is applicable to detect and hangs up.
  18. 18. system according to claim 17 is characterized in that, this system also comprises the computer as detector subsystem enforcement that furnishing detects the different language of saying.
  19. 19. system according to claim 17 is characterized in that, this system also comprises the computer as detector subsystem enforcement that is applicable to the estimation voice quality.
  20. 20. system according to claim 17 is characterized in that, this system also comprises the computer as detector subsystem enforcement that is applicable to that detection is not intended to call out.
  21. 21. system according to claim 17 is characterized in that, this system also comprises the computer as detector subsystem enforcement of the message that is applicable to that detection and conversion are write down in advance.
  22. 22. system according to claim 17 is characterized in that, this system also comprises the computer as detector subsystem enforcement of the numeral that is applicable to that detection and conversion are said.
  23. 23. system according to claim 17 is characterized in that, this system also comprises the computer as detector subsystem enforcement of the address that is applicable to that detection and conversion are said.
  24. 24. system according to claim 17, it is characterized in that this system also comprises and is applicable to and detects and the computer as detector subsystem enforcement of conversion candidates real name speech, numeral, network address, e-mail address, actual address, positional information, coordinate.
  25. 25. be connected to the mobile telephone network of each described system in the claim 1 to 24.
CNU2007900000221U 2006-02-10 2007-02-12 Large-scale user-independent and device-independent voice message system Expired - Lifetime CN201355842Y (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB0602682.7 2006-02-10
GB0602682A GB0602682D0 (en) 2006-02-10 2006-02-10 Spinvox speech-to-text conversion system design overview
GB0700377.5 2007-01-09
GB0700376.7 2007-01-09

Publications (1)

Publication Number Publication Date
CN201355842Y true CN201355842Y (en) 2009-12-02

Family

ID=36119847

Family Applications (1)

Application Number Title Priority Date Filing Date
CNU2007900000221U Expired - Lifetime CN201355842Y (en) 2006-02-10 2007-02-12 Large-scale user-independent and device-independent voice message system

Country Status (2)

Country Link
CN (1) CN201355842Y (en)
GB (1) GB0602682D0 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102714681A (en) * 2010-01-15 2012-10-03 高通伊司库特股份有限公司 Methods and apparatus for providing messaging using voicemail
CN102982061A (en) * 2011-07-11 2013-03-20 索尼公司 Information processing apparatus, information processing method, and program
CN103003875A (en) * 2010-05-18 2013-03-27 沙扎姆娱乐有限公司 Methods and systems for performing synchronization of audio with corresponding textual transcriptions and determining confidence values of the synchronization
CN110287461A (en) * 2019-05-24 2019-09-27 北京百度网讯科技有限公司 Text conversion method, device and storage medium
CN111369981A (en) * 2020-03-02 2020-07-03 北京远鉴信息技术有限公司 Dialect region identification method and device, electronic equipment and storage medium
WO2023220516A1 (en) * 2022-05-13 2023-11-16 Sony Interactive Entertainment Inc. Vocal recording and re-creation

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9066216B2 (en) 2010-01-15 2015-06-23 Qualcomm Incorporated Methods and apparatus for providing messaging using voicemail
US9560205B2 (en) 2010-01-15 2017-01-31 Qualcomm Incorporated Methods and apparatus for providing messaging using voicemail
CN102714681B (en) * 2010-01-15 2016-03-16 高通股份有限公司 For the method and apparatus using voice mail to provide message to transmit
CN102714681A (en) * 2010-01-15 2012-10-03 高通伊司库特股份有限公司 Methods and apparatus for providing messaging using voicemail
CN103003875B (en) * 2010-05-18 2015-06-03 沙扎姆娱乐有限公司 Methods and systems for performing synchronization of audio with corresponding textual transcriptions and determining confidence values of the synchronization
CN103003875A (en) * 2010-05-18 2013-03-27 沙扎姆娱乐有限公司 Methods and systems for performing synchronization of audio with corresponding textual transcriptions and determining confidence values of the synchronization
CN102982061A (en) * 2011-07-11 2013-03-20 索尼公司 Information processing apparatus, information processing method, and program
US9824143B2 (en) 2011-07-11 2017-11-21 Sony Corporation Apparatus, method and program to facilitate retrieval of voice messages
CN110287461A (en) * 2019-05-24 2019-09-27 北京百度网讯科技有限公司 Text conversion method, device and storage medium
CN110287461B (en) * 2019-05-24 2023-04-18 北京百度网讯科技有限公司 Text conversion method, device and storage medium
CN111369981A (en) * 2020-03-02 2020-07-03 北京远鉴信息技术有限公司 Dialect region identification method and device, electronic equipment and storage medium
CN111369981B (en) * 2020-03-02 2024-02-23 北京远鉴信息技术有限公司 Dialect region identification method and device, electronic equipment and storage medium
WO2023220516A1 (en) * 2022-05-13 2023-11-16 Sony Interactive Entertainment Inc. Vocal recording and re-creation

Also Published As

Publication number Publication date
GB0602682D0 (en) 2006-03-22

Similar Documents

Publication Publication Date Title
ES2420559T3 (en) A large-scale system, independent of the user and independent of the device for converting the vocal message to text
US8976944B2 (en) Mass-scale, user-independent, device-independent voice messaging system
US8374863B2 (en) Mass-scale, user-independent, device-independent voice messaging system
US7809117B2 (en) Method and system for processing messages within the framework of an integrated message system
Zue et al. JUPlTER: a telephone-based conversational interface for weather information
CN109313896B (en) Extensible dynamic class language modeling method, system for generating an utterance transcription, computer-readable medium
US7236932B1 (en) Method of and apparatus for improving productivity of human reviewers of automatically transcribed documents generated by media conversion systems
US20080063155A1 (en) Mass-Scale, User-Independent, Device-Independent Voice Messaging System
US8165887B2 (en) Data-driven voice user interface
CN201355842Y (en) Large-scale user-independent and device-independent voice message system
CN114818649A (en) Service consultation processing method and device based on intelligent voice interaction technology
KR100822170B1 (en) Database construction method and system for speech recognition ars service
EP2261818A1 (en) A method for inter-lingual electronic communication
Koumpis Automatic voicemail summarisation for mobile messaging
US20050197839A1 (en) Apparatus, medium, and method for generating record sentence for corpus and apparatus, medium, and method for building corpus using the same
Seneff The use of subword linguistic modeling for multiple tasks

Legal Events

Date Code Title Description
C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term

Granted publication date: 20091202

CX01 Expiry of patent term