CN104517609A

CN104517609A - Voice recognition method and device

Info

Publication number: CN104517609A
Application number: CN201310451614.7A
Authority: CN
Inventors: 陈茂国; 吕梁; 刘帅东
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2013-09-27
Filing date: 2013-09-27
Publication date: 2015-04-15

Abstract

The invention discloses a voice recognition method and device; a media resource server comprises the following elements: a reception module used for receiving a terminal sent recognition request for starting a conversation; a conversation classification processing module used for determining the conversation to be a continuous voice recognition conversation according to the recognition request, contentiously receiving voice stream sent by the terminal, and returning recognition result of the voice stream; a conversation stop module used for stopping the conversation after receiving a stop recognition request sent by the terminal. The method and device use set parameters to enable the terminal and media source server to determine a voice recognition occasion, so many voice recognition results, being middle recognition results, can be continuously reported, thus improving voice recognition continuous effects.

Description

A kind of audio recognition method and device

Technical field

The present invention relates to technical field of voice recognition, particularly relate to a kind of audio recognition method and device.

Background technology

Automatic speech recognition technology (Automatic Speech Recognition, ASR) is a kind of is the technology of text by the speech conversion of people.Speech recognition is the field of a multi-crossed disciplines, and it is closely connected with numerous subjects such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science.It is widely used in the fields such as speech communication system, voice control telephone exchange, data query, seat reservation system, the customer service of telecommunications bank, computing machine control, Industry Control.

Media Resource Server generally adopts Media Resource Control Protocol (Media Resource Control Protocol when terminal provides various voice service, MRCP), this communications protocol is defined in RFC4463 by IETF, and current defined function has speech recognition (Speech Recognize), phonetic synthesis (Speechsynthesis), recording (Record), speaker detection and confirmation (Speaker Verification andIdentification).MRCP does not define session connection, be indifferent to server how to be connected with terminal, MRCP message uses real-time streaming protocol (Real-Time Streaming Protocol, RTSP), initial session protocol (Session Initiation Protocol, etc. SIP) as control protocol etc., current MRCPv2 version, uses SIP control protocol.The system architecture of existing MRCPv2 mainly comprises MRCP terminal and MRCP server, wherein:

MRCP terminal (MRCP Client) one or more media resources on control MRCP Server.

MRCP server (MRCP Server) is for providing one or more media resources, and such as literary periodicals, speech recognition, speech identity are recognized, recorded.

MRCP terminal and MRCP server, when carrying out data interaction, can pass through following protocol realization:

MRCP agreement second large paper edition (MRCPv2) connects based on TCP, for the media resource of control MRCPServer, uses task with the media resource completing MRCP Client.

Initial session protocol (SIP), for realizing session establishment between MRCP Server and MRCP Client and session signaling manages, the Session Description Protocol (Session DescriptionProtocol, SDP) of exchange termination and server, for the foundation of audio data stream lays the first stone.

Real-time transport protocol (rtp), for the audio data stream of transmission terminal and server.

Define in prior art MRCPv2 agreement and how to make MRCP Client and MRCP Server complete speech identifying function by the cooperation of SIP, RTP, MRCP agreement.

Realize the typical flow process sequential of disposable speech recognition in prior art, specifically comprise step:

MRCP Client sends INVITE and sets up session to MRCP Server request, carries the SDP of MRCPClient side;

MRCP Server replys 200 and represents that request successfully accepts process, carries the SDP of MRCP Server side;

MRCP Client sends ACK message subsequently and confirms that 200 message receive, and a so far SIP session is successfully set up;

MRCP Client sends RECOGNIZE message and asks speech recognition to MRCP Server, and the form specified according to MRCP agreement carries relevant speech recognition controlled parameter, and specifies grammar file path;

MRCP Server receives RECOGNIZE request, and compiling grammar file, replys 200 message to MRCP Client;

MRCP Client now starts the SDP according to consulting before, starts to send RTP voice flow endlessly to MRCP Server;

MRCP Server receives RTP voice flow, when detecting that user loquiturs, sends START-OF-INPUT event;

When MRCP Server obtains recognition result according to grammar file definition, return recognition result by RECOGNITION-COMPLETE event;

MRCP Client sends BYE message to MRCP Server end session;

MRCP Server sends 200 message and confirms to terminate to MRCP Client;

The complete speech recognition capability that MRCP Client is provided by above-mentioned flow process acquisition MRCP Server.

The problem of being carried out speech recognition by the scheme of above-mentioned prior art is: under the continual scene of rtp streaming, if carried out identification and the feedback of voice flow by the mode of one-off recognition, will inevitably cause identifying the interval controlled, some rtp streaming is missed process, thus will have influence on the accuracy identified continuously.

Summary of the invention

The invention provides a kind of audio recognition method and device, the audio recognition method that method and apparatus solution prior art provided by the present invention provides is under rtp streaming is continual continuous identification scene, can cause identifying the interval controlled, some rtp streaming is missed process, thus will have influence on the accuracy problem identified continuously.

First aspect, the invention provides a kind of Media Resource Server, and this Media Resource Server comprises:

Receiver module, receives the identification request of the beginning session that terminal is sent for basis;

Session classification processing module, for determining that this session is continuous speech recognition session according to described identification request, the voice flow that terminal described in continuous reception sends continuously, and the recognition result feeding back described voice flow;

Session termination module, for receive described terminal send stopping identification request after, stop this session.

In conjunction with first aspect, in the implementation that the first is possible, the parameter value that described identification request is parameter preset in described grammar file; Then described session classification processing module also for obtaining grammar file from the identification RECOGNIZE message received; Described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session according to described parameter value.

In conjunction with the implementation that the first is possible, in the implementation that the second is possible, described parameter preset is pattern Mode attribute; Then described session classification processing module is also for the first parameter value that obtaining mode Mode attribute from the syntactic element grammar element of described grammar file is corresponding; Described Media Resource Server, after described first parameter value mates with the first preset parameter value, determines that this session is continuous speech recognition session.

In conjunction with the implementation that the first is possible, in the implementation that the third is possible, described parameter preset is root attribute; Then described session classification processing module is also for obtaining the second parameter value corresponding to root attribute in described grammar file; Described Media Resource Server, after described second parameter value mates with the second preset parameter value, determines that this session is continuous speech recognition session.

In conjunction with first aspect, in the 4th kind of possible implementation, described identification is asked as identifying the 3rd parameter value that the newly-increased header field of RECOGNIZE message is corresponding; Then described session classification processing module is also for the 3rd parameter value that the newly-increased header field obtaining the identification RECOGNIZE message that the described terminal that receives is sent is corresponding; Described Media Resource Server, after described 3rd parameter value mates with the 3rd preset parameter value, determines that this session is continuous speech recognition session.

In conjunction with first aspect to the 4th kind of possible implementation of first aspect, in the 5th kind of mode in the cards, the real-time Transmission voice flow that described in described Media Resource Server continuous reception, terminal sends continuously, session classification processing module is also for identifying according to receiving real-time Transmission voice flow, by middle recognition result event INTERMEDIATE-RESULT, return the recognition result for described real-time Transmission voice flow to described terminal.

In conjunction with first aspect, in the 6th kind of possible implementation, described session termination module also for receive described stopping identify request after, detect and whether do not identify voice flow in addition, if had, there is no the voice flow identified described in then identifying, and identifying that the described voice flow rear line terminal feedback not having to identify is for the described response message stopping identifying request, stops this session.

Second aspect, the present invention also provides a kind of audio recognition method, and the method comprises:

Media Resource Server is according to the identification request receiving the beginning session that terminal is sent;

According to described identification request, Media Resource Server determines that this session is continuous speech recognition session, the voice flow that terminal described in continuous reception sends continuously, and the recognition result feeding back described voice flow;

Media Resource Server stops this session after receiving the stopping identification request of described terminal transmission.

In conjunction with second aspect, in the first possible implementation of second aspect, the parameter value that described identification request is parameter preset in described grammar file; The then described Media Resource Server identification request of sending according to the terminal received, determine whether this session is continuous speech recognition session, comprising:

Described Media Resource Server obtains grammar file from the identification RECOGNIZE message received;

Described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session according to described parameter value.

In conjunction with the first possible implementation of second aspect, in the implementation that the second of second aspect is possible, described parameter preset is pattern Mode attribute;

Described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session, comprising according to described parameter value:

Described Media Resource Server the first parameter value that obtaining mode Mode attribute is corresponding from the syntactic element grammar element of described grammar file;

Described Media Resource Server, after described first parameter value mates with the first preset parameter value, determines that this session is continuous speech recognition session.

In conjunction with the first possible implementation of second aspect, in the third possible implementation of second aspect, described parameter preset is root attribute;

Described Media Resource Server obtains the parameter value of parameter preset in the syntactic element grammar element of described grammar file, determines whether this session is continuous speech recognition session, comprising according to described parameter value:

Described Media Resource Server obtains the second parameter value corresponding to root attribute in described grammar file;

Described Media Resource Server, after described second parameter value mates with the second preset parameter value, determines that this session is continuous speech recognition session.

In conjunction with second aspect, in the 4th kind of possible implementation of second aspect, described identification is asked as identifying the 3rd parameter value that the newly-increased header field of RECOGNIZE message is corresponding; Then describedly determine that this session is continuous speech recognition session according to described identification request, comprising:

Described Media Resource Server obtains the 3rd parameter value corresponding to the newly-increased header field of the identification RECOGNIZE message that the described terminal that receives is sent;

Described Media Resource Server, after described 3rd parameter value mates with the 3rd preset parameter value, determines that this session is continuous speech recognition session.

In conjunction with second aspect to the 4th kind of possible implementation of second aspect, in the 5th kind of mode in the cards, the voice flow that described in described continuous reception, terminal sends continuously, and the recognition result feeding back described voice flow, comprising:

Described Media Resource Server identifies according to receiving real-time Transmission voice flow, by middle recognition result event INTERMEDIATE-RESULT, returns the recognition result for described real-time Transmission voice flow to described terminal.

In conjunction with second aspect, in the 6th kind of possible implementation, this session of described termination comprises:

After Media Resource Server receives described stopping identification request, detect and whether do not identify voice flow in addition, if had, the voice flow identified is not had described in then identifying, and identifying that the described voice flow rear line terminal feedback not having to identify is for the described response message stopping identifying request, stops this session.

One or two in technique scheme, at least has following technique effect:

Scheme provided by the invention is by arranging continuous speech recognition session, achieve supplementing MRCPv2 agreement, reach employing MRCP to control, realize repeatedly identifying the effect that event reports continuously, solve the leakage identification problem occurred in the scene that voice identify continuously.So method provided by the present invention is under continuous speech recognition session, continuously repeatedly voice identification result can be reported as middle recognition result, and do not need repeatedly to issue recognition command, thus improve the continuity of speech recognition, it also avoid voice simultaneously and leak the problem identified.

Accompanying drawing explanation

Fig. 1 is the structural representation of a kind of Media Resource Server of the embodiment of the present invention;

Fig. 2 is the structural representation of a kind of speech recognition system of the embodiment of the present invention;

Fig. 3 is the process flow diagram of a kind of audio recognition method of the embodiment of the present invention;

Fig. 4 is the process flow diagram that in the embodiment of the present invention, Media Resource Server and terminal interaction realize the method for speech recognition.

Embodiment

The voice recognition mode provided in prior art can be good at supporting disposable speech recognition scene, such as in phone booking Voice Navigation application, user says " I thinks Shanghai ", the backward terminal that Media Resource Server identifies voice content returns recognition result, and whole identifying terminates.But under having a lot of scene, need the voice content of identification very intensive.Such as user and attending a banquet in call, need the content of user and talk of attending a banquet continuously to be presented in real time on screen, now user often says and in short wants to return a recognition result.Service needed shows chat script, keyword knowledge base automatic indexing etc. in full in real time according to user and the voice of attending a banquet, the situation that the voice content identified for above-mentioned needs is very intensive, and existing MRCPv2 agreement just cannot well be supported.

For above-mentioned speech recognition problem of the prior art, the embodiment of the present invention provides a kind of Media Resource Server, and this server comprises:

Scheme provided by the invention is by arranging conversation type and the continuous speech recognition session of a kind of difference and prior art, achieve supplementing MRCPv2 agreement, reach employing MRCP to control, realize repeatedly identifying the effect that reports continuously of event, solve the leakage identification problem occurred in the scene that voice identify continuously.

Be described various embodiment of the present invention and various aspects below with reference to following details, accompanying drawing will illustrate various embodiment.Explanation below and accompanying drawing are to exemplary illustration of the present invention, and should not regard limitation of the present invention as.Describe a large amount of detail to provide the detailed understanding to various embodiments of the invention.But in some cases, will known or traditional details be described, to provide brief description to embodiments of the invention.

Some part of following detailed description represents with the form of algorithm, and these algorithms comprise the operation carried out for the data stored in computer memory.Algorithm refer to substantially cause the operation of results needed be certainly in harmony sequence.These operations usually need or relate to physical manipulation or physical quantity.Usually (but not being inevitable), this tittle takes the form of electric signal or magnetic signal, and these signals can be stored, transmit, merge, compare and otherwise be handled.Already proved, sometimes (mainly in order to normally used reason), these signals were called position, value, element, symbol, character, item, number etc. are easily.

But should keep firmly in mind, these and similar all terms are associated with suitable physical quantity, and be only the label being easily applied to this tittle.Unless outside being hereinafter otherwise noted with other forms significantly, the explanation using such as " process " or " calculating " or the term such as " judgement " or " display " to carry out in whole instructions can refer to the action that data handling system or like carry out and process, measure the data that represent and convert thereof in the storer of this system or register (or other this category informations store, transmit or show device) similarly with other data that the form of physical quantity represents in the RS of described action and process operating computer with physics (electronics).

The present invention can relate to the equipment for performing the one or more operation in operation described in the application.This equipment can be required object and special configuration, or also can comprise multi-purpose computer, and described multi-purpose computer optionally activates (activate) or reconstruct (reconfigure) by the computer program be stored in this computing machine.Such computer program can be stored in machine (such as computing machine) computer-readable recording medium or be stored in and be suitable for stored electrons instruction be coupled in any type media of bus respectively, and described computer-readable medium includes but not limited to the dish (comprising floppy disk, CD, CD-ROM and magneto-optic disk) of any type, ROM (read-only memory) (ROM), random access memory (RAM), erasable programmable ROM (EPROM), electrically erasable ROM (EEPROM), flash memory, magnetic card or optical card.

Machine readable media comprises for be stored by the readable form of machine (such as computing machine) or any mechanism of transmission information.Such as, machine readable media comprises ROM (read-only memory) (ROM); Random access memory (RAM), disk storage media, optical storage medium, flash memory device, with electricity, light, sound or other form propagate signal (such as carrier wave, infrared signal, digital signal etc.) etc.

As shown in Figure 1, the voice caused when utilizing a voice recognition mode to identify in prior art leak identification problem, and the present invention also provides a kind of Media Resource Server, and this Media Resource Server comprises:

Receiver module 101, receives the identification request of the beginning session that terminal is sent for basis;

Session classification processing module 102, for determining that this session is continuous speech recognition session according to described identification request, the voice flow that terminal described in continuous reception sends continuously, and the recognition result feeding back described voice flow;

Session termination module 103, for receive described terminal send stopping identification request after, stop this session.

In embodiments of the present invention, because the speed of the identification voice of Media Resource Server generally with terminal to report voice synchronous, so generally Media Resource Server receive terminal send stopping identification request after then can stop current session immediately.

But in order to ensure the complete of session content, the present invention also provides in implementation, after Media Resource Server receives and stops identifying request, also further determine the voice flow whether Media Resource Server also has this session and do not identified, so the described session termination module 103 that provides of the embodiment of the present invention is also for receiving after described stopping identifies request, detect and whether do not identify voice flow in addition, if had, the voice flow identified is not had described in then identifying, and identifying that the described voice flow rear line terminal feedback not having to identify is for the described response message stopping identifying request, stop this session.

Because the Media Resource Server that the embodiment of the present invention provides is the scene be suitable for for continuous identification session is there is scene that the is a large amount of and identification of continuous print voice needs, so server provided by the invention is by conversation procedure, the voice at every turn received are all as middle identified amount, and recognition result reports as middle recognition result, so in order to reporting of intermediate result can be realized on the basis of existing protocol, session classification processing module 102 is also for identifying according to receiving real-time Transmission voice flow, by middle recognition result event INTERMEDIATE-RESULT, the recognition result for described real-time Transmission voice flow is returned to described terminal.

In embodiments of the present invention, in order to adapt to continuous speech recognition session, terminal needs to determine what kind of the scene of this speech recognition is alternately with server end, so the information interaction that the speech recognition session type providing various ways to realize server end and terminal in the Media Resource Server that provides of the embodiment of the present invention is determined.Be below the concrete module realization for several implementation of optimization, then described session classification processing module 102 can comprise:

Mode one: the parameter value that described identification request is parameter preset in described grammar file;

Then described session classification processing module 102 also for obtaining grammar file from the identification RECOGNIZE message received; Described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session according to described parameter value.

Because the many kinds of parameters in grammar file can be used for realizing the validation of information of terminal and server end, by under the mode of grammar file parameters in the embodiment of the present invention, optimized scheme is described parameter preset can be pattern Mode attribute or root attribute:

A, described parameter preset is pattern Mode attribute; Then described session classification processing module 102 is also for the first parameter value that obtaining mode Mode attribute from the syntactic element grammar element of described grammar file is corresponding; Described Media Resource Server, after described first parameter value mates with the first preset parameter value, determines that this session is continuous speech recognition session.

B, described parameter preset is root attribute; Then described session classification processing module 102 is also for obtaining the second parameter value corresponding to root attribute in described grammar file; Described Media Resource Server, after described second parameter value mates with the second preset parameter value, determines that this session is continuous speech recognition session.

Mode two, described identification are asked as identifying the 3rd parameter value that the newly-increased header field of RECOGNIZE message is corresponding;

Described session classification processing module 102 is also for the 3rd parameter value that the newly-increased header field obtaining the identification RECOGNIZE message that the described terminal that receives is sent is corresponding; Described Media Resource Server, after described 3rd parameter value mates with the 3rd preset parameter value, determines that this session is continuous speech recognition session.

Because the audio recognition method that the embodiment of the present invention provides is terminal and Media Resource Server arrange the scene of speech recognition by the parameter preset, so below by way of Media Resource Server and the concrete interaction flow of terminal to a kind of speech recognition system provided by the invention, this speech recognition system specifically comprises (as shown in Figure 2):

Terminal 201, sends to Media Resource Server the request of identification; And when end session, send to Media Resource Server and stop identifying request, terminate this session;

Media Resource Server 202, according to the described identification request received, determines whether this session is continuous speech recognition session; After determining that this session identifies scene continuously, the real-time Transmission voice flow that terminal described in continuous reception sends continuously, and to continuing the middle recognition result returning described real-time Transmission voice flow to terminal.

Foregoing describes the device realizing the embodiment of the present invention, based on the basis of said apparatus, introduces the method that the embodiment of the present invention provides below in detail:

Embodiment one, the embodiment of the present invention provides a kind of audio recognition method, is described in detail (as shown in Figure 3) to the specific embodiment of the present invention below in conjunction with Figure of description:

Step 301, Media Resource Server is according to the identification request receiving the beginning session that terminal is sent;

What in prior art, each MRCP controlled to comprise multiple signaling can waste a large amount of time and resource alternately, and repeatedly speech recognition continuity also can be caused not good alternately, while a part of voice also can be caused to be leaked identification.So for the situation of this continuous speech recognition, terminal in the embodiment of the present invention is first by identifying request notice Media Resource Server, thus Media Resource Server can enter continuous identification process tupe, such Media Resource Server just can realize the speech recognition of continuous several times in a MRCP controls, and feeds back the result of speech recognition in the mode of intermediate result event.

Step 302, according to described identification request, Media Resource Server determines that this session is continuous speech recognition session, the voice flow that terminal described in continuous reception sends continuously, and the recognition result feeding back described voice flow;

In this embodiment, after Media Resource Server receives the identification request of terminal transmission, if determine that follow-up identification scene identifies scene continuously by described identification request, then corresponding control relevant device identifies continuously.

Step 303, Media Resource Server stops this session after receiving the stopping identification request of described terminal transmission.

In embodiments of the present invention, because be continuous identification, so unless MRCP Server generally can not run into the gross error identifying and cannot continue by active reporting end of identification event RECOGNITION-COMPLETE(MRCP Server, just allow to report), require to stop identifying until MRCP Client holds, then terminate identification process.So in embodiments of the present invention, Media Resource Server needs just can determine whether to terminate this session by the request of terminal, so in the method that provides of the embodiment of the present invention, the specific implementation terminating this session can be:

A, in the real-time Transmission voice flow process that described Media Resource Server sends continuously in terminal described in continuous reception after real-time Transmission voice flow described in None-identified (wherein, the situation of real-time Transmission voice flow described in described None-identified may be occur that catastrophic failure causes Media Resource Server cannot continue follow-up rtp streaming identification), send to described terminal and terminate identification message.Or

B, terminal sends the stopping identification request terminating this session to described server end.

In this approach, in order to ensure the integrality of session content identification, method of the present invention also provides implementation determination Media Resource Server after receiving stopping and identifying request, also further determine the voice flow whether Media Resource Server also has this session and do not identified, so the specific implementation that described Media Resource Server stops this session can also be:

In embodiments of the present invention, in order to adapt to continuous speech recognition session, terminal needs to determine what kind of the scene of this speech recognition is alternately with server end, so provide the information interaction of the speech recognition scene of accomplished in many ways server end and terminal in the method that provides of the embodiment of the present invention.Optimized mode is following several:

Mode one, the parameter value that described identification request is parameter preset in described grammar file;

The identification request that described Media Resource Server sends according to the terminal received, determine whether this session is continuous speech recognition session, comprising:

In this mode, optimized scheme is described parameter preset is pattern Mode attribute or root attribute, and wherein specific implementation is:

According to described parameter value, first branch of mode one: if described parameter preset is pattern Mode attribute, then described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session, comprising:

Described Media Resource Server the first parameter value that obtaining mode Mode attribute is corresponding from the syntactic element (grammar element) of described grammar file;

In concrete application scenarios, grammar file (carrying in RECOGNIZE message) defines the associated technical parameters of speech recognition and the particular content of identification, in order to distinguish single identification and identify scene continuously, the present embodiment has Mode attribute by grammar element, so that MRCP server can differentiated treatment.

Expand a kind of new model in embodiments of the present invention: continuousMode attribute indicates continuous speech recognition pattern, namely get the first parameter value corresponding to Mode attribute to mate with the first preset parameter value continuous, if the match is successful, determine that this session is continuous speech recognition session.Grammar file is in xml format example below, illustrates in the method that the embodiment of the present invention provides, and identifies that the grammar file of scene is as follows continuously:

< xml version=" 1.0 " encoding=" utf-8 " >/ identifies this grammar file and adopts XML version, the character set code/decode format of use/

<grammar xmlns=" http://www.w3.org/2001/06/grammar " xml:lang=" en-US " version=" 1.0 " mode=" continuous " root=" service " >/ identifies some relevant attributes of grammer, as version, languages type, pattern, root syntax rule etc./

<one-of>

</one-of>

One of </rule>/ describes the particular content of root syntax rule service, one-of represents " ", when namely having multiple item, any one rule meets can; Item, an item is only had to refer to an explicit home town ruling speech-to-text/ inside this example

<one-of>

<item>telecom</item>

<item>banking</item>

</one-of>

</rule>/ this section definition this home town ruling particular content, wherein have two item, one is telecom, and one is banking, representative wherein one of/

</grammar>

Above-mentioned grammar file defines the rule of a service, and this rule comprises the home town ruling of speech-to-text, and this rule comprises telecom and banking two fields.Identify by this grammar file, engine just can determine this is by what kind of mode identification voice content, and this all comprises which type of design parameter a little in identifying.When MRCP Server finds that when compiling above-mentioned grammar file the parameter value that mode is corresponding is that continuous will enter continuous recognition mode automatically.

Server end is by the content of above-mentioned grammar file, determine that the scene of this speech recognition is for identify scene continuously, but because be all once identify scene in prior art, so identifying that continuously server end in scene realizes the result feedback that identifies continuously in the following manner to terminal, be implemented as: described Media Resource Server identifies according to receiving real-time Transmission voice flow, by middle recognition result event INTERMEDIATE-RESULT, return the recognition result for described real-time Transmission voice flow to described terminal.

In MRCPv2, speech recognition application define only three event (recognizer-event): recognizer-event=" START-OF-INPUT ", " RECOGNITION-COMPLETE " or " INTERPRETATION-COMPLETE ", in order to constantly recognition result can be reported, the invention process also provides intermediate result event a: INTERMEDIATE-RESULT, this intermediate result event can carry recognition result information, then in a MRCP controls, the result repeatedly identified can be reported as intermediate result by this intermediate result event Media Resource Server.

In order to reporting of intermediate result can be realized on the basis of existing protocol, so the intermediate result event INTERMEDIATE-RESULT Main Function provided in the embodiment of the present invention is for reporting recognition result, need to observe MRCP agreement regulation simultaneously, wherein, the form of described INTERMEDIATE-RESULT event is:

event-line=mrcp-version SP message-length SP event-name

SP request-id SP request-state CRLF

Wherein, event-name can be INTERMEDIATE-RESULT; Request-state can be IN-PROGRESS; The form of the header field that event header comprises and RECOGNITION-COMPLETE is consistent, but because of in the continuous identification scene that the embodiment of the present invention provides, when needing to terminate speech recognition, terminal needs to send specific instruction notification Media Resource Server, so do not comprise Completion-Cause and Completion Reason two header fields in described event header.Event body mainly recognition result, follows NLSML form, and RECOGNITION-COMPLETE form is consistent.

Second branch of mode one: if described parameter preset is root attribute; Described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session, comprising according to described parameter value:

In concrete applied environment, except in embodiment two by except the mode attribute agreement of grammar element, terminal can also be realized by Root rule and Media Resource Server arranges to identify whether scene is continuous speech recognition session.Such as when the root attribute of grammar element is a certain specific character string, think continuous identification.

Wherein, the specific implementation form of described grammar file can be:

<?xml version="1.0"encoding="utf-8"?>

<one-of>

</one-of>

</rule>

<one-of>

<item>telecom</item>

<item>banking</item>

</one-of>

</rule>

</grammar>

Upper example is exactly be this specific character string of continuous by the second preset parameter value that definition root attribute is corresponding, if the second parameter value that the root attribute that described Media Resource Server obtains from described grammar file is corresponding is continuous, then illustrate that this is identified as continuous identification.For the definition of middle identification event in the embodiment of the present invention, also various ways can be adopted, as long as event title and existing event are not conflicted.

Mode two, described identification is asked as identifying the 3rd parameter value that the newly-increased header field of RECOGNIZE message is corresponding; The 3rd parameter value in this embodiment does not limit position set by this parameter value and order, only represents that this parameter value is the parameter value that newly-increased header field that Media Resource Server gets when identifying RECOGNIZE message is corresponding.

Determine that this session is continuous speech recognition session according to described identification request in the step 302 that then embodiment of the present invention one provides, comprising:

In concrete applied environment, above-mentioned identification request is identify that whether the 3rd parameter value determination current sessions that the newly-increased header field of RECOGNIZE message is corresponding is that the specific implementation step of continuous speech recognition session comprises:

Except being arranged except identification continuously or one-off recognition by grammar file, can also be transmitted by the mode of MRCP request message RECOGNIZE header field between expansion MRCP Client and MRCP Server, such as newly-increased header field a: work-mode, this header field is optional in speech recognition, give tacit consent to one-off recognition when not selecting, be defined as follows:

work-mode="Work-Mode"":"Serve CRLF

Serve="once"/"continuous"

When MRCP Server receives RECOGNIZE message, if find that there is Work-Mode header field, and value is continuous, namely thinks that MRCP Client requires to start and identifies continuously, if be once, be shown to be one-off recognition.

The following is the method utilizing this embodiment three to provide, in RECOGNIZE message, indicate the current RECOGNIZE message instance being identified as identification continuously after newly-increased header field:

MRCP/2.0290RECOGNIZE2

Channel-Identifier:2ce5baab46401041@speechrecog

Work-Mode:continuous

Content-Type:text/uri-list

Cancel-If-Queue:false50

No-Input-Timeout:3600000

Recognition-Timeout:3600000

Start-Input-Timers:true

Confidence-Threshold:0.0

The above-mentioned code of Content-Length:33/ is the message header of MRCP message, defines some content of parameter between MRCP Client and MRCPServer, such as type of message, channel number, relevant overtime duration etc./

File: //C: tmp the message body of this MRCP message of analytics1.grxml/, comprise the address of grammar file.MRCP Server goes to obtain grammar file according to this address, then carry out resolving/

As shown in Figure 4, because the audio recognition method that the embodiment of the present invention provides is terminal and Media Resource Server arrange the scene of speech recognition by the parameter preset, so be further described a kind of audio recognition method provided by the invention below by way of Media Resource Server and the concrete interaction flow of terminal, the method specifically comprises:

Step 401, terminal sends to Media Resource Server the request of identification;

Step 402, Media Resource Server, according to the described identification request received, determines whether this session is continuous speech recognition session;

In embodiments of the present invention, described identification request can be whether arrange this session by following several mode and server end be identify scene continuously:

Mode one, the parameter value that described identification request is parameter preset in described grammar file; The identification request that described Media Resource Server sends according to the terminal received, determine whether this session is continuous speech recognition session, comprising:

A, when described parameter preset is pattern Mode attribute, then described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session, comprising according to described parameter value:

B, when described parameter preset is root attribute; Described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session, comprising according to described parameter value:

Step 403, described Media Resource Server after determining that this session identifies scene continuously, the real-time Transmission voice flow that terminal described in continuous reception sends continuously, and to continuing the middle recognition result returning described real-time Transmission voice flow to terminal;

Because the method that the embodiment of the present invention provides is for identifying scene meeting continuously, each recognition result is reported as middle recognition result, so in order to reporting of intermediate result can be realized on the basis of existing protocol, the invention process also provides intermediate result event a: INTERMEDIATE-RESULT, this intermediate result event can carry recognition result information, then the result repeatedly identified can be reported as intermediate result in a MRCP controls by this intermediate result event Media Resource Server.

Step 404, terminal to Media Resource Server send stop identify request after, terminate this session.

Scheme provided by the invention is by arranging continuous speech recognition session, achieve supplementing MRCPv2 agreement, reach employing MRCP to control, realize repeatedly identifying the effect that event reports continuously, solve the leakage identification problem occurred in the scene that voice identify continuously.

Above-mentioned one or more technical scheme in the embodiment of the present application, at least has following technique effect:

The present invention is on the basis of original disposable speech recognition, a kind of new mode type is increased by grammar file, define new root, define the modes such as new header field to make between terminal and Media Resource Server, to set up a kind of new recognition mode, i.e. continuous speech recognition session, feeds back one-off recognition result of the prior art as middle recognition result.Under this continuous speech recognition session, after MRCP Server creates a RTP passage, just can process real-time Transmission voice flow endlessly, once match the recognition result defined in grammar file, namely INTERMEDIATE-RESULT event is fed back, by repeatedly feeding back, complete the continuous identification to voice.So method provided by the present invention is under continuous speech recognition session, continuously repeatedly voice identification result can be reported as middle recognition result, and do not need repeatedly to issue recognition command, thus improve the continuity of speech recognition, it also avoid voice simultaneously and leak the problem identified.

Method of the present invention is not limited to the embodiment described in embodiment, and those skilled in the art's technical scheme according to the present invention draws and other embodiment belongs to technological innovation scope of the present invention equally.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a Media Resource Server, is characterized in that, this Media Resource Server comprises:

2. Media Resource Server as claimed in claim 1, is characterized in that, the parameter value that described identification request is parameter preset in described grammar file; Then described session classification processing module also for obtaining grammar file from the identification RECOGNIZE message received; Described Media Resource Server obtains the parameter value of parameter preset in described grammar file, determines whether this session is continuous speech recognition session according to described parameter value.

3. Media Resource Server as claimed in claim 2, it is characterized in that, described parameter preset is pattern Mode attribute; Then described session classification processing module is also for the first parameter value that obtaining mode Mode attribute from the syntactic element grammar element of described grammar file is corresponding; Described Media Resource Server, after described first parameter value mates with the first preset parameter value, determines that this session is continuous speech recognition session.

4. Media Resource Server as claimed in claim 2, it is characterized in that, described parameter preset is root attribute; Then described session classification processing module is also for obtaining the second parameter value corresponding to root attribute in described grammar file; Described Media Resource Server, after described second parameter value mates with the second preset parameter value, determines that this session is continuous speech recognition session.

5. Media Resource Server as claimed in claim 1, is characterized in that, described identification is asked as identifying the 3rd parameter value that the newly-increased header field of RECOGNIZE message is corresponding; Then described session classification processing module is also for the 3rd parameter value that the newly-increased header field obtaining the identification RECOGNIZE message that the described terminal that receives is sent is corresponding; Described Media Resource Server, after described 3rd parameter value mates with the 3rd preset parameter value, determines that this session is continuous speech recognition session.

6. the Media Resource Server as described in as arbitrary in Claims 1 to 5, it is characterized in that, the real-time Transmission voice flow that described in described Media Resource Server continuous reception, terminal sends continuously, session classification processing module is also for identifying according to receiving real-time Transmission voice flow, by middle recognition result event INTERMEDIATE-RESULT, return the recognition result for described real-time Transmission voice flow to described terminal.

7. Media Resource Server as claimed in claim 1, it is characterized in that, described session termination module also for receive described stopping identify request after, detect and whether do not identify voice flow in addition, if had, there is no the voice flow identified described in then identifying, and identifying that the described voice flow rear line terminal feedback not having to identify is for the described response message stopping identifying request, stops this session.

8. an audio recognition method, is characterized in that, the method comprises:

9. method as claimed in claim 8, is characterized in that, the parameter value that described identification request is parameter preset in described grammar file; The then described Media Resource Server identification request of sending according to the terminal received, determine whether this session is continuous speech recognition session, comprising:

10. method as claimed in claim 9, it is characterized in that, described parameter preset is pattern Mode attribute;

11. methods as claimed in claim 9, it is characterized in that, described parameter preset is root attribute;

12. methods as claimed in claim 8, is characterized in that, described identification is asked as identifying the 3rd parameter value that the newly-increased header field of RECOGNIZE message is corresponding; Then describedly determine that this session is continuous speech recognition session according to described identification request, comprising:

13. as arbitrary in claim 8 ~ 12 as described in method, to it is characterized in that, the voice flow that described in described continuous reception, terminal sends continuously, and the recognition result feeding back described voice flow, comprising:

14. methods as claimed in claim 8, it is characterized in that, this session of described termination comprises: