CN110310632A

CN110310632A - Method of speech processing and device and electronic equipment

Info

Publication number: CN110310632A
Application number: CN201910583851.6A
Authority: CN
Inventors: 龙海; 徐培来; 汪俊杰
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-10-08

Abstract

Present disclose provides a kind of method of speech processing, this method comprises: obtaining voice messaging；It determines in voice messaging with the presence or absence of redundancy；There are in the case where redundancy in the voice messaging, the redundancy is removed, obtains information to be processed；And according to information to be processed, determine the intent information for being directed to voice messaging.The disclosure additionally provides a kind of voice processing apparatus and a kind of electronic equipment.

Description

Method of speech processing and device and electronic equipment

Technical field

This disclosure relates to a kind of method of speech processing and device and electronic equipment.

Background technique

With the fast development of electronic equipment, in order to improve user experience, the intellectualized technology of human-computer interaction such as voice Identification technology is come into being.Speech recognition technology is inputted by monitoring users voice, and the voice input that discriminance analysis is monitored comes It determines the phonetic order of user, so that electronic equipment can execute corresponding operating according to the phonetic order, realizes intelligentized Human-computer interaction.

Existing speech recognition technology is often persistently monitored, and complete by what is persistently listened to when monitoring voice input Voice input in portion's is sent to electronic equipment backstage and carries out identifying processing to determine the phonetic order of user.Due to redundancy language in the technology Sound input can also be identified processing, can interfere the identification of right instructions to a certain extent.In order to avoid redundant voice input is dry It disturbs, existing speech recognition technology can not also hear user speech input in the predetermined time, or listen to the superfluous of user The input of remaining voice (such as " uh ", " ", " this " and/or " that " etc.) when cease listening for, the voice listened to is inputted and is sent out Identifying processing is carried out toward backstage to determine the phonetic order of user.In view of redundant voice input is often the habituation of user Statement, and the termination of voice input is not characterized, the above-mentioned scheme for listening to redundant semantic input and ceasing listening for can undoubtedly be led The omission for causing efficient voice input, to influence the identification of right instructions.

Summary of the invention

An aspect of this disclosure provides a kind of for improving the method for speech processing of user experience.This method comprises: Obtain voice messaging；It determines in voice messaging with the presence or absence of redundancy；There are in the case where redundancy in voice messaging, Redundancy is removed, information to be processed is obtained；According to information to be processed, the intent information for being directed to voice messaging is determined.

Optionally, above-mentioned acquisition voice messaging includes: the starting point and language using the input of end-point detection model inspection voice The terminating point of sound input；And according to the starting point of voice input and the terminating point of voice input, acquire voice messaging.

Optionally, the terminating point of detection voice input includes: the starting point in response to detecting voice input, determines detection Voice input whether be redundant voice；In the case where determining the voice input of detection is redundant voice, end-point detection is changed The parameter of model obtains updating aft terminal detection model；And according to aft terminal detection model is updated, detect the end of voice input Stop.

Optionally, the parameter of above-mentioned end-point detection model includes: the waiting time for terminating point；Above-mentioned change endpoint inspection The parameter for surveying model includes: to increase the waiting time for being directed to terminating point.

Optionally, include with the presence or absence of redundancy in above-mentioned determining voice messaging: being known using the first speech recognition modeling Other voice messaging determines in voice messaging with the presence or absence of redundant voice information.Above-mentioned determination and the matched intention of voice messaging are believed Breath includes: to identify information to be processed using the second speech recognition modeling, obtains the text to be processed with information matches to be processed；With And according to text to be processed, using the determination of semantic understanding model and the matched intent information of voice messaging.Wherein, redundancy information Include redundant voice information.

Optionally, above-mentioned removal redundancy, obtaining information to be processed includes: the starting point and redundancy according to redundancy The terminating point of information, removes redundancy from voice messaging.

Optionally, include with the presence or absence of redundancy in above-mentioned determining voice messaging: being known using the second speech recognition modeling Other voice messaging obtains and the matched speech text of voice messaging；It determines in speech text with the presence or absence of redundancy text；And There are in the case where redundancy text, determine that there are redundancies in voice messaging in speech text.Above-mentioned determination is believed for voice The intent information of breath includes: according to information to be processed, using the determination of semantic understanding model and the matched intent information of voice messaging.

Optionally, above-mentioned removal redundancy, obtaining information to be processed includes: the redundancy text removed in speech text, Obtain text to be processed.Wherein, information to be processed includes text to be processed.

Another aspect of the present disclosure provides a kind of voice processing apparatus, which includes: acquisition module, for obtaining language Message breath；Redundancy determining module, for determining in voice messaging with the presence or absence of redundancy；Redundancy remove module, For, there are in the case where redundancy, removing redundancy in voice messaging, obtaining information to be processed；And intent information Determining module, for determining the intent information for being directed to voice messaging according to information to be processed.

Optionally, above-mentioned acquisition module includes: detection sub-module, for using the input of end-point detection model inspection voice Starting point and the terminating point of voice input；And acquisition submodule, what starting point and voice for being inputted according to voice inputted Terminating point acquires voice messaging.

Optionally, above-mentioned detection sub-module includes: voice determination unit, for the starting in response to detecting voice input Point determines whether the voice input of detection is redundant voice；Parameter change unit, for being superfluous in the voice input for determining detection In the case where remaining voice, the parameter of end-point detection model is changed, obtains updating aft terminal detection model；And detection unit, it uses According to aft terminal detection model is updated, the terminating point of voice input is detected.

Optionally, the parameter of above-mentioned end-point detection model includes: the waiting time for terminating point.Above-mentioned parameter modified application Member is for increasing the waiting time for terminating point.

Optionally, above-mentioned redundancy determining module is used to identify voice messaging using the first speech recognition modeling, determines It whether there is redundant voice information in voice messaging.Above-mentioned intent information determining module includes: the first identification submodule, for adopting Information to be processed is identified with the second speech recognition modeling, obtains the text to be processed with information matches to be processed；And it is intended to true Stator modules are used for according to text to be processed, using the determination of semantic understanding model and the matched intent information of voice messaging.Its In, redundancy includes redundant voice information.

Optionally, above-mentioned redundancy remove module is used for the termination of the starting point and redundancy according to redundancy Point, removes redundancy from voice messaging.

Optionally, above-mentioned redundancy determining module includes: the second identification submodule, for using the second speech recognition mould Type identifies voice messaging, obtains and the matched speech text of voice messaging；Text determines submodule, for determining in speech text With the presence or absence of redundancy text；And redundancy determines submodule, for, there are in the case where redundancy text, being determined in speech text There are redundancies in voice messaging.Above-mentioned intent information determining module is used for according to information to be processed, using semantic understanding mould Type determination and the matched intent information of voice messaging.

Optionally, above-mentioned redundancy remove module is used to remove the redundancy text in voice text, obtains text to be processed, Wherein, information to be processed includes text to be processed.

Another aspect of the present disclosure provides a kind of electronic equipment, including one or more processors；And storage device, For storing one or more programs, wherein when one or more of programs are executed by one or more of processors, So that one or more of processors execute above-mentioned method of speech processing.

Another aspect of the disclosure provides a kind of computer readable storage medium, is stored with the executable finger of computer It enables, which makes processor execute above-mentioned information processing method when being executed by processor.

Another aspect of the disclosure provides a kind of computer program, which, which includes that computer is executable, refers to It enables, described instruction is when executed for realizing method as described above.

Detailed description of the invention

In order to which the disclosure and its advantage is more fully understood, referring now to being described below in conjunction with attached drawing, in which:

Fig. 1 is diagrammatically illustrated according to the method for speech processing and device of the embodiment of the present disclosure and answering for electronic equipment With scene figure；

Fig. 2 diagrammatically illustrates the flow chart of the method for speech processing according to disclosure exemplary embodiment one；

Fig. 3 A diagrammatically illustrates the flow chart of the acquisition voice messaging according to disclosure exemplary embodiment；

Fig. 3 B diagrammatically illustrates the flow chart of the terminating point of the detection voice input according to disclosure exemplary embodiment；

Fig. 4 diagrammatically illustrates the flow chart of the method for speech processing according to disclosure exemplary embodiment two；

Fig. 5 diagrammatically illustrates the flow chart of the method for speech processing according to disclosure exemplary embodiment three；

Fig. 6 diagrammatically illustrates the exemplary process diagram of the method for speech processing according to disclosure exemplary embodiment；

Fig. 7 diagrammatically illustrates the structural block diagram of the voice processing apparatus according to disclosure exemplary embodiment；And

Fig. 8 diagrammatically illustrates the structure of the electronic equipment for being adapted for carrying out method of speech processing according to the embodiment of the present disclosure Block diagram.

Specific embodiment

Hereinafter, will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are only exemplary , and it is not intended to limit the scope of the present disclosure.In the following detailed description, to elaborate many specific thin convenient for explaining Section is to provide the comprehensive understanding to the embodiment of the present disclosure.It may be evident, however, that one or more embodiments are not having these specific thin It can also be carried out in the case where section.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid Unnecessarily obscure the concept of the disclosure.

Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.It uses herein The terms "include", "comprise" etc. show the presence of the feature, step, operation and/or component, but it is not excluded that in the presence of Or add other one or more features, step, operation or component.

There are all terms (including technical and scientific term) as used herein those skilled in the art to be generally understood Meaning, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification Meaning, without that should be explained with idealization or excessively mechanical mode.

It, in general should be according to this using statement as " at least one in A, B and C etc. " is similar to Field technical staff is generally understood the meaning of the statement to make an explanation (for example, " system at least one in A, B and C " Should include but is not limited to individually with A, individually with B, individually with C, with A and B, with A and C, have B and C, and/or System etc. with A, B, C).Using statement as " at least one in A, B or C etc. " is similar to, generally come Saying be generally understood the meaning of the statement according to those skilled in the art to make an explanation (for example, " having in A, B or C at least One system " should include but is not limited to individually with A, individually with B, individually with C, with A and B, have A and C, have B and C, and/or the system with A, B, C etc.).

Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart Frame or combinations thereof can be realized by computer program instructions.These computer program instructions can be supplied to general purpose computer, The processor of special purpose computer or other programmable data processing units, so that these instructions are when executed by this processor can be with Creation is for realizing function/operation device illustrated in these block diagrams and/or flow chart.The technology of the disclosure can be hard The form of part and/or software (including firmware, microcode etc.) is realized.In addition, the technology of the disclosure, which can be taken, is stored with finger The form of computer program product on the computer readable storage medium of order, the computer program product is for instruction execution system System uses or instruction execution system is combined to use.

Embodiment of the disclosure provides a kind of method of speech processing for improving user experience and device and electronics Equipment.Wherein, method of speech processing includes: acquisition voice messaging；It determines in voice messaging with the presence or absence of redundancy；In voice There are in the case where redundancy in information, redundancy is removed, information to be processed is obtained；And according to information to be processed, really Surely it is directed to the intent information of voice messaging.

The method of speech processing of the disclosure can be in voice messaging before determining the intent information of voice messaging Redundancy is removed.When determining intent information, determine that user instruction is corresponding according to the voice for eliminating redundancy Intent information.So as to avoid lacking because of phonetic order identification inaccuracy caused by the interference of redundancy in the prior art It falls into, and therefore improves the accuracy of identification user speech instruction, improve user experience.

Fig. 1 is diagrammatically illustrated according to the method for speech processing and device of the embodiment of the present disclosure and answering for electronic equipment With scene figure.It should be noted that being only the example that can apply the scene of the embodiment of the present disclosure shown in Fig. 1, to help ability Field technique personnel understand the technology contents of the disclosure, but are not meant to that the embodiment of the present disclosure may not be usable for other equipment, be System, environment or scene.

As shown in Figure 1, the application scenarios 100 of the embodiment of the present disclosure include terminal device 111,112,113.

Wherein, terminal device 111,112,113 has voice detection function, with for detect terminal device 111,112, The voice messaging of 113 place environment, and in the phonetic order for detecting user, in response to the phonetic order of user, execute with The matched operation of phonetic order.The terminal device 111,112,113 for example can include but is not limited to smart home, intelligent video Equipment, intelligent wearable device, smart phone, tablet computer, pocket computer on knee and desktop computer etc..

In accordance with an embodiment of the present disclosure, which can also for example have processing function, to inspection The voice messaging measured is identified and is handled, and determines intent information corresponding with the voice messaging of user, and believe according to intention The phonetic order of breath response user.

In accordance with an embodiment of the present disclosure, various clients can be for example installed on the terminal device 111,112,113 to answer With, for example, intelligent sound assistant, music class application, shopping class application, searching class application, instant messaging tools, social activity put down Platform software etc. (merely illustrative).The terminal device 111,112,113 can when running the various client applications, in response to The voice messaging that family is inputted by each client application determines intent information corresponding with voice messaging.

As shown in Figure 1, the application scenarios 100 of the embodiment of the present disclosure can also include network 120 and server 130.Network 120 between terminal device 111,112,113 and server 130 to provide the medium of communication link.Network 120 may include Various connection types, such as wired, wireless communication link or fiber optic cables etc..

Server 130 can be to provide the server of various services, such as utilize terminal device 111,112,113 to user The client application run provides the back-stage management server (merely illustrative) supported.The server 130 for example can also be Virtual server equipped with cloud platform.The voice messaging that server can obtain terminal device 111,112,113 is known Not Deng handle, and by processing result (such as may include intent information etc.) for voice messaging feed back to terminal device 111, 112,113, in order to which the phonetic order of 111,112,113 couples of users of terminal device responds.

It should be noted that method of speech processing provided by the embodiment of the present disclosure generally can by terminal device 111, 112,113 or server 130 execute.Correspondingly, voice processing apparatus provided by the embodiment of the present disclosure generally can be set in In terminal device 111,112,113 or server 130.

It should be understood that the number and type of terminal device, network and server in Fig. 1 are only schematical.According to It realizes and needs, can have the terminal device, network and server of arbitrary number and type.

Fig. 2 diagrammatically illustrates the flow chart of the method for speech processing according to disclosure exemplary embodiment one.

As shown in Fig. 2, the method for speech processing of the embodiment of the present disclosure includes operation S210~operation S240.The speech processes Method can be executed by terminal device 111,112,113 or be executed by server 130.

In operation S210, voice messaging is obtained.

In accordance with an embodiment of the present disclosure, which for example can be the real-time acquisition of terminal device 111,112,113 and obtains It takes.Alternatively, the voice messaging can also be voice acquisition device acquire and be sent in real time in real time terminal device 111,112, 113 or server 130.

In accordance with an embodiment of the present disclosure, which for example may include voice letter corresponding with the phonetic order of user Breath, in order to which terminal device 111,112,113 is in response to the voice messaging, execution and the matched operation of phonetic order, thus real Existing man-machine interactive voice.Wherein, voice messaging for example can be the acoustic signals etc. acquired in real time, and the disclosure does not limit this It is fixed.

In accordance with an embodiment of the present disclosure, the acquisition methods of the voice messaging are detailed in the operating process of Fig. 3 A~Fig. 3 B description, Details are not described herein.

In operation S220, determine in voice messaging with the presence or absence of redundancy.

In accordance with an embodiment of the present disclosure, operation S220 can for example be determined by carrying out identifying processing to voice messaging It whether there is redundancy in voice messaging.Wherein, redundancy for example may include not having to the intent information of determining user The information of help.It such as may include that the modal particles such as " ", " this ", " that " or auxiliary words of mood or user are accustomed to using Pet phrase (such as " seeming ", " being such ") etc..

In accordance with an embodiment of the present disclosure, aforesaid operations S220 for example may include: first to use speech recognition technology by voice Information is converted to speech text, then judges again in speech text with the presence or absence of redundancy text.This method is detailed in Fig. 5 to operation S521~operation S523 description, this will not be detailed here.

In accordance with an embodiment of the present disclosure, aforesaid operations S220 for example may include: first to voice messaging (such as sound wave believe Number) identifying processing is carried out, it determines in the acoustic signals with the presence or absence of the matched acoustic signals of pronunciation with above-mentioned redundancy vocabulary. When there are acoustic signals matched with the pronunciation of redundancy vocabulary, determine that there are redundancies in voice messaging.This method is detailed in To the description of operation S420 in Fig. 4, this will not be detailed here.

In operation S230, there are in the case where redundancy in voice messaging, redundancy is removed, letter to be processed is obtained Breath.

In accordance with an embodiment of the present disclosure, operation S230 for example may include: to position redundancy according to determining redundancy Position of the information in voice messaging or speech text, and according to the position of the redundancy from voice messaging or speech text Delete the redundancy.

In accordance with an embodiment of the present disclosure, after removing redundancy, in order to enable remaining voice messaging or voice text This has continuity, can also voice messaging to redundancy two sides or speech text carry out splicing, to obtain most Whole information to be processed.

It should be noted that information to be processed is to spell in the case where operating S220 removal is the acoustic signals of redundancy Acoustic signals after connecing.In the case where operating S220 removal is redundancy text, information to be processed is spliced voice text This.

The intent information for being directed to voice messaging is determined according to information to be processed in operation S240.

In accordance with an embodiment of the present disclosure, the intent information for example may include that can characterize the text envelope of user's intention The machine language (such as binary code or character string etc.) that breath or terminal device can identify.For example, the intent information can To include text information corresponding with the user demands such as " playing music ", " casting weather forecast " or machine language etc..

In accordance with an embodiment of the present disclosure, the case where information to be processed is the acoustic signals after removing the acoustic signals of redundancy Under, operation S240 may include: first to carry out identifying processing to remaining acoustic signals, obtain matching with remaining acoustic signals Speech text.Then the processing such as feature extraction, syntactic analysis and text cluster is carried out to speech text again, obtained for voice The intent information of information.

In accordance with an embodiment of the present disclosure, the case where information to be processed is the speech text after removing redundant voice text Under, operation S240 may include: to carry out the processing such as feature extraction, syntactic analysis and text cluster to remaining speech text, Obtain the intent information for voice messaging.

In summary, the method for speech processing of the embodiment of the present disclosure, due to determining the intent information for being directed to voice messaging Before, first remove redundancy, so as to avoid the presence because of redundancy caused by determine intent information inaccuracy Defect.Therefore the accuracy of determining intent information can be improved in the method for speech processing of the embodiment of the present disclosure, and therefore can The phonetic order of enough precisely response users, improves user experience.

In accordance with an embodiment of the present disclosure, Fig. 2, which operates the voice messaging in S210, for example can acquire in real time voice messaging While, the starting point and ending point of the voice messaging acquired in real time is determined using end-point detection model.

Fig. 3 A diagrammatically illustrates the flow chart of the acquisition voice messaging according to disclosure exemplary embodiment.

It as shown in Figure 3A, may include following operation S311~behaviour using the method that end-point detection model obtains voice messaging Make S312.In operation S311, using the starting point of end-point detection model inspection voice input and the terminating point of voice input.

In accordance with an embodiment of the present disclosure, wherein end-point detection model for example may include based on end-point detection (Voice Activity Detection, VAD) technology building model, with the starting point for accurately orienting voice from voice And terminating point.Wherein, voice inputs the voice flow as acquired in real time.

In accordance with an embodiment of the present disclosure, temporal signatures and frequency domain that end-point detection model is for example mainly inputted according to voice Thinking that feature combines, which constructs, to be formed.Wherein, temporal signatures for example may include time domain energy size, energy gradient Deng.Frequency domain character for example may include fundamental frequency, frequency domain sub-band etc..

In accordance with an embodiment of the present disclosure, the detection of the starting point of voice input for example may include: in end-point detection model The starting point of voice input is confirmly detected when detecting audio-frequency information.The detection of the terminating point of voice input for example can wrap It includes: after the starting point for detecting voice input, if audio-frequency information is not detected in predetermined amount of time (such as in 100ms), really Regular inspection measures the terminating point of voice input.Alternatively, if when the voice input detected is redundant voice information, it is determined that detect The terminating point of voice input.

Voice messaging is acquired according to the starting point of voice input and the terminating point of voice input in operation S312.

In accordance with an embodiment of the present disclosure, operation S312 for example may include: detect voice input starting point when, The voice flow that persistently will acquire is stored to terminal device 111,112,113 or server 130.In the termination for detecting voice input When point, then stop storaged voice stream.Therefore, the voice flow stored in final terminal device 111,112,113 or server 130 is i.e. For for carrying out redundancy judgement and determining the voice messaging of intent information.

In accordance with an embodiment of the present disclosure, in order to avoid using end-point detection model inspection voice input terminating point when, The incomplete situation of voice messaging that storage caused by terminating point is directly determined because detecting redundant voice, it is defeated in detection voice When the terminating point entered, such as can also be according to the parameter of the voice messaging adjustment endpoint detection model obtained in real time.So that adjustment End-point detection model afterwards detect redundancy or it is mute etc. meet the voice messaging of termination condition when, be capable of providing longer Waiting time determine whether to detect terminating point again, thus guarantee obtain voice messaging integrality.

Fig. 3 B diagrammatically illustrates the flow chart of the terminating point of the detection voice input according to disclosure exemplary embodiment.

As shown in Figure 3B, the method for the terminating point of the detection voice input of the embodiment of the present disclosure may include following operation S3111~operation S3113.The voice input of detection is determined in response to detecting the starting point of voice input in operation S3111 It whether is redundant voice.The end is changed in the case where determining the voice input of detection is redundant voice in operation S3112 The parameter of point detection model obtains updating aft terminal detection model.It is examined in operation S3113 according to aft terminal detection model is updated Survey the terminating point of voice input.

In accordance with an embodiment of the present disclosure, operation S3111 can for example use the first nerves network model of pre-training.? After the starting point for detecting voice input, the voice messaging obtained in real time is inputted in first nerves network model, by the first mind Output result through network model determines in voice messaging with the presence or absence of redundant voice.Wherein, the first nerves network mould Type can be for example two disaggregated models etc., and the disclosure does not limit this.

In accordance with an embodiment of the present disclosure, make end-point detection in order to avoid the first nerves network model processing time is longer The update of model, which exists, to be delayed, and lesser threshold value or the less number of plies can be set for the first nerves network model, to improve The treatment effeciency at first nerves network.

In accordance with an embodiment of the present disclosure, the parameter of above-mentioned end-point detection model for example may include the waiting for terminating point Time (Eos waiting time) determines the waiting time of terminating point that is, after detecting mute or redundancy.In order to avoid superfluous The case where efficient voice input can also be collected after remaining information, operation S3112 may include: to determine that voice input is redundancy When information, increases the waiting time for being directed to terminating point, the longer waiting time is established with the determination for terminating point.

Aforesaid operations S3113 determines voice messaging again after the waiting time for terminating point for increasing end-point detection model Terminating point, it is ensured that obtain voice messaging during will not be terminated because detecting voice messaging, thereby may be ensured that The voice messaging integrality of acquisition.

Fig. 4 diagrammatically illustrates the flow chart of the method for speech processing according to disclosure exemplary embodiment two.

As shown in figure 4, the method for speech processing of the embodiment of the present disclosure other than operating S210, further includes operation S420~behaviour Make S430.Wherein, operating S420 and operating S430 is respectively a specific embodiment for operating S220 and operating S230.It is operating S420 identifies voice messaging using the first speech recognition modeling, determines in voice messaging with the presence or absence of redundant voice information.It is depositing In the case where redundancy, operation S430 is executed, according to the terminating point of the starting point of redundancy and redundancy, from voice Redundancy is removed in information.Wherein, redundancy is redundant voice information.

In accordance with an embodiment of the present disclosure, which for example may include nervus opticus network model.On Stating operation S420 may include: to input voice messaging in nervus opticus network model, handle via nervus opticus network model Multiple voice fragments in voice messaging are exported afterwards to belong to the probability of redundant voice information or directly export in multiple voice fragments Each voice fragment be redundant voice information or be not redundant voice information result.Belong to redundancy exporting multiple voice fragments In the case where the probability of voice messaging, make a reservation for if being greater than first in the presence of the probability for belonging to redundant voice information in multiple voice fragments The voice fragment of probability (such as 80%) determines that there are redundant voice information in voice messaging, and it is pre- to determine that probability is greater than first The voice fragment for determining probability is redundant voice information.

In accordance with an embodiment of the present disclosure, for the ease of positioning position of each voice fragment in entire voice messaging, logical It crosses in the output result that nervus opticus network model obtains, result corresponding with voice fragment each in multiple voice fragments may be used also To include start position and final position of the voice fragment in entire voice messaging.Operating S430 can be according to multiple voice point Start position and the final position for belonging to the voice fragment of redundant voice information in piece, remove this from entire voice messaging and belong to The voice fragment of redundant voice information, to obtain information to be processed.

In accordance with an embodiment of the present disclosure, which for example can be also used for executing operation S420~operation All operations of S430 description, then via the processing of nervus opticus network model, what is exported is to remove redundancy Information to be processed.

The nervus opticus network model for needing to illustrate and the first nerves network mould referred in the aforementioned description to Fig. 3 B The difference of type is that first nerves network model can only identify voice messaging, but cannot divide voice messaging Piece processing and/or removal operation.

In accordance with an embodiment of the present disclosure, since the information to be processed obtained via operation S430 is still voice messaging, In the intent information for determining user, need to be first text information by speech signal analysis.As shown in figure 4, operation S220 can be with Including following operation S441~operation S442.In operation S441, information to be processed is identified using the second speech recognition modeling, is obtained With the text to be processed of information matches to be processed.In operation S442, according to text to be processed, using semantic understanding model determine with The matched intent information of voice messaging.

In accordance with an embodiment of the present disclosure, the second speech recognition modeling for example may include based on automatic speech recognition The model of (Automatic Speech Recognition, ASR) technology building.Operation S441 can specifically include: using the Two speech recognition modelings will operate the information to be processed that S430 is obtained and be converted to speech text.

In accordance with an embodiment of the present disclosure, semantic understanding model for example may include based on natural language understanding (Natural Language Understanding, NLU) technology building model.Operation S442 can specifically include: using semantic understanding mould Type in operation S441 to obtaining carrying out text to be processed sentence detection, participle, part-of-speech tagging, syntactic analysis, text classification/poly- The processing such as class, obtains the intent information of user.

In summary, voice messaging is being converted to text using ASR technology by the method for speech processing of the embodiment of the present disclosure Before, the redundant voice in voice messaging is removed, first so as to improve the accurate of speech text that ASR technology identifies Rate, and therefore further increase user experience.

Fig. 5 diagrammatically illustrates the flow chart of the method for speech processing according to disclosure exemplary embodiment three.

As shown in figure 5, the method for speech processing of the embodiment of the present disclosure other than operating S210, further includes operation S521~behaviour Make S523.Wherein, operation S521~S523 is operation mono- specific embodiment of S220.In operation S521, known using the second voice Other model identifies voice messaging, obtains and the matched speech text of voice messaging.

In accordance with an embodiment of the present disclosure, operation S521 is similar with the operation S441 in Fig. 4, and difference is only that operation S521 In voice messaging be do not remove the voice messaging of redundancy, and operate the voice messaging in S441 be remove redundancy Voice messaging.

In operation S522, judge in speech text with the presence or absence of redundancy text.

In accordance with an embodiment of the present disclosure, operation S522 may include: by speech text and pre-stored redundancy dictionary It is compared, whether judges in speech text including the redundancy vocabulary in redundancy dictionary.If in speech text including redundancy vocabulary, Then illustrate that there are redundancy texts in speech text；If do not included redundancy vocabulary in speech text, then illustrate not deposit in speech text In redundancy text.

In accordance with an embodiment of the present disclosure, it such as can also be determined by using the good third nerve network model of pre-training It whether there is redundancy text in redundancy dictionary.Operation S522 may include: that speech text is inputted third nerve network model, It is handled by third nerve network model and exports the probability that the multiple vocabulary for obtaining composition speech text belong to redundancy vocabulary.More When being greater than the vocabulary of the second predetermined probability (such as 70%) in the presence of the probability for belonging to redundancy vocabulary in a vocabulary, then language can be determined There are redundancy texts in sound text.

There are operation S523 in the case where redundancy text, can be executed in speech text, determines in voice messaging and exist Redundancy.It needs to move before determining the intent information of user to improve the accuracy of the intent information of determining user Except the redundancy text in speech text.Therefore, removing the redundancy in voice messaging can be realized by operation S530.? S530 is operated, the redundancy text in speech text is removed, obtains text to be processed.

In accordance with an embodiment of the present disclosure, operation S530 may include the redundancy vocabulary removed in speech text, obtain wait locate Manage text.In accordance with an embodiment of the present disclosure, for the ease of positioning position of each text fragment in entire speech text, passing through In the output result that third nerve network model obtains, not only includes the probability that each vocabulary belongs to redundancy vocabulary, can also wrap Include byte location of each vocabulary in entire speech text.Operation S530 can be according to belonging to redundancy vocabulary in multiple vocabulary Probability is greater than the byte location of the vocabulary of the second predetermined probability, and the probability for belonging to redundancy vocabulary is removed from entire speech text Greater than the vocabulary of the second predetermined probability, to obtain text to be processed.In such cases, information to be processed above-mentioned be it is described to Handle text.

After removing redundancy text, the determination of intent information can be carried out.Therefore operation S540 is executed, according to be processed Information, using the determination of semantic understanding model and the matched intent information of voice messaging.Wherein, the behaviour in operation S540 and Fig. 4 It is identical to make S442, details are not described herein.

In the case where redundancy text is not present in speech text, it can determine and redundancy is not present in voice messaging, So as to directly according to speech text determination and the matched intent information of voice messaging.It is not in the judging result of operation S522 There are in the case where redundancy text, directly execute operation S540 to determine the intent information of user.In accordance with an embodiment of the present disclosure, The third nerve network and the difference of nervus opticus network is that nervus opticus network is used for voice messaging Reason, and third nerve network is for handling speech text.

Fig. 6 diagrammatically illustrates the exemplary process diagram of the method for speech processing according to disclosure exemplary embodiment.

As shown in fig. 6, the method for speech processing of the embodiment of the present disclosure is firstly the need of acquisition raw tone (operation S610).Example Such as, the raw tone acquired can be " this lustily water that grace plays Liu Dehua ".Operation S610 and operation S210 Similar, details are not described herein.

In order to avoid the inaccuracy of speech recognition caused by the presence of redundant voice in raw tone, need to move redundant voice It removes.The removal can remove before raw tone is converted to text, can also be after raw tone to be converted to text It removes.

Wherein, when removing redundant voice before raw tone is converted to text, operation S620 is executed, removes redundancy letter Breath obtains voice messaging corresponding with " the lustily water for playing Liu Dehua ".After removing redundant voice, i.e., executable operation Voice messaging is converted to text by automatic speech recognition technology by S630, obtains finally identifying that text 601, such as text " are broadcast Put the lustily water of Liu Dehua ".After obtaining finally identifying text 601, operation S650 can be executed, using natural language understanding skill Art carries out semantic understanding, obtains the intent information of user.Wherein, operation S620 specifically can by the operation S420 in Fig. 4~ S430 is operated to execute, operation S630 can be executed by the operation S441 in Fig. 4, and operation S650 can be by Fig. 4 S442 is operated to execute, details are not described herein.

Wherein, it when removing redundant voice after raw tone to be converted to text, needs first to be converted to raw tone Text executes operation S630, raw tone is converted to text using automatic speech recognition technology.Then it carries out in text again The removal of redundancy text, and operation S640 is executed, redundancy text is removed, obtains finally identifying that text 601, such as text " are broadcast Put the lustily water of Liu Dehua ".After obtaining finally identifying text 601, operation S650 can be executed, using natural language understanding skill Art carries out semantic understanding, obtains the intent information of user.Wherein, operation S630 specifically can by the operation S521 in Fig. 5 come It executes, operation S640 can be executed by operation S522~operation S523 in Fig. 5 and operation S530, and operation S650 can lead to The operation S540 that crosses in Fig. 5 is executed, and details are not described herein.

Fig. 7 diagrammatically illustrates the structural block diagram of the voice processing apparatus according to disclosure exemplary embodiment.

As shown in fig. 7, the voice processing apparatus 700 of the embodiment of the present disclosure is including obtaining module 710, redundancy determines mould Block 720, redundancy remove module 730 and intent information determining module 740.

Wherein, module 710 is obtained for obtaining voice messaging (operation S210).

In accordance with an embodiment of the present disclosure, as shown in fig. 7, the acquisition module 710 may include detection sub-module 711 and obtain Submodule 712.Wherein, detection sub-module is used to input using the starting point and voice of the input of end-point detection model inspection voice Terminating point (operation S311).Wherein, the terminating point of acquisition submodule is used to be inputted according to voice starting point and voice input, is obtained Obtain voice messaging (operation S312).

In accordance with an embodiment of the present disclosure, as shown in fig. 7, detection sub-module 711 may include voice determination unit 7111, ginseng Number changing unit 7112 and detection unit 7113.Wherein, voice determination unit 7111 is used in response to detecting voice input Starting point determines whether the voice input of detection is redundant voice (operation S3111).Parameter change unit 7112 is used in determination In the case that the voice input of detection is redundant voice, the parameter of end-point detection model is changed, obtains updating aft terminal detection mould Type (operation S3112).Detection unit 7113 is used to detect the terminating point of voice input according to aft terminal detection model is updated.

In accordance with an embodiment of the present disclosure, the parameter of above-mentioned end-point detection model includes the waiting time for terminating point.On Parameter change unit 7112 is stated for increasing the waiting time for being directed to terminating point.

Wherein, redundancy determining module 720 is for determining in voice messaging with the presence or absence of redundancy (operation S220). Redundancy remove module 730 is used in voice messaging remove redundancy there are in the case where redundancy, obtain wait locate Manage information (operation S230).Intent information determining module 740 is used to determine the intention for being directed to voice messaging according to information to be processed Information (operation S240).

In accordance with an embodiment of the present disclosure, above-mentioned redundancy determining module 720 is used to know using the first speech recognition modeling Other voice messaging determines in voice messaging with the presence or absence of redundant voice information (operation S420).Above-mentioned redundancy remove module 730 from voice messaging for removing redundancy (operation according to the starting point of redundancy and the terminating point of redundancy S430).As shown in fig. 7, intent information determining module 740 may include the first identification submodule 741 and be intended to determine submodule 742.Wherein, the first identification submodule is used to identify information to be processed using the second speech recognition modeling, obtains and letter to be processed Cease matched text to be processed (operation S441).It is intended to determine that submodule 742 is used for according to text to be processed, using semantic understanding Model determination and the matched intent information of voice messaging (operation S442).Wherein, redundancy includes redundant voice information.

In accordance with an embodiment of the present disclosure, as shown in fig. 7, above-mentioned redundancy determining module 720 may include the second identification Submodule 721, text determine that submodule 722 and redundancy determine submodule 723.Wherein, the second identification submodule 721 is for using Second speech recognition modeling identifies voice messaging, obtains and the matched speech text of voice messaging (operation S521).Text determines Submodule 722 is for determining in speech text with the presence or absence of redundancy text (operation S522).Redundancy determines submodule in language There are in the case where redundancy text, determine that there are redundancy (operation S523) in voice messaging in sound text.Above-mentioned redundancy letter Breath remove module 730 be used for remove voice text in redundancy text, obtain text to be processed, wherein information to be processed include to Handle text (operation S530).Above-mentioned intent information determining module 740 is used for according to information to be processed, using semantic understanding model The determining and matched intent information of voice messaging (operation S540).

It is module according to an embodiment of the present disclosure, submodule, unit, any number of or in which any more in subelement A at least partly function can be realized in a module.It is single according to the module of the embodiment of the present disclosure, submodule, unit, son Any one or more in member can be split into multiple modules to realize.According to the module of the embodiment of the present disclosure, submodule, Any one or more in unit, subelement can at least be implemented partly as hardware circuit, such as field programmable gate Array (FPGA), programmable logic array (PLA), system on chip, the system on substrate, the system in encapsulation, dedicated integrated electricity Road (ASIC), or can be by the hardware or firmware for any other rational method for integrate or encapsulate to circuit come real Show, or with any one in three kinds of software, hardware and firmware implementations or with wherein any several appropriately combined next reality It is existing.Alternatively, can be at least by part according to one or more of the module of the embodiment of the present disclosure, submodule, unit, subelement Ground is embodied as computer program module, when the computer program module is run, can execute corresponding function.

As shown in figure 8, electronic equipment 800 includes processor 810 and computer readable storage medium 820.The electronic equipment 800 can execute the method according to the embodiment of the present disclosure.

Specifically, processor 810 for example may include general purpose microprocessor, instruction set processor and/or related chip group And/or special microprocessor (for example, specific integrated circuit (ASIC)), etc..Processor 810 can also include using for caching The onboard storage device on way.Processor 810 can be the different movements for executing the method flow according to the embodiment of the present disclosure Single treatment unit either multiple processing units.

Computer readable storage medium 820, such as can be non-volatile computer readable storage medium, specific example Including but not limited to: magnetic memory apparatus, such as tape or hard disk (HDD)；Light storage device, such as CD (CD-ROM)；Memory, such as Random access memory (RAM) or flash memory；Etc..

Computer readable storage medium 820 may include computer program 821, which may include generation Code/computer executable instructions execute processor 810 according to the embodiment of the present disclosure Method or its any deformation.

Computer program 821 can be configured to have the computer program code for example including computer program module.Example Such as, in the exemplary embodiment, the code in computer program 821 may include one or more program modules, for example including 821A, module 821B ....It should be noted that the division mode and number of module are not fixation, those skilled in the art can To be combined according to the actual situation using suitable program module or program module, when these program modules are combined by processor 810 When execution, processor 810 is executed according to the method for the embodiment of the present disclosure or its any deformation.

According to an embodiment of the invention, at least one of each module of Fig. 7 description, each submodule, each unit can be real It is now the computer program module described with reference to Fig. 8, when being executed by processor 810, may be implemented described above corresponding Operation.

The disclosure additionally provides a kind of computer readable storage medium, which can be above-mentioned reality It applies included in equipment/device/system described in example；Be also possible to individualism, and without be incorporated the equipment/device/ In system.Above-mentioned computer readable storage medium carries one or more program, when said one or multiple program quilts When execution, the method according to the embodiment of the present disclosure is realized.

In accordance with an embodiment of the present disclosure, computer readable storage medium can be non-volatile computer-readable storage medium Matter, such as can include but is not limited to: portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In the disclosure, computer readable storage medium can With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.

It will be understood by those skilled in the art that the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination, even if such combination or combination are not expressly recited in the disclosure.Particularly, exist In the case where not departing from disclosure spirit or teaching, the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.

Although the disclosure, art technology has shown and described referring to the certain exemplary embodiments of the disclosure Personnel it should be understood that in the case where the spirit and scope of the present disclosure limited without departing substantially from the following claims and their equivalents, A variety of changes in form and details can be carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment, But should be not only determined by appended claims, also it is defined by the equivalent of appended claims.

Claims

1. a kind of method of speech processing, comprising:

Obtain voice messaging；

It determines in the voice messaging with the presence or absence of redundancy；

There are in the case where redundancy in the voice messaging, the redundancy is removed, obtains information to be processed；And

According to the information to be processed, the intent information for being directed to the voice messaging is determined.

2. according to the method described in claim 1, wherein, the acquisition voice messaging includes:

Using the starting point of end-point detection model inspection voice input and the terminating point of voice input；And

According to the starting point of voice input and the terminating point of voice input, the voice messaging is acquired.

3. according to the method described in claim 2, wherein, the terminating point for detecting the voice input includes:

In response to detecting the starting point of the voice input, determine whether the voice input of detection is redundant voice；

In the case where determining the voice input of detection is redundant voice, the parameter of the end-point detection model is changed, is obtained more New aft terminal detection model；And

According to the update aft terminal detection model, the terminating point of the voice input is detected.

4. according to the method described in claim 3, wherein:

The parameter of the end-point detection model includes: the waiting time for terminating point；

The parameter of the change end-point detection model includes: to increase the waiting time for being directed to terminating point.

5. according to the method described in claim 1, wherein:

It determines in the voice messaging and includes with the presence or absence of redundancy: the voice letter is identified using the first speech recognition modeling Breath determines in the voice messaging with the presence or absence of redundant voice information；

It determines with the matched intent information of the voice messaging and includes:

The information to be processed is identified using the second speech recognition modeling, obtains the text to be processed with the information matches to be processed This；And

It is determining with the matched intent information of the voice messaging using semantic understanding model according to the text to be processed,

Wherein, the redundancy includes the redundant voice information.

6. obtaining information to be processed includes: according to the method described in claim 5, wherein, removing the redundancy

According to the terminating point of the starting point of the redundancy and the redundancy, removed from the voice messaging described superfluous Remaining information.

7. according to the method described in claim 1, wherein:

It determines in the voice messaging and includes with the presence or absence of redundancy:

The voice messaging is identified using the second speech recognition modeling, is obtained and the matched speech text of the voice messaging；

It determines in the speech text with the presence or absence of redundancy text；And

There are in the case where the redundancy text, determine that there are redundancy letters in the voice messaging in the speech text Breath；

Determine for the voice messaging intent information include: according to the information to be processed, it is true using semantic understanding model The fixed and matched intent information of the voice messaging.

8. obtaining information to be processed includes: according to the method described in claim 7, wherein, removing the redundancy

The redundancy text in the speech text is removed, text to be processed is obtained,

Wherein, the information to be processed includes the text to be processed.

9. a kind of voice processing apparatus, comprising:

Module is obtained, for obtaining voice messaging；

Redundancy determining module, for determining in the voice messaging with the presence or absence of redundancy；

Redundancy remove module, for there are in the case where redundancy, remove the redundancy letter in the voice messaging Breath, obtains information to be processed；And

Intent information determining module, for determining the intent information for being directed to the voice messaging according to the information to be processed.

10. a kind of electronic equipment, comprising:

One or more processors；And

Storage device, for storing one or more programs,

Wherein, when one or more of programs are executed by one or more of processors, so that one or more of Method described in processor execution according to claim 1~any one of 8.