CN110956955B

CN110956955B - Voice interaction method and device

Info

Publication number: CN110956955B
Application number: CN201911257073.8A
Authority: CN
Inventors: 吴旭貌; 薛少飞
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2022-08-05
Anticipated expiration: 2039-12-10
Also published as: CN110956955A

Abstract

The invention discloses a voice interaction method and device, and relates to the technical field of computers. One embodiment of the method comprises: carrying out first voice recognition processing on received voice data to obtain first text data of the voice data; performing semantic understanding on the first text data, and determining a target intention of the voice data; acquiring geographic position information required by second voice recognition processing; according to the geographic position information, performing second voice recognition processing on the voice data to obtain second text data of the voice data; determining whether the target intent is associated with a location factor; if the position factor is related, determining information to be output according to the second text data; if the position factor is not relevant, information to be output is determined according to the first text data. The method improves the accuracy of voice recognition and improves the performance and experience of voice interaction products.

Description

Voice interaction method and device

Technical Field

The invention relates to the technical field of computers, in particular to a voice interaction method and device.

Background

In the existing voice interaction system, after a microphone receives the voice of a user, an automatic voice recognition system ASR built based on an acoustic model and a language model is input to recognize the text spoken by the user, the text information is processed by a natural language understanding system NLU, a dialogue management system DM judges the next machine action (information to be output), and finally, the final feedback voice of the machine is broadcasted by using a text-to-voice conversion system TTS.

Particularly, in a vehicle-mounted voice interaction system, the field most frequently used by users is the navigation field. For the existing vehicle-mounted voice interaction system, the language model constructed by using national position information has no good adaptability in a plurality of local regions, and the experience of voice navigation is poor. Because the language model adopted by the vehicle-mounted voice interaction system is constructed based on national position information, the voice navigation results of two cities are the same under the condition that the cities have place names with the same tone and different characters, namely the results of the voice recognition ASR are the same. Therefore, the results of ASR are certainly contrary to the expectation of voice navigation in one city, and the influence of wrong text information on subsequent NLUs and DMs is great, resulting in deviation of the overall experience. For example, if a user desires a voice interactive system to reply to a poi (point of interest) name located at a lower frequency in a three-line city, the address may not be recognized by existing systems.

Further, since the existing ASR system is a language model based on all the POI information nationwide, for some lower frequency POI points, the weight of the POI points themselves in the POI information nationwide is quite low, so that the lower frequency POI points score in the language model through ASR decoding, and therefore the probability of the result is small. In addition, in the case that the place names of some homonyms and different characters or the place names of characters with similar sounds and different characters have equivalent scores due to the acoustic models, the same language model is destined to only have one POI result in different cities, so that the texts recognized by the ASRs are indistinguishable, and the experience of the whole voice interaction system is biased.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for voice interaction, which can improve accuracy of voice recognition and improve performance and experience of voice interaction.

To achieve the above object, according to an aspect of an embodiment of the present invention, a method of voice interaction is provided.

The voice interaction method of the embodiment of the invention comprises the following steps: carrying out first voice recognition processing on received voice data to obtain first text data of the voice data; performing semantic understanding on the first text data, and determining a target intention of the voice data; acquiring geographic position information required by second voice recognition processing; according to the geographic position information, performing second voice recognition processing on the voice data to obtain second text data of the voice data; determining whether the target intent is associated with a location factor; if the position factor is related, determining information to be output according to the second text data; if not, the information to be output is determined from the first text data.

Optionally, before performing the first speech recognition processing on the received speech data, the method further includes: processing the received voice data by voice signals; wherein the speech signal processing comprises at least one of: echo cancellation processing, noise reduction processing and reverberation removal processing.

Optionally, the semantic understanding is performed on the first text data, and the step of determining the target intention of the voice data includes: calling a semantic understanding model, and analyzing and processing the first text data through the semantic understanding model to determine a target intention of the voice data; the semantic understanding model is an intention classifier obtained by training based on a classification algorithm, and the classification algorithm at least comprises the following steps: naive Bayes algorithm, decision tree algorithm, Support Vector Machine (SVM) algorithm and LR analysis method.

Optionally, the step of determining whether the target intention is related to a location factor comprises: acquiring a set intention set; at least one intention related to the position factor is included in the intention set; and/or, at least a navigation intent is included in the set of intents; determining whether the target intent belongs to the set of intents; if so, the target intent is related to a location factor; otherwise, the target intent is not correlated to a location factor.

Optionally, the step of acquiring the geographical location information required for the second speech recognition process includes: acquiring geographical position information required by second voice recognition processing according to a preset rule; wherein the preset rule at least comprises one of the following rules: acquiring current geographical position information to use the current geographical position information as geographical position information required by second voice recognition processing; receiving geographical position information input by a user; taking the geographical position information input by the user as geographical position information required by second voice recognition processing; the user's history data is counted to use the counted geographical location information as geographical location information required for the second speech recognition process.

Optionally, after determining the information to be output, the method further includes: converting the information to be output into voice data to be output; and outputting the voice data to be output.

To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided an apparatus for voice interaction.

The voice interaction device of the embodiment of the invention comprises:

the first voice recognition processing module is used for performing first voice recognition processing on received voice data to obtain first text data of the voice data; performing semantic understanding on the first text data, and determining a target intention of the voice data;

the second voice recognition processing module is used for acquiring geographic position information required by second voice recognition processing; according to the geographic position information, performing second voice recognition processing on the voice data to obtain second text data of the voice data;

the judging module is used for judging whether the target intention is related to a position factor;

the information to be output determining module is used for determining information to be output according to the second text data if the information is related to the position factor; if not, the information to be output is determined from the first text data.

Optionally, the system further comprises a voice signal processing module, configured to perform voice signal processing on the received voice data; wherein the speech signal processing comprises at least one of: echo cancellation processing, noise reduction processing and reverberation removal processing.

Optionally, the first speech recognition processing module is further configured to invoke a semantic understanding model, and perform analysis processing on the first text data through the semantic understanding model to determine a target intention of the speech data; the semantic understanding model is an intention classifier obtained by training based on a classification algorithm, and the classification algorithm at least comprises the following steps: naive Bayes algorithm, decision tree algorithm, Support Vector Machine (SVM) algorithm and LR analysis method.

Optionally, the determining module is further configured to obtain a set intention set; at least one intention related to the position factor is included in the intention set; and/or, at least a navigation intent is included in the set of intents; determining whether the target intent belongs to the set of intents; if so, the target intent is related to a location factor; otherwise, the target intent is not correlated to a location factor.

Optionally, the second speech recognition processing module is further configured to obtain geographic location information required for the second speech recognition processing according to a preset rule; wherein the preset rule at least comprises one of the following rules: acquiring current geographical position information to use the current geographical position information as geographical position information required by second voice recognition processing; receiving geographical position information input by a user; taking the geographical position information input by the user as geographical position information required by second voice recognition processing; the user's history data is counted to use the counted geographical location information as geographical location information required for the second speech recognition process.

Optionally, the system further comprises an output module, configured to convert the information to be output into voice data to be output; and outputting the voice data to be output.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.

The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of voice interaction of any of the above.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer readable medium having a computer program stored thereon, wherein the program is configured to implement the method of voice interaction of any one of the above when executed by a processor.

One embodiment of the above invention has the following advantages or benefits: aiming at different requirements, when the voice interaction is realized, the corresponding output result can be determined according to the relevance of the target intention and the position factor. The second voice recognition processing of the line takes geographic position factors into consideration, so that aiming at the intention related to the geographic factors, a more accurate recognition result can be obtained through the second voice recognition processing, further, information expected to be obtained by a user can be accurately pushed, and the user experience is improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a diagram illustrating a method of voice interaction according to a first embodiment of the invention;

FIG. 2 is a diagram illustrating a method of voice interaction according to a second embodiment of the invention;

FIG. 3 is a diagram of a system implemented by a method for voice interaction according to a first embodiment of the invention;

FIG. 4 is a diagram of a system implemented by a method of voice interaction according to a second embodiment of the invention;

FIG. 5 is a schematic diagram of the main modules of an apparatus for voice interaction according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The voice interaction method of the embodiment of the invention comprises the following steps: carrying out first voice recognition processing on the received voice data to obtain first text data of the voice data; performing semantic understanding on the first text data, and determining a target intention of the voice data; acquiring geographic position information required by second voice recognition processing; secondly, performing second voice recognition processing on the voice data according to the geographic position information to obtain second text data of the voice data; determining whether the target intent is associated with a location factor; if the position factor is related, determining information to be output according to the second text data; if not, the information to be output is determined from the first text data. The relevance of the intention and the position factor can be configured in advance (or can be analyzed in real time according to historical data), for example, for the intention related to the position factor, such as navigation (determining a route to a certain place), searching and positioning (determining an attachment supermarket, a hospital, and the like), or certain interactive parameter setting (the output voice is dialect), the intention can be set to be related to the geographic factor in advance, and the second voice recognition processing is performed by taking the geographic position factor into consideration, for example, the model for performing the second voice recognition processing is trained on the labeled sample data including the geographic factor. Therefore, a more accurate recognition result can be obtained by the second speech recognition process for the intention related to the geographic factor.

In the embodiment of the present invention, the sequence of the processing procedures is not unique, and the execution sequence may be adjusted according to the actual application requirements. According to the embodiment of the invention, aiming at different requirements, when the voice interaction is realized, the corresponding output result can be determined according to the correlation of the target intention and the position factor. The output result integrates the factors of the geographic position information, so that the information expected by the user can be accurately pushed, and the user experience is improved.

FIG. 1 is a diagram illustrating a method of voice interaction according to a first embodiment of the invention; FIG. 2 is a diagram illustrating a method of voice interaction according to a second embodiment of the invention; FIG. 3 is a diagram of a system implemented by a method for voice interaction according to a first embodiment of the invention; fig. 4 is a schematic diagram of a system implemented by the method of voice interaction according to the second embodiment of the present invention.

As shown in fig. 1, in the first embodiment of the present invention, a process of acquiring the geographical location information required for the second speech recognition processing and performing the second speech recognition processing on the speech data according to the geographical location information is performed, and then a determination is performed as to whether the target intention is related to the location factor. Specifically, as shown in fig. 1, the method for voice interaction according to the embodiment of the present invention mainly includes:

step S101: carrying out first voice recognition processing on the received voice data to obtain first text data of the voice data; and performing semantic understanding on the first text data, and determining a target intention of the voice data.

Step S102: acquiring geographic position information required by second voice recognition processing; and according to the geographic position information, performing second voice recognition processing on the voice data to obtain second text data of the voice data.

Step S103: it is determined whether the target intent is associated with a location factor. If the position factor is related, executing step S104; if not, step S105 is performed.

Step S104: and determining information to be output according to the second text data.

Step S105: and determining information to be output according to the first text data.

According to the first embodiment of the present invention, step S101 and step S102 may be executed synchronously, and after determining the correlation between the target intention of the voice data and the position factor, the corresponding output information may be determined directly. Through this embodiment, not only can improve speech recognition's accuracy, reduced the time that whole speech recognition first word delays and position delay moreover, promoted the performance and the experience of pronunciation product greatly.

As shown in fig. 2, in the second embodiment of the present invention, after determining that the target intention is related to the location factor, the process of obtaining the geographic location information required by the second speech recognition processing and performing the second speech recognition processing on the speech data according to the geographic location information is performed. In the second embodiment of the present invention, step S201 and step S203 are not executed synchronously, so that fewer resources are required. Specifically, as shown in fig. 2, the method for voice interaction according to the embodiment of the present invention mainly includes:

step S201: carrying out first voice recognition processing on the received voice data to obtain first text data of the voice data; and performing semantic understanding on the first text data, and determining a target intention of the voice data.

Step S202: it is determined whether the target intent is associated with a location factor. If the position factor is relevant, executing step S203; if not, step S204 is performed.

Step S203: acquiring geographic position information required by second voice recognition processing; and according to the geographic position information, performing second voice recognition processing on the voice data to obtain second text data of the voice data. And determining information to be output according to the second text data.

Step S204: and determining information to be output according to the first text data.

As shown in fig. 3, for the implementation of the first embodiment of the present invention, multiple paths of speech models, such as the one-path speech model and the two-path speech model shown in fig. 3, may be set according to the intention of speech data. The one-way voice model relates to all fields in voice interaction application (such as vehicle-mounted voice interaction equipment), namely the one-way voice model has no obvious tendency to the intention of voice data. Unlike the one-way speech model, the two-way speech model is focused on areas that require geographic information, such as navigation. Specifically, the voice information of the user may be received by using an audio input device such as a microphone, and then the data is processed through the one-path voice model and the two-path voice model at the same time through a subsequent voice signal processing module. The acoustic models in the two paths are the same, but the voice models in the two paths are greatly different, the voice model in one path relates to all fields containing voice interaction in a vehicle, and has no obvious tendency, and the voice models in the two paths can pertinently identify voice data with certain intentions based on geographic position information. And the source of the geographical location information may be GPS, user input, or statistical analysis of the user's data. In the speech model, text data output by the ASR is classified through an NLU tool, the intention of the text is obtained, and the subsequent result selection is influenced by the judgment of the intention.

Take the navigation domain as an example: and if the voice intention is navigation and the output text has no corresponding place, loading a corresponding two-way language model according to the obtained geographic position information, and outputting a final output result from the two-way language model. If the text has a clear place and the voice intention is navigation, loading the two-way language model of the corresponding city, and finally outputting a result from the two-way language model. If the text is intended for a category that does not require geographic information (not related to geographic factors), the final output results are drawn from a one-way language model. Having obtained the ASR text (the first text data recognized from the speech data) and the final determined flow path, the text can be passed through the NLU and DM modules to obtain the next action of the machine. The feedback of the final machine can be output externally by TTS. The embodiment of the invention fully considers the influence of delay, does not influence the concurrency of the system because of supporting a multi-path language model, and improves the processing efficiency.

As shown in fig. 4, for the implementation of the second embodiment of the present invention, the biggest difference from the first embodiment is that the present solution does not need to simultaneously go through two ASR modules, but obtains whether the process needs to go through a loop by using feedback information in a judgment module combining the NLU and the geographic information. In the embodiment of the invention, the ASR is not executed in two paths concurrently, and the needed resources are less.

The detailed process is as follows:

1) firstly, voice of a user can be input through equipment such as a microphone, after voice data of the user is received, the voice data passes through a voice signal processing module and then is input into an ASR module, and text data corresponding to the voice data is obtained;

2) the second module is consistent with the judgment module (NLU module for determining the intention) in the previous process, and can classify the text by analyzing the text so as to obtain the corresponding text intention. Taking the navigation field as an example, the decision-making mode of the determining module is the same as that of the determining module in the first embodiment. The difference lies in that in the first embodiment, the two language models directly identify the voice data according to the geographic location information, and the speech recognition ASR in the two language models is an identification module obtained by training aiming at the corresponding intention;

3) the next action of the machine can be obtained by the NLU and DM modules after the recognized text data is determined. The feedback of the final machine can be output externally by TTS.

Preferably, the voice signal processing is performed on the received voice data before the first voice recognition processing is performed on the received voice data. Wherein the voice signal processing at least comprises one of the following steps: echo cancellation processing, noise reduction processing and reverberation removal processing. In the process of performing semantic understanding on the first text data and determining the target intention of the voice data, a semantic understanding model is called, and the first text data is analyzed and processed through the semantic understanding model to determine the target intention of the voice data. The semantic understanding model is an intention classifier obtained based on classification algorithm training, and the classification algorithm at least comprises the following steps: naive Bayes algorithm, decision tree algorithm, Support Vector Machine (SVM) algorithm and LR analysis method.

Acquiring a set intention set in the process of judging whether the target intention is related to the position factor; at least one intention related to the position factor is included in the intention set; and/or, at least a navigation intent is included in the set of intents. Judging whether the target intention belongs to an intention set or not; if so, the target intent is correlated with the location factor; otherwise, the target intent is not correlated to the location factor.

And in the process of acquiring the geographical position information required by the second voice recognition processing, acquiring the geographical position information required by the second voice recognition processing according to a preset rule. Wherein, the preset rule at least comprises one of the following rules: acquiring current geographical position information to take the current geographical position information as geographical position information required by second voice recognition processing; receiving geographical position information input by a user; taking the geographical position information input by the user as the geographical position information required by the second voice recognition processing; the user's history data is counted to use the counted geographical location information as geographical location information required for the second speech recognition process. As described above, in all of the three cases, priority may be set, and geographic position information necessary for the second speech recognition processing may be determined based on the set priority. In the embodiment of the invention, after the information to be output is determined, the information to be output is converted into the voice data to be output; and outputting the voice data to be output.

Fig. 5 is a schematic diagram of main modules of a voice interaction apparatus according to an embodiment of the present invention, and as shown in fig. 5, the voice interaction apparatus 500 according to the embodiment of the present invention mainly includes a first voice recognition processing module 501, a second voice recognition processing module 502, a judgment module 503, and an information to be output determination module 504.

The first speech recognition processing module 501 is configured to perform first speech recognition processing on the received speech data to obtain first text data of the speech data; and performing semantic understanding on the first text data, and determining a target intention of the voice data. The first voice recognition processing module is also used for calling a semantic understanding model and analyzing and processing the first text data through the semantic understanding model so as to determine the target intention of the voice data; the semantic understanding model is an intention classifier obtained based on classification algorithm training, and the classification algorithm at least comprises the following steps: naive Bayes algorithm, decision tree algorithm, Support Vector Machine (SVM) algorithm and LR analysis method.

The second speech recognition processing module 502 is configured to obtain geographic location information required for the second speech recognition processing; and according to the geographic position information, performing second voice recognition processing on the voice data to obtain second text data of the voice data. The second voice recognition processing module is also used for acquiring the geographical position information required by the second voice recognition processing according to the preset rule; wherein, the preset rule at least comprises one of the following rules: acquiring current geographical position information to take the current geographical position information as geographical position information required by second voice recognition processing; receiving geographical position information input by a user; taking the geographical position information input by the user as the geographical position information required by the second voice recognition processing; the user's history data is counted to use the counted geographical location information as geographical location information required for the second speech recognition process.

The determining module 505 is configured to determine whether the target intent is related to a location factor. The judging module is also used for acquiring a set intention set; at least one intention related to the position factor is included in the intention set; and/or, at least a navigation intent is included in the set of intents; judging whether the target intention belongs to an intention set or not; if so, the target intent is correlated with the location factor; otherwise, the target intent is not correlated to the location factor.

The information to be output determining module 504 is configured to determine information to be output according to the second text data if the information is related to the position factor; if not, the information to be output is determined from the first text data.

The voice interaction device of the embodiment of the invention also comprises a voice signal processing module which is used for processing the voice signal of the received voice data; wherein the voice signal processing at least comprises one of the following steps: echo cancellation processing, noise reduction processing and reverberation removal processing. The voice interaction device of the embodiment of the invention also comprises an output module, a voice module and a voice module, wherein the output module is used for converting the information to be output into voice data to be output; and outputting the voice data to be output.

According to the embodiment of the invention, aiming at different requirements, when the voice interaction is realized, the corresponding output result can be determined according to the correlation of the target intention and the position factor. The output result integrates the factors of the geographic position information, so that the information expected by the user can be accurately pushed, and the user experience is improved. The relevance of the intention and the location factor can be configured in advance (or analyzed in real time according to historical data), for example, for the intentions related to the location factor, such as navigation, searching for location, or setting certain interaction parameters, the intention can be set to be related to the geographic factor in advance, and the second speech recognition processing is performed by taking the geographic location factor into consideration, for example, the model subjected to the second speech recognition processing is trained based on the labeled sample data including the geographic factor. Therefore, a more accurate recognition result can be obtained by the second speech recognition process for the intention related to the geographic factor.

Fig. 6 shows an exemplary system architecture 600 of a voice interaction method or voice interaction apparatus to which embodiments of the present invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

601, 602, 603. The background management server can analyze and process the received data such as the product information inquiry request and feed back the processing result to the terminal equipment.

It should be noted that the method for voice interaction provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus for voice interaction is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a first voice recognition processing module, a second voice recognition processing module, a judgment module and a module for determining information to be output. The names of these modules do not constitute a limitation to the module itself in some cases, and for example, the determination module may also be described as a "module that determines whether the target intention is related to a location factor".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: carrying out first voice recognition processing on the received voice data to obtain first text data of the voice data; performing semantic understanding on the first text data, and determining a target intention of the voice data; acquiring geographic position information required by second voice recognition processing; secondly, performing second voice recognition processing on the voice data according to the geographic position information to obtain second text data of the voice data; determining whether the target intent is associated with a location factor; if the position factor is related, determining information to be output according to the second text data; if not, the information to be output is determined from the first text data.

According to the embodiment of the invention, aiming at different requirements, when the voice interaction is realized, the corresponding output result can be determined according to the correlation of the target intention and the position factor. The second voice recognition processing is performed by taking the geographic position factor into consideration, so that aiming at the intention related to the geographic factor, a more accurate recognition result can be obtained through the second voice recognition processing, and the user experience is improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of voice interaction, comprising:

carrying out first voice recognition processing on received voice data to obtain first text data of the voice data; performing semantic understanding on the first text data, and determining a target intention of the voice data;

acquiring geographic position information required by second voice recognition processing; according to the geographic position information, performing second voice recognition processing on the voice data to obtain second text data of the voice data;

determining whether the target intent is associated with a location factor;

if the position factor is related, determining information to be output according to the second text data; if not, the information to be output is determined from the first text data.

2. The method of claim 1, wherein prior to performing the first speech recognition process on the received speech data, further comprising:

processing the received voice data by voice signals; wherein the speech signal processing comprises at least one of: echo cancellation processing, noise reduction processing and reverberation removal processing.

3. The method of claim 1, wherein the semantic understanding of the first text data, and wherein the step of determining the target intent of the speech data comprises:

calling a semantic understanding model, and analyzing and processing the first text data through the semantic understanding model to determine a target intention of the voice data;

the semantic understanding model is an intention classifier obtained by training based on a classification algorithm, and the classification algorithm at least comprises the following steps: naive Bayes algorithm, decision tree algorithm, Support Vector Machine (SVM) algorithm and LR analysis method.

4. The method of claim 1, wherein determining whether the target intent is associated with a location factor comprises:

acquiring a set intention set; at least one intention related to the position factor is included in the intention set; and/or, at least a navigation intent is included in the set of intents;

determining whether the target intent belongs to the set of intents; if so, the target intent is related to a location factor; otherwise, the target intent is not correlated to a location factor.

5. The method of claim 1, wherein the step of obtaining geographic location information required for the second speech recognition process comprises:

acquiring geographical position information required by second voice recognition processing according to a preset rule;

wherein the preset rule at least comprises one of the following rules: acquiring current geographical position information to use the current geographical position information as geographical position information required by second voice recognition processing; receiving geographical position information input by a user; taking the geographical position information input by the user as geographical position information required by second voice recognition processing; the user's history data is counted to use the counted geographical location information as geographical location information required for the second speech recognition process.

6. The method according to any one of claims 1 to 5, wherein after determining the information to be output, further comprising:

converting the information to be output into voice data to be output;

and outputting the voice data to be output.

7. An apparatus for voice interaction, comprising:

8. The apparatus of claim 7, further comprising a voice signal processing module for performing voice signal processing on the received voice data; wherein the speech signal processing comprises at least one of: echo cancellation processing, noise reduction processing and reverberation removal processing.

9. The apparatus according to claim 7, wherein the first speech recognition processing module is further configured to invoke a semantic understanding model, and perform an analysis process on the first text data through the semantic understanding model to determine a target intention of the speech data; the semantic understanding model is an intention classifier obtained by training based on a classification algorithm, and the classification algorithm at least comprises the following steps: naive Bayes algorithm, decision tree algorithm, Support Vector Machine (SVM) algorithm and LR analysis method.

10. The apparatus of claim 7, wherein the determining module is further configured to obtain a set of intents; at least one intention related to the position factor is included in the intention set; and/or, at least a navigation intent is included in the set of intents; determining whether the target intent belongs to the set of intents; if so, the target intent is related to a location factor; otherwise, the target intent is not correlated to a location factor.

11. The device of claim 7, wherein the second speech recognition processing module is further configured to obtain geographic location information required for the second speech recognition processing according to a preset rule; wherein the preset rule at least comprises one of the following rules: acquiring current geographical position information to use the current geographical position information as geographical position information required by second voice recognition processing; receiving geographical position information input by a user; taking the geographical position information input by the user as geographical position information required by second voice recognition processing; the user's history data is counted to use the counted geographical location information as geographical location information required for the second speech recognition process.

12. The apparatus according to any one of claims 7 to 11, further comprising an output module for converting the information to be output into voice data to be output; and outputting the voice data to be output.

13. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.