WO2018188907A1

WO2018188907A1 - Processing speech input

Info

Publication number: WO2018188907A1
Application number: PCT/EP2018/056945
Authority: WO
Inventors: Felix Schwarz; Christian Süss
Original assignee: Bayerische Motoren Werke Aktiengesellschaft
Priority date: 2017-04-12
Filing date: 2018-03-20
Publication date: 2018-10-18
Also published as: DE102017206281A1

Abstract

The invention relates to a method for the improved processing of speech input of a user of a mobile device, particularly of a motor vehicle, comprising the steps of: capturing a speech input; processing the speech input by a first speech processing system; evaluating (230) the result of the processing of the speech input by the first speech processing system; and, depending on the result of the evaluation, creating a data record containing at least data representing the speech input; and transmitting the data record to at least one other speech processing system. (Fig. 1)

Description

description

Processing a voice input

The present invention relates to a method for processing a voice input and a mobile device, in particular a motor vehicle, for carrying out such a method.

Voice control represents a possibility to the driver of a modern one

Motor vehicle to facilitate the operation of the various functions that may have such a vehicle. In the processing of the voice inputs basically two possibilities can be distinguished. On the one hand, the

Processing of voice inputs in the vehicle done, for example, by a central control unit of the vehicle. On the other hand, a data connection to a vehicle external server can be used, which takes over the processing of voice input. Both options can also be used in combination.

DE 10 2012 213 668 A1 describes a method for operating a

voice-controlled information system for a vehicle. In this case, depending on a linguistic input of a vehicle user, at least one keyword is determined from a set of predefined keywords. Individual units of the information system can also be arranged outside the vehicle. When determining answers, a current individual equipment of the vehicle can be taken into account.

DE 10 2012 022 630 A1 teaches a method for communication of a driver with a driver assistance system. In this case, a keyword identification is provided, which can also access external source. For example, internet servers whose databases are kept up to date can be requested.

The capabilities of machine language processing systems for processing voice input are steadily growing. Nevertheless, situations still occur which machine language processing systems reach their technical limits. It may then be desirable to provide the user with a human

To connect to the other party. The applicant offers such a service under the name "Concierge Service".

Starting from the prior art, the object is to improve the processing of a voice input of a user of a mobile device, in particular a motor vehicle.

The object is achieved in a method and a mobile device with the features of the independent claims. Advantageous developments of the invention are

Subject of the dependent claims.

The invention is suitable for use with a variety of mobile devices,

in particular those that are both equipped with independent computing power as well as a mobile connection (data connection and / or voice connection) can produce. Particularly beneficial, the invention can be used in motor vehicles, especially passenger cars, motorcycles or

Trucks. However, the mobile device may also be a portable mobile device and, in particular, a so-called smartphone. Insofar as the following description of the invention and its embodiments is made with reference to motor vehicles, this is not limiting, but explanatory-exemplary to understand.

The inventive method for processing a voice input of a user of a mobile device comprises the following method steps. In a first step, a voice input is recorded. For this purpose, a microphone of the mobile device can be used in a manner known per se and the thus processed acoustic signal further processed, in particular digitized, be. A voice input may be a variety of utterances of the user. Speech inputs can include, for example, voice commands ("Navigate Home", "Increase the volume of the radio play", "Call Martin") or questions ("What is the weather at the destination?"). In a further step, the speech input by a first

Voice processing system processed. The processing of a voice input may be accomplished in a variety of ways known in the art. As a rule, the processing will take place step by step, whereby first the audio signal representing the speech input is processed (digitized, filtered). Subsequently, a syntactic analysis can be carried out, the result of which may be a text-based reproduction of the spoken words, the meaning of which, however, has not yet been ascertained. In a further step, a semantic analysis of the (now text-based) voice input can take place. For the purposes of the present invention, the term processing of voice input is to be understood broadly. In particular, the speech input processing according to the invention may be a partial processing, for example

- Only a signal processing of the acoustic signal or

a signal processing of the acoustic signal and a syntactic analysis or

a signal processing of the acoustic signal and a syntactic and

semantic analysis

can be.

In the next step, the result of the speech input processing by the first speech processing system is evaluated. The score may include a separate score for each of the aforementioned processing steps.

Likewise, the evaluation for several or all processing steps can be done together. With particular advantage, the evaluation includes an evaluation of the

Final result of the processing of the speech input. If, for example, signal processing and subsequent syntactic analysis are carried out during the processing of the speech input, so that the result of the processing consists of text-based data, the evaluation of the processing may relate to this text-based data. The term evaluation should be understood to mean that the assessment includes a statement of the quality of the processing of the speech input. The rating may relate to the quality of the speech input itself; For example, the score may include an indication of a detected signal-to-noise ratio (SNR) of the detected acoustic signal. The evaluation may also relate to the processing; For example, the evaluation may include a statement that the speech input is not syntactically could be processed. If a reason is found during processing (eg user speaks a language not known to the first language processing system), the evaluation may include an indication of this reason. Advantageously, the evaluation comprises a measure of quality or statistical uncertainty, which in particular has a predetermined value range (eg 0 for minimum quality / maximum statistical uncertainty to 1 for maximum quality / minimum statistical uncertainty).

Depending on the result of the evaluation, a data record is created according to the invention which comprises at least data representing the speech input.

For example, the record may be speech input as digitized (and

preferably compressed) acoustic data signal. Alternatively or additionally, the record may, for example, comprise the result of the syntactic analysis as text-based data.

Furthermore, depending on the result of the evaluation, the data record is transmitted to at least one further voice processing system in the last step.

The invention exploits the fact that one mobile device has access to several

Can have voice processing systems. It is obvious that the use of a plurality of speech processing systems provides an improvement in the

Processing of the speech input. However, it would be costly to always use all available language processing systems in parallel with the processing of voice input. On the other hand, the invention is based on the idea of first using the first voice processing system for processing the voice input and then - if necessary - the at least one further

Speech processing system. This way can be an improved

Speech processing can be achieved at relatively low cost.

In a preferred embodiment, the first voice processing system is a machine language processing system located in the mobile device. Such a voice processing system may also be referred to as a local voice processing system. In other words, therefore, first to the local Language processing system used, which is available immediately and in particular independent of the existence of a cellular connection.

Further advantageous embodiments provide that the further

Voice processing system arranged outside the mobile device

A voice processing system, wherein transmitting the record comprises transmitting the record over a cellular connection. The further

Voice processing system, for example, be accessible via the Internet. The data record is then transmitted via a mobile unit of the mobile device

Mobile communication (e.g., WLAN and / or GPRS, UMTS, LTE or the like) transmitted to an Internet server, which provides the other language processing system or forward the record to this. The voice processing system located outside the mobile device may also be referred to as an external voice processing system. The advantage of such an external language processing system is that the computational power compared to the local

Voice processing system is usually larger. In addition, the external

Speech processing system for processing the speech input to access information that is not available to the local language processing system.

The external voice processing system therefore typically has better speech processing over the local voice processing system. Therefore, the invention can be particularly advantageous by a combination of the two

configure the aforementioned embodiments. First, the speech input is processed by the fast and always available local speech processing system. If the evaluation shows that the voice input could not be satisfactorily processed, the record will be sent to the external

Voice processing system transmitted.

The other language processing system can be a machine

Voice processing system include. Alternatively or additionally, the further language processing system comprises a human participant. This one can

For example, be an employee of a call center. In the latter case, the

Mobile connection include a voice connection, by means of which the user of the mobile device is connected to the call center employee. It can be provided that the mobile device decides, depending on the result of the evaluation, whether a further voice processing system is a pure

machine language processing system or a speech processing system with a human participant to be used. If, for example, it can be determined that the voice input is correctly interpreted by machine but can not be answered with the locally available information, then it makes sense to transmit the data record to a purely external machine language processing system. If, on the other hand, it is determined that the voice input can not be understood with sufficient probability for a machine voice processing system, then a voice processing system with a human user can be selected.

The decision as to whether a further speech processing system is a purely machine language processing system or a speech processing system having a

human participant is to be used, but does not necessarily have to be met by the mobile device. It may also be provided that the mobile device transmits the data record to the further voice processing system and the

Deciding whether to include a human participant or not, by the other language processing system.

In a further refinement, the processing of the speech input by the first speech processing system comprises a syntactic and / or semantic analysis of the speech input. A syntactic analysis should be understood to mean a processing of the speech input present as (possibly already digitized) acoustic signal, the result of which is a correctly structured sequence of individual words. The syntactic analysis can also detect the language of the

Include voice input. For example, a correct syntactic analysis result could be the text-based record "navigate home", without knowing its meaning, for example, an incorrect result of the syntactic analysis could be: "drive with wind over windows".

A semantic analysis is to be understood as a processing of the speech input (or the result of the preceding syntactic analysis) whose Result reflects the meaning of the speech input. For example, proper semantic analysis of the "navigate home" voice input could yield a machine readable navigation command that includes the destination "home location parameter" destination.

It is particularly advantageous in syntactic and / or semantic analysis if the step of evaluating the result of the processing of the speech input by the first speech processing system comprises determining a measure of the quality of the syntactic and / or semantic analysis of the speech input. Preferably, the range of values of the measure is limited on both sides and predetermined. For example, the metric could be between 0 (minimum quality, processing has no result at all or result is highly unusable) and 1 (maximum quality, result of the processing is most certainly correct). The measure of goodness may be configured as a confidence value that reflects a probability that the result of the processing is correct. It can be provided, for example, that whenever the confidence value of the syntactic analysis falls below a predetermined value (for example 0.5, preferably 0.8, particularly preferably 0.95), the first speech processing system will include the further speech processing system.

Advantageously, the data record comprises an audio file representing the speech input and / or a text file representing the speech input. If, as in the last-mentioned example, the syntactic analysis is unsuccessful, the data record may preferably comprise an audio file representing the speech input. If, on the other hand, speech processing fails in the semantic analysis (ie if the speech input already exists in text form in other words, but can not be interpreted), the data record may preferably include a text file representing the speech input. It can also be provided that the record audio file and

Text file includes.

More preferably, the record comprises at least

Parts of the result of the processing of the speech input by the first

Speech processing system and / or Parts of the result of the evaluation of the result of the speech input processing by the first speech processing system.

In other words, it can be provided that even in the case of insufficiently good speech processing by the first speech processing system, the result of this processing is at least partially transmitted to the further speech processing system. Likewise, it may be provided for this case that the evaluation of this result is at least partially transmitted. The data transmitted in this way can be used by the further voice processing system in a variety of ways. Thus, for example, one's own speech processing can be improved and / or the result of one's own speech processing can be checked. Furthermore, it is conceivable that the further speech processing system will only perform missing parts of the speech processing, so that the result is a "division of labor" between the first and the further speech processing system

In an exemplary case that the first voice processing system can not distinguish between multiple possible destination inputs, the set of possible destination inputs are transmitted as part of the data set.

In a particularly advantageous embodiment of the invention, it is provided that a user input for confirming the transmission of the data record to the at least one further voice processing system is requested. The data set is dependent on the user input to the at least one more

Voice processing system transmitted. In other words, so the

Transmission of the data set to the other language processing system only after explicit confirmation by the user. The request of the user input may, for example, acoustically and / or visually, in particular on a display of the

Mobile device. The user input can, for example, by

Actuation of an operating element and / or by means of voice input.

The invention is further formed by a mobile device, in particular a motor vehicle, which is set up to carry out the method described above.

Further embodiments of the invention are explained below with reference to exemplary representations. Show it Fig. 1 shows an embodiment of the invention in an exemplary arrangement and

2 shows a flow chart of an embodiment of the method according to the invention.

Fig. 1 shows a schematic representation of a motor vehicle 1 10, which has a designated head unit 1 1 1 control unit. The head unit 1 1 1 comprises the first voice processing system 1 1 1. It is therefore a local

Speech processing system 1 1 1. Other components, in particular one or more interior microphones, of the first voice processing system 1 1 1, which may be arranged in or outside the head unit 1 1 1, are not shown in Fig. 1. Via a data bus 1 13, the head unit 1 1 1 with a mobile radio unit 1 12 of the motor vehicle 1 10 is connected. The mobile radio unit 1 12 is set up, a

Cellular connection 130 via a mobile network (e.g., WLAN, GSM / GPRS / EDGE, UMTS / HSPA, LTE or the like). The cellular connection 130 may include a voice connection and / or a data connection.

Via the mobile radio connection 130, the motor vehicle 1 10 can exchange data 140 with a server 121 which can be reached via the Internet 120. The server 121 houses the other language processing system 121. It is thus an external language processing system 121. A call center (not shown in FIG. 1) may also be provided, the employee of which as a human participant of the further speech processing system 121 can be connected to the user of the motor vehicle 110 by means of a voice connection 130.

2 shows an exemplary method sequence according to an embodiment of the invention. In step 210, a voice input is detected by the first voice processing system 1 1 1, for which purpose preferably an interior microphone of the motor vehicle 1 10 can be used. The signal thus detected can first be digitized, i.

sampled, quantized and if necessary filtered.

In step 220, the speech input (now present as a digital signal) is processed. For this purpose, inter alia, a syntactic analysis can be carried out, in which the digitized audio signal is converted into a text-based date. Furthermore, a semantic analysis can be performed in which the meaning of the speech input is converted, for example, into the form of a machine-readable control command.

In step 230, the result of the speech input processing 220 is evaluated. For example, a statistical confidence value representing a statistical certainty of the result of the processing 220 may be determined. is

For example, the sound quality of speech input very bad, for example, due to high ambient noise, a low speech volume or a vague speech of the user, the syntactic analysis 220 could indeed produce a result, the confidence value is low. In other words, there is a great deal of doubt as to the correctness of the result of the syntactic analysis 220. A semantic analysis could then fail or produce an erroneous result.

Depending on the result of the previous evaluation, a data record 140 is created in step 240. The record 140 contains the speech input in

digitized form, ie a digital audio file. The data record 140 may contain further components, for example the previously determined confidence value.

In step 250, a user input confirming the transmission of the

Record 140 requested to the other language processing system 121.

For example, the user receives a message saying "Your voice input could not be processed. Press the confirm button to your

Voice input to our service center for further processing. ".

If the user input, for example, the user presses the confirmation key, the record 140 is transmitted to the other language processing system 121 in step 260.

The other language processing system 121 could initially process the speech input by purely mechanical means. This processing can therefore be more successful than that by the first voice processing system 1 1 1, because the other language processing system 121 to a larger database and / or a greater computing power can be used for speech recognition. However, it is also conceivable that the further language processing system 121 for technical reasons (speech recognition fails) or content-related reasons (speech input content can not be answered or edited with the available information) can not handle the voice input. It may be provided for this case that the further speech recognition system 121 establishes a voice connection between the user of the motor vehicle 110 and a human participant of the further voice processing system 121. This can be done automatically or after prior confirmation of the user.

LIST OF REFERENCES Mobile device, in particular motor vehicle. First language processing system

mobile unit

bus

Internet

Another language processing system

cellular Line

record

0 process steps

Claims

claims

1 . Method for processing a voice input of a user of a

Mobile device (1 10), in particular a motor vehicle (1 10), with the steps

Detecting (210) a voice input,

- processing (220) the speech input by a first one

Speech processing system (1 1 1),

Evaluating (230) the result of the processing (220) of

Speech input by the first speech processing system (1 1 1) and, depending on the result of the evaluation (230),

- Creating (240) a data set (140), which at least the

Includes data representing voice input, and

- transmitting (260) the data record (140) to at least one further voice processing system (121).

The method of claim 1, wherein the first voice processing system is a machine language processing system (1 1 1) located in the mobile device.

3. The method according to any one of the preceding claims, wherein the further

Speech processing system is a language processing system (121) arranged outside the mobile device, wherein the transmission (260) of the

Record (140) comprises transmitting (260) of the record (140) via a cellular connection (130).

The method of claim 3, wherein the further language processing system comprises a machine language processing system (121) and / or a human participant.

5. The method according to claim 1, wherein the processing (220) of the speech input by the first speech processing system (1 1 1) comprises a syntactic and / or semantic analysis (220) of the speech input.

The method of claim 5, wherein the step of evaluating (230) the

Result of the processing (220) of the speech input by the first one

Speech processing system (1 1 1) comprises determining (230) a measure of the quality of the syntactic and / or semantic analysis of the speech input.

The method of any one of the preceding claims, wherein the record (140) comprises an audio file representing the voice input and / or a text file representing the voice input.

8. The method according to any one of the preceding claims, wherein the data record (140) at least

Parts of the result of the processing (220) of the speech input by the first speech processing system (1 1 1) and / or

Parts of the result of the evaluation (230) of the result of the

Processing (220) the speech input by the first one

Voice processing system (1 1 1)

includes.

9. The method according to any one of the preceding claims, comprising the steps

Requesting (250) a user input to confirm the transmission of the data record (140) to the at least one further

Speech Processing System (121) and

- transmitting (260) the data set (140) to the at least one further voice processing system (121) as a function of the user input.

10. Mobile device, in particular motor vehicle (1 10), for carrying out the method according to one of the preceding claims.