WO2022022948A1

WO2022022948A1 - Voice assistance system and method for voice-based support

Info

Publication number: WO2022022948A1
Application number: PCT/EP2021/068566
Authority: WO
Inventors: Ingo Siegert; Norman Weißkirchen; Andreas Wendemuth
Original assignee: Otto-Von-Guericke-Universität Magdeburg
Priority date: 2020-07-29
Filing date: 2021-07-06
Publication date: 2022-02-03
Also published as: DE102020119980B3

Abstract

The invention relates to a voice assistance system for the voice-based support of a user. The voice assistance system is designed, in accordance with the type in question, to recognize voice acoustics and to determine whether or not the recognized voice acoustics were directed to the voice assistance system. If the voice acoustics were directed to the voice assistance system, a support function is performed by the voice assistance system in accordance with the linguistic content of the voice acoustics. The invention also relates to a method and a computer program therefor.

Description

Language assistance system and method for language-based support

The invention relates to a language assistance system for language-based support of a user, the language assistance system being set up generically to recognize speech acoustics and to determine whether the recognized speech acoustics were directed to the language assistance system or not, with a support function being activated by the language assistance system depending on the language content of the speech acoustics is carried out if the speech acoustics were directed to the speech assistance system. The invention also relates to a method and a computer program for this purpose.

There are many different forms of a man-machine interface. One of the most intuitive forms of establishing a human-machine interface is voice-based communication between human and machine using a voice assistance system. According to the generic type, the auditory perceptible sound signals are permanently recorded by an auditory sensor device (recording device) and a corresponding speech acoustic is identified. The linguistic content of the speech acoustics is then extracted by means of speech recognition algorithms and a support function is then selected based on the linguistic content and executed by the assistance system or a downstream (artificial) device.

Such language assistance systems thus offer natural persons the opportunity to make inputs to the assistance system using speech, with the language assistance system understanding the linguistic content of the spoken speech acoustics as an input and reacting accordingly. Such reactions can, for example, in turn be voice output if the previously made input contained a question to the assistance system in the form of voice acoustics. Such reactions but can also be the switching of actuators or the execution of calculations.

There are currently two major problem areas. On the one hand, such speech assistance systems must be able to reliably recognize the linguistic content contained in the speech acoustics. However, natural language is prone to errors and can have a great deal of room for interpretation. Usually only with knowledge of a specific context can it be determined with a high degree of probability what is actually meant. Automatically capturing this linguistic context is a major challenge in terms of algorithms and is often also prone to errors, since non-verbal channels, such as prosodic properties of language, facial expressions or gestures, also play a major role in determining the linguistic context. Due to the ever-increasing performance of modern computer systems and developments in the field of AI, a significant increase in the recognition rate of spoken content of natural speech acoustics has been achieved in recent years, which means that the speech-based human-machine interface has become increasingly important in many areas of the human found its way into everyday life.

The second major problem area is that a voice assistance system has to determine for itself whether spoken speech acoustics are directed at the voice assistance system or not. In the case of natural language communication between people, such a determination is often only made with knowledge of other communication channels, such as eye contact, facial expressions or prosodic properties, which is currently denied to a voice assistance system. Speech assistance systems must therefore determine solely based on the received and recognized speech acoustics whether the speech acoustics are directed to the speech assistance system and consequently a reaction is expected or not.

It is known that language assistance systems recognize certain keywords or command words and use this recognition to conclude that the speech acoustics or the following speech acoustics are directed at them. In the area of home automation in particular, there are a number of such voice assistance systems that react to very special command words and then analyze the voice acoustics with regard to the natural language contained therein and then execute appropriate support functions. Voice control can be implemented with which, for example, the lighting or other things in the house can be controlled.

The disadvantage here, however, is that the recognition of command words is error-prone. This is because the voice assistance system cannot distinguish whether the spoken command word is used in the context of the entire speech acoustics to operate the voice assistance system or whether the spoken command word is part of a conversation between two people in the context. It can therefore lead to the voice assistance system incorrectly assuming that the recorded and recognized voice acoustics are directed to the system, although the recognized voice communication is part of an interpersonal voice communication and the voice assistance system itself is not meant at all. This is problematic because the voice assistance system basically receives and processes every auditory sound signal in order to be able to recognize the command word directed to the voice communication system at the right time.

On the other hand, other circumstances, such as poor voice quality or recording quality, can mean that the voice assistance system does not recognize the recorded voice acoustics as directed and therefore does not carry out a function, despite a spoken command word.

It is therefore the object of the present invention to specify an improved language assistance system and method for language-based support, with which the recognition rate with regard to the response behavior of language assistance systems can be increased.

The object is achieved with the language assistance system according to claim 1 and the method according to claim 8 according to the invention. Advantageous configurations of the invention can be found in the corresponding subclaims.

According to claim 1, a language assistance system for language-based support of a user is proposed that initially has a recording device, in order to be able to record auditory perceptible sound signals. Such a recording device can, for example, have a sound sensor (microphone) that receives an audible sound signal and converts it into a digital signal.

The speech assistance system also has a generic recognition device which, for example by means of a microprocessor-controlled processing unit, recognizes speech acoustics in the recorded auditory sound signals and extracts speech-based information from them. The recognition device is therefore designed in such a way that it carries out a speech analysis and thus, for example, using models from the speech acoustics, extracts the speech-based information contained therein. In addition to the linguistic content (i.e. the spoken words of the speech communication contained in the speech acoustics), such language-based information can also contain further language-based information, such as prosodic properties of the speech acoustics. The prosodic properties are understood in particular as those properties of the language within the meaning of the present invention that do not relate to the linguistic content, i. H. the spoken words. This includes in particular accents, tonal language, intonation, quantity, tempo, rhythm, pauses in speaking and the like. Speech-based information, which was extracted from the auditory perceptible sound signals of speech acoustics, can therefore contain, in particular, speech content and/or prosodic properties of the speech acoustics or of the spoken content.

The voice assistance system generically also has an activation device which, for example, also uses a microprocessor-controlled computing unit based on a language model depending on the extracted language-based information to identify whether the recognized speech acoustics are directed to the voice assistance system or not. The activation device can thus recognize an intention to activate as a function of the extracted language-based information. In this case, an intention to activate means that the user consciously addresses the language assistance system in order to establish voice communication. Based on this recognition of whether the speech acoustics or speech communication was directed to the speech assistance system or not, the speech assistance system for speech-based support is activated, not activated or deactivated again. If the speech acoustics were known to be directed to the speech assistance system, the speech assistance system is consequently activated or put into an activated state by means of the activation device. If, on the other hand, the speech acoustics were recognized as not being addressed to the speech assistance system, the speech assistance system is consequently not activated or, if necessary, deactivated (if it was previously activated) or put into a non-activated state by means of the activation device. The detection device and the activation device can represent a structural unit.

Finally, the language assistance system generically has an assistance device that is set up to select a support function depending on extracted language-based information and to execute it for language-based support of the user, provided that the language assistance system has previously been activated by the activation device.

This means that, for example, a command word was recognized in a voice acoustic by the activation device, which means that the activation device puts the voice assistance system into an activated state and uses the language-based information still contained in the voice acoustic to select the support function. This is usually the case when the speech acoustics contain the instruction to the voice assistance system in addition to the command word. In addition to the command word for recognizing an intention to activate, it is also possible, alternatively or additionally, for an intention to activate to be recognized based on prosodic properties that are contained in the extracted language-based information. It is thus determined on the basis of per sodic properties whether the recorded speech acoustics are directed to the speech assistance system or not.

Alternatively or additionally, the language assistance system can also be directed to recognizing the corresponding command word in a first speech acoustics, whereupon the language assistance system is triggered by the activation device is put into the activated state, with the corresponding instructions to the language assistance system then being contained in a second speech acoustics received and recognized thereafter, which are then used for the selection and implementation of the support function. In this case, the command word and the instructions to the voice assistance system are contained in different voice acoustics. After the first speech acoustics received by the speech assistance system, the speech assistance system can, if appropriate, issue a confirmation in spoken form to the user to inform the user that the speech assistance system is now waiting for the instructions addressed to the speech assistance system. The voice assistance system is therefore in an activated state for a certain period of time after the first voice acoustic signal and accordingly expects a second voice acoustic signal.

In addition, the voice assistance system can alternatively or additionally also be directed to use the activation device to recognize, based on prosodic properties of the recognized voice acoustics, whether the recognized voice acoustics is directed to the voice assistance system or not.

Such a generic language assistance system is now further developed according to the invention such that a first speech acoustics and/or the speech-based information extracted from the first speech acoustics is stored in a digital data memory and that an adaptation device is provided which is set up to adapt the speech model as a function of adapt the first speech acoustics previously stored in the data memory and/or the speech-based information extracted from the first speech acoustics if there was uncertainty and/or errors in the first speech acoustics as to whether the first speech acoustics was directed to the speech assistance system or not, and this uncertainty and/or error is eliminated by a corresponding second speech acoustic detected after the first speech acoustic or by a detected absence. The first speech acoustics or the language-based information extracted from the first speech acoustics is thus used to adapt the speech model if a corresponding uncertainty and/or error was recognized as to whether the first speech acoustics was directed to the speech assistance system or not . This uncertainty and/or error is recognized by a second voice acoustic following the first voice acoustic or the absence of such a second voice acoustic.

Uncertainty as to whether an initial speech acoustic was directed at the speech assistance system or not usually occurs when the activation device cannot identify with sufficient certainty whether the speech acoustic was directed at a different speech assistance system or not. This adequate level of security is represented, for example, via a threshold value (for example a percentage threshold value or a threshold value between 0 and 1). The system can divide the recognition of the intention to activate into at least three areas. The first area is always used when there is no uncertainty about an intention to activate. The second area is always used when the system is unsure whether the user intended to activate or not (there is uncertainty). The third area is used whenever the system can determine with certainty that an activation intent is not present. Uncertainty is a system view.

A first threshold value can thus indicate that the speech acoustics are directed to the speech assistance system with sufficient certainty. In contrast, a second threshold value can indicate that the speech acoustics are not directed to the speech assistance system with sufficient certainty. If neither the first nor the second threshold value is exceeded, the speech assistance system cannot assume with sufficient certainty that the speech acoustics are directed at the speech assistance system or are not directed at the speech assistance system. In most cases, there is usually no activation if there is uncertainty.

Such a threshold value or the plurality of threshold values can be part of the speech model, on the basis of which a decision is made as to whether the recognized speech acoustics are directed to the speech assistance system or not. That's how she can The language model can be adapted, for example, by changing the values of the threshold values of the language model accordingly, in order to generate an improved decision basis in this way.

A fault exists when the activation device has activated the voice assistance system although the voice acoustics were not directed at the voice assistance system or when the activation device has not activated or deactivated the voice assistance system although the voice acoustics were definitely directed at the voice assistance system. This can be recognized, for example, by the content of the second voice acoustic or by the fact that a second voice acoustic following the first voice acoustic is absent and this absence of a second voice acoustic is thus recognized. Incorrectness is the user's point of view.

The fact of whether there was uncertainty or an error can usually be recognized by at least one second speech acoustic and, if necessary, eliminated. In this case, however, the extracted speech-based information from the first speech acoustics previously stored in the digital data memory is used to adapt the speech model for detecting whether the speech acoustics is directed to the speech assistance system or not so that such uncertainties and/or Incorrectness and when recognizing the response of the assistance system is avoided or the uncertainty rate and/or error rate is reduced.

It has been shown that such a subsequent adjustment of the speech model in the event of an existing uncertainty or error as to whether the speech assistance system has been addressed by speech acoustics or not, based on the speech acoustics that were incorrectly or reliably recognized, leads to improved activation behavior of the Language assistance system leads so that the acceptance of such language assistance system can be significantly increased.

According to one embodiment, it is provided that the language assistance system is set up to use the adaptation device to adapt the language model as a function of the previously stored first speech acoustics and/or the data from the first speech acoustically extracted speech-based information if within a certain period of time after the first speech acoustics the second speech acoustics is recognized or within the certain period of time the absence of the second speech acoustics is recognized.

Accordingly, the second speech acoustics should be received or recognized within a certain period of time after the receipt or recognition of the first speech acoustics in order to establish a contextual connection between the two speech acoustics based on the temporal reference. The first speech acoustics or the language-based information extracted from the first speech acoustics, such as prosodic properties or language-based information, are then used to adapt the language model. In this case, it can be assumed that the first speech acoustics and the second speech acoustics in one contextual connection and thus belong together.

It can also be provided that no further second speech acoustic is recognized within the certain period of time (the absence of a following second speech acoustic is recognized), so that it can be assumed that the recognition of the first speech acoustic was faulty. This is usually the case when the first speech acoustics were erroneously recognized as being directed to the speech assistance system, which the speaker did not intend to do. The speaker now remains silent for a certain period of time and does not produce any further speech acoustics, so that the speech assistance system now assumes that the original first speech acoustics were not directed at the speech assistance system. In this case, the first speech acoustics or the speech-based information extracted from the first speech acoustics is used to adapt the speech model.

According to one embodiment, it is provided that the language assistance system is set up to define or vary the certain period of time depending on at least one acoustic quality criterion of the first speech acoustics -Distance, incorrect activations that have already taken place, number of speakers, etc. It is conceivable that the storage duration of the first speech acoustic or the speech-based information extracted from the first speech acoustic is linked to this certain period of time, which can be a few seconds, for example (preferably less than 10 seconds, particularly preferably less than 5 seconds). so that after a certain period of time this information is deleted from the digital data storage.

According to one embodiment it is provided that the recognition device is set up to extract prosodic properties of the first speech acoustics as part of the speech-based information from the first speech acoustics, the activation device is set up based on the language model depending on the extracted speech-based information prosodic properties of the first speech acoustics to recognize whether the recognized speech acoustics is directed to the voice assistance system or not, and the adaptation device is set up to adapt the language model with regard to the prosodic recognition as a function of the first speech acoustics previously stored in the data memory, the one from the first speech acoustics extracted language-based information and / o to adapt the prosodic properties of the first speech acoustics contained in the extracted language-based information.

In this embodiment, the language model is adapted with regard to the prosodic recognition of whether the speech acoustics are directed to the voice assistance system or not, so that the recognition rate can be improved in the future and the uncertainty rate or error rate can be reduced.

According to one embodiment it is provided that the recognition device is set up to extract linguistic content of the first speech acoustics as part of the language-based information from the first speech acoustics, the activation device is set up based on the language model depending on the extracted language-based information contained linguistic content to recognize whether the recognized speech acoustics is directed to the language assistance system or not, and the adaptation device is set up, the language model with regard Lich the recognition of linguistic content depending on the previously by Data storage stored first speech acoustics to adapt the extracted from the first speech acoustics language-based information and / or contained in the extracted language-based information linguistic content of the first speech acoustics.

In this embodiment, the language model is adapted with regard to the language-based information, so that, for example, activation intentions (e.g. command words, prosodic properties of the speech acoustics) can be recognized better in the future.

According to one embodiment it is provided that the activation device is directed, if there is uncertainty as to whether the recognized speech acoustics should be recognized as directed to the voice assistance system or not (ie whether an activation intention is present or not), an optical, haptic, olfactory and / or generate acoustic system query and output using an output device of the language assistance system. The system query can be an acoustic system query, for example a voice output in the form of voice acoustics, where the voice output contains a query to the user regarding the recognized voice acoustics.

In this embodiment, the language assistance system will first generate a system query when it detects uncertainty, which contains a visual, haptic, olfactory and/or acoustic query in relation to the detected first speech acoustics. The assistance system therefore asks the speaking user what is meant or whether the speech acoustics were actually directed at the assistance system or not.

According to one embodiment, it is provided that the language model is a learned language model based on a machine learning system.

The object is also achieved with the method for speech-based support of a user according to claim 8, in that audible sound signals are recorded by means of a recording device, by means of a recognition device in the recorded, audibly perceptible sound signals speech acoustics are recognized and, if speech acoustics were recognized, speech-based information is extracted from the recognized speech acoustics, an activation device based on a speech model, depending on the extracted speech-based information, is used to determine whether the recognized speech acoustics is directed to the voice assistance system or not, whereby the language assistant system is activated for language-based support when the voice acoustics were recognized as directed to the voice assistant system, and/or is not activated if the voice acoustics were not recognized as directed to the voice assistant system, and by using an assistance device Depending on extracted language-based information, a support function is selected and executed for language-based support of the user when the language assistance system has been activated beforehand, wherein a digital data store he ste speech acoustics and/or the speech-based information extracted from the first speech acoustics is stored, and that the speech model is adapted by means of an adaptation device depending on the first speech acoustics previously stored in the data memory and/or the speech-based information extracted from the first speech acoustics, if in the case of the first speech acoustic, there was uncertainty and/or error as to whether the first speech acoustic was directed at the voice assistance system or not, and this uncertainty and/or error resulted from a second speech acoustic detected following the first speech acoustic or through a recognized absence of a corresponding one second speech acoustics is eliminated.

Advantageous refinements of the method can be found in the dependent claims.

The invention is explained in more detail by way of example on the basis of the attached figures. It shows:

FIG. 1 shows a schematic representation of the language assistance system according to the invention.

FIG. 1 shows, in a schematically simplified representation, a voice assistance system 10 that is designed by means of a recording device 11 to record acoustically perceptible sound signals. Such a recording device 11 can have a microphone 12, for example, with which the sound signals are recorded and then converted into a digital signal by means of a converter unit.

A recognition device 13 now receives these sound signals, which are present in digital form, as an input and is set up to recognize speech acoustics A, B contained therein based on the auditory perceptible sound signals. The detection device 13 is thus designed, in a first step, to identify whether the recorded sound signal and speech acoustics contain A, B or not.

If the recognition device 13 recognizes that the acoustically perceptible sound signals contain speech acoustics A, B, then the recognition device 13 is also designed to extract speech-based information from the recognized speech acoustics A, B in a manner known per se. Such speech-based information can include prosodic properties of speech acoustics A,

B included. In addition, the language-based information also contains language content of speech acoustics A, B, d. H. those words and phrases spoken by a user 100 previously.

In the exemplary embodiment in FIG. 1, a user 100 has generated a first speech acoustics A and a second speech acoustics B by the user 100 uttering certain words or sentences. The first speech acoustics A was pronounced at a first time t, while the second speech acoustics B was pronounced at a subsequent second time t+1. Between the first There is a certain period of time between the point in time and the second point in time, which suggests a contextual connection between the two speech acoustics A and B.

The recognition device 13 has access to a data memory on which a language model 14 is stored. Based on this language model 14, the recognition device 13 is able to recognize whether speech acoustics A, B were recognized in the audible sound signals and is also able to extract the language-based information from the recognized speech acoustics A, B.

The speech-based information is then transmitted to an activation device 15, which also has access to the speech model 14 and is set up to recognize based on this whether the recognized speech acoustics A, B is directed to the speech assistance system 10 or not.

The first speech acoustics A can be such that the speech acoustics A contains a command word and/or a prosodic property, which indicates that the first speech acoustics A or the second speech acoustics B is directed to the speech assistance system 10 (first speech acoustics A and second speech acoustics B can be spoken in a temporal context or staggered in time to wait for a reaction of the system between the first speech acoustics A and second speech acoustics B). The recognition device 13 recognized that the speech acoustics A contain a command word and/or a prosodic property and, if applicable, which word or which sentence is contained or what is to be expressed by the prosodic property, what in is stored with the language-based information. Thus, the recognition device 13 can also continue to determine prosodic properties from the speech acoustics and thus be part of the speech-based information generated by the recognition device 13 .

In the activation device 15, based on the language model 14, it is now determined whether the language-based information contains a command word and/or a prosodic property, which indicates that the first language acoustic A or the second voice acoustic B that follows it is directed to the voice assistance system 10 . Based on the language model 14, the activation device 15 can also determine whether a voice communication with the voice assistance system 10 is to be established using the prosodic properties of the voice acoustics A, which are stored in the language-based information. It is conceivable that an intention to activate is determined only on the basis of prosodic properties of the first speech acoustics A.

If the activation device 15 recognizes that the speech acoustics A are such that speech communication with the speech assistance system 10 is to be set up, thus the speech acoustics A and/or B are directed to the speech assistance system 10, then the activation device 15 activates the speech assistance system 10, whereby further processing by an assistance device 16 follows.

The assistance device 16 selects a support function as a function of speech-based information and executes it if the speech assistance system 10 was previously activated. If the voice assistance system 10 was not previously activated by the activation device 15 or if there was uncertainty as to whether the voice communication was directed to the voice assistance system 10, then nothing happens. If there is uncertainty about the user's intention to activate, the system can generate a query in advance to ask the user whether activation is desired or not, i.e. whether at least the first speech acoustics contained an intention to activate or not.

The previously received first speech acoustics A is temporarily stored in a buffer memory 17 for at least a period of time that is suitable for receiving a further speech acoustics B that belongs to the first speech acoustics A contextually.

If, after receiving a second speech acoustics B, which also ran through the recognition device 13 and the activation device 15, it was determined that there was uncertainty and/or error as to whether the first Speech acoustics A was addressed to the speech assistance system or not, the speech model 14 is adapted and optimized with the aid of an adaptation device 18 based on the temporarily stored first speech acoustics A in order to avoid such uncertainties and/or errors in the future. By adapting the language model, not only is the activation device 15 improved, but also the way the recognition device 13 works. The language model can be adapted in such a way that parameters of a learned model are adapted, or that threshold values for deciding whether the Speech acoustics is directed to the voice assistance system 10 or not, to be adjusted.

When adapting the language model 14, different cases can be distinguished.

1st case: Incorrect activation

In the first case, which is considered here, the first speech acoustics A is not directed at the speech assistance system and is part of an interpersonal speech communication, for example. However, the activation device 15 incorrectly recognizes that the speech acoustics A is directed to the speech assistance system, as a result of which the speech assistance system 10 is activated.

In the second speech acoustics B, which follows the first speech acoustics A within a certain period of time, the user has verbally signaled that the previous first speech acoustics A was not addressed to the speech assistance system 10 . This can be done, for example, using keywords such as “stop” or “cancel”, which are contained in the second speech acoustics B.

Based on the second speech acoustics B, the activation device 15 now recognizes that the previous activation of the speech assistance system 10 based on the first speech acoustics A was faulty and deactivates the speech assistance system 10 accordingly. In addition, with the aid of the adaptation device 18 and the previously stored speech acoustics A or the speech acoustics extracted from the speech acoustics A based information, the language model 14, on the basis of which both the recognition of the language-based information and the activation of the language assistance system 10 takes place adjusted.

The recognition that the activation based on the first speech acoustics A was faulty can also be achieved by the fact that no further second speech acoustics B was received within the aforementioned certain period of time, whereby the speech assistance system also recognizes that the first speech acoustics is not connected to a speech assistance system was directed (waiting for a "timeout").

2nd case: Erroneous non-activation

In the second case, the speech acoustics A contains linguistic or prosodic information that is intended to signal that the speech acoustics A or a subsequent speech acoustics B is directed to the speech assistance system 10 . This can be done, for example, by the speech acoustics A containing a command word such as "Hello" or the like, which is to be identified by the recognition device 13 and interpreted by the activation device 15 in such a way that the speech acoustics A or in a certain period of time subsequent speech acoustics B is addressed to the speech assistance system 10 .

In the case of erroneous non-activation, however, the activation device 15 erroneously does not activate the speech assistance system, although the speech acoustics A are intended to be directed at the speech assistance system 10 . In other words, the voice assistance system 10 is not activated even though the person 100 addresses the voice assistance system 10 and wanted or wants to set up voice communication with it.

This non-activation based on the first speech acoustics A is then recognized by a further subsequent speech acoustics B. This can be done in that it is now determined in the second voice acoustics B that the voice assistance system 10 should be activated, the second voice acoustics B being recognized within a certain period of time after the first voice acoustics A. Due to the close temporal relationship between the first speech acoustics A and the second A contextual connection is established for speech acoustics B and it is assumed that the first speech acoustics A has already included an activation of the speech assistance system 10 .

Now that it has been determined by the subsequent second speech acoustics B that the first speech acoustics A should already lead to an activation of the language assistance system 10, the speech model 14 is now adapted using the temporarily stored first speech acoustics type.

3rd case: Unsafe activation or non-activation

In the third case, the language assistance system 10, more precisely the activation device 15, cannot determine with sufficient certainty whether, based on the first speech acoustics A, activation or non-activation should take place. Such uncertainty always arises when the threshold values provided for a sufficiently reliable determination of activation or non-activation have not been exceeded and are, for example, within a safety range.

If this uncertainty is eliminated by a subsequent second speech acoustics B or if the previous uncertain decision about activation or non-activation is confirmed by the second speech acoustics B, then the speech model 14 is also adapted accordingly by the adaptation device 18 .

For example, it is conceivable that the first speech acoustics A contains information that indicates that the speech acoustics A or the subsequent speech acoustics B is directed to the speech assistance system 10 . However, this cannot be concluded with sufficient certainty. Despite the remaining uncertainty, based on the activation information it is concluded that the language assistance system 10 is to be activated. If necessary, the voice assistance system 10 can also be set up in such a way that a voice output is sent to the user 100 to ask whether the voice acoustics A were actually directed to the voice assistance system 10 . If a temporally related second speech acoustics B is now more clearly established that the first speech acoustics A was already directed to the speech assistance system 10, which can be recognized, for example, by the speech acoustics B undoubtedly containing an instruction to the speech assistance system 10 is included, the language model 14 is adapted accordingly to eliminate uncertainties based on the first speech acoustics A or the language-based information extracted therefrom.

However, the opposite case is also conceivable, in which there is uncertainty as to whether the first speech acoustic A contains a non-activation that can be cleared up by a subsequent second speech acoustic B.

reference list

10 language assistance system

11 recording device 12 microphone

13 detection device

14 language model

15 activation device

16 assistance device 17 digital data memory

18 adaptation device 100 users

A first speech acoustics B second speech acoustics

Claims

Patent Claims:

1. Language assistance system (10) for language-based support of a user

(100) with

- a recording device (11) for recording auditory perceptible sound signals,

- a recognition device (13), which is set up to recognize speech acoustics in the recorded, audibly perceptible sound signals and, if speech acoustics has been recognized, to extract speech-based information from the recognized speech acoustics,

- an activation device (15) which is set up, based on a language model (14) and depending on the extracted language-based information, to recognize whether the recognized speech acoustics are directed to the language assistance system (10) or not, and the language assistance system (10 ) to activate the speech-based support if the speech acoustics were recognized as being directed to the speech assistance system (10), and/or not to be activated if the speech acoustics were not recognized as being directed to the speech assistance system (10), and

- an assistance device (16) which is set up to select a support function as a function of extracted language-based information and to carry it out for language-based support of the user (100) if the language assistance system (10) has previously been activated, characterized in that

- The voice assistance system (10) is set up to deposit a first voice acoustics (A) and/or the language-based information extracted from the first voice acoustics (A) in a digital data memory (17), and

- The language assistance system (10) has an adaptation device (18) which is set up to adapt the language model (14) as a function of the previously in adapt the first speech acoustics (A) stored in the data memory and/or the language-based information extracted from the first speech acoustics (A) if there was uncertainty and/or error in the first speech acoustics (A) as to whether the first speech acoustics (A) was on the voice assistance system (10) was directed or not, and this uncertainty and/or error is eliminated by a corresponding second voice acoustic (B) detected after the first voice acoustic (A) or by a detected absence.

2. Speech assistance system (10) according to one of the preceding claims, characterized in that the speech assistance system (10) is set up to use the adaptation device (18) to change the speech model (14) as a function of the previously stored first speech acoustics (A) and/or to adapt the speech-based information extracted from the first speech acoustics (A) if the second speech acoustics (B) is recognized within a certain period of time after the first speech acoustics (A) or the absence of the second speech acoustics (B) is recognized within a certain period of time .

3. Speech assistance system (10) according to claim 2, characterized in that the speech assistance system (10) is set up to define or vary the certain period of time depending on at least one acoustic quality criterion of the first speech acoustics (A).

4. Language assistance system (10) according to any one of the preceding claims, characterized in that

- the recognition device (13) is set up to extract prosodic properties of the first speech acoustics (A) as part of the speech-based information from the first speech acoustics (A),

- The activation device (15) is set up to recognize, based on the language model (14), depending on the prosodic properties of the first speech acoustics (A) contained in the extracted language-based information, whether the recognized speech acoustics is directed to the speech assistance system (10). or not, and - the adaptation device (18) is set up, the language model (14) with regard to the prosodic recognition depending on the first speech acoustics (A) previously stored in the data memory, the speech-based information extracted from the first speech acoustics (A) and/or the adapt prosodic properties of the first speech acoustics contained in the extracted speech-based information.

5. Language assistance system (10) according to any one of the preceding claims, characterized in that

- the recognition device (13) is set up to extract linguistic content of the first speech acoustics (A) as part of the language-based information from the first speech acoustics (A),

- the activation device (15) is set up to recognize, based on the language model (14), depending on the language content contained in the extracted language-based information, whether the recognized speech acoustics are directed to the language assistance system (10) or not, and

- the adaptation device (18) is set up, the speech model (14) with regard to the recognition of speech content depending on the first speech acoustics (A) previously stored in the data memory, the speech-based information extracted from the first speech acoustics (A) and/or adapt the linguistic content of the first speech acoustics (A) contained in the extracted language-based information.

6. Voice assistance system (10) according to one of the preceding claims, characterized in that the activation device (15) is set up when there is uncertainty as to whether the recognized first speech acoustics (A) should be recognized as directed to the voice assistance system (10) or not to generate an optical, haptic, olfactory and/or acoustic system query and to output it by means of an output device of the language assistance system (10).

7. Language assistance system (10) according to one of the preceding claims, characterized in that the language model (14) is a learned language model (14) based on a machine learning system.

8. Method for speech-based support of a user (100) by means of a speech assistance system (10), in that acoustically perceptible sound signals are recorded by means of a recording device (11), speech acoustics are recognized in the recorded, audibly perceptible sound signals by means of a recognition device (13) and, if a speech acoustician was recognized, speech-based information is extracted from the recognized speech acoustics, an activation device (15) based on a speech model (14) is used as a function of the extracted speech-based information to determine whether the recognized speech acoustics are sent to the speech assistance system (10th ) is directed or not, the language assistance system (10) being activated for language-based support if the speech acoustics were recognized as being directed at the language assistance system (10), and/or not being activated if the speech acoustics were not recognized as being directed at the speech assistance system ( 10) court et was recognized, and by means of an assistant device (16) depending on extracted language-based information, a support function is selected and executed for language-based support of the user (100) when the language assistance system (10) has been activated beforehand, characterized in that in one Digital data store (17) stores a first speech acoustics (A) and/or the language-based information extracted from the first speech acoustics (A), and that the language model (14) is adapted by means of an adaptation device (18) as a function of the previously stored in the data memory stored first speech acoustics (A) and/or the language-based information extracted from the first speech acoustics (A) is adapted if there was uncertainty and/or error in the first speech acoustics (A) as to whether the first speech acoustics (A) Language assistance system (10) was directed or not, and this uncertainty and / or error strength by a second speech acoustic detected temporally subsequent to the first speech acoustic (A). (B) or by recognizing the absence of a corresponding second speech acoustic (B).

9. The method according to claim 8, characterized in that the language model (14) is adapted by means of the adaptation device (18) depending on the previously stored first speech acoustics (A) and/or the language-based information extracted from the first speech acoustics (A). if the second speech acoustic (B) is recognized within a certain period of time after the first speech acoustic (A) or the absence of the second speech acoustic (B) is recognized within the certain period of time.

10. The method as claimed in claim 9, characterized in that the certain period of time is defined or varied as a function of at least one acoustic quality criterion of the first speech acoustics (A).

11. The method according to any one of claims 8 to 10, characterized in that

- prosodic properties of the first speech acoustics (A) are extracted as part of the speech-based information from the first speech acoustics (A) by means of the recognition device (13),

- Using the activation device (15) based on the language model (14) as a function of the prosodic properties of the first speech acoustics (A) contained in the extracted speech-based information, it is recognized whether the recognized first speech acoustics (A) are sent to the voice assistance system ( 10) directed or not, and

- by means of the adaptation device (18), the speech model (14) with regard to the prosodic recognition depending on the first speech acoustics (A) previously stored in the data memory, the speech-based information extracted from the first speech acoustics (A) and/or the in the extracted language-based information contained prosodic properties of the first speech acoustics (A) is adjusted.

12. The method according to any one of claims 8 to 11, characterized in that - using the recognition device (13), linguistic contents of the first speech acoustics (A) are extracted as part of the speech-based information from the first speech acoustics (A),

- by means of the activation device (15) based on the language model (14) depending on the linguistic content of the first speech acoustics (A) contained in the extracted speech-based information (A) it is recognized whether the recognized first speech acoustics (A) to the speech assistance system (10) directed or not, and

- by means of the adaptation device (18), the speech model (14) with regard to the recognition of speech content depending on the first speech acoustics (A) previously stored in the data memory, the speech-based information extracted from the first speech acoustics (A) and/or the extracted language-based information contained linguistic content of the first speech acoustics (A) is adapted.

13. Speech assistance system (10) according to one of the preceding claims, characterized in that by means of the activation device (15) when there is uncertainty as to whether the recognized speech acoustics should be recognized as directed to the speech assistance system (10) or not, a speech output would be generated in the form of speech acoustics and output by means of an output device of the speech assistance system (10), the speech output containing a query relating to the recognized speech acoustics to the user (100). 14. Computer program with program code means, set up to carry out the method according to one of claims 8 to 13, when the computer program is executed on a data processing system.