FI128000B

FI128000B - Speech recognition method and apparatus based on a wake-up word

Info

Publication number: FI128000B
Application number: FI20156000A
Authority: FI
Inventors: Tapio Koivuniemi; Tuomas Tuononen; Teijo Kinnunen; Jarkko Koivikko
Original assignee: Code Q Oy
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2019-07-15
Also published as: FI20156000A

Abstract

The present invention discloses a method, device and computer program for speech recognition systems, where a wake-up word is used and its functionality is improved algorithmically. The method at first checks the input audio flow, and when a candidate wake-up word is detected, it is compared with the positive acoustic model of the wake-up words and with the negative acoustic model of the wakeup words. If the decision is negative, the negatively identified voice sample is checked whether it is close to a previously detected negatively identified voice sample, both received within a given time period. This is performed through comparing the confidence levels of the two received voice samples which are both detected as potential wake-up words. If the decision for the candidate wake-up word is positive, it is decided that the wake-up word was truly and intentionally said and the device will then enter into a command listening mode.

Description

Speech recognition method and apparatus based on a wake-up word

Field of the invention

The present invention relates to speech recognition and devices capable of 5 receiving human audio commands in triggering different functionalities.

Background of the invention

Speech recognition and devices capable of interpreting human audio commands is a rapidly developing branch of technology which is continuously evolving and improving. The main problem in recognizing human speech and distinguishing 10 specific words from the speech is that the ambient noise is a significant disturbance. Also the differences between different human speakers in the form of accents, intonation and even regarding the pitch of the speech tone affect the quality of the results in the speech recognition methods.

In prior art, modem smartphones apply e.g. a Google search functionality where 15 the smartphone user may provide the phrase to be searched through his/her own voice.

WO 2014/063104 (“Audience Inc.”) discloses a voice activation system using keywords in a vehicle environment. There is a noise reduction or suppression functionality before the clean speech component is analyzed for keyword 20 detection. The keyword detection is based on mutual distances between the microphones and the human speaker, the seat position and primary voice frequencies. The microphones may be fixed in at least two locations inside the vehicle or in a wearable device of the human speaker.

US 2015/0106085 (“Lindahl”) discloses a speech recognition wake-up method of a 25 handheld portable device. This method relies on multiple parallel analysis of the speech signal received with multiple differently positioned microphones. In Figures 3-4, there are shown three microphones. Lindahl applies a long phrase recognition processor and a short phrase recognition processor.

The company “Code-Q Oy”, the applicant of this application, has published a press 30 release in the internet on 25 February 2015, discussing a DialoQ application titled as “A mobile assistant helps the mobile phone user”. In this press release, DialoQ is stated to be a speech controlled application assisting in mobile telephone use, and it has been implemented in Finnish. The application can receive names and

20156000 prh 22 -12- 2015 telephone numbers in audio form, and also SMS messages can be sent by voice assistance. The application assists the user between different process steps. Furthermore, the application may be provided with important incoming telephone number data, such as for a relative or a visiting nurse for the elderly patient 5 residing at home. Such a telephone or video connection may be set to open automatically, even without any audio indication from the resident. The device has a local speech recognition module, thus requiring no Internet connection during the use of the device. The device needs not to be touched physically in order to use the application.

The currently used techniques have the problem that wrong words can be interpreted as wake-up words because of the ambient noise, or an intendedly or even repeatedly said wake-up word will not trigger the desired functionality.

Summary of the invention

The present invention introduces a method for improving recognition quality of a 15 wake-up word usable in an apparatus capable of speech recognition and controlling the apparatus or an external device through human voice commands, wherein the method comprises the step of:

- setting or retaining the apparatus in an active listening mode, where at least one wake-up word and its acoustic model has been predefined in a memory available to the apparatus.

The method is characterized in that the method further comprises the steps of:

- identifying at least one wake-up word from the audio environment during the active listening mode through comparing the received voice samples to a previously defined acoustic model, and detecting a positively identified wake-up word when identification similarity of the received voice sample exceeds a set threshold value; and when the positively identified wake-up word is followed by a recognizable audio command for the apparatus or for controlling the external device in a preset first time period after detecting the positively identified wake-up word,

- outputting the command to the apparatus or to the external device in order to trigger an action by the apparatus or by an external device.

In an embodiment of the invention, the method further comprises:

20156000 prh 22 -12- 2015 determining whether the negatively identified voice sample is substantially close to the acoustic model of a wake-up word, and if it is, comparing the negatively identified voice sample with a previously detected negatively identified voice sample, both received during a preset second time 5 period from the audio environment, and in case there is substantial mutual similarity between these two negatively identified voice samples, converting the negatively identified latter voice sample into a positively identified wake-up word.

In an embodiment of the invention, the method further comprises:

- deciding whether the detected voice sample is closer to a positive wake-up word model or a negative wake-up word model; and

- making the decision for the positively identified wake-up word in case the detected voice sample is closer to the positive wake-up word model.

In an embodiment of the method of the invention, the positive and negative wake15 up word models are gathered cumulatively and/or user-specifically during the functioning of the method, resulting in a user-specific and environmentally adjustable identification of the wake-up words and commands.

In an embodiment of the invention, the negatively identified voice sample, which is determined as substantially close, is saved to a temporary word database.

In an embodiment of the invention, if the negatively identified voice sample is not substantially close to the correct wake-up word,

- the negatively identified voice sample is saved into the negative wake-up word model; and

- the method returns into the active listening mode of the audio environment.

In an embodiment of the invention, the positively identified wake-up word is saved into the positive wake-up word model when the recognizable audio command has been received within the preset first time period.

In an embodiment of the invention, the identifying, determining and comparing steps are based on a confidence level of the wake-up word; and where the 30 confidence level is adjustable continuously based on each received voice sample.

20156000 prh 22 -12- 2015

In an embodiment of the invention, if a conversion of the negatively identified latter voice sample has been made into a positively identified wake-up word,

- adjusting the confidence level of the wake-up word and saving both said voice samples into the positive wake-up word model.

In an embodiment of the invention, in the memory of the apparatus there is saved information of at least two different wake-up words and their acoustic models; and the method further comprises the step of:

- gathering the corresponding positive wake-up word models and the negative wake-up word models in the memory separately for each wake-up word.

- setting and adjusting confidence levels separately for each wake-up word.

In an embodiment of the invention, the confidence levels can be stored and 15 adjusted separately relating to each different human user of the device.

Corresponding with the above presented method, the inventive idea according to the present invention also comprises an apparatus for improving recognition quality of a wake-up word, the apparatus capable of speech recognition and controlling the apparatus or an external device through human voice commands, 20 wherein:

-the apparatus is configured to be set or retained in an active listening mode, where at least one wake-up word and its acoustic model has been predefined in a memory available to the apparatus.

The apparatus is characterized in that it further comprises:

- processing means configured to identify at least one wake-up word from the audio environment during the active listening mode through comparing the received voice samples to a previously defined acoustic model, and to detect a positively identified wake-up word when identification similarity of the received voice sample exceeds a set threshold value; and when the positively identified wake-up word is followed by a recognizable audio command for the apparatus or for controlling the external device in a preset first

20156000 prh 22 -12- 2015 time period after detecting the positively identified wake-up word, the processing means is further configured to

- output the command to the apparatus or to the external device in order to trigger an action by the apparatus or by an external device.

In an embodiment of the invention, the processing means is further configured to determine whether the negatively identified voice sample is substantially close to the acoustic model of a wake-up word, and if it is, the processing means is further configured to compare the negatively identified voice sample with a previously detected negatively identified voice sample, both received during a preset second 10 time period from the audio environment, and in case there is substantial mutual similarity between these two negatively identified voice samples, the processing means is further configured to convert the negatively identified latter voice sample into a positively identified wake-up word.

In an embodiment of the invention, the apparatus further comprises a positive 15 wake-up word model and a negative wake-up word model, wherein the processing means is further configured to:

- decide whether the detected voice sample is closer to a positive wake-up word model or a negative wake-up word model; and

- make the decision for the positively identified wake-up word in case the detected 20 voice sample is closer to the positive wake-up word model.

In an embodiment of the invention, the apparatus is configured to gather the positive and negative wake-up word models cumulatively and/or user-specifically during the usage of the apparatus resulting in a user-specific and environmentally adjustable identification of the wake-up words and commands.

In an embodiment of the invention, the apparatus comprises a temporary word database, where the negatively identified voice sample, which is determined as substantially close, is saved to.

20156000 prh 22 -12- 2015

- the apparatus is configured to return into the active listening mode of the audio environment.

In an embodiment of the invention, the processing means are configured to save the positively identified wake-up word into the positive wake-up word model when 5 the recognizable audio command has been received within the preset first time period.

In an embodiment of the invention, the processing means are configured to identify, determine and compare based on a confidence level of the wake-up word; and where the confidence level is configured to be adjustable continuously based 10 on each received voice sample.

In an embodiment of the invention, if a conversion of the negatively identified latter voice sample has been made into a positively identified wake-up word, the processing means is configured to adjust the confidence level of the wake-up word and to save both said voice samples into the positive wake-up word model.

In an embodiment of the invention, in the memory of the apparatus there is saved information of at least two different wake-up words and their acoustic models; and the processing means is further configured to gather the corresponding positive wake-up word models and the negative wake-up word models in the memory separately for each wake-up word.

In an embodiment of the invention, in the memory of the apparatus there is saved information of at least two different wake-up words and their acoustic models; and the processing means is further configured to set and adjust confidence levels separately for each wake-up word.

In an embodiment of the invention, the apparatus is configured to store and adjust 25 confidence levels separately relating to each different human user of the device.

According to yet another aspect of the present invention, the inventive idea also comprises a computer program for improving recognition quality of a wake-up word usable in an apparatus capable of speech recognition and controlling the apparatus or an external device through human voice commands, where the 30 computer program comprises code which is executable by a processing means, and the computer program comprises the step of:

20156000 prh 22 -12- 2015

The computer program is characterized in that it further comprises the steps of:

In an embodiment of the invention, the computer program further comprises the steps of:

- determining whether the negatively identified voice sample is substantially close to the acoustic model of a wake-up word, and if it is,

- comparing the negatively identified voice sample with a previously detected 20 negatively identified voice sample, both received during a preset second time period from the audio environment, and in case there is substantial mutual similarity between these two negatively identified voice samples,

- converting the negatively identified latter voice sample into a positively identified wake-up word.

In an embodiment of the invention, the computer program is embodied in a computer-readable medium.

Brief description of the drawings

Figure 1 illustrates the first embodiment of the speech recognition method,

Figure 2 illustrates the second embodiment of the speech recognition method, with 30 improved wake-up word recognition,

20156000 prh 22 -12- 2015

Figure 3 illustrates an exemplary device in use, here a smartphone screen, applying the speech recognition software and during an active command giving session, and

Figure 4 illustrates elements of the speech recognition device in an exemplary 5 configuration.

Detailed description of the invention

The present invention introduces a method for good quality speech recognition, usable for various applications applying human voice commands.

The inventive idea also comprises a device, which is configured to perform the 10 method steps for efficient speech recognition using wake-up words. The device may be a dedicated speech recognition device, or it may be a speech recognition application running e.g. in a smartphone.

Furthermore, the inventive idea also comprises a computer program product and a medium where the computer program may be stored. The computer program 15 product, or software, comprises computer code which will execute the presented method steps when run in a processor of the device or of an external server.

The main element for the speech recognition system is a mobile smartphone in the first embodiment of the invention. There is a microphone in the smartphone which gathers the audio input from the surroundings, including ambient noise.

The method according to the invention can be implemented as a smartphone application. The application can be directly used and its parameter options controlled by the user which also provides the audio commands. It is notable that the application does not need to be taught with a specific user voice before it can be used. The application is capable to receive and recognize commands from several different persons, e.g. in an alternating manner.

The presented method is not restricted to be available with some specifically named language(s); instead it can be in principle realized in any given language, either naturally present or with even artificial one (e.g. Esperanto). The method is thus language-independent.

While many embodiments are discussed with no Internet connection required, i.e. using local recognition analysis within the smartphone, the analysis may well be performed in a server, locating externally over the Internet connection. In one embodiment, it is possible that a part of the analysis is locally performed, and the

20156000 prh 22 -12- 2015 rest is externally performed. Afterwards, these results are brought together as cumulative analysis results.

The starting point of the invention is to provide a command or order to the device which in principle can be anything, e.g. like “Call to +358401234567” or “Turn the 5 radio on”. In order for the device to achieve an active “command listening mode”, the system uses a specific wake-up word (or several different wake-up words) which may in one embodiment be “Kuule!” representing a short Finnish phrase for “Please hear me!”. In one embodiment of the invention, the device responds to the positively recognized wake-up word by a beeping sound or other kind of audio 10 acknowledgement sound. In helping tool applications e.g. for the elderly and for non-deaf disabled people, an acknowledgement sound greatly enhances the usability of the application.

Figure 1 illustrates the steps of the speech recognition method as a flow chart, in a first, broader embodiment of the invention.

The device which may be a smartphone and in which the speech recognition application is running may be placed in the vicinity of the user, like put on a table, or held in the user’s hand, or set otherwise so that the microphone of the device may grab outside audio signals without too much physical restrictions attenuating the incoming sounds.

The device is set in an active listening mode. The audio flow may include any words, either intendedly pronounced by the smartphone user or caught from farther away from other human speakers not relating to the device interaction in any way. Furthermore, there might be other sounds present in the environment, originating close or farther away from the microphone. These sounds form the 25 ambient environment, and their intensity may vary significantly.

There are two different acoustic models or storages for words which are created beforehand and these word storages can be supplied with new data during the usage of the method. In that way, the system is capable to learn from the actions made in the past. The first acoustic model is a positive model of the wake-up word, 30 and the second acoustic model is a negative model of the wake-up word. All the detected words, which are followed with a relevant command in a predetermined time period, are decided to be positively identified wake-up words. These results are stored into the positive acoustic model storage.

20156000 prh 22 -12- 2015

Instead, all nearby but still incorrect results, which were initially considered to be a correct wake-up word but which are not followed by a suitable command within the predetermined time period, it can be decided that with a large probability the word was just close to the wake-up word. Such falsely detected words, “the close but 5 wrong results”, are stored in the negative model storage. In an example in the

Finnish language, a nearby candidate could be a word “kuula” (= a ball) or “kuulo” (= hearing), if the wake-up word is “Kuule” (= Please hear me).

Referring to the functional blocks in Figure 1, with the initially set wake-up word or words available, the recognition block of the wake-up word 11 continuously tracks 10 the ambient audio flow, and tries to catch all words, which resemble this word (or one among the several wake-up words). Of course, it is possible that the wake-up word is actually a wake-up sentence, comprising plurality of words in a particular order. For practical reasons, it is however beneficial to keep the sentence short, in order to keep the commanding procedure simple and reliable for the user. All 15 detected words, which are close enough to the wake-up word, will pass to the next analysis module 12 which makes a decision based on this question: Is the analyzed word closer to the words of the positive model or the words of the negative model?

If the result from block 12 is closer to the negative model, the detected word which 20 is “almost correct” but still an incorrect one, will be saved to the negative model storage 14. This storage can also be called as a minus-database (in Figure 1) or a negative word model.

If the word is closer to the positive model in the comparison block 12, the device will wait for the actions of the user 13. Time restrictions are applied here, which 25 means that the device will wait only for an appropriate time period. The threshold value for the waiting time may be set e.g. as five seconds which means that the command must be given in less than five seconds from the wake-up word detection, in an embodiment of the invention.

If there is no speech or command audible within the predetermined time period 18, 30 the algorithm determines that the detected word is not a correct wake-up word. It is highly likely that the device has picked a random word resembling the wake-up word, but the word was not intended for the device as a command. After this decision is made in algorithm blocks 15 and 18, the word is thereafter saved to the minus-database 14. The device returns then to the active listening state 11. Block 35 15 can be called an active command identifying step, or a command listening step.

20156000 prh 22 -12- 2015

If there is a proper command received and identified during the waiting period through algorithm blocks 15 and 16, a positive result is achieved and the detected wake-up word is then saved to the plus-database 17. A command executing signal is sent forward to the relevant device but these aspects are not covered in detail in 5 this disclosure.

In one embodiment, the starting moment of the command is relevant regarding the waiting period. The whole command need not to be said fully within the predetermined time period. With the exemplary five second threshold value, if the command starts e.g. in 4,5 s time and finishes in 7 s, it will result in a positive 10 decision and the command will be executed and the corresponding wake-up word is added to the plus-database 17.

In a second embodiment of the invention, an improved wake-up word determination process is introduced. This process is illustrated in Figure 2 as a flow chart.

The blocks 11-18 are similar functional blocks as already described in connection with Figure 1. The positive decision path from block 12 into block 13 and the data flow after that are all similar with the simpler embodiment of Figure 1. The negative decision route from block 12 into block 22 and the steps after that are additional steps in view of the first embodiment. Furthermore, the wake-up word storages are 20 more clearly shown as word storages or word databases 21, 23 and 27. In practice, the databases comprise statistical models of the wake-up word, a positive 27 and a negative 23 one. Further, there is a temporary storage for a possible wake-up word 24. Database 21 comprises cumulatively both the positive and negative wake-up word models, available for use by the decision block 12.

Some correctly and purposefully said wake-up words may however end up in the negative word databases through erroneous decisions because e.g. ambient noise or low volume of the user’s voice can make the decision process very difficult. Also some medical conditions may affect the pronunciation quality of the speech unintentionally so that the correct interpretation and recognition may become very 30 difficult. The second embodiment is a more detailed one in order to tackle this problem in a better way.

In the second embodiment of the invention, the model databases are gathered in the following way. When the user has said the wake-up word and the correctly identified wake-up word is followed by a recognizable command within the 35 predetermined time period, it can be decided that the wake-up word recognition is

20156000 prh 22 -12- 2015 positive, and the said wake-up word is saved to the positive model database. On the contrary, if the positively identified wake-up word is not followed by any recognizable command within the predetermined time period, it is decided that the wake-up word recognition is negative, and the said wake-up word is saved to the 5 negative model database. The latter situation means in practice that the correct wake-up word has been caught from regular conversation, or somehow the ambient noise has deteriorated the recognition quality, or the word has been picked from external conversation, for instance.

In practice, the negative model database comprises also some correctly applied 10 and meant wake-up words. In this regard, the negative model needs a fine-tuning procedure where the positive results are transferred into the positive model where they belong.

In the present, second embodiment, block 22 of the method chart applies the next step if the first comparison 12 between the said word and the words in the 15 database 21 results in the negative outcome. In step 22, the recognized word is compared with the wake-up word itself. If the recognized word is close to the wake-up word, the recognized word is saved into a temporary saving location for the possible wake-up word 24. If the recognized word is not close to the wake-up word, the word will be saved into the negative word database 23 and the method 20 will return to the initial starting step 11. In this latter case, the system will not create any acknowledgement information, like an acknowledgement beeping sound, to the user.

Of course, the acknowledgment signal may be different than a sound or tune; e.g. a green light visible through a LED indicator of the device or e.g. a green screen or 25 light visible on the smartphone screen.

After the saving into the temporary saving database 24, the temporarily saved possible wake-up word is then compared with the earlier received recognized word within a given second time period, in step 25. This is performed in order to find out whether the user has said the wake-up word for the second time. The logic is 30 based on the fact that if the user intentionally says the wake-up word and the device does not respond with an acknowledgement sound, the user knows that the order has not reached the device and he/she will say the wake-up word once again. Repetition is a usual human way of communication if there is clear indication that the message has not reached the receiver correctly. If the said word 35 is the same within given boundaries, it is close to 100 % clear that this word is the

20156000 prh 22 -12- 2015 wake-up word. If the comparison is positive in blocks 25, 26, the device will wait the action (the command) of the user 13. If the comparison is negative in 26, the wake-up word has not been said and the device will return to the initial listening mode to block 11.

As a further feature, we introduce a concept of the confidence level. It means the preset threshold according to which the decision between the positive result and the negative result is made in the comparison steps. The confidence level may be automatically tuned during the process through cognitive control, so that the system continuously learns the way the user uses his/her voice and also because 10 the characteristics of each human voice are very personal and distinctive between different users.

In one embodiment of the invention, the confidence level may be tuned in the situation where, according to the flow chart of Figure 2, the decision flow goes through blocks 11, 12, 22, 24, 25, 26 and 13. In that case, the user tries to say the 15 correct wake-up word at first but the system won’t recognize it positively at the first step 12. At that situation, the confidence level may be lowered so that in the subsequent and similar situation with the same user, the logic will make a positive decision already through blocks 11,12 and 13 of the flow chart.

Generally, by checking what the user does (=says) after the assumed wake-up 20 word, it can be deduced whether it was really a wake-up word complemented with a proper command. In case there is a repetition of a word or group of words, it is highly likely that this is not a general conversation but indeed an intended wake-up word or an intended command provided to the device. In case such repeated wake-up words do not trigger the method into step 13, the confidence level will be 25 tuned lower so that the next wake-up word from the same user will trigger the command listening mode of block 13.

The repeated wake-up words need not be consecutive words but the device may also be set so that within a given time period, the repetitive manner of the given words is checked. The time period may be set e.g. for 5 seconds and the wake-up 30 words need not be consecutively said.

In one embodiment of the invention, the system may apply several different wakeup words or group of words. For two different wake-up words, like “Kuule” and “Odota hetki” in Finnish language (“Listen up, immediately”; and “Wait a moment”, respectively), the system can have a specific confidence level and databases for 35 the first wake-up word, and another confidence level and databases for the second

20156000 prh 22 -12- 2015 wake-up word. In that embodiment, the first wake-up word and its decision logic works independently in relation to the logic relating to the second wake-up word. In such an embodiment, the word databases are kind of duplicated for serving the two wake-up words available for the system.

Of course, there can be even more different wake-up words simultaneously usable within the system. In that case, the databases and confidence levels are multiplied correspondingly in order to create independent decision logic for each wake-up word (or wake-up sentence meaning the group of words).

The device may be complemented with an internal or external audio speaker 10 device which can indicate the different recognition steps of the method to the human user while the system is at use. For instance, when the wake-up word is correctly recognized, the device may output a first acknowledgement sound or tune through the speaker device. Furthermore, when an appropriate command is given and recognized correctly, the device may output a second acknowledgement 15 sound or tune to indicate the proper reception of the command back to the human user.

The present invention may be applied in numerous environments and technical solutions. Generally, any field already applying speech recognition or where speech recognition could possibly enhance the operation in some way in the 20 future, are target operation areas for the invention. For instance, assisting instruments, tools, devices and applications in the healthcare business is an important technical field assisting elderly people and patients with disabilities. Also smartphone applications are a good platform for applying the presented method according to the invention. It is probably the easiest way of realizing the invented 25 system into the everyday household use currently because smartphones are nowadays so widely popular, and the popularity is quickly increasing in the elderly population as well.

Regarding the actual devices where the presented speech recognition algorithm may be implemented physically, and example of its screen during the speech 30 recognition procedure can be seen in Figure 3, and a simplified block diagram of the required elements of the device for the speech recognition algorithm can be seen in Figure 4.

In Figure 3, it is shown a smartphone 31 which acts as a speech recognition device in one embodiment. The screen 32 acts as an interface between the user 35 and the smartphone. The screen 32 shows the interaction between the user and

20156000 prh 22 -12- 2015 the device when user gives voice commands for the smartphone itself or for another device to be controlled remotely.

The user may at first say something close to the intended wake-up word to trigger the device but in this example, the user pronounces the wake-up word a bit poorly, 5 though he or she intended the correct wake-up word. This is of course very human and also some medical states (illnesses) may affect the speech clarity. The wakeup word is in this example “Listen”, but at first, the smartphone speech recognition app interprets the said word as “Lazy”, indicating a false result, and thus, no triggering into the actual command listening mode. After the word “Lazy” has been 10 interpreted to be said, the method has actually gone to step 24 of the method chart because the word is close to the actual wake-up word. The device then waits for the new verbal action by the user.

At the next step, the user repeats the word but the word is said in a much clearer fashion in this exemplary situation. This time, the speech recognition software of 15 the smartphone interprets the said word as “Listen”. A positive result is achieved and the device will enter the step 13 of the method chart. In there, the device will wait for the actions of the user, meaning the actual command he/she intends to order.

At the next step, the user says “Call Jarkko”, thus ordering the device to start a 20 phone call to a Finnish friend called Jarkko. After this command is said, the device has entered the step 16 of the method chart, meaning the identification of the command and proceeding into the actual triggering of the command. The smartphone thus initiates a telephone call procedure to Jarkko, whose number has been saved into the device beforehand. Of course, the command may be in the 25 form “Call 0401234567”, meaning that the telephone number does not necessarily need to be pre-saved into the device memory or SIM card.

All the time when the smartphone is on and the speech recognition app is activated, the active stand-by (or listening) mode of the device may be shown to the user by showing a symbol on the screen, or by showing a text box such as in 30 the lower part of the screen 32 of Figure 3. This gives the important information to the user that the device is ready and functioning, and thus, ready to receive commands, together with the wake-up word(s).

There might also be a vertical or horizontal bar on the screen indicating the received voice intensity in real-time through the microphone of the device. From

20156000 prh 22 -12- 2015 there the user receives valuable feedback that the device is in an active listening mode.

Figure 4 illustrates certain physical functional elements and parts within a speech recognition device, such as a smartphone, in order for a proper voice recognition 5 system. This is a simplified block diagram comprising merely the most important parts needed in the working device, and also in order to create a pleasant experience for the user. The device is a smartphone 31 which has an antenna 46 allowing wireless access to the base station and allowing also an internet connection, if desired. The core element is the processing unit, or a processor 41, 10 which handles all internal calculation processes within the device 31. The required software may be locally saved in a memory 42 but of course some relevant software may originate also from the cloud services. Also the speech recognition software application (= app) 43 may be downloaded through external application providers such as Google Play. The downloaded piece of software 43 is saved into 15 the memory 42 of the device. The essential user interface sensor element is a microphone 45 capturing the audio information around the device 31. It is up to the location of the device whether there is ambient noise which would significantly harm the quality of the speech reception and the recognition. Through placement of the device in close vicinity of the user, and by selecting a peaceful environment 20 (if possible), the speech reception quality through the microphone 45 will enhance.

A practical feedback issuing tool for the user is sound information which may be provided in connection of both the successful and non-successful commands. The internal speaker 44 of the smartphone fulfils this task sufficiently well. Of course, the sound information may be replaced or completed by a visual information on 25 the screen (like the ones shown in Fig. 3) or e.g. through a light signal of some form in case of a dedicated voice recognition device, for instance. The speaker 44 may output beeps or other sounds showing acknowledgement of a received command. The beep may also inform the user whether the wake-up word or a command has been positively or negatively received and interpreted. The device 30 speaker 44 may also output speech messages made either artificially or through pre-recorded sound clips. In that case, the speech acknowledgement messages may be repeated commands, or the action which the device is about to take. The acknowledgement message may be formulated in a question form, which the user must confirm or decline e.g. by “Yes” or “No”. This allows a possibility to double35 check the given audio command, which will further diminish the risks emerging through activation of an unintended command.

20156000 prh 22 -12- 2015

Going back to the algorithm, an alternative manner of performing the algorithm is discussed next. In recognizing the wake-up word in the audio input flow from the vicinity of the speech recognizing device, the basic principle is whether the recognizing algorithm believes to have received the correct wake-up word with a 5 high probability. The probability of the correctness of the wake-up word can be determined with a numerical value which has a theoretical maximum value for a 100 % correctness in the recognition. In practice, there can be defined a threshold value which is the limit for the “high probability”.

In an example relating to the previous paragraph, a wake-up word may be a 10 Finnish word “Kuule” (engl. “Please listen”). There can be defined a threshold value of 4000 points. This is the confidence level of the wake-up word. At first, when the device user says the word “Kuule”, the algorithm results in block 12 of Fig. 2 into numerical value of 3800 when analyzing the said word. This means that the speech recognizing algorithm will not accept this word as a correct wake-up 15 word. In the same moment, the timer is started and a preset time window is set for it. If during the preset time window, the user says the same wake-up word and the device receives another identification of the same wake-up word e.g. with a value 3750 points, the algorithms determines that the correct wake-up word is tried to be said, while the threshold has not been exceeded. This means that the device will 20 listen up to the command, and also that the numerical threshold for the wake-up word identification has been set too high for that particular environment and/or particular user situation. In that regard, the system may change the threshold value either permanently or temporarily into an appropriate lower value. Of course, the threshold could also be too low, accepting even wrong words as a wake-up 25 word, and in that case the threshold value could be risen either temporarily or permanently. In the above example, the threshold value could be changed into 3775 points or even 3700 points to ensure that the less clearly pronounced correct wake-up words are recognized correctly. This will enhance the user satisfaction a lot, and lessen the possible frustration because the wake-up word need not to be 30 repeated that much for the device in order to start listening to the actual command.

The comparing process is made in one embodiment of the invention such that the pronounced word is compared with the usual manner of pronouncing the word in the specified language. It can be said that this reference word or reference group of words is a median or average of all the ways, accents, speech speeds and 35 tones used in saying that particular word or sentence in the given language among

20156000 prh 22 -12- 2015 native speakers. This median or average manner of speaking a wake-up word forms a previously defined acoustic model of the wake-up word.

The above situation with the numerical values being lower than the set threshold might occur because the user has a medical state which affects the clarity and/or 5 the volume of the pronunciation of the speech. Also there might be a situation that there are several different users giving voice commands into the device. Therefore the identification results and the resulting “points” may clearly differ with each other when two different users say the same wake-up word. Therefore, it is usable to use a temporary change in the threshold value in order to adjust the sensitivity 10 of the listening mode of the device for each of the different speakers in a mutually similar fashion. In this case it is assumed that each of the human speakers use the device a given period, and the next speaker follows the first speaker after a given time period, meaning that there are no overlapping instructions given almost simultaneously to the device by two different users.

In yet another embodiment of the invention, it is possible to compare two different identified wake-up words or commands with each other. In this way the comparison does not happen between the received voice sample and a reference word model within the device, but between two consequential or nonconsequential previously voiced words or sentences. By comparing the previous 20 “word samples” between each other, a more reliable estimate of the correct wakeup word can be achieved.

In one embodiment of the invention, we consider the case where several different wake-up words have been set in the memory of the speech recognition device. This is analogical to the situation where there could be a preset group of the most 25 important commands available to e.g. a patient in a nursing home. In that case, the used commands may directly be used as both the wake-up word and the command, both incorporated in the said command. For instance, there could be three commands: “Call the doctor”, “Call my daughter”, and “Turn on/off the radio”. Each of these actions could be triggered directly by saying the command without 30 any triggering wake-up word. This of course requires that the device is constantly in an active listening mode. This is analogical with the situation that these three commands are the same as three wake-up sentences, and no actual “command” after the wake-up sentence is therefore required in order to initiate a command or control order for an external device. The actual recognition process of the suitable 35 command works the same way as previously explained in relation to the potential wake-up words.

20156000 prh 22 -12- 2015

As a generalization, with a wake-up word it is meant also that the wake-up word may comprise several words forming a short sentence or part of a sentence. A wake-up word thus means a sentence of 1...N words, whichever is seen reasonable for the user and for the environment the device is used. Of course, the 5 shorter wake-up sentences comprising 1-3 words are the most reasonable option, in order to ensure a good and smooth user experience in waking up the device before the actual command.

The device according to the invention can of course include many other elements for producing a practical speech recognition device and command unit for e.g.

disabled people. Also people with speech difficulties or speech impairment condition will benefit from the use of the invention because their pronunciation might not be that clear but it will be sufficiently close to the average pronunciation. For them and also in a more general fashion, the system can be adapted for different individual users by setting the wake-up word identification threshold 15 values user-specifically. Referring to the above and especially to the threshold values in the comparing step, the system may be adjusted to include individual threshold values for each individual device user (the patient, for instance). And importantly, the commands may also direct other devices than just the smartphone itself. The applications may include anything from home appliance control 20 methods, like television or radio remote controlling or adjusting the temperatures at home or wherever the user is staying. One further application of a smartphone specific internal commands through voice could be sending SMS:s through voice commands.

Beyond speech recognition as such, the invented method is suitable also for 25 control methods involving speech input, text input, search string input for search services, and setting timings for devices or systems. The technical application environments where the method can be implemented, comprise industrial processes and control methods, hospital environments with healthcare and patient monitoring applications, personal home entertainment devices and home 30 automation control systems, vehicle control systems through voice input and various possibilities in smartphone applications relating anything where the text or data input could be replaced by human voice input. The application may be realized in regular smartphones, tablets and regular PCs. Search services, map or locationing services applying GPS, gaming applications, and generally data 35 browsing applications are just a few examples on the possible technical applications available for the present invention.

20156000 prh 22 -12- 2015

In an embodiment, the location information available e.g. from the smartphone may be used in deciding the environment where the device is used. For instance, the home of the user is an example of the usually quite quiet environment where the noise can be created through another quest, or e.g. through television sounds 5 and voices. Another example is an outdoor public venue where the ambient sounds may have large intensities but also other human voices may well be present. Hospital environment is yet another example of the location where the device and method can be applied.

The advantages of the invention are the general quality improvement regarding 10 the voice command recognition and command triggering. Impulses which were not intended as a wake-up word will not end up in the command initiation. On the other hand, the wake-up words which are intendedly said, will better initiate the command listening mode without the need for repeating the wake-up word which would be frustrating for the user. One great advantage is that the method tunes 15 itself user-specifically. This means that the wake-up word can be selected according to the selection of the user, and most beneficially, the tunes, pitches and speaking speeds of each user are taken into account in deciding the correct confidence level for each user. The system thus becomes very personal and it will be pleasant to use. It can be said that the system thus applies biometric sensing.

The advantages of a remote voice control device include naturally the possibility to use the device without physically touching it through hands or any other part of the human body. The smartphone for instance could be located on a hospital table, where a disabled patient could create the control commands solely through his/her voice. Many actions and procedures could also be automatized with the help of the 25 used device, such as smartphone. For instance, the incoming call could be opened automatically within 10 seconds in order to ensure the telephone contact to the user without responding to the incoming call. Also it is possible to set a listening mode on constantly, and if the incoming audio signal has been weak for a set time period and/or the device has not been activated in the set time period, an 30 alarm may be triggered to the medical staff or to the closest relative(s) taking care of the patient.

One advantage of the invention is that the method is usable also in situations where there are several different human speakers providing instructions to a single device. The method recognizes different tones and manners of speech and the 35 steps of the method will work properly regardless of the actual speaker giving the audio commands.

The presented method according to the invention can be realized through a computer program where the method steps are implemented in the form of computer program code and executed by a processor. The computer program may be stored in a computer-readable medium.

All the appropriate method steps in the independent and dependent method claims can be implemented also as an embodiment of the computer program which can further be stored in a computer-readable medium as well.

It is possible to combine at least two of the presented features together in order to create a new embodiment of the apparatus or method according to the invention.

The scope of protection is not merely restricted to the examples mentioned above but they may vary within the scope of the claims.

Claims

The claims

A method for improving the quality of a wake-up word recognition device used in a speech recognition-capable device (31) and controlling the device (31) or an external device by human voice commands, the method comprising: - setting the device (31) into an active listening mode; at least one predetermined alarm word and its acoustic model in the available memory (42) of the device (31);

characterized in that the method further comprises the steps of:

identifying at least one wake-up word (11) from the audio environment during active listening mode by comparing the received audio samples with a previously defined acoustic model, and expressing the positively identified alarm when the received a decision on a positively identified alarm word if the detected audio sample is closer to the positive alarm word pattern; and when the positively identified wake-up word is followed by an identifiable audio command (16) for the device (31) or for controlling the external device in a predetermined first time period after the positively identified wake-up word is detected,

providing a command to the device (31) or to an external device to initiate an action performed by the device (31) or the external device.

A method according to claim 1, characterized in that the method further comprises:

- determining (22) whether the negatively identified sound sample is near the acoustic model of the excitation word and, if so,

- comparing (25) the negatively identified audio sample with the previously detected negatively identified audio sample, both received from the audio environment during a predetermined second period of time, and in the event that there is a match between the two negatively identified audio samples;

converting (26) the negatively identified second audio sample into a positively identified excitation word.

20156000 prh 26 -03-20199

A method according to claim 1, characterized in that the method collects positive and negative excitation word patterns cumulatively and / or per user during operation of the method, resulting in user-specific and environmentally adjustable identification of the excitation words and commands.

The method of claim 2, characterized in that the negatively identified audio sample, determined to be near, is stored (24) in a temporary word database.

A method according to claim 1, characterized in that if the negatively identified sound sample is not near the correct excitation word,

- storing the negatively identified voice sample in the negative excitation word pattern (23); and

the method returns to the active listening mode of the audio environment.

A method according to claim 1, characterized in that the positively identified alarm word is stored in the positive alarm word pattern (21) after the identifiable audio command has been received in a preset first time period.

Method according to claim 1 or 2, characterized in that the identification, determination and comparison steps are based on the confidence level of the alarm word; and wherein the level of assurance is continuously adjustable based on each received audio sample.

A method according to claim 7, characterized in that, if a conversion of the negatively identified voice sample to a positively identified excitation word is performed,

adjusting the confidence level of the alarm word and storing both of said audio samples in a positive alarm word pattern.

A method according to claim 1, characterized in that the memory (42) of the device (31) stores information about at least two different alarm words and their acoustic models; and the method further comprising the step of:

20156000 prh 26 -03-20199

- collecting the respective positive alarm word patterns and the negative alarm word patterns in memory (42) separately for each alarm word.

A method according to claim 7, characterized in that the memory (42) of the device (31) stores information about at least two different alarm words and their acoustic models; and the method further comprising the step of:

- Set and adjust the confidence levels separately for each alarm word.

A method according to claim 7, characterized in that the security levels can be stored and adjusted separately for each different user of the device.

A device (31) for improving the quality of alarm word recognition, wherein the device (31) is capable of speech recognition, and controlling the device (31) or an external device by human voice commands, wherein:

the device (31) is configured to be set to an active listening mode or to be kept in an active listening mode, predetermining at least one alarm word and its acoustic model in the available memory (42) of the device (31); characterized in that the device (31) comprises

- processing means (41) configured to identify at least one alarm word (11) from the audio environment during active listening mode by comparing the received audio samples with a previously defined acoustic model and detecting a positively identified alarm when the received audio sample identifier exceeds and a negative excitation word pattern, wherein the processing means (41) is further configured to decide (12) whether the detected audio sample is closer to the positive excitation word or negative excitation word pattern and to decide on the positively identified excitation word if the detected audio sample is closer to the positive excitation word; and when the positively identified wake-up word is followed by an identifiable audio command (16) for the device (31) or for controlling the external device after a predetermined first time period after the positively identified wake-up word is detected, the processing means (41) is further configured

command a device (31) or an external device to initiate a process performed by the device (31) or an external device.

20156000 prh 26 -03-20199

Device according to claim 12, characterized in that the processing means (41) are further configured

- determining (22) whether the negatively identified audio sample is near the acoustic model of the excitation word, and if so, the processing means (41) is further configured

comparing (25) the negatively identified audio sample with the previously detected negatively identified audio sample, both received from the audio environment during a predetermined second period of time, and in the event of a match between the two negatively identified audio samples, the processing means (41) is further configured

- converting (26) the negatively identified audio sample into a positively identified excitation word.

Device according to Claim 12, characterized in that the device (31) is configured to collect positive and negative alarm word patterns cumulatively and / or user-by-user during operation of the device (31), resulting in user-specific and environmentally adjustable identification of the alarm words and commands.

The device of claim 13, characterized in that the device (31) comprises a temporary word database in which a negatively identified voice sample determined to be near is stored (24).

Device according to Claim 12, characterized in that if the negatively identified sound sample is not near the correct wake-up word,

- the device (31) is configured to return to the active listening mode of the audio environment.

Apparatus according to claim 12, characterized in that the processing means (41) is configured to store the positively identified wake-up word in the positive wake-up word pattern (21) after the identifiable audio command has been received within a predetermined first time period.

20156000 prh 26 -03-20199

Device according to Claim 12 or 13, characterized in that the processing means (41) are configured to identify, determine and compare the alarm word based on the confidence level; and wherein the confidence level is configured to be continuously adjustable based on each received audio sample.

Device according to claim 18, characterized in that if a conversion of the negatively identified voice sample into a positively identified alarm word is performed,

- processing means (41) configured to adjust the confidence level of the alarm word and to store both of said voice samples in a positive alarm word pattern.

Device according to Claim 12, characterized in that the memory (42) of the device (31) stores information about at least two different alarm words and their acoustic models; and the processing means (41) is further configured

- collecting the respective positive alarm word patterns and negative alarm word patterns in the memory (42) separately for each alarm word.

Device according to Claim 18, characterized in that the memory (42) of the device (31) stores information on at least two different alarm words and their acoustic models; and the processing means (41) is further configured

- Set and adjust the confidence levels for each alarm word individually.

Device according to Claim 18, characterized in that the device (31) is configured to store and adjust the security levels separately for each different human user of the device.

A computer program for improving the quality of a wake-up word recognition in a speech-capable device (31) and for controlling the device (31) or an external device by human voice commands, the computer program comprising code executable by processing means (41);

- setting the device (31) in an active listening mode or keeping the device (31) in an active listening mode, predetermining at least one alarm word and its acoustic model in the available memory (42) of the device (31);

characterized in that the computer program further comprises the steps of

20156000 prh 26 -03-20199

A computer program according to claim 23, characterized in that the computer program further comprises the steps of

- comparing (25) the negatively identified audio sample with the previously detected negatively identified audio sample, both received from the audio environment during a predetermined second period of time, and in the event of a match between the two negatively identified audio samples;

- converting (26) the negatively identified second audio sample into a positively identified excitation word.

Computer program according to claim 23 or 24, characterized in that the computer program is implemented on a computer-readable medium.

1/3