WO2019224434A2

WO2019224434A2 - Improving embedded voice recognition devices

Info

Publication number: WO2019224434A2
Application number: PCT/FR2019/000081
Authority: WO
Inventors: Joseph DUREAU; Alaa Saade; Alexandre GAULIER; Alice Coucke; Adrien BALL; Théodore BLUCHE; David LEROY; Clément DOUMOURO; Thibault Gisselbrecht; Francesco CALTAGIRONE; Thibaut LAVRIL; Maël PRIMET
Original assignee: Snips, Sas
Priority date: 2018-05-24
Filing date: 2019-05-24
Publication date: 2019-11-28
Also published as: WO2019224434A3; FR3081599A1

Abstract

The invention relates to methods for improving voice recognition, implemented by a stand-alone information processing device, providing for embedded voice recognition via a dedicated human-machine interface, based on the optimisation of the three essential components, which are: an acoustic model for automatic speech recognition (MA) using an artificial neural network, a language model (ML), and a natural language processing engine (NLU) that must be able to function completely autonomously and efficiently even considering the limited storage capacities and computational power that the connected objects and pico-computers provide.

Description

Invention Title: Improved On-Board Voice Recognition Devices

The field of the invention is that of the adaptation of speech recognition techniques, including those allowing the comprehension of voice commands, to the constraints of a connected object, such as for example a digital assistant, providing autonomous processing.

Numerous recognition or voice command techniques are already known that use the capabilities provided by digital "cloud" (or commonly "cloud") networks to operate a digital assistant.

Such configurations have definite advantages in computing capacity and remote storage, but they imply that users' personal data, including the recording of their voices, are partially or fully transmitted online and remotely processed. induces significant risks in terms of security and the protection of personal data, especially with regard to the new requirements of the new European Data Protection Regulation, known as the "RGPD".

The technical problem to which the present invention responds therefore lies in the possibility of obtaining a speech processing and its interpretation which is carried out in real time and which is of good quality by implementing only the processing and storage capacities. very limited, that is to say the level of that the current embedded devices usable in connected objects can provide, as, by way of example, pico-computers Raspberry Pi 3.

Such a technical problem is already posed by a 2017 article entitled "Speech Recognition and Understanding on Harware-Accelerated DSP" (G. Stemmer and Ali, Proceedings Of Interspeech 2017, Show & Telle Contribution, 20/08/2017, pp 2036-2037) but which remains very generic and imprecise with regard to the technical means that can be used to solve it.

To do this, the invention proposes to optimize the performance of the three main processes that are usually implemented in speech recognition devices or understanding voice commands, personal assistant type or other connected object, namely: the model acoustic (MA),

the language model (hereinafter referred to as "ML"), and

the natural language interpretation engine (commonly identified as "NLU" for "Natural Language Understanding")

The optimization performed at each of these three levels facilitates that of the other processes, so that their cooperation produces a joint technical effect in terms of reducing the need for digital processing capacity and data storage.

As regards the acoustic model in the first place, the invention lies in the fact of using a network of artificial neurons whose small size and depth are compatible with the storage capacities of an onboard computer of the type mentioned above. , namely of the order of a few tens of Mbytes. To obtain such compactness while nevertheless ensuring a satisfactory level of speech recognition in real time, it is possible to reduce in particular the number of parameters of the neural network. In the current state of the art, for example, a network sized to approximately 3 million parameters makes it possible to hold in a storage space of approximately 10 Mbytes, while the solutions using, to also make the recognition In real-time, cloud-based resources can contain more than one hundred million parameters and have memory requirements of several gigabytes.

It is also possible according to the invention to reduce, in addition to the number, the accuracy of the parameters of the neural network, in particular by resorting to singular value decomposition techniques which make it possible to replace a matrix describing the connections of the neural network by the multiplication of smaller matrices.

If the compactness of the neural network, and therefore of the acoustic model which rests on it, thus obtained necessarily reduces the performance of the acoustic model in terms of the fineness of sound detection, the invention makes it possible to compensate for this loss of resolution by implementing, as will be developed below, an optimization of the efficiency of the language model (ML) as well as the natural language interpretation engine (NLU).

This compactness of the acoustic model embedded in the device does not prevent that, in an initial phase of its industrialization prior to its use by the end user, the acoustic model has been connected to a server containing a very developed set of voice recordings. and a generic lexicon indicating in particular the possible correspondence between sounds and phonemes. It is after having accessed and processed these external resources that the neural network of its acoustic model will have been trained to be able to then carry out the sound recognition during operational use.

When, in a subsequent situation of use, the acoustic model will not succeed in identifying a word, the device may be configured, in a variant of the invention, to mention in its place and place a code indicating that it is is an unknown word.

Secondly, as regards the language model (ML), it is a central element of the system which, in the context of the present invention, will cooperate at the same time with the acoustic motor in order to constitute a recognition system. automated speech (hereinafter referred to as "ASR" for "Automatic Speech Recognition") and with the engine of interpretation of the natural language (NLU) with which it will share in particular a common set of data for their training.

It is indeed the language model (ML) that transforms the predictions produced by the acoustic model into sentences, from which the natural language interpretation engine (NLU) will extract intentions and identify fields that can take different values. .

Known voice recognition systems use for their language model very large general purpose lexical databases, which often require several terabytes of storage, as well as significant computing resources to perform the decoding. Conversely, the present invention provides that the data set that is used to drive the language model that it uses is a restricted dataset that only has a specialized vocabulary previously established according to the scope of application. of the onboard device object of the invention.

To generate such a specialized data set, it is possible to use various more or less automated techniques capable of producing from a number of queries concerning the intended field of use a diverse set of words and sentences expressing in a different way the same intentions and using the same fields.

The applicant has in particular described in his European patent application No. 17200837.7 a method which makes it possible to produce semiautomatically and in a short time, specialized data sets of reduced size which can be used in particular for training a trainer. natural language interpretation engine.

The present invention can use datasets from said method or produced by any other means, including manuals, provided that the data set obtained is sufficiently specific to the intended application domain and requires only a storage space not exceeding the permanent memory capacity of an onboard computer as described above.

According to the invention, it is also possible to program the voice recognition and command device so that it can enrich its reference data set according to the particular uses and needs of its user, whether manually by G's initiative. user who can transmit different data by any means (including voice input) or, always with the agreement of the user more automatically, for example by allowing him to read the data contained in a notebook d 'email address. This capacity for personalized enrichment of the reference data of the device is also a technical advantage peculiar to the invention, since on-line voice recognition and voice control systems are instead based in nature on the generic nature of their reference data which must be usable by all their users indifferently. On the contrary, an autonomous embedded system allows its personalization by each of its users.

To promote the efficiency of the language model (ML), which is driven on such a limited data set, the invention uses a combination of two approaches: on the one hand, a statistical approach (of the "n-gram model" type). ") Which evaluates for each word and phrase the probability of what the next word might be in terms of the context and examples available in the training data set; on the other hand, an approach based on the identification of entities, which are categories of information that can take different values, for example the entity "city" which can take different values corresponding to different cities (Paris, London , New York, ...) that the speech recognition system can integrate into its reference dataset as it is implemented by the user. Some of the most common entities (such as: dates, numbers, temperatures, ...) are already predetermined in the dataset provided to the language model. To reduce the weight of the data to be stored and processed, the language model does not have a pre-existing general graph for all the words in the reference dataset, but the device is programmed to be word-based. identified by the acoustic engine during the voice recognition phase, the different relevant sub-graphs are called in memory and dynamically combined on demand to enable it to analyze the terms of the sentence or the phrase decoded by the acoustic motor.

To then facilitate the interpretation of its results by the NLU, the language model retranscribes in a standardized form comprehensible by the NLU entities, values and queries identified from the dataset drive.

By way of example, the request "to raise the lighting of the chamber to 70% brightness" will be associated with a specialized intention (as a function of the predetermined data set corresponding to the use of the device) and standardized for example under the form "SetLightBrightness". This intention will be completed by the values of the different entities involved in the query (ie "room" and "brightness") and expressed for example in a standardized way: "set the (room) [kitchen] lights intensity to (brightness) [65] %]. "

To reduce also the size of the language model, this one is organized in different complementary engines: lexicon, model of statistical interpretation of the queries, model of treatment of the entities, which will be dynamically called and successively implemented during the process interpretation.

Thirdly, as regards the natural language interpretation engine (NLU), the invention firstly provides that it is driven from the same limited and specialized data set as that used for training. of the language model.

To reinforce its capacity to identify unknown words, the device can be programmed so that during the training phase of the NLU, it is injected into the reference dataset that it randomly uses a certain proportion of the corresponding code. to an unknown word, a proportion that is chosen to be substantially equal to the occurrence of said code in the results obtained by the acoustic model during its own training phase.

To determine the intentions expressed by the queries that the language model has extracted from the speech recognition phase results by the acoustic model and that it has translated in a standardized manner, the natural language interpretation (NLU) engine can also to cross two approaches that will successively be implemented:

- a first deterministic "regular expression" type of processing by which the NLU will retain the result only if it appears strictly consistent with what its reference dataset indicates to it, and, in the case where the NLU does not could not retain the result at the end of the first step, a probabilistic treatment of the type "conditional random field", allowing to extrapolate a result from a probability of detection of an intention. The natural language interpretation engine (NLU) also processes the entities identified by the language model and assigns them the values corresponding to those that the acoustic model has detected from the command request uttered by the user.

To improve the performance of this detection of intentions, it is also possible according to the invention that the natural language interpretation engine (NLU) can use the probability score of each word given by the language model to refuse detection of an entity or an intention whose confidence index seems to be too low. The device can then be programmed to ask a question by any means (including through a voice synthesis) to the user to remove the doubt about the element considered too uncertain.

Hereinafter, a presentation of the invention.

According to a first aspect, the subject of the invention is an autonomous information processing device that provides embedded voice recognition via a dedicated human-machine interface, and comprising at least:

an acoustic model for automatic speech recognition,

a language model (ML), and

a natural language processing engine (NLU) such that all these means as well as the data sets and the bases that they use for recognition and voice control, such as in particular vocabulary lexicons, are entirely stored locally in the autonomous information processing device, and that the implementation by the end user of any of the operational phases of recognition and voice control through the said system does not imply access to resources external or does not include the transmission to an external server of data from these treatments.

According to a second aspect of the invention, the acoustic model of this device is driven into an initial initial phase on a generic corpus of sounds in order to improve its automatic speech recognition function and implements an embedded artificial neural network configured so as to be compatible with the reduced storage capacity of said device.

FIG. 1 schematically represents the invention in which it can be seen that the sound of the user speaker of the device is firstly processed by the acoustic model (MA) which uses a neural network (RN) and then that results from the acoustic model are successively implemented the language model (ML) then the natural language interpretation engine (NLÜ) to lead for example to the execution of an order consistent with the request formulated by the user, symbolized here by the illumination of a light source.

Another aspect of the invention relates to the training phase of the language model (ML) as well as the natural language processing engine (NLU) which are both driven in a prior phase from the same set of predetermined and specialized data depending on the type of use of the information processing device concerned.

The invention also covers the implementation of a method of improving voice recognition by an onboard device as described above and schematized in the aforementioned FIG. 1, by which the language model is programmed to allow, during the analysis of the words identified by the acoustic module, an improved detection of the words that the automatic speech recognition system has not recognized, by implementing the following steps:

identify among the words pronounced by the user, each of those who do not have their precise correspondence in the predetermined vocabulary game,

- in the case of these words, refuse to relate them to other words present in the predetermined vocabulary whose pronunciation is near,

- replace them with a generic value identifying them as unknown words.

In a variant of the same method, the language model is also programmed to distinguish on the one hand categories of information (formerly called "entities", which may take different values and, on the other hand, for each of these entities, the This approach based on the relationship between entities and values is combined with a statistical approach of the "n-gram model" type in order to reinforce voice recognition capabilities.

Another variant allows the end user of a voice recognition device implementing this method to enrich its reference data set according to its particular uses and needs of its user, whether manually by said user who may transmit various data to him by any means (including voice input) or more automatically by allowing him to access data resources that he has or has access to.

The voice recognition method above can also be improved by the fact that the language model does not have a pre-existing general graph of all the words in the reference dataset but dynamically combines with the request and function of the words identified by the acoustic module, the relevant sub-graphs enabling it to analyze the terms of the sentence or phrase decoded by the acoustic module. In this way also, it ensures the compactness of the data necessary for the processing of voice recognition or understanding of voice commands without losing performance but allowing it to be fully realized within an embedded device.

Another variant of the method ensures that the natural language interpretation engine (NLU) is programmed to reinforce the difficulty of its learning in order to improve its performance by injecting randomly into the reference data set used for its training. certain proportion of unknown words, and taking into account for the calculation of this proportion of the frequency of the words identified as unknown by the acoustic module (MA) during its own learning.

Another method for improving the interpretation capacity of the intentions and values contained in the voice requests addressed to the voice recognition device consists in carrying out during the operational phase of use of said method the following two steps:

- a deterministic treatment by which the NLU retains the result only if it appears to him strictly consistent with what his reference dataset indicates to him, and, in the case where the NLU was unable to retain the result at the end of the first step, a probabilistic process that makes it possible to extrapolate a result from a probability of detection of an intention.

Claims

claims

A method for improving speech recognition implemented by an autonomous information processing device that ensures embedded voice recognition via a dedicated human-machine interface, and locally comprising at least the three next items:

- an acoustic model for automatic speech recognition (AD),

- a language model (ML), and

a natural language processing engine (NLU) characterized in that the acoustic model is driven on a generic corpus of sounds and uses an artificial neural network stored in the permanent memory of said embedded device and in that the language model is programmed to allow, during the analysis of the words identified by the acoustic model, an improved detection of the words that the automatic speech recognition system has not recognized, by the implementation of the following steps:

- identify among the words pronounced by the user, each of those who do not have their exact match in the specialized vocabulary game,

- in the case of these words, refuse to relate them to other words in the specialized vocabulary whose pronunciation is close,

- replace them with a generic value identifying them as unknown words.

2. Method according to claim 1, characterized in that the language model is programmed so as to distinguish firstly entities corresponding to categories of information likely to take different values and secondly, for each of these entities, the different values that each of them can take, and that this approach based on the relationship between entities and values combines with a statistical approach of the type "n-gram model" in order to reinforce the voice recognition capabilities .

3. Method according to one or more of claims 1 to 2, characterized in that the device implementing said method is programmed so that its reference data set can be enriched according to the uses and particular needs of its user. or from data entered by said user or retrieved by connection to external data resources which said user has authorized access.

Method according to one or more of claims 1 to 3, characterized in that the language model does not have beforehand a pre-existing general graph for all the words of the reference data set but dynamically combines with the asks and according to the words identified by the acoustic model, the relevant subgraphs allowing him to analyze the terms of the sentence or the phrase decoded by the acoustic model.

Method according to one or more of claims 1 to 4, characterized in that the natural language interpretation engine (NLU) is programmed to reinforce the difficulty of its learning to improve performance, by randomly injecting into the reference dataset used for training a certain proportion of unknown words,

and taking into account for the calculation of this proportion of the frequency of the words identified as unknown by the acoustic model (ASR) during its own learning.

Method according to one or more of claims 1 to 5, characterized in that:

- the natural language interpretation engine (NLU) performs on the standardized results from the language model a deterministic processing by which the NLU retains the result only if it appears strictly consistent with its reference dataset tells him,

and then in the case where the NLU could not retain the result at the end of the first step, a probabilistic process making it possible to extrapolate a result from a probability of detection of an intention.

7. Method according to one or more of claims 1 to 6, characterized in that it implements the following steps:

the natural language interpretation engine (NLU) uses the probability score of each word given by the language model to refuse the detection of an entity or an intention whose confidence index seems to be too low,

in the case of such a refusal, the device is programmed to interrogate, in particular but not exclusively by means of a voice synthesis, the user so that the said user can remove the doubt as to the interpretation of the element considered. as uncertain.

8. A standalone information processing device that provides embedded voice recognition via a dedicated human-machine interface, and locally comprising at least the following three elements:

- an acoustic model for automatic speech recognition (AD),

- a language model (ML), and

a natural language processing engine (NLU), characterized in that it implements a method for improving speech recognition according to one or more of claims 1 to 7.