WO2022131954A1

WO2022131954A1 - Dialogue control method and system for understanding natural language in a virtual assistant platform

Info

Publication number: WO2022131954A1
Application number: PCT/RU2020/000730
Authority: WO
Inventors: Станислав Игоревич АШМАНОВ; Павел Сергеевич СУХАЧЕВ; Кирилл Федорович ЗОРКИЙ
Original assignee: Общество с ограниченной ответственностью "Виртуальные Ассистенты"
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-06-23
Also published as: RU2759090C1

Abstract

The technical solution relates to dialogue control methods and systems for understanding natural language in a virtual assistant platform. The claimed method is implemented by a processor and includes: receiving from a user a dialogue initiation request containing a unique virtual assistant identifier and a dialogue identifier; conducting a database search for an assistant on the basis of the identifier received; conducting a database search for a dialogue on the basis of the dialogue identifier, wherein if a dialogue is found, a database search is conducted for an active session related to said dialogue, and if a session is not found, a new session is created in the database; receiving from the user a request in the dialogue, said request containing an identifier of the dialogue with an assistant, a text message from the user, and the context of the request; identifying the user's intents by means of machine learning algorithms which receive, on the input side, the virtual assistant identifier and the text message from the user request, resulting in the generation of a list of intents and of the degree of certainty that said intents are present in the user's message; searching for the most suitable rules for generating a response based on the user's intents.

Description

DIALOGUE CONTROL METHOD AND NATURAL LANGUAGE UNDERSTANDING SYSTEM IN VIRTUAL ASSISTANT PLATFORM

FIELD OF TECHNOLOGY

[001] This technical solution generally relates to the field of computing, and in particular to methods for managing dialogue and natural language understanding systems in a virtual assistant platform.

BACKGROUND OF THE INVENTION

[002] To date, information dialogue systems have become widespread and are used in various areas of public life, for example, for organizing automatic knowledge testing, automated user support, for diagnosing diseases, and so on. However, the existing information dialogue systems are designed to solve problems of a narrow profile, that is, they are able to support only a dialogue on a given topic. In addition, most of them do not have the ability to form a response in natural language, give an emotional coloring to the generated response, perform any additional actions, including interacting with other information systems and subsystems. The presence of such capabilities would allow not only a two-way exchange of information, instructions and commands between the user and the system, but also conduct a full-fledged dialogue, giving the user the impression of communicating with a live interlocutor, and also solve the tasks set by the user much more efficiently. Therefore, at the moment, it remains relevant to develop such a method of user communication with an information dialogue system that would expand the possibilities of user interaction with an information dialogue system.

[003] The closest analogue of the claimed invention is an adaptive natural language interface and a method for receiving, interpreting and performing user input in natural language, described in patent US7216080B2 "Natural-language voice-activated personal assistant", copyright holder: INDNER ROBERT D JR Nuance Communications Inc Vlingo Corp, publ. 05/08/2007. The method includes user input of a request, receiving and converting the user request into text, processing the text and generating a response in the form of an output command, converting the output command into an executive command, outputting the executive command to an additional system and/or subsystems for execution.

SUMMARY OF THE INVENTION

[004] The technical problem or technical problem solved in this technical solution is the implementation of a dialogue control method and a natural language understanding system in a virtual assistant platform.

[005] The technical result achieved in solving the above technical problem is to increase the accuracy of generating responses to the user by a virtual assistant through the use of machine learning algorithms.

[006] The specified technical result is achieved by implementing a dialog control method executed by at least one processor, in which at least one request is received from at least one user to initiate a dialog, containing a unique virtual assistant identifier and a dialog identifier; searching the database of the virtual assistant based on the assistant identifier obtained in the previous step; searching the database for a previously initiated dialog based on the dialog identifier obtained in the previous step, and if the dialog was found, then searching the database for the active session associated with the found dialog; if the session is not found, then a new session is created in the database for the found dialog, which is determined by a new unique session identifier, an identifier of the created dialog, an empty session context; receive from the user a request in the dialog containing the identifier of the dialog with the virtual assistant received when the dialog was initiated, at least one text user message, request context; carry out detection of user intents by means of machine learning algorithms that receive a virtual assistant identifier and a text message from the user's request as input, as a result of which a list of intents and degrees of confidence that these intents are present in the user's message are formed; searching for the most appropriate rule for generating a response based on the user's intents obtained in the previous step.

[007] In some implementations of the technical solution, the virtual assistant contains a unique assistant identifier and a set of named parameters.

[008] In some embodiments of a technical solution, upon receiving a request from a user in a dialog, they search for occurrences of DL dictionaries in the user's message in the following way:

• determine all occurrences of text elements of strings of DL dictionaries using a previously prepared prefix tree;

• select all dictionary strings that match the following heuristics: a. the user's message contains all the text elements included in the DL dictionary string, b. there is an order of found occurrences of text elements corresponding to the DL dictionary string, o if there are no text elements located at the very beginning of the user's message, then an asterisk or superstar must be at the beginning in the DL dictionary string, o if there are no text elements located at the very end of the user's message , then the string in the DL dictionary must end with an asterisk or a superstar.

[009] In some implementations of the technical solution, when intents are detected, if their classification confidence is below the configuration parameter, they are discarded. [0010] In some embodiments of the technical solution for each intent, the classification confidence is multiplied by the configuration parameter and rounded to the accuracy determined by the configuration parameter.

[0011] In some embodiments of the technical solution, after the discovery of intents, they are grouped by the name of the role assigned to them.

[0012] In some embodiments of a technical solution, when searching for a suitable rule for generating an answer for each template-question and intent included in the template of the DL language, the weight of its match with the user's message is determined.

[0013] In some implementations of the technical solution, the weight is a 32-bit unsigned integer.

[0014] In some implementations of the technical solution, templates of the same weight are selected by selecting from the found all question templates and intents with the maximum weight, while the selected templates are removed from the search list.

[0015] In some embodiments of a technical solution, when searching for a suitable rule to generate a response

- make a list of template-answers related to the selected template-questions and intents;

■ check the possibility of generating a non-empty response, where an empty response is a response that does not contain any text output element;

■ carry out randomization, and randomly choose one of the previously selected template-responses and determine it as a response to the user's message;

■ form a response according to the selected rule, and for the selected response template, all functional elements of the DL language are processed in accordance with their description;

■ write the conversation log and the generated response is written to the database.

BRIEF DESCRIPTION OF THE DRAWINGS [0016] The features and advantages of the present technical solution will become apparent from the following detailed description and the accompanying drawings, in which:

[0017] In FIG. 1 shows an implementation of a dialog control method in a virtual assistant platform.

[0018] In FIG. 2 shows a variant of the implementation of the interaction between the user and the virtual assistant in the form of a block diagram.

[0019] In FIG. 3 shows an example of the implementation of the context of the user's dialogue with the virtual assistant.

[0020] In FIG. 4 shows an example of the implementation of a natural language understanding system in the virtual assistant platform.

DETAILED DESCRIPTION OF THE INVENTION

[0021] Below will be discussed in detail the terms and their definitions used in the description of the technical solution.

[0022] In this invention, the system refers to a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems and any other devices capable of performing a given, well-defined sequence of operations (actions, instructions), centralized and distributed databases, smart contracts.

[0023] A command processing device means an electronic unit or an integrated circuit (microprocessor) executing machine instructions (programs), a smart contract, an Ethereum virtual machine (EVM), or the like. An instruction processing device reads and executes machine instructions (programs) from one or more data storage devices. The role of a storage device can be, but not limited to, hard disk drives (HDD), flash memory, ROM (read only memory), solid state drives (SSD), optical drives. [0024] A program is a sequence of instructions intended to be executed by a computer control device or command processing device.

[0025] NLU (Natural Language Understanding) - natural language understanding or natural language interpretation

[0026] The NLU part of this technical solution consists of components for determining the user's intentions (Intent Classification, abbr. IC) and extracting named entities (Named Entity Recognition, abbr. NER) from the text of the user's message . [0027] The task of classification (categorization) is a task in which there are many objects divided in some way into classes (categories). A finite set of objects is given for which it is known which classes they belong to. This set is called a sample. The class affiliation of the rest of the objects is unknown. It is required to construct an algorithm capable of classifying (categorizing) an arbitrary object from the initial set. To classify an object means to indicate the number (or name) of the class to which the given object belongs.

[0028] Classification of an object is the number or name of the class produced by the classification algorithm as a result of its application to this particular object.

[0029] Accuracy (eng, "precision") and recall (eng, "recall") are metrics that are used in evaluating most of the information extraction algorithms.

[0030] The accuracy of an algorithm within a class is the proportion of objects in a classification that actually belong to a given class relative to all objects that the algorithm assigned to that class. The completeness of the algorithm is the proportion of objects belonging to the class found by the classifier relative to all objects of this class in the test data set.

[0031] The E-measure is a common quality metric for categorization tasks and is the harmonic mean between precision and recall. [0032] Determination of user intent - categorization of what the user wanted to know when he entered the query text in a call to the chatbot. Replicas are categorized according to a predefined list of possible user intentions. It is a classification algorithm at the level of short texts (user replies). For example, the user response "I want to open a deposit in your bank" can be categorized as the intention "opening deposits".

[0033] The search query intent is the user's intent that he puts into the query; the goal that drives a person when he enters a query into the search box.

[0034] Extraction of named entities - the process of extracting boundaries and categorizing words or phrases in the text of user calls to the chatbot that are related in meaning to one of the predefined categories that are interesting for possible further processing by the chatbot. It is also a classification algorithm at the level of individual words and phrases in the user's remarks. For example, in the user replica “I want to transfer 1000 rubles from a savings account to a debit card”, you can select the entities “1000 rubles” - “amount of money”, “savings account” - “bank product”, “debit card” - “bank cards”.

[0035] A dialogue is an exchange of related messages between two participants in this process, for example, between two users.

[0036] The platform of virtual assistants (hereinafter referred to as the platform) is a software package that implements the functionality of conducting dialogues between users and in an automated mode.

[0037] Dialogues are conducted in text format in natural language between the user of the platform and dialogue robots (chat bots), called within the framework of the platform - virtual assistants (hereinafter - assistants). A virtual assistant is a software agent that can perform tasks for the user based on information entered by the user, data about his location, as well as information obtained from various Internet resources.

[0038] Various text communication channels can serve as channels (transports) for conducting dialogues within the framework of a technical solution. (such as a widget on a website, instant messengers, etc.) connected to the platform through a server API based on the HTTP protocol.

[0039] This technical solution allows for the simultaneous conduct of multiple dialogues between different users and different assistants.

[0040] Below, a method for controlling a dialogue in a virtual assistant platform will be described in detail, shown as a sequence of steps.

[0041] Step 110: At least one dialog initiation request is received from at least one user, containing a unique virtual assistant identifier and a dialog identifier.

[0042] In this technical solution, the user 210 interacts with the virtual assistant 230 through the user communication device 220, as shown in FIG. 2.

[0043] Each assistant 230 is defined by the following data:

• a unique assistant identifier 230,

• a set of named parameters called assistant context 230.

[0044] All context parameters are a pair of values: a unique parameter name; about the value of the parameter, which may be different for different assistants 230 and can be changed.

[0045] The assistant's knowledge base (hereinafter referred to as the knowledge base) is a set of rules and dictionaries, including: o descriptions of intentions (intents) in the form of a set of examples for training a neural network and rules in a specialized dialogue description language DL, o descriptions of selected entities in in the form of a set of labeled examples for training a neural network.

[0046] Assistant 230 context parameters can be defined as system parameters - such parameters are used in the dialogue algorithm and directly affect the choice of priorities when generating responses to user 210 requests. [0047] This technical solution uses the following system parameters (the use of such parameters is described later in the current document):

• that_anchor.

[0048] All data of the virtual assistants 230 is stored in the solution database.

[0049] The dialogue between the user 210 and the assistant 230 consists of the following steps:

1. Initiation of a dialogue between the user 210 and the assistant 230,

2. The exchange of messages in text form between the participants in the dialogue, and is determined by the following data:

• a unique dialog identifier defined at the dialog initiation stage.

• the unique identifier of the assistant 230 with whom the dialogue is being conducted.

• a unique identifier of the user 210 conducting a dialogue with the assistant 230.

• a unique session identifier, where a session is a segment of the dialogue, determined by the fact that during its duration the time between user messages 210 did not exceed a predetermined value, called the maximum allowable idle time (hereinafter referred to as the idle limit).

[0050] Each dialog identifier may contain dialog variables - these are variables used within the dialog. Their values persist for the duration of the dialog, but are reset when the dialog ends. The criterion for terminating the dialog is the set timeout for the user's silence period.

[0051] Examples of rules are shown below:

[0052] //rule 1 $ my name is [-**-]

# Nice to meet you, [&1]. [%var1="[&1]"]

//rule 2 $*what is*my*name*

#[if(%var1 )]{You are [%var1].}[e1se]{I don't know your name.}

[0053] In rule 1, the response generation instruction (line #) specifies the command to assign the variable var1 the value that was extracted from the user's replica at the stage of its processing by the rule (line $).

[0054] In rule 2, in the response generation instruction, a check is made that the variable var1 is not empty (has a value). If the value of var1 exists at that particular moment in the dialog, then it is used to generate the response.

[0055] An example of a dialogue using these rules:

• User: my name is Vasya

• Bot (rule 1 worked): Nice to meet you, Vasya.

• User: what is my name?

• Bot (rule 2 worked): You are Vasya.

[0056] The dialogue has ended. If after some time the same user 210 starts a new dialog, then the values of the variables will be "reset", and the bot will not "remember" the username:

• User: what is my name?

• Bot (rule 2 worked, part of the answer in else, because the value of the var1 variable was reset at the end of the last dialogue): I don't know your name.

[0057] A conversation session is defined by the following data:

• a unique session identifier.

• a unique identifier for the dialog to which the session belongs.

• a set of named parameters, called the session context. [0058] All context parameters are a pair of values: a unique parameter name. o the value of the parameter, which can be changed during the dialog. [0059] Session context parameters can be defined as dialog level parameters - such parameters retain their values between sessions, other parameters lose them, as shown in FIG. 3.

[0060] A session in which there is no message from user 210, or for which the time elapsed since the last message from user 210 is less than the idle limit, is called an active session.

[0061] All dialogue and session data is stored in the solution database.

[0062] Dialog initiation is the first step in a dialog that defines all the data of the dialog and forms it within the technical solution, the result of which is the provision of a dialog identifier necessary for exchanging messages with an assistant.

[0063] To initiate a dialogue, the following parameters are required, accepted in the decision:

• assistant ID (mandatory parameter),

• dialog ID.

[0064] The conversation ID is transmitted if the initiated conversation is a continuation of a conversation that took place previously between the same user 210 and assistant 230.

[0065] Upon receipt of a request to initiate a dialogue, the following steps are performed.

[0066] Step 120: The virtual assistant database is searched based on the assistant ID obtained in the previous step.

[0067] At the first step, the assistant 230 is searched. Thus, the assistant 230 with the received identifier is searched in the database for the technical solution. If the assistant 230 is not found in the database, then the technical solution returns error information and completes the initiation.

[0068] Step 130: Search the database for a previously initiated conversation based on the conversation ID obtained in the previous step.

[0069] Then, a dialogue search is performed. This step is only carried out if a dialog ID has been obtained. It produces search in the database of a technical solution for a previously initiated dialogue with the corresponding identifier.

[0070] Each user 210 has its own variables. User variables 210 are variables that store values between dialogs with the same user. Variable name format: user X.

[0071] Template examples are shown below.

//template 1

$ my name is [-**-]

# Nice to meet you, [&1]. [%user_name="[&1]"]

//template 2

$*what is*my*name*

#[if(%user_name)]{Bbi [%user_name].}[else]^ don't know your name.}

[0072] An example dialog is shown below:

[0073] User: my name is Vasya

[0074] Virtual Assistant (rule 1 worked): Nice to meet you, Vasya.

[0075] User: What's my name?

[0076] Virtual assistant (rule 2 worked, part of the answer under the condition on a non-empty user name variable): You are Vasya.

[0077] The dialogue has ended. A day later, user 210 comes from the same computer to talk to virtual assistant 230, who "understands" that this is the same user 210.

[0078] User: What's my name?

[0079] Virtual assistant (rule 2 worked, part of the answer under the condition on the non-empty variable user_name, because the value of this variable is saved between dialogs): You are Vasya.

[0080] If the dialog was found, then the active session database associated with the found dialog is searched. If the session is not found, then a new session (automatically active) is created in the database for the found dialog, defined by the following data:

• new unique session identifier.

• ID of the created dialog. • an empty session context.

[0081] The technical solution returns the identifier of the found dialog and completes the initiation.

[0082] The next step is to create a dialogue. This step is carried out if the previous step was skipped or the target dialog was not found on it. It performs the following steps.

[0083] A new dialog is created in the solution database, defined by the following data:

• a new unique dialog identifier.

• received assistant ID.

[0084] For the created dialog, a new session is created in the solution database, defined by the following data:

• new unique session identifier.

• ID of the created dialog.

• an empty session context.

[0085] The technical solution returns the identifier of the created dialog and completes the initiation.

[0086] Messaging will then be discussed in more detail below.

[0087] Step 140: a dialog request is received from the user, containing the virtual assistant dialog identifier received when the dialog was initiated, at least one user text message, the request context.

[0088] The user 210 of the technical solution sends a request that contains the following data:

• identifier of the dialogue with the assistant 230, received when the dialogue was initiated.

• user text message 210.

• a set of named parameters, called the user request context 210 (hereinafter the request context).

[0089] All context parameters are a pair of values: a unique parameter name, a parameter value. [0090] The task of the technical solution is to process such a request, generate the most suitable response to it and transfer it to the user.

[0091] As part of this task, the following steps are performed.

[0092] At the first stage, a database is searched for a previously initiated dialogue with the corresponding identifier.

[0093] If the dialog was not found, then the solution returns an error and ends processing the user's request 210.

[0094] Next is getting the session.

[0095] This step searches the database for the active session associated with the found dialog. If the session is not found, then a new session (automatically active) is created in the database for the found dialog, defined by the following data:

• new unique session identifier.

• ID of the created dialog.

• an empty session context.

[0096] Next, the session context is updated.

[0097] This step is only performed if the request context is not empty. It updates the session context according to the following algorithm:

• all parameters whose names do not exist in the session context but are present in the request context are added to the session context with the corresponding values from the request context.

• all parameters whose names are present both in the session context and in the request context are assigned the corresponding values from the request context in the session context.

[0098] The search context is then generated. The search context is a set of named parameters obtained from the context of the assistant 230 and the context of the session according to the following algorithm.

[0099] The context of the assistant 230 is copied to the search context. All parameters whose names are not in the search context but are present in the session context are added to the search context with the corresponding values from the session context. All parameters whose names are present in both the search context and the session context, the corresponding values from the session context are assigned to the search context.

[00100] Next, search for occurrences of DLs.

[00101] This step searches for all occurrences of DL dictionaries in user message 210 using the following algorithm.

• all occurrences of text elements of strings of DL dictionaries (not containing elements of the DL language) are searched using a prefix tree prepared in advance.

• all dictionary strings are selected that correspond to the following heuristics: o User message 210 contains all text elements included in the DL dictionary string, o There is an order of found occurrences of text elements corresponding to the DL dictionary string, o If there are no text elements located at the very beginning of the message user 210, then the DL dictionary line must start with an asterisk or superstar. o If there are no text elements located at the very end of the user's message, then the DL dictionary line must end with an asterisk or superstar.

• all selected dictionary strings are checked for exact match to the user's message.

[00102] Described below is how to start pooling dedicated named entities.

[00103] For each DL dictionary whose entries were found in the previous step, entries are selected for which the number of words in the DL template when instead of substituting all inline dictionaries with their specific values is the maximum for this dictionary. Such entries are called distinguished named dictionary entities, whose name is the same as the name of the DL dictionary in which they appear.

[00104] The set of distinguished named vocabulary entities forms the set of distinguished entities for user message 210. [00105] Step 150: User intents are detected using machine learning algorithms that receive a virtual assistant ID and a text message from the user's request as input, resulting in a list of intents and degrees of confidence that these intents are present in the user's message.

[00106] Next, the possible intentions of the user 210 (intents) are detected.

[00107] At this step, the neural network algorithms described below are used to detect possible intentions, which receive the assistant ID 230 and the text message from the user request 210 as input, and the result of their execution is a list of possible intentions and degrees of confidence that this intention is valid present in user message 210.

[00108] All intents whose classification confidence is below a configuration parameter (eg, Threshold value) are discarded.

[00109] For each intent, the confidence is multiplied by a special configuration parameter (Multiplier) and rounded with the accuracy determined by the configuration parameter (Accuracy). If confidence exceeds the value of the configuration parameter (Limit), then the confidence value is set to Limit. The resulting value is considered further as the certainty of this intent.

[00110] Then, named entities are extracted based on the neural network. At this step, the neural network algorithms described below are used to extract named entities, which receive the assistant identifier 230 and a text message from the user request 210 as input, and the result of their execution is a list of selected entities indicating their roles and the degree of confidence that they were highlighted correctly.

[00111] All selected entities are grouped by the name of the role assigned to them. For each name, the entity with the highest confidence score is selected. For each such entity, it is checked whether there is an entity with the same name in the set of selected entities, previously formed. If there is no such entity, then it is added to the set of selected entities.

[00112] Step 160: The most appropriate rule is searched for generating a response based on the user's intents obtained in the previous step.

[00113] The next step is to search for the most appropriate rule to generate a response. For each question template and intent included in the DL language template, the weight of its match with the user's message 210 is determined. The weight is a 32-bit unsigned integer determined according to the following principles:

• if the DL template has conditions on the special variable that anchor, then the 31st bit takes the value 1 , otherwise 0.

• for intents and pattern questions that contain asterisks and superstars, the 30th bit is always 0; for pattern questions that do not have asterisks or superstars, the 30th bit is 1.

• The 16-bit number, located from bits 14 to 29, is defined as follows: o for intents, it is equal to the number of words in the user's message multiplied by the confidence of the intent. If the value of 2 ^L 16-1 is exceeded, the value is set to 2 ^L 16-1. For template-questions, it is equal to the number of words in the template-question in expanded form (after substituting specific values for DL language elements such as dictionaries or inline dictionaries), including asterisks and superstars. If the value of 2 ^L 16-1 is exceeded, the value is set to 2 ^L 16-1.

• An 8-bit number, ranging from 0th to 13th bit, is defined as the number of pattern-conditions in the pattern. If the number of condition-patterns exceeds 2 ^L 8-1 , then the value is set to 2 ^L 8-1.

[00114] Next, the following selection algorithm is executed.

[00115] In the first step, templates of the same weight are sampled by selecting from all found question templates and intents with the maximum weight, while the selected templates are removed from the search list. If the list is empty, then consider that an empty response is generated and skip the next two steps.

[00116] Next, a sample of all realistically possible template responses is formed.

[00117] A list of template-answers is compiled that relates to the template-questions and intents selected in the previous step, the possibility of generating a non-empty response is checked, where an empty response is a response that does not contain any text output element (text, inline dictionary, dictionary, etc. .P.). If the list is empty, then return to the previous step.

[00118] Next is randomization. Randomly choose one of the previously selected template-responses and determine it as a response to the user's message.

[00119] Then a response is generated according to the selected rule. For the selected response template, all functional elements of the DL language are processed in accordance with their description.

[00120] Next, a dialogue log is recorded. The generated response is recorded in the technical solution database.

[00121] The session context is then saved. For each instruction to change the session, change the context of the session in accordance with this instruction. The updated session context is stored in the solution database.

[00122] The last step is the transmission of the response to the user 210. The generated response is transmitted to the user 210. [00123] The machine learning algorithms used in this technical solution will be described below.

[00124] Modern algorithms for classifying intents and extracting named entities traditionally consist of two parts, namely extracting features from text in some numeric vector form and applying a special decision function to these numeric vectors, which results in a class number in the list of intent classes for the task classification of intents and labels with numbers names of entities from the list of entities for all words of the user's original replica.

[00125] Feature extraction is a type of abstraction, a dimensionality reduction process in which the original set of original variables is reduced to more manageable groups (features) for further processing, while remaining a sufficient set to accurately and completely describe the original data set.

[00126] Text data vectorization is a way of representing selected features through transformation into numerical vectors of some multidimensional space.

[00127] A function that vectorizes text data is called a vectorizer (embedder), and the resulting features corresponding to the original phrase are called a vector (sentence embedding). [00128] Some of the best vectorizers in terms of quality (by the final F-measure of a complete algorithm for classifying intents or extracting named entities on arbitrary data sets) at the moment are ELMo (Embeddings from Language Models) and BERT (Bidirectional Encoder Representations from Transformers).

[00129] ELMo is a more productive option (capable of processing more text replicas per second), built on the basis of Bi-LSTM networks (Bidirectional Long-Short-Term-Memory, bidirectional recurrent artificial neural networks, consisting of LSTM layers, layers of long short-term memory). ELMo looks at the whole sentence before assigning each word its embedding. The neural network of such a vectorizer is trained (selects and optimizes the coefficients of some complex numerical function to better approximate the expected behavior of the algorithm) specifically for the task of creating such embeddings. The model, while training on a large set of texts, tries to predict the next word in sentences, as well as the previous one when it backwards through the sentence. Preliminary training takes place on texts without markup, automatically generating pairs “part of a sentence” - “next or previous word”. Thus, the model remembers co-occurring words and stable language constructs. [00130] BERT is a heavier version (has less performance), the architecture of which consists of parts with sequence-to-sequence neural networks (models that take sequences of elements, words, letters, other signs as input, and return a different sequence of elements), based on the mechanism of attention. This is a neural network-transformer, consisting of a whole stack of sequence-to-sequence network encoders. It is possible to train such a model by "masking" individual words in sentences without any markup. A model that has broken several records for the success of solving a number of NLP tasks. There are also a large number of pre-trained models for different languages in the public domain, including multilingual ones for solving various problems.

[00131] In the NLU part of this technical solution, as vectorizers, based on quality and performance indicators, a lightweight implementation of BERT (DistilBERT) for classifying intents and ELMo for extracting named entities can be used. There are no special text preprocessing, since both of these models work better with source texts and are able to take into account parts of speech, word forms, punctuation marks and other additional information that may be lost during preprocessing, and based on knowledge of word combinations and language constructs, they make their own predictions.

[00132] After receiving the embedding for the entire replica of the user, it remains to classify this embedding in vector space. This problem in this technical solution can be solved both by means of classical machine learning and by using shallow neural networks consisting of 2-3 fully connected layers. Empirically, on representative data sets (about 500 intents, at least 50 examples for each intent), a comparison was made in terms of the quality and performance of algorithms (k-NN, Random Forest, SVM, XGBoost, their compositions, shallow neural networks), as a result of which was selected approach with Feed-Forward Neural Network (FFNN) of three fully connected layers with an output layer equal to the number required classes and architecture parameters obtained using Bayesian optimization.

[00133] Bayesian optimization is a global optimization technique for an unknown function. The Neural Network Architecture Search (NNAS) task involves automatic selection of the neural network architecture. This is the task of neural network morphism (English, "network morphism"), when the functionality of the neural network is preserved, but the architecture of the neural network is selected differently. The key idea of the proposed method is to explore the architecture search space using the Bayesian Optimization (BO) algorithm. Traditionally, Bayesian optimization consists of a 3-step loop: update, generate, and observe. In the context of searching for a neural network architecture, the cycle consists of the following steps:

• Update: Gaussian process is trained on existing architectures and their results;

• Generation: a new architecture is generated through the optimization of the acquisition function;

• Observation: record the results of the new architecture.

[00134] This selects the number of layers, neurons on each layer, the type of normalization used, activation, the probability of turning off neurons in Dropout layers.

[00135] The final optimal neural network for classifying intents is selected automatically for each data set, in the example with the considered representative data set, an architecture was obtained, consisting of a sequence of layers: the first fully connected layer, batch normalization, Dropout layer, Elu activation layer, second fully connected layer , batch normalization, Dropout layer, Elu activation layer, third fully connected layer, total - 233 neurons in fully connected layers. In essence, such a neural network is a hybrid of the FFNN network. The output of this neural network is a probability distribution vector for each of the classes (the number of vector elements is equal to the number of classes). [00136] As a result, the following sequence of processing user replicas is obtained: replica -> DistilBERT -> embedding -> FFNN -> probability distribution of replica belonging to intents.

[00137] After receiving the embeddings of the words of the user replica, they are further classified according to a given set of labels (tags) of entities. When marking text for training, each word is marked with a special tag - the name of the class to which the word belongs, and the position of the word in relation to the entire marked entity in the form of a prefix:

[00138] B - from the word beginning - the first word in an entity consisting of more than 1 word.

[00139] I - from the word inside - words that are in the middle of an entity consisting of more than 1 word.

[00140] O - from the word out - if the word does not refer to any entity, it is marked with this label.

[00141] For each word, you need to get a tag. This can be done in many ways. A simpler and more obvious way is to use some neural network, the last layer of which will be a fully connected layer of dimension d, where d is the number of possible word labels. Thus, we get the probabilities of the word to have each of the possible labels (and we can choose the most probable of them). But to take into account their mutual dependence in the final tags, this is not enough.

[00142] After comparing the quality metrics of various classical machine learning algorithms (k-NN, Random Forest, SVM, XGBoost, their compositions) and neural network architectures (Bi-LSTM Encoder + CRF, sequence-to-sequence models) as a classifier Bi-LSTM Encoder was chosen with an additional CRF (conditional random field) layer that takes into account the relationship of tags with each other for the correct placement of prefixes. The output of this neural network is a probability distribution vector for each of the tags for each word (the number of vector elements is equal to the number of tags).

[00143] As a result, the following sequence of processing user replicas is obtained: replica -> ELMo -> word embeddings -> Bi-LSTM Encoder + CRF -> distribution of probabilities that words of a replica belong to entity tags.

[00144] Vectorization models are not further trained within the platform, but are prepared and delivered by developers for each specific application. Such models are trained in advance without markup on large text arrays of target content. For example, large corpora of the Russian language, such as Wikipedia and news sites, expanded with logs of communication between bank customers in chats with operators for training vectorizers for banking topics. Then, in the case of the BERT model for use in classification, a one-time additional training can be performed together with the classifier on the target training data (replica-intent pairs), which significantly increases the final quality metrics.

[00145] Classification models for intents and entity tags are trained as part of the interaction of the operator with the platform on operator-tagged user data entered through the platform IDE interface in sections with “examples”. The platform has the following adjustable learning hyperparameters that can be changed in a text configuration file: learning rate, loss function selection, optimization metrics, number of training epochs, and batch size. When training is started from the IDE, all examples are collected with marks of their intents and entity tags, the data is passed through vectorizers and phrase embeddings are collected for subsequent intent classification and word embeddings to extract named entities. The data is stratified divided into training and validation samples (the proportion of splitting into these parts is also set in the configuration file with training hyperparameters, and this proportion is observed for each intent and entity tag), then for the problem of classifying intents, the classes are automatically balanced by the number of examples. For this task, oversampling (artificial expansion of the number of examples in classes with a small number of examples) of the training part of the sample using the SMOTE (Synthetic Minority Oversampling TEchnique) technique of small classes to the size of the largest is used. This is a special generation technique synthetic (fake) data in the embedding space received from the vectorizer (and not at the level of text generation fed into the vectorizer). Thus, we increase the sensitivity of the classifier to small classes.

[00146] Next is the direct training of the classifier (intents and entities). During training, the learning rate is automatically adjusted, and an early stop can be made in case of fast convergence (so as not to go through all the given epochs in vain), or if the algorithm cannot significantly optimize on the current data set.

[00147] At the end of training, all received model parameters are serialized into model files that can be used to create new instances of intent classifiers and named entity extraction models.

[00148] All components of the NLU part of the technical solution work inside separate services, which are wrappers around neural network modules. They include a separate service with vectorizers (in order to optimize, increasingly heavier neural networks, vectorizers, were moved to a separate service), services for classifying intents, training intent classifiers, extracting named entities, and training models for extracting named entities.

[00149] To initialize runtime services, you must specify the path to the files of the corresponding models from which the services are raised. Requests to services must contain user replicas, and service responses are lists of most likely intents with their respective probability levels from 0.0 to 1.0, and sets of words that may be representatives of the named entities being searched for. For each such highlighted word in the response of the named entity extraction service, there is its position in the text (indentation from the beginning of the replica in characters) and a list of entities with their probability levels. The most likely intents and entities can be used in further processing of the user's response by the platform. [00150] Learning services in the body of requests to start learning take paths to text files with sets of examples marked up by IDE replica operators and form model files. If necessary, training can be stopped, request the training status (find out at what stage the process is), or upon completion, use the generated model files to initialize new runtime services.

[00151] Referring to FIG. 4, the present technical solution can be implemented as a computer system 400 for implementing a dialog control method and a natural language understanding system in a virtual assistant platform, which includes one or more of the following components:

• a processing component 401 comprising at least one processor 402,

• memory 403,

• multimedia component 405,

• audio component 406,

• interface 407 input / output (I / O),

• sensor component 408,

• component 409 data.

[00152] The processing component 401 mainly manages all operations of the system 400, such as processing user data or a chat request, as well as managing the display, phone call, data transmission, camera operation, and recording operation of the mobile communication device. Processing component 401 may include one or more processors 402 executing instructions for completing all or part of the steps from the above methods. In addition, the processing component 401 may include one or more modules for convenient interaction between other processing modules 401 and other modules. For example, the processing component 401 may include a multimedia module for convenient, lightweight interaction between the multimedia component 405 and the processing component 401.

[00153] The memory 403 is configured to store various types of data to support the operation of the system 400, such as a database with user profiles. Examples of such data include instructions from any application or method, contact data, address book data, messages, images, videos, etc., all of which run on system 400. Memory 403 may be implemented as any type of volatile memory, non-volatile memory, or a combination thereof, e.g., static random access memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory device (ROM), magnetic memory, flash memory, magnetic disk or optical disk, and others, without being limited.

[00154] The media component 405 includes a screen providing an output interface between the system 400, which may be installed on a user's mobile communications device, and the user. In some implementations, the screen may be a liquid crystal display (LCD) or a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input from a user. The touchpad includes one or more touch sensors in terms of gestures, touching and sliding on the touchpad. The touch sensor can not only sense the subject's touch boundary or swipe gesture, but also determine the length of time and pressure associated with the touch and slide operation mode. In some embodiments, media component 405 includes one front camera and/or one rear camera. When the system 400 is in an operating mode, such as shooting mode or video mode, the front camera and/or rear camera can receive media data from outside. Each front camera and rear camera can be one fixed lens optics system or can have focal length or optical zoom.

[00155] The audio component 406 is configured to output and/or input an audio signal. For example, the audio component 406 includes one microphone (MIC) that is configured to receive an external audio signal when the system 400 is in an operating mode, such as call mode, recording mode, and speech recognition mode. The received audio signal may be further stored in the memory 403 or routed through the communication component 409 . In some embodiments, the audio component 406 also includes a single speaker configured to output an audio signal.

[00156] An input/output (I/O) interface 407 provides an interface between the processing component 401 and any peripheral interface module. The above peripheral interface module may be a keyboard, steering wheel, button, etc. These buttons may include, but are not limited to, a start button, a volume button, a home button, and a lock button.

[00157] The touch component 408 includes one or more sensors and is configured to provide various aspects of assessing the state of the system 400. For example, the touch component 408 can detect the on/off states of the system 400, the relative position of components, such as a display and a keypad, of a single component system 400, the presence or absence of contact between a subject and system 400, as well as the orientation or acceleration/deceleration and temperature change of system 400. Sensor component 408 includes a proximity sensor configured to detect the presence of a nearby object when there is no physical contact. The sensor component 408 includes an optical sensor (eg, CMOS or CCD image sensor) configured for use in rendering an application. In some embodiments, the sensor component 408 includes an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

[00158] The communication component 409 is configured to facilitate wired or wireless communication between the system 400 and other devices. System 400 may access a wireless network based on a communication standard such as WiFi, 2G, 3G, 5G, or combinations thereof. In one exemplary embodiment, the communication component 409 receives a broadcast signal or a broadcast associated therewith. information from the external broadcast control system via the broadcast channel. In one embodiment, communication component 409 includes a Near Field Communication (NFC) module to facilitate near field communications. For example, the NFC module may be based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

[00159] In an exemplary embodiment, system 400 may be implemented by one or more Application-Specific Integrated Circuits (ASICs), a Digital Signal Processor (DSP), a Digital Signal Processor (DSP), a Programmable Logic Unit (PLU), a logic chip programmable in operating environment (FPGA), controller, microcontroller, microprocessor, or other electronic components, and may be configured to implement the dialog control method 100 and natural language understanding system in the virtual assistant platform.

[00160] In an exemplary embodiment, the non-volatile computer-readable medium includes a memory 403 that includes instructions, where the instructions are executed by the processor 401 of the system 400 to implement the dialog control methods described above.

[00161] . For example, a non-volatile computer-readable medium can be ROM, random access memory (RAM), compact disc, magnetic tape, floppy disks, optical storage devices, and the like.

[00162] Computing system 400 may include a display interface that transmits graphics, text, and other data from a communications infrastructure (or framebuffer, not shown) for display on media component 405. Computing system 400 further includes input devices or peripherals. Peripheral devices may include one or more devices for interacting with a mobile communications device. user, such as a keyboard, microphone, wearable device, camera, one or more audio speakers, and other sensors. Peripherals may be external or internal to the user's mobile communications device. The touch screen may display, typically, graphics and text, and also provides a user interface (such as, but not limited to, a graphical user interface (GUI)) through which a subject may interact with the user's mobile communications device, such as accessing and interacting with with applications running on the device.

[00163] The elements of the proposed technical solution are in a functional relationship, and their joint use leads to the creation of a new and unique technical solution. Thus, all blocks are functionally connected.

[00164] All blocks used in the system can be implemented using electronic components used to create digital integrated circuits, which is obvious to a person skilled in the art. Not limited to, microcircuits can be used, the logic of which is determined during manufacture, or programmable logic integrated circuits (FPGA), the logic of which is set by programming. Programmers and debugging environments are used for programming, allowing you to set the desired structure of a digital device in the form of a circuit diagram or a program in special hardware description languages: Verilog, VHDL, AHDL, etc. An alternative to FPGAs can be programmable logic controllers (PLCs), basic matrix crystals ( BMK), requiring a factory production process for programming; ASIC - specialized custom-made large integrated circuits (LSI), which are significantly more expensive for small-scale and single-piece production.

[00165] Usually, the FPGA chip itself consists of the following components: configurable logic blocks that implement the required logic function; • programmable electronic links between configurable logic blocks;

• programmable input/output blocks that provide communication between the external output of the microcircuit and the internal logic.

[00166] Blocks can also be implemented using read-only memories.

[00167] Thus, the implementation of all used blocks is achieved by standard means based on the classical principles of implementing the fundamentals of computer technology.

[00168] As one of skill in the art will appreciate, aspects of the present technical solution may be implemented as a system, method, or computer program product. Accordingly, various aspects of the present technical solution may be implemented solely as hardware, as software (including application software, etc.), or as an embodiment combining software and hardware aspects, which may be generally referred to as a "module" , "system" or "architecture". In addition, aspects of the present technical solution may take the form of a computer program product implemented on one or more computer-readable media having computer-readable program code embodied thereon.

[00169] Any combination of one or more computer-readable media can also be used. The computer-readable storage medium can be, without limitation, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination thereof. More specifically, examples (non-exhaustive list) of a computer-readable storage medium include: an electrical connection using one or more wires, a portable computer diskette; hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or Flash memory), fiber optic connection, compact disc read only memory (CD-ROM), optical storage device, magnetic storage device, or any combination of the above. As used herein, a computer-readable storage medium can be any flexible storage medium that can contain or store a program for use by or in connection with a system, device, apparatus, or in connection with them.

[00170] Program code embedded in a computer-readable medium may be transmitted using any medium, including, without limitation, wireless, wired, fiber optic, infrared, and any other suitable network, or a combination of the foregoing.

[00171] The computer program code for performing the operations for the steps of the present technical solution may be written in any programming language or combinations of programming languages, including an object-oriented programming language such as Python, R, Java, Smalltalk, C++, and so on, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may be executed in whole, in part on the user's computer, or as a separate software package, in part on the user's computer and in part on a remote computer, or entirely on a remote computer. In the latter case, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN), a wide area network (WAN), or a connection to an external computer (eg, via the Internet via ISPs).

[00172] Aspects of the present technical solution have been described in detail with reference to block diagrams, circuit diagrams and/or diagrams of methods, devices (systems), and computer program products in accordance with embodiments of the present technical solution. It should be appreciated that each block from the block diagram and/or diagrams, as well as combinations of blocks from the block diagram and/or diagrams, may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general purpose computer, special purpose computer, or other data processing device to create a procedure, so that instructions executed by the computer processor or other programmable data processing device create the means to implement the functions/actions indicated in the block or blocks of the block diagram and/or charts.

[00173] These computer program instructions may also be stored on a computer-readable medium that can control a computer other than a programmable data processing device or other than devices that operate in a particular manner such that the instructions stored on the computer-readable medium create a device including instructions that perform the functions/actions indicated in the block diagram and/or diagram.

Claims

FORMULA Dialog control method executed by at least one processor and including the following steps:

• receive from at least one user at least one request to initiate a dialogue containing a unique virtual assistant identifier and a dialogue identifier;

• perform a database search for a virtual assistant based on the assistant ID obtained in the previous step;

• search the database for a previously initiated dialog based on the dialog ID obtained in the previous step, where a. if the dialog was found, then search the active session database associated with the found dialog. b. if the session is not found, then a new session is created in the database for the found dialog, which is determined by a new unique session identifier, the identifier of the created dialog, and an empty session context.

• receive from the user a request in the dialog, containing the ID of the dialog with the virtual assistant received when the dialog was initiated, at least one text message of the user, the context of the request;

• detecting user intents by means of machine learning algorithms that receive a virtual assistant identifier and a text message from the user's request as input, as a result of which a list of intents and degrees of confidence that these intents are present in the user's message are formed;

• search for the most appropriate rule for generating a response based on the user's intents obtained in the previous step.

33 The method according to claim 1, characterized in that the virtual assistant contains a unique assistant identifier and a set of named parameters. The method according to claim 1, characterized in that upon receiving a request from the user in the dialog, they search for occurrences of DL language dictionaries in the user's message according to the following method:

• select all dictionary strings that match the following heuristics: a. the user's message contains all the text elements included in the DL dictionary string, b. there is an order of found occurrences of text elements corresponding to the DL dictionary string, but if there are no text elements located at the very beginning of the user's message, then the DL dictionary string must have an asterisk or superstar at the beginning. o If there are no text elements located at the very end of the user's message, then the DL dictionary line must end with an asterisk or superstar. The method according to claim 1, characterized in that when intents are detected, if their classification confidence is lower than the configuration parameter, they are discarded. The method according to claim 1, characterized in that for each intent the classification confidence is multiplied by the configuration parameter and rounded with the accuracy determined by the configuration parameter. The method according to claim 1, characterized in that after the discovery of intents, they are grouped by the name of the role assigned to them. The method according to claim 1, characterized in that when searching for a suitable rule for generating an answer for each question template and intent included in the DL language template, the weight of its match with the user's message is determined.

34 The method according to claim 7, characterized in that the weight is a 32-bit unsigned integer. The method according to claim 7, characterized in that templates of the same weight are selected by selecting from the found all question templates and intents with the maximum weight, while the selected templates are removed from the search list. The method according to claim 1, characterized in that when searching for a suitable rule for generating a response

• Randomization is carried out, and one of the previously selected template-responses is chosen randomly and determined by its response to the user's message;

• form a response according to the selected rule, and for the selected response template, all functional elements of the DL language are processed in accordance with their description;