WO2022071826A1

WO2022071826A1 - System and method for automating voice call processing

Info

Publication number: WO2022071826A1
Application number: PCT/RU2020/000530
Authority: WO
Inventors: Юрий Юрьевич КОЗИН
Original assignee: Общество С Ограниченной Ответственностью "Колл Инсайт"
Priority date: 2020-09-30
Filing date: 2020-10-08
Publication date: 2022-04-07
Also published as: RU2763691C1

Abstract

The present technical solution relates to the field of computer engineering. A system for the automated processing of client voice calls to company customer care services comprises: an interaction server (CORE) for interacting with a voice application server by means of a Voice XML interpreter and an MRCP client, receiving an audiostream/audiofile for transcription, selecting a method of processing the audiostream/audiofile from between an automatic speech recognition (ASR) system and an operator service (OSR) module, routing the audiostream/audiofile, processing a response from the ASR system and checking the level of confidence in the transcribed text, producing a file of transcribed text and semantic tags, and transmitting client call voice recognition results and identified semantic tags; an OSR server for routing client calls to an APM operator, transmitting call processing results, and registering and selecting an operator for processing a call; an APM operator; an APM OSR configurator; a semantic service; a logger service; a statistics service which saves information about all stages of a dialogue; and an APM monitoring specialist. The invention makes it possible to dynamically select a method for processing a client voice call according to transmitted parameters of a voice sample.

Description

SYSTEM AND METHOD FOR AUTOMATION OF PROCESSING OF VOICE INQUIRIES

FIELD OF TECHNOLOGY

This technical solution relates to the field of computer technology, in

5 in particular, to the system and method for automating the processing of customer voice requests to the company's service departments.

BACKGROUND OF THE INVENTION

The high level of competition and the unstable economic situation in the world require companies to continuously optimize their costs, one of the key items0 is traditionally the cost of customer service. This entails the need to reduce the cost of servicing a customer's contact while maintaining a given level of quality, which in turn creates the need to automate these processes.

Global trends in the development of voice services and the improvement of modern means of voice channels (voice communication protocols over the Internet VoIP and other technologies for transmitting voice information) and technologies for analyzing and processing voice information lead to an increase in the number of automation systems for processing customer voice requests and, as a result, an increase in the number of costs companies for their development.

A significant number of customer voice automation systems are known from the prior art, in part such solutions are described in applications: US2013246053A1, publ. 09/19/2013; US2011010173A1, publ. 01/13/2011.

The prior art solutions use the following system of objects: 5 System message: question, prompt, or some other system beep to the client. For example, the question "How can I help you?";

Client response/response: A response, command, question, clarification, or other response from the client. For example, "Tell me the balance on my card";

The dialogue between the client and the system is a sequence of messages from the system and the responses/reactions of the client. The dialogue may aim to ask a question to the client in order to obtain information from him or to provide information at the request of the client;

Communication is a continuous sequence of dialogues of one client with the system. Dialogues can be both logically interconnected and contain different thematic blocks. 5 The key indicator of the quality of automatic and automated systems for servicing customer voice requests is the accuracy of recognizing the customer's response in each dialogue and communication in general. Recognition accuracy in communication (P(right)) can be measured as the ratio of the number of correctly recognized the client's responses in the entire communication (Nright) to the total number of client's responses in this communication (N).

P(right) = Nright/N*100%.

Recognition accuracy depends on two components: the accuracy of translating sound into text and the accuracy of identifying meaning from the recognized text.

The accuracy of identifying meaning from the recognized text is related to the number of words in the client's response and the syntactic complexity of the client's sentence. There are currently no universal methods for assessing the quality of this indicator.

These limitations lead to the impossibility of achieving a 100% probability level of recognition of client responses in automatic and automated systems and the possibility of full competition with the quality of human speech recognition. Moreover, the more complex the client's response, the greater the likelihood of an error in recognizing the response by the system.

This leads to the following main disadvantages of currently available systems:

1. The service should always take into account that an error could occur during recognition, since the accuracy of speech recognition is less than 100%, and on average it ranges from 80 to 93%. To solve this problem, existing systems include a step of clarifying (requesting) information from the client, which leads to a lengthening of the service time, an increase in the client’s negative attitude towards the system and, as a result, termination of interaction with the system (breaks connections, hangs up, etc.) .

2. Simplification of dialogues between the client and the system. As a result, the number of dialogues within communication increases due to the fact that the system asks simpler questions and receives information from the client gradually. This also leads to longer service times and increases the likelihood of misidentification of customer responses.

3. Limiting the list of topics and voice services that can be automated. This leads to a departure from the concept of "human" communication between the client and the system, to the solution of the problem of reducing the probability of incorrect recognition or incorrect extraction of meaning.

The limitations listed above negatively affect the key performance indicators of automatic and automated systems and the company as a whole: a general decrease in the quality of correctly processed customer voice requests due to the exclusion of a person from the system; decrease in the percentage of automation, which is measured as the ratio of the number of clients who received service in the system to the total number of clients who applied to the system; growth in the cost of servicing customer voice requests due to the lengthening of the service time; reduced opportunities for development and self-learning of systems due to the limitation of the automated list of topics and voice services.

SUMMARY OF THE INVENTION

The technical problem to be solved by the claimed technical solution is the creation of a system, method and computer-readable medium for automating the processing of customer voice requests to the company's service departments, which are described in independent claims. Additional embodiments of the present invention are presented in dependent claims.

EFFECT: technical result consists in dynamic selection of a method for processing a client's voice request, depending on the transmitted parameters of the voice segment (voice sample), the quality of automatic speech recognition and the availability of the operator. The specified technical result is achieved due to the function of routing calls between the ASR system and the OSR service module, taking into account such parameters as the quality of speech recognition, the price/criticality of an error in the customer's business process, the availability of the operator, the category of the client, etc.

In the preferred embodiment, a system for automating customer voice calls to the company's service departments is declared, containing: an interaction server (CORE), which provides:

• interaction with the Voice Applications server via Voice XML interpreter and MRCP client;

• receiving from it an audio stream/audio file for transcription;

• selects the processing method - by the automatic speech recognition system (ASR) or the operator service module (OSR) of the audio stream/audio file in accordance with the transferred settings;

• audio stream/audio file routing sequentially to the ASR system;

• processing the response from the ASR system and checking the level of trust in the transcribed text, if the level of trust is higher than the minimum set, it transfers the text to the Sematic service to extract semantic tags, if the level of trust is lower than the minimum set, it routes the call to the OSR service module;

• receiving an array of transcribed text and semantic tags from the OSR service module;

• transferring the results of recognition of the client's voice request and selected semantic tags to the Voice Applications server; OSR server providing:

• routing of customer requests to the Operator's workstation; • transferring the results of processing requests to the interaction server;

• registration and selection of an operator to process the request;

Operator's workstation containing a web-interface for processing a client's voice request with pre-configured response templates and providing playback of a sound fragment (Voice Sample) to the Operator;

Workstation OSR Configurator containing a web interface for configuring the operator's workstation and the OSR service module; a Semantic service that extracts keywords from the transcribed text according to a given grammar transmitted by the interaction server (CORE), based on the configured statistical model; the Logger service, which alloys the results of recognition of voice messages, clients, and selected semantic tags; statistics service (Statistics), which saves information about all stages of the dialogue: the date and time of the session start; date and time of the end of the session; URL of the audio stream/audio file; settings for statistics of the Voice Applications server, for further use in AWP Statistics;

AWP of the Monitoring Specialist, containing a web-interface for viewing reports on the operation of the system and controlling the correctness of recognition of voice requests from clients.

In a particular version, the interaction server (CORE), depending on the level of criticality of the dialogue, routes the function of recognizing the client's request to the OSR service module without first contacting the ASR system, in which the transmitted audio segment (VS) is listened to and the choice of the correct text recognition option is noted, after which the OSR service module returns to the interaction server (CORE) an array of transcribed text and semantic tags.

In another particular version, the interaction server (CORE) routes the audio stream/audio file only to the ASR system, processes the response from the ASR system and checks the level of trust in the transcribed text, routes to the Sematic service to extract semantic tags at a trust level above the minimum set, and generates a negative response. if the trust level is lower than the minimum set, transferring the results of recognition of the client's voice request and selected semantic tags to the Voice Applications server.

In another particular variant, the interaction server (CORE) first sends the audio segment (VS) to the ASR system, and after receiving the results of the automatic recognition to the OSR service module, where they listen to the audio segment (VS) and check / supplement the results of the client’s automatic speech recognition, confirm the ASR system data depending on the quality of the automatic recognition, or make appropriate adjustments, after which the OSR service module returns to the interaction server ( CORE) array of transcribed text and semantic tags.

In another private variant, the interaction server (CORE) simultaneously sends an audio segment (VS) to both the ASR system and the OSR service module, if the first response comes from the OSR service module, then the result of text recognition is transmitted to the Voice Application Server and semantic tags from the OSR service module, if the first response comes from the ASR system, then the probability of text recognition is additionally checked, if it is greater than the specified level in the system, then the interaction server (CORE) sends the result of automatic recognition to the Voice Applications server by the ASR system, if the trust level is less than the specified level in the interaction server (CORE), then a response from the OSR service module is expected.

In another particular variant, after processing the client's speech and extracting semantic tags, the interaction server (CORE) makes an appeal through the client's terminal to the customer's IT systems and receives the text for speech synthesis, then handles the received text to the speech synthesis system (TTS) and returns to the client terminal an audio file with a synthesized message according to the information requested by the client.

The claimed solution is also implemented by means of a method for automating voice calls from clients to the company's service departments, comprising the steps at which: establish a connection using the client terminal via the Media Resource Control Protocol (MRCP) with the Voice Applications server and send a request containing an identifier (ID) dialogue and audio stream; perform pre-processing of the call using the Voice Applications server, determine the beginning of speech using the Voice Activity Detection (VAD) function and timeouts; transfer the ID-dialog and a unique resource pointer (URL) to the audio stream/audio file (VS) to the interaction server (CORE), and also provide interaction with the Customer's systems; receiving, using the interaction server (CORE) from the client terminal, a dialog ID, a unique resource pointer (URL) to the audio stream/audio file, and transmitting the audio stream/audio file and text recognition settings to the automatic speech recognition (ASR) system; transcribing and assessing the probability of correct sound recognition using the ASR system; returning, by means of the ASR system, to the interaction server (CORE) an array of transcribed text and a sound recognition confidence level; evaluate using the interaction server (CORE) the level of confidence in the recognition of the vocal segment (VS); at a trust level above the minimum set, the text and the required grammar are transferred to the Sematic service to extract semantic tags; allocate semantic tags from the transferred text according to the specified grammar by the Semantic service; at a trust level below the minimum set, the call is routed to the OSR service module, the audio segment (VS) is listened to using the OSR service module, and the choice of the correct text recognition option is fixed; using the OSR service module, returning to the interaction server (CORE) an array of transcribed text and semantic tags; transmitting by means of the interaction server (CORE) to the Voice Applications server an array of transcribed text and semantic tags; using the interaction server (CORE) alloying the recognition results in the Logger service; record and store information about all stages of the dialogue: date and time of the beginning of the session; date and time of the end of the session; URL of the audio stream/audio file; settings of the statistics of the Voice Applications server, in the statistics service (Statistics) for further use in the Statistics AWP.

In a particular variant, using the interaction server (CORE), the client request recognition function is routed to the OSR service module without first contacting the ASR system, while the OSR service module listens to the transmitted audio segment (VS) and fixes the choice of the correct text recognition option, returns using the OSR service module in the server interaction (CORE) array of transcribed text and semantic tags.

In another private variant, additionally: using the interaction server (CORE), the routing of the recognition function of the client's request to the ASR system is performed; transcribing and assessing the probability of correct sound recognition using the ASR system; evaluate using the interaction server (CORE) the level of confidence in the recognition of the voice segment VS; at a trust level above the minimum set, the text and the required grammar are transferred to the Sematic service to extract semantic tags; allocate semantic tags from the transferred text according to the specified grammar by the Semantic service; transmitting by means of the interaction server (CORE) to the Voice Applications server an array of transcribed text and semantic tags; at a trust level below the minimum set, a negative response is generated to the Voice Applications server.

In another particular variant, additionally: using the interaction server (CORE), the audio segment (VS) is sequentially sent to the ASR system, and after receiving the results of automatic recognition, to the OSR service module; using the OSR service module, listening to the audio segment (VS) and checking the results of the client's automatic speech recognition; confirm or correct in the OSR service module the results of automatic audio-to-text transcription using the ASR system; using the OSR service module, they send an array of transcribed text and semantic tags to the interaction server (CORE).

In another private variant, additionally: a sound segment (VS) is simultaneously sent to both the ASR system and the OSR service module; when receiving a response from the ASR system or the OSR service module in the interaction server (CORE), the order of received responses is evaluated in accordance with the following order: if the first response comes from the OSR service module, then the recognition result and semantic tags of the service are transmitted to the Voice Application Server OSR; if the first response comes from the ASR system and the transmitted text recognition probability is greater than the specified level in the interaction server (CORE), then the result of automatic recognition by the ASR system is transmitted to the Voice Application server; if the first response comes from the ASR system and the probability of OCR is less than the specified level in the interaction server (CORE), then a response from the OSR service module is expected.

In another particular variant, additionally: after processing the client's speech and extracting semantic tags, the interaction server (CORE) is accessed through the Voice Applications server to the customer's IT system and the text for speech synthesis is received; using the interaction server (CORE), the text received from the Customer's IT system is transferred to the speech synthesis system (TTS); using the interaction server (CORE), an audio file with a synthesized message is returned to the client terminal according to the information requested by the client.

The claimed solution is also implemented by a computer-readable medium for automating customer voice calls to the company's service departments, containing processor-executable instructions that cause hardware to interact to perform a method for automating customer voice calls to the company's service departments.

In a private embodiment, the interaction server (CORE) is used to route the function of recognizing the client's request to the OSR service module without first contacting the ASR system; using the OSR service module, the operator listens to the transmitted audio segment (VS) and fixes the choice of the correct text recognition option; using the OSR service module, they return to the interaction server (CORE) an array of transcribed text and semantic tags.

In another private variant, additionally: using the interaction server (CORE), the routing of the recognition function of the client's request to the ASR system is performed; transcribing and assessing the probability of correct sound recognition using the ASR system; evaluate using the interaction server (CORE) the level of confidence in the recognition of the voice segment VS; at a trust level above the minimum set, the text and the required grammar are transferred to the Sematic service to extract semantic tags; allocate semantic tags from the transferred text according to the specified grammar by the Semantic service; transmitting by means of the interaction server (CORE) to the Voice Applications server an array of transcribed text and semantic tags; if the trust level is below the minimum set, a negative response is generated to the Voice Applications server.

In another private variant, additionally: using the interaction server (CORE), the audio segment (VS) is sequentially sent to the ASR system, and after receiving the results of automatic recognition, to the OSR service module .; using the OSR service module, the operator listens to the audio segment (VS) and checks the results of the client's automatic speech recognition; confirm or correct in the OSR service module the results of automatic audio transcription in text using the ASR system; using the OSR service module, they send an array of transcribed text and semantic tags to the interaction server (CORE).

In another private variant, additionally: a sound segment (VS) is simultaneously sent to both the ASR system and the OSR service module; when receiving a response from the ASR system or from the OSR service module, the interaction server (CORE) evaluates the order of received responses in accordance with the following order: if the first response comes from the OSR service module, then the recognition result and semantic OSR service module tags; if the first response comes from the ASR system and the transmitted probability of text recognition is greater than the specified level in the interaction server (CORE), then the result of automatic recognition by the ASR system is transmitted to the Voice Application server; if the response from the ASR system comes first in time and the probability of recognizing the text is less than the specified level in the interaction server (CORE), then a response from the OSR service is expected.

In another particular variant, additionally: after processing the client's speech and semantic tags extraction, the interaction server (CORE) is used to access the Customer's IT system through the Voice Messages server and the text for speech synthesis is received; using the interaction server (CORE), the text received from the Customer's IT system is transferred to the speech synthesis system (TTS); using the interaction server (CORE), an audio file with a synthesized message is returned to the client terminal according to the information requested by the client.

DESCRIPTION OF THE DRAWINGS

The implementation of the invention will be described hereinafter in accordance with the accompanying drawings, which are presented to explain the essence of the invention and in no way limit the scope of the invention. The following drawings are attached to the application:

Fig. 1 illustrates a hardware-software complex for automating customer voice calls to service departments;

Fig. 2-6 illustrate examples of interfaces for interacting with the system, which provides the ability to view lists of customized dialog scripts and generate a new dialog script; Fig. 7 illustrates an example of an interface for interacting with the system, which presents a flow chart of the process of voice analysis during a real-time call and determining the choice of the next branch of the dialog script, depending on the voice analysis;

Fig. 8 illustrates an example of a variant of the interface of a monitoring specialist with detailed information on a client's request.

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description of the implementation of the invention, numerous implementation details are provided to provide a clear understanding of the present invention. However, one skilled in the art will appreciate how the present invention can be used, both with and without these implementation details. In other cases, well-known methods, procedures and components have not been described in detail so as not to obscure the features of the present invention.

Furthermore, it will be clear from the foregoing that the invention is not limited to the present implementation. Numerous possible modifications, changes, variations and substitutions that retain the spirit and form of the present invention will be apparent to those skilled in the subject area.

The present invention is directed to providing a system, method and computer-readable medium for automating the processing of customer voice calls to company service departments, which combines an automatic speech recognition (ASR) system and an operator service module (OSR).

The following terms and abbreviations are used in the claimed solution:

NoInput - an event in which the system does not receive a voice command from the client (the client is silent or speaks too quietly);

NoMatch - an event in which the client enters a value that is not determined by the system based on the underlying grammar;

Aggregated dialog - the top level of dialog aggregation in the OSR service, reflects the business logic of combining the topics of customer requests;

Topics - aggregates dialogues with a client with a single reason for contacting;

Dialogue - A message played to the client as part of a conversation. Dialogues related to one issue are combined into topics;

Operator skill - a characteristic of the operator, reflecting his specialization and level of preparedness;

Global response - a pre-configured action option for an operator to a client's voice command, allowing you to switch the client to a menu branch outside of the pre-configured dialog logic; Local response - a pre-configured option for the operator to the client's voice command, allowing the client to switch to the next step of the dialogue;

AS - automated system;

AWP - automated workplace;

ATS - automatic telephone exchange;

DB - database;

DBMS - database management system;

ON - software;

ASR (Automation speech recognition) - automatic speech recognition;

IVR (Interactive Voice Response) - a system of pre-recorded voice messages that performs the function of routing calls within the contact center;

NFS (Network File System) - protocol for network access to file systems;

TCP / IP (Transmission Control Protocol (TCP)) - a data transfer control protocol and Internet Protocol (IP) - an internet protocol that describes the format of a data packet transmitted over a network;

TTS (Text-To-Speech) - speech synthesis.

In the claimed solution, the subscriber's voice processing system replaces the company's service employee (contact center operator, store clerk, restaurant waiter, etc.) with an automatic (with full automation) or automated (when using the OSR service) service. The solution makes maximum use of the capabilities of modern speech recognition technology, which makes it possible not to use the capabilities of the operator, and to connect the human resource only in the most necessary cases.

The claimed solution uses the following technological solutions:

1. The recording of the conversation in online mode is processed and the beginning of the client's conversation is detected.

2. The client's voice is recorded and a Voice Sample is created (VS - a fragment of the Client - System dialogue, which contains only the client's answer to the system's question).

3. The VS is passed to the operator for processing via the Operator Speech Recognition (OSR) service module.

Depending on the system settings, the translation is carried out immediately in online mode, or can be processed offline if the ASR system did not recognize the client's speech.

In one of the variants of the system operation, the simultaneous use of the ASR system and the OSR service module is provided.

To form an audio segment, VS perform: determining the beginning of a client's conversation; highlight the client's response and cut off the silence at the beginning and end of the conversation. Submit the VS to the ASR system. The probability of correct recognition of segment VS is estimated, and if a response with a probability of correct recognition more than specified in the system settings is received, the recognized text is transferred to the Semantic service.

If the recognition level is less than the specified threshold, then the VS is passed to the OSR service module. Perform recognition of the transmitted VS by the operator. A short beep is initiated in the operator's headphones, and then VS is played (at the same speed or N (default equal to 1.3) times faster). In this case, the playback position pointer moves in accordance with VS. The operator, after listening to the VS, presses the button/line corresponding to the client's response, or dials text/numbers and sends the result back to the interaction server (CORE), which, in accordance with the specified settings, routes the data to the Voice Applications server.

In one alternative system operation, only the OSR service module is used. In this case, the formation of a sound segment VS is performed. Determine the beginning of the client's conversation, highlight the client's response and cut off the silence at the beginning and end of the conversation. Send the VS to the OSR service module. Perform recognition of the transmitted VS by the operator. A short beep is initiated in the operator's headphones, and then VS is played (at the same speed or N (default equal to 1.3) times faster). In this case, the playback position pointer moves in accordance with VS. The operator, after listening to the VS, presses the button/line corresponding to the client's response, or types text/numbers and sends the result back to the interaction server (CORE), which routes the data to the Voice Application server.

In the proposed solution, the system for processing customer voice requests to the company's service departments allows you to involve an operator in the following cases:

1. at the dialogue stage, where the automatic speech recognition system cannot recognize the client's words.

2. only in critical steps of the dialogue, when 100% confidence in the correct recognition of the client's response is required (order confirmation, password/codeword verification, etc.).

This approach, unlike the usual Operator-Client dialog, allows the operator to conditionally process up to 10-14 calls at the same time, which allows up to 80% reduction in the costs of maintaining the corresponding company service. Additionally, the results of transcribing the text and selected semantic tags for 100% of the received voice messages (both processed by the automatic service and processed by the operator) are stored, which allows using the received data to improve the operation of the Semantic service. As shown in Figure 1, the hardware-software complex for automating voice calls from clients to customer services consists of server modules and client modules.

The composition of the server modules:

102. The interaction server (CORE) is responsible for the interaction of all components of modules and submodules with each other, transmission of requests, including to the statistics service. It receives requests from the Voice Application Server via the Media Resource Control Protocol (MRCP), transfers calls to the OSR service, including processing and pre-building json forms, receiving and routing recognition results from the OSR service module to the MRCP. Calls the Semantic server to extract meaning from the recognized text.

104. The OSR (Operator Speech Recognition) operator service module is responsible for transmitting requests and receiving responses to the Operator's workstation. Manages agent registration (assignment and removal of Busy status), monitors agent status (break, ready, busy) and routes calls depending on agent skill groups, their busyness and handled call history.

107. The Semantic service is responsible for extracting meaning (keywords) from the recognized text based on a statistical model.

103. The Statistics Service (Statistic) is responsible for storing information about all stages of the dialogue, for further use in the Statistics AWP.

Client modules:

111. Operator's workstation (Operator's workstation) - operator's workstation, to which requests are received from the MRCP service. When a request is received, the workstation automatically opens the request window and plays an audio clip. The operator has the opportunity to choose the option that corresponds to the sound clip. The operator's response is returned to the interaction server (CORE).

112. AWP Configurator - workstation of the administrator of the hardware and software complex, allows you to configure the operator's interface (OSR service module), recognition parameters (ASR system) and speech synthesis (TTS service), web services settings for accessing the Customer's systems.

113. Workstation of Statistics - workplace of a monitoring specialist. Provides reports based on generated statistics.

To ensure the operation of the hardware-software complex, a technical environment is also required, consisting of the following modules:

101. Voice Application Server (Voice XML): supports the main logic of the service and adapts to the specifics of the Customer's work. Responsible for interaction with the Customer's IT systems, call pre-processing, detection of the beginning of speech using the VAD function and timeouts. Transmits audio to the interaction server (CORE).

108. The Voice XML interpreter and the MRCP client transmit requests between the interaction server (CORE) and the Voice Application server.

105. The ASR (Automation Speech Recognition) system is responsible for interacting with speech recognition servers from various manufacturers, incl. Nuance ASR, Yandex Speech Kit, etc.

106. The TTS (Text-To-Speech) service is responsible for interacting with speech pronunciation servers from various manufacturers, incl. TTS Nuance, TTS Yandex Speech Kit, etc.

Below is the logic of interaction between the components of the system:

151. The client's voice message through the Customer's IT systems is routed to the Voice Applications server.

152. Server of Voice Applications: fixes the beginning of the call in the Statistics service; transfers a request for processing an audio stream/audio file to the CORE server.

The processing request includes: call identifier; URL to the audio stream/audio file; processing type (ASR/ OSR/ ASR+OSR); grammar for affixing semantic tags.

153. The Voice XML interpreter and MRCP client transmit the request to the CORE server.

154. Depending on the transferred settings, the CORE server routes the audio stream/audio file for recognition: to the ASR system (154) (audio stream/audio file and text recognition settings); to the OSR service module (154') (audio stream/audio file, if available, recognized text from the ASR service and the name of the dialog); simultaneously to the OSR service module and the ASR system.

155. The ASR system (155) processes the audio stream/audio file and generates an array of recognized text indicating the level of trust.

155’ The OSR service module (155’), upon receiving a request to process an audio stream/audio file, searches for and prioritizes free operators and routes the audio stream/audio file to the selected employee.

The result of processing an OSR/ASR request is:

- selected semantics from the audio recording;

- audio recording of the dialogue with the client transcribed by the operator.

If there are no free employees, it generates a response in CORE with the BUSY status. 156. After text recognition by the ASR system or transcription of the record by the operator, the CORE server transmits this information to the Semantic server to extract semantic tags.

If during the processing of an audio recording the operator used preconfigured dialogue responses, then the OSR service module transmits the already selected semantic tags and no call to the Semantic server occurs.

157. The Semantic Server performs the extraction of semantic tags in the transmitted text using the specified grammar.

158. After receiving an array of semantic tags from the Semantic server, the CORE server transmits this information to the Voice Applications server to determine further steps in processing the dialogue with the client.

If semantic tags were not selected or errors occurred while processing the audio stream/audio file, the CORE server sends one of the following event types: No Match, No Input, Error.

159. After the recognition results are transferred to the Voice Applications server via the MRCP protocol, they are also alloyed in the Logger service.

160. Based on the results of the analysis of the recognized text from the client, the Voice Applications server can make a request to the Customer's IT systems to obtain additional information to respond to the client (balance request, order status, information about the work of branches / stores, etc.)

161. IT systems of the Customer generate the necessary information at the request of the client.

162. If the information is of a dynamic nature and it is necessary to perform speech synthesis for its voicing, then the Voice Applications server sends a request to the TTS service (depending on the TTS contractor chosen by the Customer, the request goes directly or through the MRCP client).

163. According to the given text, the TTS service performs speech synthesis and transfers the created audio file for playback to the customer's IT systems.

164. Upon completion of the processing of a voice call from the client, the Voice Applications server records the end of the dialogue to the Statistic server.

If it is necessary to analyze recognition results and monitor the quality of the service, the monitoring specialist searches and analyzes audio recordings and recognition results using the web interface of the Statistic service.

Figures 2-5 show examples of interfaces for interacting with the system, which provide the ability to view lists of configured dialog scripts and generate a new dialog script.

Dialog Administration To switch to viewing the configured dialogs for processing by the Operator, select the “Dialogues” section on the main screen.

This window allows you to:

1. View a list of configured dialogs

2. Create a new dialog

3. Modify an existing dialog

To create a new dialog in the system, click on the "Create" button at the top of the dialog box. In the dialog box that opens, you must specify the name of the dialog; in the Promt section, enter the full text of the dialogue, read to the client in IVR; Select a theme and click the "Create" button.

To edit a previously created dialog, select the required entry from the list and follow the hyperlink to the dialog viewing window. In the window that opens, click on the "Edit" button. A dialog box for editing dialog parameters opens, in which it is available:

Change the name of the dialog

Edit Dialog Description

Select the topic to which this dialog belongs from the list of available topics

To create and edit a list of dialogue responses, select the required entry from the list of dialogues and follow the hyperlink. A dialog box opens allowing you to:

1. View and correct the list of answers within this dialog (add, change or delete);

2. Change the parameters of the dialog (title, description and subject).

To create a response, you must click on the "Create response" button in the upper right part of the dialog box. In the dialog box that opens, you must specify: the name of the answer in Latin; response display type. Available options: BUTTON - button, ADDRESS - field for entering an address, TEXT - field for entering text, NUMBER - field for entering a number, DATE - field for selecting a date; a description of the response that will be displayed to the operator

To edit the answer, select the desired entry from the list and click on the "Edit" button:

To delete an answer, select the desired entry from the list and click the "Delete" button.

The figure 6 shows a variant of the interface when a call is received by the operator. When a call comes in, the agent opens a dialog box to select options for the customer's response.

The dialog box includes:

• Dialog name (601) - this field displays the name of the dialog from which the client's response to the agent was translated. Appears at the top of the dialog box.

• Remaining time (602) - this field displays the time available for the agent to respond, expressed in milliseconds. After the expiration of the allotted time, if the operator does not answer, the dialog box is automatically closed, the BUSY event is recorded in the system.

• Text entry field (603) - designed to enter the client's response to the current dialog.

It is filled in if the dialogue involves an extended (has many answer options: entering names, addresses, full names) or a unique (comment, passwords and code words) customer response to the question asked.

• Local buttons (604) - take the dialogue with the client to the next step within the standard, preconfigured route.

The names of the buttons and the transition logic are configured by the system administrator when creating a dialog (Administration of dialogs).

• Global buttons (605) - allow you to control the dialogue with the client according to a non-standard scenario: transferring a call several steps forward or backward; playing a pre-recorded phrase in response to the client's response/comment and returning to the same step of the dialogue; playing a pre-recorded phrase in response to the client's response/comment and ending the dialogue with a fixed event.

Global buttons are customizable within a theme and are the same for all dialogs within a given theme.

• "Send" button - when the button is pressed, the system records the client's response entered by the operator in the text field. The dialog moves to the next step according to the configured route.

The figure 7 shows a variant of the interface of the monitoring specialist.

The dialog box includes:

• Area for searching and selecting customer voice messages (701) - in this field, you can search by the parameters time from... to, phone number, termination reason, termination action.

• Search statistics (702) - how many hits were found for the specified parameters, including a breakdown by reason and completion actions. • List of found dialogs (703) - the list displays information on the number from which the call was made, date and time of the call, duration, termination method, termination action.

• Go to detailed information on the selected case (704)

• Listen to the call (705) - listen to the audio recording of the customer's call.

To select requests, the user can specify the following parameters: Application - select a specific application from the list configured in the system

Call time - the period in which the client's call occurred.

Phone number - the phone number from which the client contacted, only for phone calls.

End reason - the reason for ending the conversation with the client.

End action - a recorded action in the system to end a conversation with a client.

After filling in the search parameters for requests, you must click the "Find" button. The system will display:

Case search results statistics: total number of cases matching the entered criteria and broken down by completion status.

A list of hits that meets the entered criteria, indicating: name of the application number from which the call was made (for calls via telephone) date and time of the hit duration of the hit method of ending the hit (reason for ending the hit) termination action

The following operations are available to the user in this interface:

Listen to the entire selected customer call by pressing the Play button in the Audio item

View detailed information on the selected case. To do this, follow the hyperlink in the "Number" item for the selected call.

Figure 8 shows a variant of the interface of the monitoring specialist with detailed information on the client's request.

The dialog box includes:

1. List of dialogs within the selected call.

2. The following information is displayed for each dialog: a. Date and time of the dialogue; b. The level of confidence in the recognized text - takes values from 0 to 1, where 1 - a high degree of recognition accuracy, 0 - the text is not recognized; c. Transcription - contains the recognized text of the client's response within the dialogue; d. Result - contains a rule describing the further action of the program after recognition; e. Status - the result of comparing the recognized phrase with the grammar of the application; f. Who answered - the source of text recognition is indicated: ASR service or operator e. Dialogue stage - the name of the dialogue

According to the selected request, the user can:

View the list of dialogs within the selected case;

Listen to the selected dialogue;

For each dialog, the following information is displayed:

Date and time of the dialogue;

The level of confidence in the recognized text - takes values from 0 to 1, where 1 - a high degree of recognition accuracy, 0 - the text is not recognized;

Transcription - contains the recognized text of the client's response within the dialogue;

Result - contains a rule describing the further action of the program after recognition;

Status - the result of comparing the recognized phrase with the grammar of the application;

Who answered - the source of text recognition is indicated: ASR service or operator;

Dialog stage - the name of the dialog.

In these application materials, a preferred disclosure of the implementation of the claimed technical solution was presented, which should not be used as limiting other, private embodiments of its implementation, which do not go beyond the requested scope of legal protection and are obvious to specialists in the relevant field of technology.

Claims

CLAIM

1. A system for automating customer voice calls to the company's service departments, comprising: an interaction server (CORE), configured to:

• receiving from it an audio stream/audio file for transcription;

• selection of the processing method - by the automatic speech recognition system (ASR) or the operator service module (OSR) of the audio stream/audio file in accordance with the transmitted settings;

• audio stream/audio file routing sequentially to the ASR system;

• transferring the results of recognition of the client's voice request and selected semantic tags to the Voice Applications server; an OSR server configured to:

• routing of customer requests to the Operator's workstation;

• transferring the results of processing requests to the interaction server;

• registration and selection of an operator to process the request;

Operator's workstation, containing a web-interface for processing a client's voice request with pre-configured response templates and providing playback of a sound fragment (Voice Sample) to the Operator;

Workstation OSR Configurator containing a web interface for configuring the operator's workstation and the OSR service module; a Semantic service configured to extract keywords from the transcribed text according to a given grammar transmitted by an interaction server (CORE) based on a customized statistical model; a Logger service capable of alloying the results of recognition of voice calls, clients, and selected semantic tags; statistics service (Statistics), made with the ability to save information about all stages of the dialogue:

■ date and time of session start; date and time of the end of the session; URL of the audio stream/audio file;

■/ settings for statistics of the Voice Applications server, for further use in AWP Statistics;

AWP of the Monitoring Specialist, containing a web-interface for viewing reports on the operation of the system and monitoring the correctness of recognition of voice requests from clients.

2. The system according to claim 1, characterized in that the interaction server (CORE), depending on the level of criticality of the dialogue, routes the function of recognizing the client's request to the OSR service module without first contacting the ASR system, in which the transmitted audio segment (VS) is listened to and the choice of the correct text recognition option is noted, after which the OSR service module returns to the interaction server (CORE) an array of transcribed text and semantic tags.

3. The system according to claim 1, characterized in that the interaction server (CORE) routes the audio stream / audio file only to the ASR system, processes the response from the ASR system and checks the level of trust in the transcribed text, routing to the Sematic service to highlight semantic tags at the level of trust above the minimum set value, generation of a negative response when the trust level is below the minimum value set, transmission of the results of recognition of the client's voice request and selected semantic tags to the Voice Applications server.

4. The system according to claim 1, characterized in that the interaction server (CORE) first sends the audio segment (VS) to the ASR system, and after receiving the results of automatic recognition to the OSR service module, where the audio segment (VS) is listened to and checked / supplemented the results of the automatic speech recognition of the client, depending on the quality of the automatic recognition, confirm the data of the ASR system, or make appropriate adjustments, after which the OSR service module returns to the interaction server (CORE) an array of transcribed text and semantic tags.

5. The system according to claim 1, characterized in that the interaction server (CORE) simultaneously sends an audio segment (VS) to both the ASR system and the OSR service module, if the first response comes from the OSR service module, then the result of text recognition and semantic tags from the OSR service module are transmitted to the Voice Application server, if the first response comes from the ASR system, then the probability of text recognition is additionally checked, if it is greater than the specified level in the system, then the interaction server (CORE) transmits to the Voice Applications server the result of automatic recognition by the ASR system, if the trust level is less than the level set in the interoperability server (CORE), then a response is expected from the OSR service module.

6. The system according to claim 1, characterized in that after processing the client's speech and semantic tags extraction, the interaction server (CORE) makes an appeal through the client's terminal to the customer's IT systems and receives the text for speech synthesis, then handles the received text to the system speech synthesis (TTS) and returns to the client terminal an audio file with a synthesized message according to the information requested by the client.

7. A method for automating voice calls from clients to the company's service departments, comprising the steps of: establishing a connection using the client terminal via the Media Resource Control Protocol (MRCP) with the Voice Applications server and sending a request containing an identifier (ID) of the dialogue and an audio stream; perform pre-processing of the call using the Voice Applications server, determine the beginning of speech using the Voice Activity Detection (VAD) function and timeouts; transfer the ID-dialog and a unique resource pointer (URL) to the audio stream/audio file (VS) to the interaction server (CORE), and also provide interaction with the Customer's systems; receiving, using the interaction server (CORE) from the client terminal, a dialog ID, a unique resource pointer (URL) to the audio stream/audio file, and transmitting the audio stream/audio file and text recognition settings to the automatic speech recognition (ASR) system; transcribing and evaluating the probability of correct sound recognition using the ASR system; returning, by means of the ASR system, to the interaction server (CORE) an array of transcribed text and a sound recognition confidence level; evaluate using the interaction server (CORE) the level of confidence in the recognition of the vocal segment (VS); at a trust level above the minimum set, the text and the required grammar are transferred to the Sematic service to extract semantic tags; allocate semantic tags from the transferred text according to the specified grammar by the Semantic service; at a trust level below the minimum set, the call is routed to the OSR service module, the audio segment (VS) is listened to using the OSR service module, and the choice of the correct text recognition option is fixed; using the OSR service module, returning to the interaction server (CORE) an array of transcribed text and semantic tags; transmitting by means of the interaction server (CORE) to the Voice Applications server an array of transcribed text and semantic tags; using the interaction server (CORE) alloying the recognition results in the Logger service; record and store information about all stages of the dialogue: date and time of the beginning of the session; date and time of the end of the session; URL of the audio stream/audio file; settings of the statistics of the Voice Applications server, in the statistics service (Statistics) for further use in the Statistics AWP.

8. The method according to claim 7, characterized in that, using the interaction server (CORE), the client call recognition function is routed to the OSR service module without first contacting the ASR system, while the transmitted audio segment (VS) is listened to in the OSR service module and fixing the choice of the correct text recognition option, returning, using the OSR service module, to the interaction server (CORE) an array of transcribed text and semantic tags.

9. The method according to claim 7, characterized in that additionally: using the interaction server (CORE) routing the client call recognition function to the ASR system; transcribing and assessing the probability of correct sound recognition using the ASR system; evaluate using the interaction server (CORE) the level of confidence in the recognition of the voice segment VS; at a trust level above the minimum set, the text and the required grammar are transferred to the Sematic service to extract semantic tags; allocate semantic tags from the transferred text according to the specified grammar by the Semantic service; transmitting by means of the interaction server (CORE) to the Voice Applications server an array of transcribed text and semantic tags; at a trust level below the minimum set, a negative response is generated to the Voice Applications server.

10. The method according to claim 7, characterized in that additionally: using the interaction server (CORE) to sequentially send the audio segment (VS) to the ASR system, and after receiving the results of automatic recognition to the OSR service module; using the OSR service module, listening to the audio segment (VS) and checking the results of the client's automatic speech recognition; confirm or correct in the OSR service module the results of automatic audio-to-text transcription using the ASR system; using the OSR service module, they send an array of transcribed text and semantic tags to the interaction server (CORE).

11. The method according to claim 7, characterized in that additionally: produce a simultaneous sending of the audio segment (VS) and the system ASR, and the service module OSR; when receiving a response from the ASR system or the OSR service module in the interaction server (CORE), the order of received responses is evaluated in accordance with the following order: if the first response comes from the OSR service module, then the recognition result and semantic tags of the service are transmitted to the Voice Application Server OSR; if the first response comes from the ASR system and the transmitted text recognition probability is greater than the specified level in the interaction server (CORE), then the result of automatic recognition by the ASR system is transmitted to the Voice Application server; if the first response comes from the ASR system and the text recognition probability is less than the specified level in the interaction server (CORE), then a response from the OSR service module is expected.

12. The method according to claims 7-11, characterized in that additionally: after processing the client's speech and extracting semantic tags, they are accessed using the interaction server (CORE) through the Voice Applications server to the customer's IT system and receive text for speech synthesis; using the interaction server (CORE), the text received from the Customer's IT system is transferred to the speech synthesis system (TTS); using the interaction server (CORE), an audio file with a synthesized message is returned to the client terminal according to the information requested by the client.

13. A computer-readable medium for automating customer voice calls to the company's service departments, containing processor-executable instructions that cause hardware to interact to perform the method according to any one of paragraphs. 7-12.

14. The machine-readable medium according to claim 13, characterized in that the interaction server (CORE) routes the function of recognizing the client's call to the OSR service module without first contacting the ASR system; using the OSR service module, the operator listens to the transmitted audio segment (VS) and fixes the choice of the correct text recognition option; using the OSR service module, they return to the interaction server (CORE) an array of transcribed text and semantic tags.

15. Machine-readable medium according to claim 13, characterized in that additionally: produce using the interaction server (CORE) routing the function of recognizing the client's call to the ASR system; transcribing and assessing the probability of correct sound recognition using the ASR system; evaluate using the interaction server (CORE) the level of confidence in the recognition of the voice segment VS; at a trust level above the minimum set, the text and the required grammar are transferred to the Sematic service to extract semantic tags; allocate semantic tags from the transferred text according to the specified grammar by the Semantic service; transmitting by means of the interaction server (CORE) to the Voice Applications server an array of transcribed text and semantic tags; if the trust level is below the minimum set, a negative response is generated to the Voice Applications server.

16. The machine-readable medium according to claim 13, characterized in that additionally: using the interaction server (CORE), the audio segment (VS) is sequentially sent to the ASR system, and after receiving the results of automatic recognition, to the OSR service module .; using the OSR service module, the operator listens to the audio segment (VS) and checks the results of the client's automatic speech recognition; confirm or correct in the OSR service module the results of automatic audio transcription in text using the ASR system; using the OSR service module, they send an array of transcribed text and semantic tags to the interaction server (CORE).

17. Machine-readable media according to claim 13, characterized in that additionally: produce simultaneous sending of the audio segment (VS) and the system ASR, and the service module OSR; when receiving a response from the ASR system or from the OSR service module, the interaction server (CORE) evaluates the order of received responses in accordance with the following order: if the first response comes from the OSR service module, then the recognition result and semantic module tags OSR services; if the first response comes from the ASR system and the transmitted text recognition probability is greater than the specified level in the interaction server (CORE), then the result of automatic recognition by the ASR system is transmitted to the Voice Application server; if the response from the ASR system comes first in time and the probability of recognizing the text is less than the specified level in the interaction server (CORE), then a response from the OSR service is expected.

18. A machine-readable medium according to claims 13-17, characterized in that additionally: after processing the client's speech and extracting semantic tags, they are accessed using the interaction server (CORE) through the Voice Message server to the Customer's IT system and receive text for speech synthesis; using the interaction server (CORE), the text received from the Customer's IT system is transferred to the speech synthesis system (TTS); using the interaction server (CORE), an audio file with a synthesized message is returned to the client terminal according to the information requested by the client.