RU2638003C1

RU2638003C1 - Method of tasks distribution between servicing robots and means of cyberphysical intelligent space with multimodal user service

Info

Publication number: RU2638003C1
Application number: RU2016146203A
Authority: RU
Inventors: Андрей Леонидович Ронжин; Антон Игоревич Савельев
Priority date: 2016-11-24
Filing date: 2016-11-24
Publication date: 2017-12-08

Abstract

FIELD: information technology.

SUBSTANCE: method of multimodal user service performed by the application server includes the steps of receiving, on the management path, an indication that the gesture was recognized based on the video data sent from the client device to the gesture server received from the client device, sending a message with the result of recognizing the gesture, which prescribes an update in accordance with the result of recognition; receiving from the client device an indication that the visual display is updated in accordance with the recognition result, sending a message to the gesture server and sending a message to the service robot with a speech recognition result that instructs the service robot to execute the corresponding program of actions.

EFFECT: increased reliability of interaction between service robots.

3 cl, 5 dwg

Description

Изобретение относится к области киберфизического интеллектуального пространства, а именно к распределенным многомодальным приложениям, реализуемым при обслуживании пользователей средствами (в том числе робототехническими) таких пространств.The invention relates to the field of cyberphysical intellectual space, and in particular to distributed multimodal applications implemented when serving users with means (including robotic) of such spaces.

В качестве средств киберфизического интеллектуального пространства (КФИП) рассматриваются сервисные роботы и клиентские (мобильные и стационарные) устройства (фиг. 1). Последние для обеспечения большего удобства и простоты использования включают в себя многомодальный пользовательский интерфейс. Модальности в таких интерфейсах обрабатываются комбинацией аппаратных средств и программного обеспечения, связанной с конкретным типом воспринимаемой человеком информации и/или генерируемой им.As means of a cyberphysical intellectual space (KFIP), service robots and client (mobile and stationary) devices are considered (Fig. 1). The latter include a multimodal user interface for greater convenience and ease of use. Modalities in such interfaces are processed by a combination of hardware and software related to a particular type of information perceived by a person and / or generated by him.

Например, визуальная модальность пользовательского интерфейса может быть введена с помощью видеокамеры и выведена через экран и соответствующие аппаратные средства с помощью программного обеспечения для генерирования визуальных отображений на экране. Голосовая модальность пользовательского интерфейса может быть обработана с использованием микрофона, динамика и соответствующих аппаратных и программных средств так, чтобы принимать и оцифровывать речь и/или выводить звуковые данные (например, звуковые подсказки или другую звуковую информацию).For example, the visual modality of the user interface can be entered using a video camera and displayed through the screen and associated hardware using software to generate visual displays on the screen. The voice modality of the user interface can be processed using a microphone, speaker, and appropriate hardware and software to receive and digitize speech and / or output audio data (for example, audio prompts or other audio information).

Многомодальный пользовательский интерфейс может быть реализован в связи с приложением, которое работает в сетевой (облачной) среде (например, в среде "клиент-сервер"). В таком случае, пользователь взаимодействует с многомодальным пользовательским интерфейсом на клиентском устройстве (например, на сотовом телефоне или компьютере), и клиентское устройство устанавливает связь с одним или более другими устройствами или платформами (например, с сервером) через сеть.A multimodal user interface can be implemented in connection with an application that runs in a network (cloud) environment (for example, in a client-server environment). In this case, the user interacts with the multimodal user interface on the client device (for example, on a cell phone or computer), and the client device communicates with one or more other devices or platforms (for example, with the server) through the network.

Термин "сервер" означает объект обработки данных, электронное устройство или приложение, которое выполняет услуги для одного или более связанных с сетью клиентов или других серверов в ответ на выданные клиентом или выданные сервером запросы. Термин "сервер приложений" (СП) означает сервер, адаптированный так, чтобы инициировать установление линий связи данных и управления, связанных с сеансом связи распределенного мультимодального приложения, и управлять синхронизацией между различными "представлениями", связанными с сеансом связи приложения. Термин "сервер модальности" означает сервер, адаптированный так, чтобы выполнять компонент приложения на стороне сервера, который связан с одной или более модальностями пользовательского интерфейса. Термин "голосовой сервер" (ГС) означают сервер модальности, который специально адаптирован так, чтобы выполнять компонент приложения на стороне сервера, связанный с голосовой модальностью. Термин "сервер жестов" (СЖ) означают сервер модальности, который специально адаптирован так, чтобы выполнять компонент приложения на стороне сервера, связанный с визуальной модальностью, в частности с жестами.The term “server” means a data processing object, electronic device or application that performs services for one or more network-connected clients or other servers in response to requests made by a client or issued by a server. The term "application server" (SP) means a server adapted to initiate the establishment of data and control lines associated with a communication session of a distributed multimodal application, and to control the synchronization between the various "views" associated with the communication session of the application. The term “modality server” means a server adapted to run a server-side application component that is associated with one or more user interface modalities. The term “voice server” (HS) means a modality server that is specifically adapted to run a server-side application component associated with a voice modality. The term “gesture server” (SJ) means a modality server that is specially adapted to run a server-side application component associated with visual modality, in particular gestures.

Однако ограниченные вычислительные и информационные ресурсы сервисных роботов делают невозможным их оснащение многомодальным пользовательским интерфейсом. С развитием киберфизических систем, облачной робототехники и применением, так называемого, экологического подхода, где мобильным роботам отводятся только специализированные функции, не решаемые стационарными окружающими устройствами, ситуация в методах взаимодействия и совместной работы человека с клиентскими устройствами и сервисными роботами кардинально изменилась.However, the limited computing and information resources of service robots make it impossible to equip them with a multimodal user interface. With the development of cyberphysical systems, cloud robotics and the application of the so-called ecological approach, where mobile robots are assigned only specialized functions that cannot be solved by stationary surrounding devices, the situation in the methods of human interaction and collaboration with client devices and service robots has changed dramatically.

Важным моментом в организации функционирования таких киберфизических интеллектуальных пространств является распределения задач между сервисными роботами и клиентскими устройствами при многомодальном обслуживании пользователей.An important point in organizing the functioning of such cyberphysical intellectual spaces is the distribution of tasks between service robots and client devices with multimodal user service.

В работе [Beilei Sun, Xi Li, Bo Wan, Chao Wang, Xuehai Zhou, Xianglan Chen, Definitions of Predictability for Cyber Physical Systems, Journal of Systems Architecture (2016), doi: 10.1016/j.sysarc.2016.01.007] поднимается проблема прогнозируемости КФС. В системах массового обслуживания, функционирующих в режиме реального времени, точность и период прогноза становятся критичными. Причем чаще всего прогнозируется время выполнения задач и привлекаемые для их решения ресурсы с использованием метрики минимального/максимального времени выполнения задачи [L. Thiele and R. Wilhelm. Design for Timing Predictability. Real-Time Systems, 28:157-177, 2004.; R. Kirner and P. Puschner. Time-Predictable Computing. In Software Technologies for Embedded and Ubiquitous Systems, pages 23-34. Springer, 2011.].[Beilei Sun, Xi Li, Bo Wan, Chao Wang, Xuehai Zhou, Xianglan Chen, Definitions of Predictability for Cyber Physical Systems, Journal of Systems Architecture (2016), doi: 10.1016 / j.sysarc.2016.01.007] rises CFS predictability problem. In real-time queuing systems, accuracy and forecast period become critical. Moreover, the execution time of tasks and the resources attracted to solve them are most often predicted using the minimum / maximum task execution metrics [L. Thiele and R. Wilhelm. Design for Timing Predictability. Real-Time Systems, 28: 157-177, 2004 .; R. Kirner and P. Puschner. Time-Predictable Computing. In Software Technologies for Embedded and Ubiquitous Systems, pages 23-34. Springer, 2011.].

В работе [Giovanni Merlino, Stamatis Arkoulis, Salvatore Distefano, Chrysa Papagianni, Antonio Puliafito, Symeon Papavassiliou. Mobile crowdsensing as a service: A platform for applications on top of sensing Clouds. Future Generation Computer Systems 56 (2016) 623-639.] обсуждается разработка краудсенсорной платформы, использующей мобильные устройства пользователей для оценки социальной динамики и предоставления персонифицированных сервисов. Предложен новый облачный сервис SAaaS (Sensing and Actuation as a Service), реализующий для пользователя задачи восприятия и действия посредством мобильных клиентских устройств и облачных ресурсов [S. Distefano, G. Merlino, A. Puliafito, Sensing and actuation as a service: A new development for clouds, in: Proceedings of the 2012 IEEE 11th International Symposium on Network Computing and Applications, NCA '12, IEEE Computer Society, Washington, DC, USA, 2012, pp. 272-275.]. Более детально задачи, решаемые мобильными краудсенсорными системами, представлены в обзоре [R. Ganti, F. Ye, H. Lei, Mobile crowdsensing: current state and future challenges, IEEE Commun. Mag. 49(11) (2011) 32-39.].In the work of [Giovanni Merlino, Stamatis Arkoulis, Salvatore Distefano, Chrysa Papagianni, Antonio Puliafito, Symeon Papavassiliou. Mobile crowdsensing as a service: A platform for applications on top of sensing Clouds. Future Generation Computer Systems 56 (2016) 623-639.] Discusses the development of a crowdsensor platform that uses users' mobile devices to evaluate social dynamics and provide personalized services. A new cloud service SAaaS (Sensing and Actuation as a Service) is proposed that implements for the user the tasks of perception and action through mobile client devices and cloud resources [S. Distefano, G. Merlino, A. Puliafito, Sensing and actuation as a service: A new development for clouds, in: Proceedings of the 2012 IEEE 11th International Symposium on Network Computing and Applications, NCA '12, IEEE Computer Society, Washington, DC , USA, 2012, pp. 272-275.]. In more detail, the tasks solved by mobile crowdsensor systems are presented in the review [R. Ganti, F. Ye, H. Lei, Mobile crowdsensing: current state and future challenges, IEEE Commun. Mag. 49 (11) (2011) 32-39.].

В работе [Byung-Cheol Mina, Yongho Kima, Sangjun Leea, Jin-Woo Jungb, Eric T. Matsona. Finding the optimal location and allocation of relay robots for building a rapid end-to-end wireless communication. Ad Hoc Networks Vol. 39. 2016. pp. 23-44] рассматривается возможность применения мобильных роботов для создания каскадной коммуникационной сети в труднодоступных местах или в районах проведения спасательных операций, где отсутствует сотовая связь. С применением генетического алгоритма и метода частичной оптимизации роя выполняется расчет координат расположения отдельных роботов и их пиринговые связи с учетом расположения препятствий на местности.In [Byung-Cheol Mina, Yongho Kima, Sangjun Leea, Jin-Woo Jungb, Eric T. Matsona. Finding the optimal location and allocation of relay robots for building a rapid end-to-end wireless communication. Ad Hoc Networks Vol. 39.2016. Pp. 23-44] the possibility of using mobile robots to create a cascade communication network in remote places or in areas of rescue operations where there is no cellular communication is considered. Using the genetic algorithm and the partial swarm optimization method, the location coordinates of individual robots and their peer-to-peer relationships are calculated taking into account the location of obstacles on the ground.

Несмотря на наличие результатов решения частных задач обеспечения функционирования киберфизических систем, в настоящее время отсутствуют решения по распределению задач между сервисными роботами и клиентскими устройствами при многомодальном обслуживании пользователей и, как следствие, решение важной задачи обеспечения их взаимодействия.Despite the results of solving particular problems of ensuring the functioning of cyberphysical systems, there are currently no solutions for distributing tasks between service robots and client devices with multimodal user services and, as a result, solving the important problem of ensuring their interaction.

Наиболее близким по технической сущности к заявляемому способу и выбранным в качестве прототипа является способ реализации распределенных мультимодальных приложений (патент RU 2491617 от 27.08.2013), выполняемый сервером приложений и клиентским устройством.The closest in technical essence to the claimed method and selected as a prototype is a method for implementing distributed multimodal applications (patent RU 2491617 from 08/27/2013), performed by the application server and the client device.

Способ реализации распределенных мультимодальных приложений, выполняемый сервером приложений, заключается в том, что принимают по тракту управления сервера приложений/голосового сервера между сервером приложений и голосовым сервером от голосового сервера индикацию относительно того, что речь была распознана на основе звуковых данных восходящей линии связи, посланных от клиентского устройства на голосовой сервер по тракту звуковых данных между клиентским устройством и голосовым сервером, причем звуковые данные восходящей линии связи представляют фрагмент активной речи пользователя, принятый через голосовую модальность клиентского устройства, при этом голосовой сервер является отдельным от сервера приложений, и отправляют по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством на клиентское устройство сообщение, которое включает в себя результат распознавания для речи и которое предписывает клиентскому устройству обновлять визуальное отображение так, чтобы отразить результат распознавания; отправляют мультимодальную страницу на клиентское устройство по тракту управления сервера приложений/клиента, при этом мультимодальная страница при ее интерпретации предписывает клиентскому устройству воспроизводить визуальное отображение, которое включает в себя по меньшей мере один элемент отображения, для которого входные данные являются принимаемыми клиентским устройством через визуальную модальность и голосовую модальность; принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что клиентское устройство инициировало интерпретирование машинного кода, который предписывает клиентскому устройству воспроизводить визуальное отображение, которое включает в себя по меньшей мере один элемент отображения, для которого входные данные являются принимаемыми клиентским устройством через визуальную модальность и голосовую модальность, и отправляют команду на голосовой сервер по тракту управления сервера приложений/голосового сервера для того, чтобы голосовой сервер начинал интерпретировать речевой диалог, связанный с машинным кодом, интерпретируемым клиентским устройством; принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что клиентское устройство обновило визуальное отображение в соответствии с результатом распознавания, и отправляют сообщение на голосовой сервер по тракту управления сервера приложений/голосового сервера, чтобы указать, что клиентское устройство обновило визуальное отображение; принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что произошло сгенерированное клиентом событие, которое служит основанием обновления визуального отображения, воспроизводимого на клиентском устройстве, отправляют информацию на клиентское устройство по тракту управления сервера приложений/клиента, чтобы предписать клиентскому устройству обновить визуальное отображение, основываясь на сгенерированном клиентом событии, и отправляют на голосовой сервер по тракту управления сервера приложений/голосового сервера команду, которая включает в себя информацию, указывающую сгенерированное клиентом событие.A method for implementing distributed multimodal applications performed by an application server is that an indication is received along the application server / voice server control path between the application server and the voice server from the voice server that speech was recognized based on the uplink audio data sent from the client device to the voice server along the audio data path between the client device and the voice server, and the uplink audio data and they represent a fragment of the user's active speech received through the voice modality of the client device, while the voice server is separate from the application server, and a message is sent along the control path of the application / client server between the application server and the client device to the client device, which includes the recognition result for speech, and which instructs the client device to update the visual display so as to reflect the recognition result; send a multimodal page to the client device along the control path of the application server / client, while the multimodal page, when interpreted, instructs the client device to reproduce a visual display that includes at least one display element for which the input data is received by the client device through visual modality and voice modality; receiving from the client device along the control path of the application server / client an indication that the client device has initiated the interpretation of a machine code that instructs the client device to reproduce a visual display that includes at least one display element for which the input data is received by the client device through visual modality and voice modality, and send a command to the voice server along the server management path n applications / voice server so that the voice server begins to interpret the voice dialogue associated with the machine code interpreted by the client device; receive an indication from the client device along the control path of the application server / client that the client device has updated the visual display in accordance with the recognition result, and send a message to the voice server along the control path of the application server / voice server to indicate that the client device has updated the visual display; receive an indication from the client device along the control path of the application server / client that an event generated by the client has occurred that serves as the basis for updating the visual display played on the client device, send information to the client device along the control path of the application server / client to instruct the client device update the visual display based on the event generated by the client, and send it to the voice server along the control path eniya application server command / voice server, which includes information indicating the client generated the event.

Способ реализации распределенных мультимодальных приложений, выполняемый клиентским устройством, заключается в том, что воспроизводят визуальное отображение, основываясь на интерпретации машинного кода, который предписывает клиентскому устройству воспроизводить визуальное отображение, при этом визуальное отображение включает в себя по меньшей мере один элемент отображения, для которого входные данные являются принимаемыми клиентским устройством через визуальную модальность и голосовую модальность; принимают сигнал, представляющий фрагмент активной речи пользователя, через голосовую модальность; оцифровывают этот сигнал так, чтобы генерировать звуковые данные восходящей линии связи, соответствующие одному или более элементам отображения из упомянутого по меньшей мере одного элемента отображения; отправляют звуковые данные восходящей линии связи на голосовой сервер по тракту звуковых данных между клиентским устройством и голосовым сервером; принимают результат распознавания речи от сервера приложений по тракту управления сервера приложений/клиента между сервером приложении и клиентским устройством, причем результат распознавания речи основан на выполнении голосовым сервером процесса распознавания речи в отношении звуковых данных восходящей линии связи, причем тракт звуковых данных является отдельным от тракта управления сервера приложений/клиента, и при этом голосовой сервер является отдельным от сервера приложений; и обновляют упомянутые один или более элементов отображения визуального отображения в соответствии с результатом распознавания речи; принимают мультимодальную страницу от сервера приложений по тракту управления сервера приложений/клиента, при этом мультимодальная страница включает в себя машинный код, причем воспроизведение визуального отображения выполняется посредством интерпретирования машинного кода в форме разметки в мультимодальной страниц; принимают звуковые данные нисходящей линии связи от голосового сервера по тракту звуковых данных, при этом звуковые данные нисходящей линии связи включают в себя звуковую подсказку; и воспроизводят звуковую подсказку на устройстве вывода звукового сигнала из состава клиентского устройства; принимают ввод пользователя, который служит основанием обновления визуального отображения, воспроизводимого на клиентском устройстве; основываясь на приеме ввода пользователя, отправляют на сервер приложений по тракту управления сервера приложений/клиента индикацию относительно того, что произошло сгенерированное клиентом событие, и принимают от сервера приложений по тракту управления сервера приложений/клиента информацию, которая предписывает клиентскому устройству обновлять визуальное отображение, основываясь на сгенерированном клиентом событии.A method for implementing distributed multimodal applications performed by a client device is to reproduce a visual display based on an interpretation of a machine code that causes the client device to reproduce a visual display, wherein the visual display includes at least one display element for which input data is received by the client device through visual modality and voice modality; receive a signal representing a fragment of the user's active speech through the voice modality; digitizing this signal so as to generate uplink audio data corresponding to one or more display elements from said at least one display element; send uplink audio data to the voice server along the audio data path between the client device and the voice server; receiving a speech recognition result from the application server along the control path of the application / client server between the application server and the client device, the speech recognition result based on the voice server performing the speech recognition process with respect to the uplink audio data, the audio data path being separate from the control path application / client server, while the voice server is separate from the application server; and updating said one or more display elements of the visual display in accordance with the result of speech recognition; receive a multimodal page from the application server along the control path of the application server / client, while the multimodal page includes machine code, and the visual display is reproduced by interpreting the machine code in the form of markup in the multimodal pages; receive audio data downlink from the voice server along the path of audio data, while the audio data downlink includes an audio prompt; and reproducing an audio prompt on a device for outputting an audio signal from a client device; accept user input, which serves as the basis for updating the visual display reproduced on the client device; based on the reception of user input, an indication is sent to the application server along the control path of the application server / client that an event has been generated by the client, and information is received from the application server along the control path of the application server / client that instructs the client device to update the visual display based on a client-generated event.

Способу-прототипу присущи следующие недостатки:The prototype method has the following disadvantages:

1) сервер приложений не имеет возможности взаимодействия с сервисными роботами и, как следствие, отсутствует возможность использования последних для многомодального обслуживания пользователей;1) the application server does not have the ability to interact with service robots and, as a result, there is no possibility of using the latter for multimodal user service;

2) клиентские устройства не реализуют возможность распознавания визуальной модальности, в частности жестов. Последние приобретают особую актуальность при целеуказаниях пользователя;2) client devices do not realize the possibility of recognizing visual modality, in particular gestures. The latter are of particular relevance when targeting the user;

3) отсутствует механизм реализации сервисными роботами задач многомодального обслуживания пользователей.3) there is no mechanism for the implementation by service robots of the tasks of multimodal user service.

Задачей изобретения является разработка способа распределения задач между сервисными роботами и средствами киберфизического интеллектуального пространства при многомодальном обслуживании пользователей, позволяющего снизить вычислительную сложность процесса многомодального обслуживания пользователей при их взаимодействии с сервисными роботами.The objective of the invention is to develop a method for distributing tasks between service robots and means of cyberphysical intellectual space with multimodal user services, which reduces the computational complexity of the multimodal user service during their interaction with service robots.

В заявленном способе эта задача решается тем, что в способе распределения задач между сервисными роботами и средствами киберфизического интеллектуального пространства при многомодальном обслуживании пользователей, выполняемым сервером приложений, заключающемся в том, что принимают по тракту управления сервера приложений/голосового сервера между сервером приложений и голосовым сервером от голосового сервера индикацию относительно того, что речь была распознана на основе звуковых данных восходящей линии связи, посланных от клиентского устройства на голосовой сервер по тракту звуковых данных между клиентским устройством и голосовым сервером, причем звуковые данные восходящей линии связи представляют фрагмент активной речи пользователя, принятый через голосовую модальность клиентского устройства, при этом голосовой сервер является отдельным от сервера приложений, и отправляют по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством на клиентское устройство сообщение, которое включает в себя результат распознавания для речи и которое предписывает клиентскому устройству обновлять визуальное отображение так, чтобы отразить результат распознавания; принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что клиентское устройство обновило визуальное отображение в соответствии с результатом распознавания, и отправляют сообщение на голосовой сервер по тракту управления сервера приложений/голосового сервера, чтобы указать, что клиентское устройство обновило визуальное отображение; отправляют мультимодальную страницу на клиентское устройство по тракту управления сервера приложений/клиента, при этом мультимодальная страница при ее интерпретации предписывает клиентскому устройству воспроизводить визуальное отображение, которое включает в себя по меньшей мере один элемент отображения, для которого входные данные являются принимаемыми клиентским устройством через визуальную модальность и голосовую модальность; принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что произошло сгенерированное клиентом событие, которое служит основанием обновления визуального отображения, воспроизводимого на клиентском устройстве, отправляют информацию на клиентское устройство по тракту управления сервера приложений/клиента, чтобы предписать клиентскому устройству обновить визуальное отображение, основываясь на сгенерированном клиентом событии, и отправляют на голосовой сервер по тракту управления сервера приложений/голосового сервера команду, которая включает в себя информацию, указывающую сгенерированное клиентом событие, дополнительно принимают по тракту управления сервера приложений/сервера жестов между сервером приложений и сервером жестов от сервера жестов индикацию относительно того, что жест был распознан на основе видеоданных восходящей линии связи, посланных от клиентского устройства на сервер жестов по тракту видеоданных между клиентским устройством и сервером жестов. При этом видеоданные восходящей линии связи представляют собой видеофрагмент активной манипуляции пользователя, принятый через визуальную модальность клиентского устройства, а сервер жестов является отдельным от сервера приложений и голосового сервера. Затем отправляют по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством на клиентское устройство сообщение, которое включает в себя результат распознавания жеста и которое предписывает клиентскому устройству обновлять визуальное отображение так, чтобы отразить результат распознавания. После этого принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что клиентское устройство обновило визуальное отображение в соответствии с результатом распознавания, и отправляют сообщение на сервер жестов по тракту управления сервера приложений/сервера жестов, чтобы указать, что клиентское устройство обновило визуальное отображение. Отправляют по тракту управления сервера приложений/робота сервисному роботу сообщение, которое включает в себя результат распознавания для речи и которое предписывает сервисному роботу выполнить соответствующую программу действий. Затем принимают от сервисного робота по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания для речи, и отправляют сообщение на голосовой сервер по тракту управления сервера приложений/голосового сервера, чтобы указать, что сервисный робот выполнил предписанную программу действий. Оправляют по тракту управления сервера приложений/робота сервисному роботу сообщение, которое включает в себя результат распознавания жеста и которое предписывает сервисному роботу выполнить соответствующую программу действий. Затем принимают от сервисного робота по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания жеста и отправляют сообщение на сервер жестов по тракту управления сервера приложений/сервера жестов, чтобы указать, что сервисный робот выполнил предписанную программу действий.In the claimed method, this problem is solved by the fact that in the method of distributing tasks between service robots and means of cyberphysical intellectual space with multimodal user service performed by the application server, which consists in accepting the application server / voice server between the application server and the voice server along the control path an indication from the voice server that speech was recognized based on uplink audio data sent from client of the device to the voice server along the audio data path between the client device and the voice server, the uplink audio data representing a fragment of the user's active speech received through the voice modality of the client device, while the voice server is separate from the application server and sent along the control path an application / client server between the application server and the client device to the client device, a message that includes a recognition result for I am speaking and which instructs the client device to update the visual display so as to reflect the recognition result; receive an indication from the client device along the control path of the application server / client that the client device has updated the visual display in accordance with the recognition result, and send a message to the voice server along the control path of the application server / voice server to indicate that the client device has updated the visual display; send a multimodal page to the client device along the control path of the application server / client, while the multimodal page, when interpreted, instructs the client device to reproduce a visual display that includes at least one display element for which the input data is received by the client device through visual modality and voice modality; receive an indication from the client device along the control path of the application server / client that an event generated by the client has occurred that serves as the basis for updating the visual display played on the client device, send information to the client device along the control path of the application server / client to instruct the client device update the visual display based on the event generated by the client, and send it to the voice server along the control path The application server / voice server command, which includes information indicating the event generated by the client, is additionally received along the control path of the application / gesture server between the application server and the gesture server from the gesture server an indication that the gesture was recognized based on the video data of the upstream communication lines sent from the client device to the gesture server along the video path between the client device and the gesture server. In this case, the uplink video data is a video fragment of the active user manipulation received through the visual modality of the client device, and the gesture server is separate from the application server and the voice server. Then, a message is sent along the control path of the application / client server between the application server and the client device to the client device, which includes a gesture recognition result and which instructs the client device to update the visual display so as to reflect the recognition result. After that, an indication is received from the client device along the control path of the application server / client that the client device has updated the visual display in accordance with the recognition result, and a message is sent to the gesture server along the control path of the application server / gesture server to indicate that the client device updated visual display. A message is sent along the control path of the application server / robot to the service robot, which includes the recognition result for speech and which instructs the service robot to execute the corresponding action program. Then, an indication is received from the service robot along the control path of the application server / robot that the service robot has completed the prescribed program of actions in accordance with the recognition result for speech, and a message is sent to the voice server along the control path of the application server / voice server to indicate that The service robot has completed the prescribed action program. A message is sent to the service robot via the control path of the application / robot server, which includes the result of gesture recognition and which instructs the service robot to execute the corresponding action program. Then, an indication is received from the service robot along the control path of the application server / robot that the service robot has completed the prescribed program of actions in accordance with the result of gesture recognition and send a message to the gesture server along the control path of the application server / gesture server to indicate that the service robot Fulfilled the prescribed action program.

Также в заявленном способе эта задача решается тем, что в способе распределения задач между сервисными роботами и средствами киберфизического интеллектуального пространства при многомодальном обслуживании пользователей, выполняемом клиентским устройством, заключающемся в том, что воспроизводят визуальное отображение, основываясь на интерпретации машинного кода, который предписывает клиентскому устройству воспроизводить визуальное отображение, при этом визуальное отображение включает в себя по меньшей мере один элемент отображения, для которого входные данные являются принимаемыми клиентским устройством через визуальную модальность и голосовую модальность; принимают сигнал, представляющий фрагмент активной речи пользователя, через голосовую модальность; оцифровывают этот сигнал так, чтобы генерировать звуковые данные восходящей линии связи, соответствующие одному или более элементам отображения из упомянутого по меньшей мере одного элемента отображения; отправляют звуковые данные восходящей линии связи на голосовой сервер по тракту звуковых данных между клиентским устройством и голосовым сервером; принимают результат распознавания речи от сервера приложений по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством, причем результат распознавания речи основан на выполнении голосовым сервером процесса распознавания речи в отношении звуковых данных восходящей линии связи, причем тракт звуковых данных является отдельным от тракта управления сервера приложений/клиента, и при этом голосовой сервер является отдельным от сервера приложений; и обновляют упомянутые один или более элементов отображения визуального отображения в соответствии с результатом распознавания речи; принимают мультимодальную страницу от сервера приложений по тракту управления сервера приложений/клиента, при этом мультимодальная страница включает в себя машинный код, причем воспроизведение визуального отображения выполняется посредством интерпретирования машинного кода в форме разметки в мультимодальной странице; принимают звуковые данные нисходящей линии связи от голосового сервера по тракту звуковых данных, при этом звуковые данные нисходящей линии связи включают в себя звуковую подсказку; и воспроизводят звуковую подсказку на устройстве вывода звукового сигнала из состава клиентского устройства; принимают ввод пользователя, который служит основанием обновления визуального отображения, воспроизводимого на клиентском устройстве; основываясь на приеме ввода пользователя, отправляют на сервер приложений по тракту управления сервера приложений/клиента индикацию относительно того, что произошло сгенерированное клиентом событие, и принимают от сервера приложений по тракту управления сервера приложений/клиента информацию, которая предписывает клиентскому устройству обновлять визуальное отображение, основываясь на сгенерированном клиентом событии, дополнительно принимают сигнал, представляющий видеофрагмент активной манипуляции пользователя, через визуальную модальность. Затем оцифровывают этот сигнал так, чтобы генерировать видеоданные восходящей линии связи, соответствующие одному или более элементам отображения из упомянутого по меньшей мере одного элемента отображения. Отправляют видеоданные восходящей линии связи на сервер жестов по тракту видеоданных между клиентским устройством и сервером жестов. Затем принимают результат распознавания жеста от сервера приложений по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством. При этом результат распознавания жеста основан на выполнении сервером жестов процесса распознавания жестов в отношении видеоданных восходящей линии связи, тракт видеоданных является отдельным от тракта управления сервера приложений/клиента, а сервер жестов является отдельным от сервера приложений и голосового сервера. После этого обновляют упомянутые один или более элементов отображения визуального отображения в соответствии с результатом распознавания жеста.Also in the claimed method, this problem is solved by the fact that in the method of distributing tasks between service robots and means of cyberphysical intellectual space with multimodal user service performed by the client device, they reproduce the visual display based on the interpretation of the machine code that prescribes the client device reproduce a visual display, wherein the visual display includes at least one item displayed ia for which the input data is received by the client device through visual modality and voice modality; receive a signal representing a fragment of the user's active speech through the voice modality; digitizing this signal so as to generate uplink audio data corresponding to one or more display elements from said at least one display element; send uplink audio data to the voice server along the audio data path between the client device and the voice server; receiving a speech recognition result from the application server along the control path of the application server / client between the application server and the client device, the speech recognition result based on the voice server performing the speech recognition process with respect to uplink audio data, the audio data path being separate from the control path application / client server, while the voice server is separate from the application server; and updating said one or more display elements of the visual display in accordance with the result of speech recognition; receiving a multimodal page from the application server along the control path of the application server / client, wherein the multimodal page includes machine code, and the visual display is reproduced by interpreting the machine code in the form of markup in the multimodal page; receive audio data downlink from the voice server along the path of audio data, while the audio data downlink includes an audio prompt; and reproducing an audio prompt on a device for outputting an audio signal from a client device; accept user input, which serves as the basis for updating the visual display reproduced on the client device; based on the reception of user input, an indication is sent to the application server along the control path of the application server / client that an event has been generated by the client, and information is received from the application server along the control path of the application server / client that instructs the client device to update the visual display based on the event generated by the client, an additional signal is received, representing a video fragment of the user's active manipulation, through the visual modality. This signal is then digitized so as to generate uplink video data corresponding to one or more display elements from said at least one display element. The uplink video data is sent to the gesture server along the video path between the client device and the gesture server. Then, a gesture recognition result is received from the application server along the control path of the application / client server between the application server and the client device. In this case, the gesture recognition result is based on the execution by the gesture server of the gesture recognition process regarding uplink video data, the video data path is separate from the application / client server control path, and the gesture server is separate from the application server and the voice server. After that, the aforementioned one or more display elements of the visual display is updated in accordance with the result of the gesture recognition.

Также в заявленном способе эта задача решается тем, что в способе распределения задач между сервисными роботами и средствами киберфизического интеллектуального пространства при многомодальном обслуживании пользователей, выполняемом сервисным роботом, заключающемся в том, что принимают по тракту управления сервера приложений/робота от сервера приложений сообщение, которое включает в себя результат распознавания для речи и которое предписывает сервисному роботу выполнить соответствующую программу действий. Затем выполняют программу действий, соответствующую принятому результату распознавания для речи, и отправляют серверу приложений по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания для речи. После этого принимают по тракту управления сервера приложений/робота от сервера приложений сообщение, которое включает в себя результат распознавания жеста и которое предписывает сервисному роботу выполнить соответствующую программу действий, а затем выполняют программу действий, соответствующую принятому результату распознавания жеста. После выполнения программы действий отправляют серверу приложений по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания жеста.Also, in the claimed method, this problem is solved by the fact that in the method of distributing tasks between service robots and means of cyber-physical intellectual space with multimodal user service performed by the service robot, the message is received via the application server / robot control path from the application server, which includes the recognition result for speech and which instructs the service robot to execute the appropriate action program. Then, an action program corresponding to the received recognition result for speech is executed, and an indication is sent to the application server along the control path of the application server / robot that the service robot has completed the prescribed action program in accordance with the recognition result for speech. After that, a message is received along the control path of the application server / robot from the application server, which includes the result of gesture recognition and which instructs the service robot to execute the appropriate program of actions, and then the program of actions corresponding to the received result of gesture recognition is executed. After executing the action program, an indication is sent to the application server along the control path of the application server / robot that the service robot has completed the prescribed action program in accordance with the result of the gesture recognition.

Новая совокупность существенных признаков позволяет достичь указанного технического результата за счет:A new set of essential features allows you to achieve the specified technical result due to:

- реализации сервисными роботами только специализированных функций, не решаемых клиентским устройствами киберфизического интеллектуального пространства при многомодальном обслуживании пользователей;- the implementation by service robots of only specialized functions that are not solved by client devices of the cyberphysical intellectual space with multimodal user service;

- реализации распознавания речи с помощью голосового сервера;- implementation of speech recognition using a voice server;

- реализации распознавания жестов с помощью сервера жестов;- implementation of gesture recognition using the gesture server;

- использования клиентских устройств только для многомодального ввода/информации пользователя;- use of client devices only for multimodal user input / information;

- использования сервера приложений для распределения задач между клиентскими устройствами, сервисными роботами, голосовым сервером и сервером жестов при многомодальном обслуживании пользователей.- using the application server to distribute tasks between client devices, service robots, a voice server and a gesture server for multimodal user services.

Проведенный анализ уровня техники позволил установить, что аналоги, характеризующиеся совокупностью признаков, тождественных всем признакам заявленного способа распределения задач между сервисными роботами и средствами киберфизического интеллектуального пространства при многомодальном обслуживании пользователей, отсутствуют. Следовательно, заявленное изобретение соответствует условию патентоспособности «новизна».The analysis of the prior art made it possible to establish that there are no analogues that are characterized by a combination of features that are identical to all the features of the claimed method of distributing tasks between service robots and means of cyberphysical intellectual space with multimodal user service. Therefore, the claimed invention meets the condition of patentability "novelty."

Результаты поиска известных решений в данной и смежных областях техники с целью выявления признаков, совпадающих с отличительными от прототипа признаками заявленного объекта, показали, что они не следуют явным образом из уровня техники. Из уровня техники также не выявлена известность влияния предусматриваемых, существенными признаками заявленного изобретения преобразований на достижение указанного технического результата. Следовательно, заявленное изобретение соответствует условию патентоспособности «изобретательский уровень».Search results for known solutions in this and related fields of technology in order to identify features that match the distinctive features of the claimed object from the prototype showed that they do not follow explicitly from the prior art. The prior art also did not reveal the popularity of the impact provided, essential features of the claimed invention, the transformations to achieve the specified technical result. Therefore, the claimed invention meets the condition of patentability "inventive step".

Заявленное изобретение поясняется следующими чертежами:The claimed invention is illustrated by the following drawings:

- фиг. 1, отображающей обобщенную схему киберфизического интеллектуального пространства в соответствии с настоящим изобретением;- FIG. 1 depicting a generalized diagram of a cyberphysical intellectual space in accordance with the present invention;

- фиг. 2, на которой представлена блок-схема последовательности действий, реализующих предлагаемый способ;- FIG. 2, which shows a block diagram of a sequence of actions that implement the proposed method;

Заявленный способ может быть реализован в КФИП (фиг. 1), включающем в себя, по меньшей мере, одно клиентское устройство 101, по меньшей мере один сервисный робот 102, сеть 103, голосовой сервер 104 (ГС), сервер 105 приложений (СП) и сервер 106 жестов (СЖ). Между этими объектами КФИП устанавливаются различные тракты данных и управления, сами они реализуют различные коммуникационные протоколы, чтобы поддерживать сеанс распределенного мультимодального приложения в пределах КФИП. В варианте осуществления сеанс распределенного мультимодального приложения включает в себя интерпретирование машинного кода (например, машинного кода, связанного с компонентом 109 приложения на стороне клиента и/или с группой из одной или более связанных мультимодальных страниц 110) клиентским устройством 101. Примером реализации предлагаемого КФИП является интеллектуальное пространство с многомодальным интерфейсом (патент РФ на полезную модель №124017 от 10.01.2013).The claimed method can be implemented in KFIP (Fig. 1), which includes at least one client device 101, at least one service robot 102, a network 103, a voice server 104 (GS), an application server 105 (SP) and a gesture server 106 (SJ). Various data and control paths are established between these KFIP objects; they themselves implement various communication protocols in order to support a distributed multimodal application session within KFIP. In an embodiment, a distributed multimodal application session includes interpreting machine code (eg, machine code associated with a client-side application component 109 and / or with a group of one or more related multimodal pages 110) by client device 101. An example implementation of the proposed KFIP is intellectual space with a multimodal interface (RF patent for utility model No. 124017 of 01/10/2013).

В качестве клиентского устройства 101 может быть использован мобильный телефон (смартфон), персональное навигационное устройство, компьютер (в том числе ноутбук). В соответствии с настоящим изобретением клиентское устройство 101 способно выполнять один или более экземпляров клиентского промежуточного программного обеспечения 107, клиентского браузера 108 и/или компонента 109 приложения на стороне клиента. Здесь под промежуточным программным обеспечением понимается программное обеспечение, предназначенное для сопряжения между программными компонентами и/или приложениями, выполняемыми на отдельных объектах обработки данных (например, на клиентах или серверах). В варианте КФИП (фиг. 1) клиентское промежуточное программное обеспечение 107 обеспечивает сопряжение между клиентским браузером 108 и/или компонентом 109 приложения на стороне клиента и одним или более серверами (например, сервером 105 приложений, голосовым сервером 104, сервером 106 жестов) через сеть 103.As the client device 101, a mobile phone (smartphone), a personal navigation device, a computer (including a laptop) can be used. In accordance with the present invention, client device 101 is capable of executing one or more instances of client middleware 107, client browser 108, and / or application component 109 of the client. Here, middleware refers to software designed for interfacing between software components and / or applications running on separate data processing facilities (for example, clients or servers). In the KFIP variant (FIG. 1), client middleware 107 provides interface between client browser 108 and / or client-side application component 109 and one or more servers (eg, application server 105, voice server 104, gesture server 106) through the network 103.

Клиентский браузер 108 обращается к машинному коду (например, к мультимодальной странице 110) в клиентском устройстве 101 в связи с компонентом 114 приложения на стороне клиента и дополнительно интерпретирует машинный код. В варианте КФИП (фиг. 1) клиентский браузер 112 адаптирован так, чтобы обращаться, по меньшей мере, к одной мультимодальной странице 110 и интерпретировать машинный код (например, разметку, сценарии и другую информацию) в пределах мультимодальной страницы 110. Здесь термин "мультимодальная страница" означает информационный набор, представляющий по меньшей мере один взаимодействующий с пользователем элемент отображения, который может быть визуально представлен на клиентском устройстве 101 и для которого пользователь может вводить информацию и/или указывать выбранный элемент через любую из множества модальностей (например, голосовую модальность и визуальную модальность).The client browser 108 accesses the machine code (for example, multimodal page 110) in the client device 101 in connection with the application component 114 on the client side and further interprets the machine code. In the KFIP variant (Fig. 1), the client browser 112 is adapted to access at least one multimodal page 110 and interpret machine code (eg markup, scripts and other information) within the multimodal page 110. Here, the term "multimodal page "means an information set representing at least one user-interacting display element that can be visually displayed on the client device 101 and for which the user can enter information and / indicate whether the selected item through any of the plurality of modalities (e.g., voice modality and a visual modality).

Мультимодальная страница 110 может включать в себя, например, web-страницу, документ, файл, форму, перечень или другой тип информационного набора. При ее интерпретировании мультимодальная страница 110 может заставлять клиентское устройство 102 воспроизводить один или более взаимодействующих с пользователем элементов отображения. В настоящем изобретении "взаимодействующий с пользователем элемент отображения" может включать в себя, например, помимо всего прочего, поле ввода текста, выбираемый элемент (например, кнопку) и/или интерактивный текст. Наряду с одним или более взаимодействующими с пользователем элементами отображения, мультимодальная страница 110 также может включать в себя другую информацию и/или элементы, такие как текстовая информация, изображения (например, статические или динамические изображения), звуковые данные, видеоизображение, гипертекстовые связи, метаданные и сценарии.The multimodal page 110 may include, for example, a web page, document, file, form, list, or other type of information set. When interpreted, the multimodal page 110 may cause the client device 102 to reproduce one or more user interacting display elements. In the present invention, a “user-interacting display element” may include, for example, but not limited to, a text input field, a selectable element (eg, a button) and / or interactive text. Along with one or more user-interacting display elements, the multimodal page 110 may also include other information and / or elements, such as text information, images (e.g., static or dynamic images), audio data, video, hypertext links, metadata and scripts.

Согласно настоящему изобретению мультимодальная страница 110 включает в себя разметку, которая может заставлять клиентский браузер 108 и/или компонент 109 приложения на стороне клиента (или другое программное обеспечение синтаксического анализа) выполнять один или более внедренных или указанных ссылкой сценариев (например, код JavaScript). Кроме того, сценарий может быть адаптирован так, чтобы заставлять клиентское устройство 101 выдавать асинхронный запрос на сервер 105 приложений.According to the present invention, the multimodal page 110 includes markup that can cause the client browser 108 and / or the client-side application component 109 (or other parsing software) to execute one or more embedded or referenced scripts (e.g., JavaScript code). In addition, the script may be adapted to cause the client device 101 to issue an asynchronous request to the application server 105.

Компонент 109 приложения на стороне клиента и/или мультимодальная страница 110 могут быть разработаны с использованием методов асинхронного JavaScript и расширяемого языка разметки (XML).Client-side application component 109 and / or multimodal page 110 may be developed using asynchronous JavaScript and Extensible Markup Language (XML) techniques.

Клиентский браузер 108 включает в себя программное обеспечение, которое выполняет синтаксический анализ машинного кода (например, разметки) в пределах мультимодальной страницы 110 и/или обеспечивает сопряжение с компонентом 109 приложения на стороне клиента способом, который дает возможность клиентскому устройству 101 воспроизводить текст, изображения, видеоизображения, музыку и/или другую информацию, представленную или упомянутую в машинном коде и/или компоненте 109 приложения на стороне клиента. В различных вариантах осуществления клиентский браузер 108 может включать в себя браузер HTML/XHTML и/или коммерчески доступный браузер (например, Internet Explorer, Mozilla Firefox, Opera и др.).The client browser 108 includes software that parses machine code (e.g., markup) within the multimodal page 110 and / or provides connection to the client component of the application component 109 in a manner that enables the client device 101 to reproduce text, images, video images, music, and / or other information presented or referred to in client machine code and / or application component 109. In various embodiments, the client browser 108 may include an HTML / XHTML browser and / or a commercially available browser (e.g., Internet Explorer, Mozilla Firefox, Opera, etc.).

Клиентское устройство 101 может устанавливать связь с голосовым сервером 104, сервером 105 приложений и сервером 106 жестов через одну или более сетей 103.Client device 101 may communicate with voice server 104, application server 105, and gesture server 106 through one or more networks 103.

Сервисный робот 102 представляет собой мобильного робота, связанного с сеансом распределенного мультимодального приложения. Сервисный робот 102 выполняет один или более экземпляров промежуточного программного обеспечения 111 CP и реализует заданную программу действий с помощью соответствующих средств 112 реализации программ действий.The service robot 102 is a mobile robot associated with a distributed multimodal application session. The service robot 102 executes one or more instances of the middleware 111 CP and implements a predetermined action program using appropriate means 112 of the implementation of the action programs.

Промежуточное программное обеспечение 111 CP обеспечивает сопряжение между средствами 112 реализации программ действий и другими серверами (например, сервером 105 приложений) через подключение 125 робота к серверу и/или сеть 103.The middleware 111 CP provides a connection between the means 112 for implementing the action programs and other servers (for example, the server 105 applications) through the connection 125 of the robot to the server and / or network 103.

Средства 112 реализации программ действий представляют собой аппаратно-программные средства, которые могут быть вызваны промежуточным программным обеспечением 111 CP, для того, чтобы управлять перемещением сервисного робота и/или его отдельных частей (узлов, элементов). Средства 112 реализации программ действий могут в процессе своего функционирования использовать библиотеку 116 программ действий или другие ресурсы управления сервисным роботом. Например, средства 112 реализации программ действий могут быть реализованы как это показано в патент РФ на полезную модель 108172 от 10.12.2011).Means 112 for implementing action programs are hardware and software tools that can be called up by middleware 111 CP in order to control the movement of the service robot and / or its individual parts (nodes, elements). Means 112 for implementing action programs may, in the course of their operation, use a library of 116 action programs or other service robot control resources. For example, funds 112 for implementing action programs can be implemented as shown in the patent of the Russian Federation for utility model 108172 dated 12/10/2011).

Сеть 103 может включать в себя, например, сеть с коммутацией пакетов и/или сеть с коммутацией каналов, Сеть 103 предназначена для обмена информацией между системными объектами КФИП, использующими любой из ряда протоколов проводной или беспроводной связи.Network 103 may include, for example, a packet-switched network and / or circuit-switched network. Network 103 is designed to exchange information between KPIP system entities using any of a number of wired or wireless communication protocols.

Голосовой сервер 104 представляет собой сервер модальности, предназначенный для выполнения обработки речи, связанной с сеансом распределенного мультимодального приложения. Голосовой сервер 104 выполняет один или более экземпляров промежуточного программного обеспечения 114 ГС и программы 115 распознавания речи. Программа 115 распознавания речи может рассматриваться как компонент приложения на стороне ГС, поскольку она формирует серверную часть распределенного приложения.The voice server 104 is a modality server for executing speech processing associated with a distributed multimodal application session. The voice server 104 runs one or more instances of the middleware 114 GS and the speech recognition program 115. The speech recognition program 115 can be considered as a component of the application on the side of the HS, since it forms the server part of the distributed application.

Промежуточное программное обеспечение 114 ГС обеспечивает сопряжение между программой 115 распознавания речи и другими серверами (например, сервером 105 приложений) и/или клиентским устройством 101 через подключение 122 сервера к серверу и/или сеть 103, соответственно.The middleware 114 GS provides a connection between the speech recognition program 115 and other servers (for example, the application server 105) and / or the client device 101 through the connection of the server 122 to the server and / or network 103, respectively.

Программа 115 распознавания речи представляет собой приложение, которое может быть вызвано промежуточным программным обеспечением 114 ГС, для того, чтобы принимать звуковые данные, выполнять алгоритм распознавания речи с использованием звуковых данных, чтобы определить результат распознавания речи (например, индикацию распознанной речи) и возвращать результат распознавания речи или указывать, что результат не был определен. Программа 115 распознавания речи может выполняться с использованием одной или более речевых библиотек 116 или с другими ресурсами распознавания речи (например, с грамматиками, последовательностями n-грамм, статистическими моделями языка или другими ресурсами распознавания речи). Например, программа распознавания может реализовывать способ распознавания речи на основе двухуровневого морфофонемного префиксного графа (патент РФ 2597498 от 10.09.2016).Speech recognition program 115 is an application that can be called by middleware 114 GS in order to receive audio data, execute a speech recognition algorithm using audio data to determine a speech recognition result (e.g., indication of recognized speech) and return a result speech recognition or indicate that the result has not been determined. Speech recognition program 115 may be executed using one or more speech libraries 116 or with other speech recognition resources (e.g., grammars, n-gram sequences, statistical language models, or other speech recognition resources). For example, a recognition program may implement a speech recognition method based on a two-level morphophonemic prefix graph (RF patent 2597498 from 09/10/2016).

В варианте осуществления голосовой сервер 104 устанавливает связь с сервером 105 приложений по тракту 126 управления СП/ГС. Кроме того, голосовой сервер 104 и клиентское устройство 101 могут непосредственно обмениваться звуковыми данными по тракту 122 звуковых данных ГС/клиента. В настоящем изобретении звуковые данные передаются по тракту 122 звуковых данных ГС/клиента с использованием версии транспортного протокола реального времени/протокола управления передачей в реальном времени (RTP/RTCP), хотя в других вариантах осуществления могут быть реализованы другие протоколы (например, протокол управления передачей (TCP) и другие).In an embodiment, the voice server 104 communicates with the application server 105 along the SP / GS control path 126. In addition, the voice server 104 and the client device 101 can directly exchange audio data via the HS / client audio path 122. In the present invention, audio data is transmitted over the GS / client audio data path 122 using the real-time transport protocol version / real-time transmission control protocol (RTP / RTCP), although other protocols (e.g., transmission control protocol may be implemented in other embodiments) (TCP) and others).

Сервер 104 приложений (СП) реализуется так, чтобы выполнять различные сервисы для клиентского устройства 101, в частности так, чтобы выполнять один или более экземпляров промежуточного программного обеспечения 117 СП и сервисов 118 СП.Application server 104 (SP) is implemented to perform various services for client device 101, in particular so as to execute one or more instances of middleware 117 SP and 118 SP services.

В настоящем изобретении промежуточное программное обеспечение 117 СП обеспечивает сопряжение с голосовым сервером 104 через подключение 126 сервера к серверу, с сервером 106 жестов через подключение 127 сервера к серверу и с клиентским устройством 101 и сервисным роботом 102 через сеть 108. Сервисы 118 СП включают в себя программное обеспечение установления сеанса связи, которое адаптировано так, чтобы инициировать установление различных трактов 122, 123, 124, 125 передачи данных и управления между сервером 105 приложений, клиентским устройством 101, сервисным роботом 102, голосовым сервером 104 и сервером 106 жестов, в связи с сеансом мультимодального приложения. Тракты передачи данных и управления, связанные с сеансом мультимодального приложения, могут включать в себя, например, тракт 123 управления СП/клиента, тракт 125 управления СП/робота, тракт 126 управления СП/ГС, тракт 127 управления СП/СЖ, тракт 122 звуковых данных ГС/клиента и тракт 124 видеоданных ГС/клиента. Согласно предлагаемому способу клиентское устройство 101 и сервер 101 приложений могут обмениваться информацией по тракту 123 управления СП/клиента, а клиентское устройство 101 и голосовой сервер 104 могут обмениваться информацией по тракту 122 звуковых данных ГС/клиента, где по меньшей мере части тракта 123 управления СП/клиента и тракта 122 звуковых данных ГС/клиента установлены через одну или более сетей 103. Сервер 105 приложений и голосовой сервер 104 могут обмениваться информацией по тракту 126 управления СП/ГС, по меньшей мере часть которого установлена через подключение 126 сервера к серверу, которое реализуется через одну или более проводных или беспроводных сетей или других промежуточных объектов. Кроме того, сервер 105 приложений и сервер 106 жестов могут обмениваться информацией по тракту 127 управления СП/СЖ, по меньшей мере часть которого установлена через подключение 127 сервера к серверу, которое также реализуется через одну или более проводных или беспроводных сетей или других промежуточных объектов.In the present invention, the SP middleware 117 interfaces with the voice server 104 through the connection of the server 126 to the server, with the gesture server 106 through the connection of the server 127 to the server and with the client device 101 and the service robot 102 through the network 108. The services of the 118 SP include session establishment software, which is adapted to initiate the establishment of various data and control paths 122, 123, 124, 125 between the application server 105, the client device 101, m robot 102, a voice server 104 and a server 106 of the gestures, in connection with a multimodal application session. The data and control paths associated with the multimodal application session may include, for example, the SP / client control path 123, the SP / robot control path 125, the SP / GS control path 126, the SP / SG control path 127, the audio path 122 HS / client data and 124 GS / client video path. According to the proposed method, the client device 101 and the application server 101 can exchange information along the SP / client control path 123, and the client device 101 and the voice server 104 can exchange information on the HS / client audio path 122, where at least part of the SP control path 123 / client and the path 122 of the sound data of the HS / client are installed through one or more networks 103. The application server 105 and the voice server 104 can exchange information on the path 126 management SP / GS, at least part of which is installed It is connected through connecting a server 126 to a server, which is implemented through one or more wired or wireless networks or other intermediate objects. In addition, the application server 105 and the gesture server 106 can exchange information via the SP / SG control path 127, at least a portion of which is established through the connection of the server 127 to the server, which is also implemented through one or more wired or wireless networks or other intermediate objects.

В настоящем изобретении "тракт управления СП/клиента" обозначает какой-либо один или более из трактов через сеть 103 (или некоторую другую коммуникационную среду), по которому может производиться обменен сообщениями между IP-адресом и/или портом, связанным с клиентским устройством 101, и IP-адресом и/или портом, связанным с сервером 105 приложений. Точно так же "тракт управления СП/ГС" обозначает какой-либо один или более из трактов между IP-адресом и/или портом, связанным с сервером 105 приложений, и IP-адресом и/или портом, связанным с голосовым сервером 104. "Тракт управления СП/СЖ" обозначает какой-либо один или более из трактов между IP-адресом и/или портом, связанным с сервером 105 приложений, и IP-адресом и/или портом, связанным с сервером 106 жестов. Кроме того "тракт управления СП/робот" обозначает какой-либо один или более из трактов между IP-адресом и/или портом, связанным с сервером 105 приложений, и IP-адресом и/или портом, связанным с сервисным роботом 102.In the present invention, an “SP / client control path" means any one or more of the paths through a network 103 (or some other communication medium) through which messages can be exchanged between the IP address and / or port associated with the client device 101 , and the IP address and / or port associated with the application server 105. Similarly, an “SP / GS control path" refers to any one or more of the paths between the IP address and / or port associated with the application server 105 and the IP address and / or port associated with the voice server 104. " The control path SP / SJ "refers to any one or more of the paths between the IP address and / or port associated with the application server 105 and the IP address and / or port associated with the gesture server 106. In addition, the "SP / robot control path" refers to any one or more of the paths between the IP address and / or port associated with the application server 105 and the IP address and / or port associated with the service robot 102.

Дополнительно, "тракт звуковых данных ГС/клиента" обозначает какой-либо один или более из трактов через сеть 103 (или некоторую другую коммуникационную среду), по которому может производиться обмен звуковыми данными между IP-адресом и/или портом, связанным с голосовым сервером 104, и IP-адресом и/или портом, связанным с клиентским устройством 101. А тракт видеоданных СЖ/клиента" обозначает какой-либо один или более из трактов через сеть 103 (или некоторую другую коммуникационную среду), по которому может производиться обмен видеоданными между IP-адресом и/или портом, связанным с сервером 106 жестов, и IP-адресом и/или портом, связанным с клиентским устройством 101.Additionally, “HS / client audio data path” means any one or more of the paths via network 103 (or some other communication medium) through which audio data can be exchanged between the IP address and / or port associated with the voice server 104, and the IP address and / or port associated with the client device 101. And the video path SJ / client "denotes any one or more of the paths through the network 103 (or some other communication medium) through which video data can be exchanged between IP address som and / or port associated with the server 106 of the gestures, and the IP address and / or port associated with the client device 101.

Сервер 106 жестов (СЖ) представляет собой сервер модальности, предназначенный для выполнения обработки видео, связанной с сеансом распределенного мультимодального приложения. Сервер 106 жестов выполняет один или более экземпляров промежуточного программного обеспечения 119 СЖ и программы 120 распознавания жестов. Программа 116 распознавания жестов может рассматриваться как компонент приложения на стороне СЖ, поскольку она формирует серверную часть распределенного приложения.Gesture Server 106 (SG) is a modality server designed to perform video processing associated with a distributed multimodal application session. Gesture server 106 runs one or more instances of the middleware 119 of the SG and the gesture recognition program 120. Gesture recognition program 116 can be considered as an application component on the SJ side, since it forms the server part of the distributed application.

Промежуточное программное обеспечение 119 ГС обеспечивает сопряжение между программой 120 распознавания жестов и другими серверами (например, сервером 105 приложений) и/или клиентским устройством 101 через подключение 124 сервера к серверу и/или сеть 103, соответственно.The middleware 119 GS provides a connection between the gesture recognition program 120 and other servers (for example, the application server 105) and / or the client device 101 through a connection 124 of the server to the server and / or network 103, respectively.

Программа 120 распознавания речи представляет собой приложение, которое может быть вызвано промежуточным программным обеспечением 119 СЖ, для того, чтобы принимать видеоданные, выполнять алгоритм распознавания жестов с использованием видеоданных, чтобы пытаться определить результат распознавания жестов и возвращать результат распознавания жеста или указывать, что результат не был определен. Программа 120 распознавания речи может выполняться с использованием одной или более библиотек 121 жестов или с другими ресурсами распознавания жестов. В качестве такой библиотеки, например, может быть использован мультимедиа корпус аудиовизуальной русской речи (Свидетельство о государственной регистрации базы данных №2011620085 от 28.01.2011 г.).The speech recognition program 120 is an application that can be called by the middleware 119 SJ in order to receive video data, perform a gesture recognition algorithm using video data to try to determine the result of gesture recognition and return a result of gesture recognition or indicate that the result is not has been determined. Speech recognition program 120 may be executed using one or more gesture libraries 121 or with other gesture recognition resources. As such a library, for example, a multimedia corps of audiovisual Russian speech can be used (Certificate of state registration of the database No. 20111620085 dated January 28, 2011).

В варианте осуществления сервер 106 жестов устанавливает связь с сервером 105 приложений по тракту 127 управления СП/ГС. Кроме того, сервер 106 жестов и клиентское устройство 101 могут непосредственно обмениваться звуковыми данными по тракту 124 видеоданных СЖ/клиента. В настоящем изобретении видеоданные передаются по тракту 124 видеоданных СЖ/клиента с использованием версии транспортного протокола реального времени/протокола управления передачей в реальном времени (RTP/RTCP), хотя в других вариантах осуществления могут быть реализованы другие протоколы (например, протокол управления передачей (TCP) и другие).In an embodiment, the gesture server 106 communicates with the application server 105 along the SP / GS control path 127. In addition, the gesture server 106 and the client device 101 can directly exchange audio data along the path 124 of the video data SJ / client. In the present invention, the video data is transmitted through the video path of the SG / client using the real-time transport protocol version / real-time transmission control protocol (RTP / RTCP), although other protocols (e.g., transmission control protocol (TCP) can be implemented in other embodiments ) and others).

Голосовой сервер 104, сервер 105 приложений и сервер жестов 106 отличаются друг от друга тем, что выполняют отдельные процессы и обмениваются управляющими сообщениями, которые влияют на эффективность этих процессов, по тракту 126 управления СП/ГС и тракту 127 управления СП/СЖ соответственно. Кроме того, тракт 123 управления СП/клиента между клиентским устройством 101 и сервером 105 приложений отличается от тракта 122 звуковых данных ГС/клиента между клиентским устройством 101 и голосовым сервером 104 по меньшей мере тем, что клиентское устройство 101 обращается к серверу приложения 105 и голосовому серверу 104, используя различные адреса (например, различные IP-адреса), а от тракта 123 видеоданных СЖ/клиента между клиентским устройством 101 и сервером 106 жестов по меньшей мере тем, что клиентское устройство 101 обращается к серверу приложения 105 и серверу 106 жестов, используя различные адреса.The voice server 104, the application server 105 and the gesture server 106 differ from each other in that they execute separate processes and exchange control messages that affect the efficiency of these processes along the SP / GS control path 126 and the SP / SG control path 127, respectively. In addition, the SP / client control path 123 between the client device 101 and the application server 105 is different from the HS / client sound data path 122 between the client device 101 and the voice server 104, at least in that the client device 101 accesses the application server 105 and the voice to the server 104, using different addresses (for example, different IP addresses), and from the path 123 of the SJ / client video data between the client device 101 and the gesture server 106, at least in that the client device 101 accesses the application server 105 Server 106 gestures, using different addresses.

Кроме того, клиентское устройство 101 может обмениваться управляющими сообщениями с сервером 105 приложений с использованием коммуникационного протокола, отличающегося от протоколов, используемых для обмена звуковыми данными с голосовым сервером 104 и видеоданными с сервером 106 жестов. При осуществлении настоящего изобретения голосовой сервер 106, сервер 105 приложений и сервер 106 жестов могут быть реализованы на разном аппаратном обеспечении, которое может быть размещено совместно или раздельно.In addition, the client device 101 can exchange control messages with the application server 105 using a communication protocol different from the protocols used for exchanging audio data with the voice server 104 and video data with the gesture server 106. In the implementation of the present invention, the voice server 106, the application server 105 and the gesture server 106 can be implemented on different hardware, which can be hosted together or separately.

В представленном варианте КФИП заявленный способ распределения задач между сервисными роботами и средствами КФИП при многомодальном обслуживании пользователей реализует следующим образом (рис. 2). Рассматривается сценарий, когда пользователь КФИП приходит в некоторую организацию и:In the presented version of KFIP, the claimed method for distributing tasks between service robots and KFIP means for multimodal user service is implemented as follows (Fig. 2). The scenario is considered when a KFIP user comes to some organization and:

1) стационарное клиентское устройство в холле приветствует (выводит экран приветствия) прибывшего пользователя и спрашивает о цели его/ее визита (блок 201);1) a stationary client device in the lobby welcomes (displays the welcome screen) the arriving user and asks about the purpose of his / her visit (block 201);

2) пользователь говорит «Мне нужно пройти к Иванову Андрею Леонидовичу», средства КФИП распознают речь пользователя и выводят на клиентское устройство результат распознавания в виде схемы помещений организации с указанием положения прибывшего пользователя и сотрудника Иванова А.Л. (блоки 202-209);2) the user says “I need to go to Andrei Leonidovich Ivanov”, KFIP means recognize the user's speech and output the recognition result to the client device in the form of the organization’s premises with the location of the arrived user and A. Ivanov’s employee (blocks 202-209);

3) клиентское устройство предоставляет аудиовизуальную справку о сотруднике Иванове А.Л. (блоки 210-211);3) the client device provides an audiovisual certificate about the employee A. Ivanov (blocks 210-211);

4) с помощью ввода в клиентское устройство (нажатия на сенсорный экран) пользователь пытается узнать маршрут движения до Иванова А.Л., клиентское устройство выводит на свой экран маршрут движения (блоки 212-217);4) using the input to the client device (clicking on the touch screen), the user tries to find out the driving route to A. Ivanov, the client device displays the driving route on its screen (blocks 212-217);

5) клиентское устройство информирует пользователя КФИП о том, что к Иванову А.Л. его будет сопровождать сервисный робот (блок 218);5) the client device informs the KFIP user that to A. Ivanov he will be accompanied by a service robot (block 218);

6) пользователь спрашивает: «Этот робот?», жестом указывая на сервисного робота, средства КФИП распознают жест, на клиентское устройство выводится изображение и тактико-технические характеристики сервисного робота (блоки 219-226);6) the user asks: “This robot?”, Pointing to the service robot, KFIP means recognize the gesture, the image and the tactical and technical characteristics of the service robot are displayed on the client device (blocks 219-226);

7) сервисный робот, выполняя предписанную программу действий, сопровождает пользователя к Иванову А.Л. (блоки 227-231);7) the service robot, following the prescribed program of actions, accompanies the user to A. Ivanov. (blocks 227-231);

8) по прибытию пользователя к месту назначения он благодарит сервисного робота кивком головы, средства КФИП распознают жест, сервисный робот убывает в исходную точку (холл организации) для выполнения других задач (блоки 232-236).8) upon arrival of the user to the destination, he thanks the service robot with a nod of his head, KPIP means recognize the gesture, the service robot decreases to the starting point (hall of the organization) to perform other tasks (blocks 232-236).

Более подробно реализация заявленного способа по представлена ниже.In more detail, the implementation of the claimed method is presented below.

В блоке 201 воспроизводят визуальное отображение, основываясь на интерпретации машинного кода, который предписывает клиентскому устройству 101 воспроизводить визуальное отображение. При этом визуальное отображение включает в себя по меньшей мере один элемент отображения, для которого входные данные являются принимаемыми клиентским устройством 101 через визуальную модальность и голосовую модальность.In block 201, a visual display is reproduced based on an interpretation of the machine code that causes the client device 101 to reproduce the visual display. Moreover, the visual display includes at least one display element for which the input data is received by the client device 101 through the visual modality and voice modality.

В блоке 202 принимают сигнал, представляющий фрагмент активной речи пользователя, через голосовую модальность. Оцифровывают этот сигнал (блок 203) так, чтобы генерировать звуковые данные восходящей линии связи, соответствующие одному или более элементам отображения из упомянутого по меньшей мере одного элемента отображения.At block 202, a signal representing a fragment of the user's active speech is received through a voice modality. This signal is digitized (block 203) so as to generate uplink audio data corresponding to one or more display elements from said at least one display element.

В блоке 204 отправляют звуковые данные восходящей линии связи на голосовой сервер по тракту звуковых данных между клиентским устройством и голосовым сервером.In block 204, uplink audio data is sent to the voice server along the audio data path between the client device and the voice server.

В блоке 205 принимают по тракту управления сервера приложений/голосового сервера между сервером приложений и голосовым сервером от голосового сервера индикацию относительно того, что речь была распознана на основе звуковых данных восходящей линии связи, посланных от клиентского устройства на голосовой сервер по тракту звуковых данных между клиентским устройством и голосовым сервером, причем звуковые данные восходящей линии связи представляют фрагмент активной речи пользователя, принятый через голосовую модальность клиентского устройства, при этом голосовой сервер является отдельным от сервера приложений.In block 205, an indication is received along the control path of the application server / voice server between the application server and the voice server from the voice server that speech was recognized based on the uplink audio data sent from the client device to the voice server along the audio data path between the client a device and a voice server, wherein the uplink audio data represents a fragment of the user's active speech received through the voice modality of the client device At the same time, the voice server is separate from the application server.

В блоке 206 отправляют по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством на клиентское устройство сообщение, которое включает в себя результат распознавания для речи и которое предписывает клиентскому устройству обновлять визуальное отображение так, чтобы отразить результат распознавания.At a block 206, a message is sent along the control path of the application / client server between the application server and the client device to the client device, which includes a recognition result for speech and which instructs the client device to update the visual display to reflect the recognition result.

В блоке 207 принимают результат распознавания речи от сервера приложений по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством, причем результат распознавания речи основан на выполнении голосовым сервером процесса распознавания речи в отношении звуковых данных восходящей линии связи, а тракт звуковых данных является отдельным от тракта управления сервера приложений/клиента.At a block 207, a speech recognition result is received from the application server along the control path of the application / client server between the application server and the client device, the speech recognition result being based on the voice server performing the speech recognition process with respect to uplink audio data, and the audio data path is separate from the application server / client management path.

В блоке 208 обновляют упомянутые один или более элементов отображения визуального отображения в соответствии с результатом распознавания речи.At a block 208, the one or more display elements of the visual display are updated in accordance with the result of speech recognition.

Принимают (блок 209) от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что клиентское устройство обновило визуальное отображение в соответствии с результатом распознавания, и отправляют сообщение на голосовой сервер по тракту управления сервера приложений/голосового сервера, чтобы указать, что клиентское устройство обновило визуальное отображение.An indication is received (block 209) from the client device along the control path of the application server / client that the client device has updated the visual display in accordance with the recognition result, and a message is sent to the voice server along the control path of the application server / voice server to indicate that the client device has updated the visual display.

В блоке 210 отправляют мультимодальную страницу на клиентское устройство по тракту управления сервера приложений/клиента, при этом мультимодальная страница при ее интерпретации предписывает клиентскому устройству воспроизводить визуальное отображение, которое включает в себя по меньшей мере один элемент отображения, для которого входные данные являются принимаемыми клиентским устройством через визуальную модальность и голосовую модальность.In block 210, a multimodal page is sent to the client device along the control path of the application server / client, while the multimodal page, when interpreted, instructs the client device to reproduce a visual display that includes at least one display element for which the input data is received by the client device through visual modality and voice modality.

В блоке 211 принимают мультимодальную страницу от сервера приложений по тракту управления сервера приложений/клиента, при этом мультимодальная страница включает в себя машинный код, причем воспроизведение визуального отображения выполняется посредством интерпретирования машинного кода в форме разметки в мультимодальной странице;At a block 211, a multimodal page is received from the application server along the control path of the application server / client, wherein the multimodal page includes machine code, the visual display being reproduced by interpreting the machine code in the form of markup in the multimodal page;

В блоке 212 принимают ввод пользователя, который служит основанием обновления визуального отображения, воспроизводимого на клиентском устройстве.At block 212, user input is received, which serves as the basis for updating the visual display being played on the client device.

Основываясь на приеме ввода пользователя, в блоке 213 отправляют на сервер приложений по тракту управления сервера приложений/клиента индикацию относительно того, что произошло сгенерированное клиентом событие.Based on the reception of user input, in block 213, an indication is sent to the application server along the control path of the application server / client regarding the occurrence of an event generated by the client.

В блоке 214 принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что произошло сгенерированное клиентом событие, которое служит основанием обновления визуального отображения, воспроизводимого на клиентском устройстве.At block 214, an indication is received from the client device along the control path of the application server / client regarding the occurrence of an event generated by the client, which serves as the basis for updating the visual display played on the client device.

В блоке 215 отправляют информацию на клиентское устройство по тракту управления сервера приложений/клиента, чтобы предписать клиентскому устройству 101 обновить визуальное отображение, основываясь на сгенерированном клиентом событии.At block 215, information is sent to the client device via the application / client server control path to instruct the client device 101 to update the visual display based on the event generated by the client.

В блоке 216 отправляют на голосовой сервер по тракту управления сервера приложений/голосового сервера команду, которая включает в себя информацию, указывающую сгенерированное клиентом событие.In block 216, a command is sent to the voice server along the control path of the application server / voice server, which includes information indicating an event generated by the client.

В блоке 217 принимают от сервера приложений по тракту управления сервера приложений/клиента информацию, которая предписывает клиентскому устройству обновлять визуальное отображение, основываясь на сгенерированном клиентом событии.In block 217, information is received from the application server along the control path of the application server / client that instructs the client device to update the visual display based on the event generated by the client.

Принимают звуковые данные нисходящей линии связи от голосового сервера по тракту звуковых данных (блок 218), при этом звуковые данные нисходящей линии связи включают в себя звуковую подсказку; и воспроизводят звуковую подсказку на устройстве вывода звукового сигнала из состава клиентского устройства.Receive audio data downlink from the voice server along the path of audio data (block 218), while the audio data downlink includes an audio prompt; and reproducing an audio prompt on the output device of the audio signal from the client device.

В блоке 219 принимают сигнал, представляющий видеофрагмент активной манипуляции пользователя, через визуальную модальность. Оцифровывают этот сигнал (блок 220) так, чтобы генерировать видеоданные восходящей линии связи, соответствующие одному или более элементам отображения из упомянутого по меньшей мере одного элемента отображения.At a block 219, a signal representing a video fragment of an active user manipulation is received through visual modality. This signal is digitized (block 220) so as to generate uplink video data corresponding to one or more display elements from said at least one display element.

В блоке 221 отправляют видеоданные восходящей линии связи на сервер жестов по тракту видеоданных между клиентским устройством и сервером жестов.In block 221, uplink video data is sent to the gesture server along the video path between the client device and the gesture server.

В блоке 222 принимают по тракту управления сервера приложений/сервера жестов между сервером приложений и сервером жестов от сервера жестов индикацию относительно того, что жест был распознан на основе видеоданных восходящей линии связи, посланных от клиентского устройства на сервер жестов по тракту видеоданных между клиентским устройством и сервером жестов, причем видеоданные восходящей линии связи представляют видеофрагмент активной манипуляции пользователя, принятый через визуальную модальность клиентского устройства, при этом сервер жестов является отдельным от сервера приложений и голосового сервера.In block 222, an indication is received along the control path of the application server / gesture server between the application server and the gesture server from the gesture server that the gesture was recognized based on uplink video data sent from the client device to the gesture server along the video data path between the client device and a gesture server, wherein uplink video data represents a video fragment of the user's active manipulation received through the visual modality of the client device, while ver gestures is separate from the application server and a voice server.

В блоке 223 отправляют по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством на клиентское устройство сообщение, которое включает в себя результат распознавания жеста и которое предписывает клиентскому устройству обновлять визуальное отображение так, чтобы отразить результат распознавания.In block 223, a message is sent along the control path of the application / client server between the application server and the client device to the client device, which includes a gesture recognition result and which instructs the client device to update the visual display so as to reflect the recognition result.

В блоке 224 принимают результат распознавания жеста от сервера приложений по тракту управления сервера приложений/клиента между сервером приложений и клиентским устройством, причем результат распознавания жеста основан на выполнении сервером жестов процесса распознавания жестов в отношении видеоданных восходящей линии связи, причем тракт видеоданных является отдельным от тракта управления сервера приложений/клиента, и при этом сервер жестов является отдельным от сервера приложений и голосового сервера.At a block 224, a gesture recognition result is received from the application server along the control path of the application / client server between the application server and the client device, the gesture recognition result being based on the execution by the gesture server of the gesture recognition process with respect to uplink video data, the video data path being separate from the path management of the application server / client, while the gesture server is separate from the application server and the voice server.

В блоке 225 обновляют упомянутые один или более элементов отображения визуального отображения в соответствии с результатом распознавания жеста.At a block 225, the one or more display elements of the visual display are updated in accordance with the result of the gesture recognition.

В блоке 226 принимают от клиентского устройства по тракту управления сервера приложений/клиента индикацию относительно того, что клиентское устройство обновило визуальное отображение в соответствии с результатом распознавания, и отправляют сообщение на сервер жестов по тракту управления сервера приложений/сервера жестов, чтобы указать, что клиентское устройство обновило визуальное отображение;In block 226, an indication is received from the client device along the control path of the application server / client that the client device has updated the visual display in accordance with the recognition result, and a message is sent to the gesture server along the control path of the application server / gesture server to indicate that the client The device has updated the visual display;

В блоке 227 отправляют по тракту управления сервера приложений/робота сервисному роботу сообщение, которое включает в себя результат распознавания для речи и которое предписывает сервисному роботу выполнить соответствующую программу действий;In block 227, a message is sent along the control path of the application / robot server to the service robot, which includes a recognition result for speech and which instructs the service robot to execute the corresponding action program;

В блоке 228 принимают по тракту управления сервера приложений/робота от сервера приложений сообщение, которое включает в себя результат распознавания для речи и которое предписывает сервисному роботу выполнить соответствующую программу действий;At a block 228, a message is received along the control path of the application server / robot from the application server, which includes a recognition result for speech and which instructs the service robot to execute an appropriate action program;

Выполняют программу действий (блок 229), соответствующую принятому результату распознавания для речи.Perform an action program (block 229) corresponding to the received recognition result for speech.

В блоке 230 отправляют серверу приложений по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания для речи.In block 230, an indication is sent to the application server along the control path of the application server / robot that the service robot has completed the prescribed program of actions in accordance with the recognition result for speech.

В блоке 231 принимают от сервисного робота по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания для речи, и отправляют сообщение на голосовой сервер по тракту управления сервера приложений/голосового сервера, чтобы указать, что сервисный робот выполнил предписанную программу действий,In block 231, an indication is received from the service robot along the control path of the application server / robot that the service robot has completed the prescribed program of actions in accordance with the recognition result for speech, and a message is sent to the voice server along the control path of the application server / voice server to indicate that the service robot has completed the prescribed program of actions,

В блоке 232 отправляют по тракту управления сервера приложений/робота сервисному роботу сообщение, которое включает в себя результат распознавания жеста и которое предписывает сервисному роботу выполнить соответствующую программу действий.In block 232, a message is sent along the control path of the application server / robot to the service robot, which includes the result of gesture recognition and which instructs the service robot to execute the corresponding action program.

В блоке 233 принимают по тракту управления сервера приложений/робота от сервера приложений сообщение, которое включает в себя результат распознавания жеста и которое предписывает сервисному роботу выполнить соответствующую программу действий, а затем выполняют программу действий;In block 233, a message is received along the control path of the application server / robot from the application server, which includes the result of gesture recognition and which instructs the service robot to execute the corresponding action program, and then execute the action program;

Выполняют программу действий (блок 234), соответствующую принятому результату распознавания жеста.Perform an action program (block 234) corresponding to the received gesture recognition result.

В блоке 235 отправляют серверу приложений по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания жеста.In block 235, an indication is sent to the application server along the control path of the application server / robot that the service robot has completed the prescribed program of actions in accordance with the result of the gesture recognition.

В блоке 236 принимают от сервисного робота по тракту управления сервера приложений/робота индикацию относительно того, что сервисный робот выполнил предписанную программу действий в соответствии с результатом распознавания, жеста и отправляют сообщение на сервер жестов по тракту управления сервера приложений/сервера жестов, чтобы указать, что сервисный робот выполнил предписанную программу действий.In block 236, an indication is received from the service robot along the control path of the application server / robot that the service robot has completed the prescribed program of actions in accordance with the recognition, gesture, and send a message to the gesture server along the control path of the application server / gesture server to indicate that the service robot has completed the prescribed action program.

Представленная последовательность действий позволяет утверждать, что применение заявленного способа распределения задач между сервисными роботами и средствами КФИП при многомодальном обслуживании пользователей позволяет снизить вычислительную сложность процесса многомодального обслуживания пользователей при их взаимодействии с сервисными роботами.The presented sequence of actions allows us to argue that the application of the claimed method for distributing tasks between service robots and KFIP tools for multimodal user service can reduce the computational complexity of the multimodal user service during their interaction with service robots.

Claims

1. A method for distributing tasks between service robots and means of cyber-physical intelligent space for multimodal user servicing, performed by the application server, which consists in receiving an indication on the application server / voice server control path between the application server and the voice server from the voice server that speech was recognized based on uplink audio data sent from the client device to the voice server over the sound path data between the client device and the voice server, the uplink audio data representing a fragment of the user's active speech received through the voice modality of the client device, while the voice server is separate from the application server and sent along the control path of the application server / client between the application server and by the client device to the client device, a message that includes a recognition result for speech and which instructs the client device Woo update the visual display so as to reflect the recognition result; receive an indication from the client device along the control path of the application server / client that the client device has updated the visual display in accordance with the recognition result, and send a message to the voice server along the control path of the application server / voice server to indicate that the client device has updated the visual display; send a multimodal page to the client device along the control path of the application server / client, while the multimodal page, when interpreted, instructs the client device to reproduce a visual display that includes at least one display element for which the input data is received by the client device through visual modality and voice modality; receive an indication from the client device along the control path of the application server / client that an event generated by the client has occurred that serves as the basis for updating the visual display played on the client device, send information to the client device along the control path of the application server / client to instruct the client device update the visual display based on the event generated by the client, and send it to the voice server along the control path application server / voice server command, which includes information indicating the event generated by the client, characterized in that they receive along the control path of the application server / gesture server between the application server and the gesture server from the gesture server an indication that the gesture was recognized on based on uplink video data sent from the client device to the gesture server along the video path between the client device and the gesture server, wherein the uplink video data and communications represent a video fragment of the active manipulation of the user, received through the visual modality of the client device, while the gesture server is separate from the application server and the voice server, and send a message along the control path of the application / client server between the application server and the client device to the client device, which includes the result of gesture recognition and which instructs the client device to update the visual display so as to reflect the result of the cognition; receive an indication from the client device along the control path of the application server / client that the client device has updated the visual display in accordance with the recognition result, and send a message to the gesture server along the control path of the application server / gesture server to indicate that the client device has updated the visual display; send a message along the control path of the application server / robot to the service robot, which includes the recognition result for speech and which instructs the service robot to execute the corresponding action program; receive from the service robot along the control path of the application server / robot an indication that the service robot has completed the prescribed program of actions in accordance with the recognition result for speech, and send a message to the voice server along the control path of the application server / voice server to indicate that the service the robot has completed the prescribed action program; send a message along the control path of the application server / robot to the service robot, which includes the result of gesture recognition and which instructs the service robot to execute the corresponding program of actions; receive an indication from the service robot along the control path of the application server / robot that the service robot has completed the prescribed program of actions in accordance with the recognition result, gesture and send a message to the gesture server along the control path of the application server / gesture server to indicate that the service robot Fulfilled the prescribed action program.

2. A method for distributing tasks between service robots and means of cyberphysical intellectual space for multimodal user service, performed by the client device, which consists in reproducing a visual display based on the interpretation of machine code that instructs the client device to reproduce a visual display, while the visual display includes at least one display element for which the input data is received by an agent device through visual modality and voice modality; receive a signal representing a fragment of the user's active speech through the voice modality; digitizing this signal so as to generate uplink audio data corresponding to one or more display elements from said at least one display element; send uplink audio data to the voice server along the audio data path between the client device and the voice server; receiving a speech recognition result from the application server along the control path of the application server / client between the application server and the client device, the speech recognition result based on the voice server performing the speech recognition process with respect to uplink audio data, the audio data path being separate from the control path application / client server, while the voice server is separate from the application server; and updating said one or more display elements of the visual display in accordance with the result of speech recognition; receiving a multimodal page from the application server along the control path of the application server / client, wherein the multimodal page includes machine code, and the visual display is reproduced by interpreting the machine code in the form of markup in the multimodal page; receive audio data downlink from the voice server along the path of audio data, while the audio data downlink includes an audio prompt; and reproducing an audio prompt on a device for outputting an audio signal from a client device; accept user input, which serves as the basis for updating the visual display reproduced on the client device; based on the reception of user input, an indication is sent to the application server along the control path of the application server / client that an event has been generated by the client, and information is received from the application server along the control path of the application server / client that instructs the client device to update the visual display based on an event generated by the client, characterized in that they receive a signal representing a video fragment of the user's active manipulation through visas cial modality; digitizing this signal so as to generate uplink video data corresponding to one or more display elements from said at least one display element; send uplink video data to the gesture server along the video path between the client device and the gesture server; receiving a gesture recognition result from the application server along the control path of the application server / client between the application server and the client device, the gesture recognition result being based on the execution by the gesture server of the gesture recognition process with respect to uplink video data, the video data path being separate from the application server control path / client, and the gesture server is separate from the application server and voice server; and updating said one or more display elements of the visual display in accordance with the gesture recognition result, receiving an indication from the client device along the control path of the application / client server, updating the visual display in accordance with the gesture recognition result, and sending messages and then sending them along the server control path application / robot message to the service robot, which includes the result of gesture recognition and which instructs the service robot to Full compliance with the program of action.

3. A method for distributing tasks between service robots and means of cyberphysical intellectual space during multimodal user servicing, performed by the service robot, which consists in updating the visual display on the client device in accordance with the speech recognition result and sending a message to the service robot via the control path of the SP / robot , which includes the result of speech recognition, after which it is received along the control path of the application server / robot from the server at message, which includes the result of speech recognition and which instructs the service robot to execute the appropriate action program, execute the action program corresponding to the received speech recognition result, send an indication to the application server along the control path of the application server / client that the service robot has completed the prescribed program ;

update the visual display on the client device in accordance with the result of the gesture recognition and send a message along the control path of the joint venture / robot to the service robot, which includes the result of the gesture recognition, after which the message that includes the result of gesture recognition and which requires the service robot to execute the appropriate action program, execute the action program corresponding to the accepted p As a result of gesture recognition, an indication is sent to the application server along the control path of the application server / client that the service robot has completed the prescribed program.