WO2023234794A1

WO2023234794A1 - Video conferencing system and method

Info

Publication number: WO2023234794A1
Application number: PCT/RU2022/000184
Authority: WO
Inventors: Иван Евгеньевич ЕГОРОВ; Михаил Миннимухаметович НАСЫРОВ; Александр Николаевич БОЛЬШАКОВ; Даниэл Игоревич СЕРГЕЕВ; Михаил Викторович ФАНДЮШИН; Герман Эдуардович НОВИКОВ; Дмитрий Леонидович БАЛИЕВ
Original assignee: Публичное Акционерное Общество "Сбербанк России"
Priority date: 2022-05-30
Filing date: 2022-06-01
Publication date: 2023-12-07

Abstract

The claimed solution relates to the field of telecommunications and computer technologies. A video conferencing system (100) comprises: client devices (101, 102) capable of connecting to a messaging module; and a video conferencing server (104) containing a messaging module, a streaming module, and a stream recognition module. The messaging module is configured to be capable of registering the connection of client devices to a conference and sending messages from client devices to a shared information space. The streaming module is configured to be capable of establishing a secure connection, receiving a data stream and transmitting same in parallel to client devices and to the stream recognition module. The stream recognition module is configured to be capable of processing a data stream and generating a message, said actions being performed in parallel with the transmission of the data stream to a client device by the streaming module. The solution is directed toward enabling parallel output of a data stream to a client device and recognition of said data stream to generate a message for output to a shared information space.

Description

SYSTEM AND METHOD FOR VIDEO CONFERENCE COMMUNICATION

TECHNICAL FIELD

[0001] The claimed solution relates to the field of telecommunications and computer technology, in particular, to telecommunication technology for the interaction of two or more remote users, in which it is possible for them to exchange audio and video information in real time.

BACKGROUND OF THE ART

[0002] Many different video conferencing systems (VCC) are known from the state of the art, both from Russian and foreign developers. [0003] The most famous of them are:

• Zoom

• Cisco Webex

• MS Teams

• Google Meet

• TrueConf

• IVA

• Videomost.

[0004] The main disadvantages of most solutions include the inability to simultaneously work with speech and text communication during a video conference, in particular, the lack of a barrier-free environment for simultaneous communication during a video conference with participants with hearing impairments.

[0005] Patent application US 20120306993 Al "System and Method for Teleconferencing" (Copyright holder: Visionary Mobile Corporation, published 12/06/2012) discloses a system for conducting video conferencing over a network for multiple clients using a server as an intermediary. The server receives a real-time video stream from each client. This real-time video stream may be a high-frequency stream representing video from a camera associated with the end user. The server provides any client with a real-time video stream selected by that client from among the video streams received on the server. [0006] The disadvantage of this solution is the lack of ability to recognize the video stream to generate a message provided to the user. This, in turn, makes it difficult to immerse in the context of the discussion for video conference participants who were forced to be distracted for some time during the discussion.

[0007] In addition, in the prior art, it is difficult to immerse in the context of the discussion for video conference participants who were forced to be distracted for some time during the discussion.

SUMMARY OF THE INVENTION

[0008] The claimed invention allows us to solve the technical problem of providing the ability to extract information from a data stream with its subsequent conversion into messages and their integration into a single information space.

[0009] The technical result is to provide functionality for parallel output of a data stream to at least one client device and recognition of this data stream to generate a message for output to a single information space.

[0010] The claimed technical result is achieved through the implementation of a video conferencing system containing:

• at least two client devices configured to connect to the messaging module; And

• a video conferencing server containing at least a messaging module, a stream forwarding module and a stream recognition module, wherein the messaging module is configured to

- register connections of client devices to the conference,

- send at least one message received from at least one client device to a single information space, the stream forwarding module is configured to

- establish a secure connection with client devices,

- receive a data stream from at least one client device, - transmit a data stream in parallel mode to at least one other client device and a stream recognition module,

- receive at least one generated message from the stream recognition module and send it to a single information space, the stream recognition module is configured to process the data stream and generate at least one message based on the processed data stream, wherein processing the data stream and generating at least at least one message is carried out in parallel with the transmission of the data stream by the stream forwarding module to at least one other client device.

[0011] In one particular implementation example, the data stream received from at least one client device contains at least an audio and/or video stream.

[0012] In another particular implementation example, processing the audio stream includes converting speech to text (Automatic Speech Recognition).

[0013] In another particular implementation example, video stream processing includes converting gestures into text and/or graphic information (Gesture Recognition).

[0014] In another particular implementation example, video stream processing includes Sign Language Recognition.

[0015] In another particular embodiment, the stream recognition module is further configured to determine the type of the data stream and perform appropriate processing of the data stream depending on the type of the data stream.

[0016] In another particular implementation example, the video conferencing server further comprises an audio stream filtering module configured to suppress noise.

[0017] In another particular embodiment, messages generated based on the processed data stream contain at least text and/or graphic information.

[0018] In another particular implementation example, the video conferencing server further comprises a video stream modification module configured to - background replacement,

- application of filters that improve the visual characteristics of the user’s image,

- application of augmented reality filters (AR filters).

[0019] In another particular implementation example, the stream forwarding module is further configured to receive a command from at least one client device to recognize an audio and/or video stream; and transmit the audio and/or video stream to the stream recognition module after receiving said recognition command.

[0020] In another particular implementation example, the thread recognition module is further configured to

- split the audio stream into fragments in the process of converting speech into text,

- process fragments obtained as a result of splitting the audio stream,

- generate at least one message based on the processed fragments; The video conferencing server additionally contains a matching module configured to

- determine in real time whether the stream recognition module processes the audio stream,

- set the value of the flag identifying the state of the current voice input of the user of the client device,

- receive a request from the messaging module for permission to send a message from another client device to the unified information space,

- send the messaging module permission to send a message from another client device to the unified information space based on the value of the mentioned flag, while

• if the flag value identifies the absence of current voice input from the client device user, the negotiation module sends permission to the messaging module to send a message from another client device to the common information space; if the value of the flag identifies the presence of current voice input from the user of the client device, the negotiation module is configured to o send the stream forwarding module a request to send a generated message containing already processed fragments to the unified information space, o after the stream forwarding module sends a generated message containing already processed fragments , to a single information space, send to the messaging module permission to send a message from another client device to a single information space; The messaging module is additionally configured to

- send a request to the coordination module for permission to send a message from another client device to a single information space,

- send messages from another client device to a single information space in response to receiving permission to send from the coordination module.

[0021] In another particular embodiment, the video conferencing server further comprises a common information space control module configured to determine that at least one generated message has been received from a stream forwarding module; and display, when outputting the at least one generated message, a label identifying generation of the generated message from the audio stream.

[0022] In another particular implementation example, the common information space management module is additionally configured to

- save messages posted in the mentioned space,

- provide connected client devices with access to saved messages.

[0023] In another particular implementation example, the common information space management module is additionally configured to implement

- search by the content of messages posted in the mentioned space, - downloading the contents of messages posted in the mentioned space,

- control the display of messages posted in the mentioned space.

[0024] In another particular embodiment, the video conferencing server further includes a translation module configured to translate at least one message received from at least one client device and/or a message generated based on the processed data stream.

[0025] The claimed technical result is also achieved by implementing a method for performing video conferencing, implemented using a video conferencing server and containing the stages of:

• connect at least two client devices to the video conferencing server,

• register connections of client devices to the conference,

• establish a secure connection with client devices,

• receive a data stream from at least one client device,

• transmit the data stream in parallel mode to at least one other client device and the stream recognition module,

• in the stream recognition module, a data stream is processed and at least one message is generated based on the processed data stream, wherein the processing of the data stream and the generation of at least one message is carried out in parallel with the transmission of the data stream to at least one other client device,

• receive at least one generated message and send it to a single information space.

BRIEF DESCRIPTION OF THE DRAWINGS

[0026] In FIG. 1 shows a conceptual diagram of the claimed solution.

[0027] In FIG. Figure 2 shows a diagram of a video conferencing server.

[0028] In FIG. 3 shows a block diagram of a method for implementing video conferencing.

[0029] In FIG. Figure 4 shows the general diagram of the computing device. IMPLEMENTATION OF THE INVENTION

[0030] The concepts and terms necessary to understand the present invention will be described below.

[0031] Videoconferencing (VCC) is a telecommunications technology for the interaction of two or more remote subscribers, in which they can exchange audio and video information in real time, taking into account the transfer of control data.

[0032] XMPP (extensible Messaging and Presence Protocol) is an open, XML (extensible Markup Language) based, free-to-use protocol for instant messaging and online presence. , close to real time.

[0033] RTP (Real-time Transport Protocol) is a network protocol for delivering audio and video over 1P networks.

[0034] SRTP (Secure Real-time Transport Protocol) is an extension to the RTP protocol designed for encryption, message authentication, integrity, and protection against tampering of RTP data.

[0035] gRPC (Remote Procedure Calls) is an open source remote procedure call (RPC) system that provides features such as authentication, bidirectional streaming and flow control, blocking or non-blocking bindings, and cancellation and timeouts.

[0036] ASR (Automatic Speech Recognition) is a computer technology for automatic speech-to-text recognition.

[0037] Gesture Recognition is a combination of computer and language technologies, the purpose of which is to interpret human gestures using mathematical algorithms.

[0038] SLR (Sign Language Recognition) is a computer technology for recognizing actions in sign language.

[0039] AR (Augmented Reality) is the result of introducing any sensory data into the visual field in order to supplement information about the environment and change the perception of the environment. [0040] WebRTC (Web Real Time Communication) is a technology that provides streaming data transfer between browsers or mobile applications using direct peer-to-peer communication.

[0041] BLE (Bluetooth Low Energy) is a low energy wireless technology for exchanging data over short distances over long periods of time.

[0042] In FIG. 1 shows a general diagram of the proposed system (100) for video conferencing. The system (100) can be a hardware-software complex in which each of its elements is located on a separate computer, connected within a single functionality with other elements via a data transmission network (103).

[0043] The system (100) consists of at least two client devices (101, 102) and a video conferencing server (104) connected via a data network (103).

[0044] Client devices (101, 102) may be configured to generate at least one message through input devices, recording audio and/or video signals through technical means such as microphones and cameras; converting said signals into a data stream; connections to a video conferencing server (104); transmitting to the video conferencing server (104) said at least one message and data stream; receiving from the video conferencing server (104) at least one message and a data stream from other client devices; displaying at least one message in a single information space and data flow through information output devices.

[0045] Client devices (101, 102) can be implemented on the basis of computing devices modified in software and hardware in such a way as to perform the above functions of client devices, and having technical means for input-output of information, as well as technical means for communication with the video conferencing server (104) via a data network (103). Examples of such computing devices include, but are not limited to: personal computer, smartphone, devices SberBox Top, SberPortal, etc. A more detailed description of the computing device is disclosed below with reference to FIG. 4.

[0046] The data network (103) can be, but is not limited to, the following examples: an internal or external computer network, for example, an Intranet, the Internet, etc.

[0047] The video conferencing server (104) contains a plurality of modules that will be described in detail below with reference to FIG. 2. Modules can be implemented structurally in the form of software and hardware solutions (for example, a system on a chip, microcontrollers, etc.). In addition, said server (104) may be implemented on a computing device, a more detailed description of which is presented below with reference to FIG. 4. Moreover, such a computing device is modified in its hardware and software in such a way as to implement the functions of the mentioned modules.

[0048] In FIG. 2 shows a detailed diagram of the implementation of the video conferencing server (104). The video conferencing server (104) contains the following modules: messaging module (201), stream forwarding module (202), stream recognition module (203), audio stream filtering module (204), video stream modification module (205), matching module (206 ), module for managing a single information space (207), translation module (208). Interaction between modules is implemented via a data bus.

[0049] The messaging module (201) registers client device connections to the conference. After registration, the messaging module (201) is configured to receive at least one message generated on at least one client device from at least one client device and send said message to a common information space.

[0050] In a particular example of implementation, a chat can act as a single information space, accessible to all client devices that have completed the process of registering a connection to the conference through the messaging module (201). Messages sent to a single information space are displayed on the output devices of client devices, for example, but not limited to, in the form of text messages in chat. In one of the particular implementation examples, sending at least one message to a single information space includes sending the mentioned message messaging module (201) to all client devices that have completed the registration process for joining the conference.

[0051] It should be noted that in one of the alternative embodiments of the claimed solution, the messaging module (201) is configured to receive from the stream forwarding module (202) at least one message generated by the stream recognition module (203) and send said message to unified information space.

[0052] In a preferred embodiment of the claimed solution, the messaging module (201) is an XMPP server.

[0053] The stream forwarding module (202) establishes a secure connection with client devices. After establishing a secure connection, the stream forwarding module (202) receives a data stream from at least one client device and transmits the data stream in parallel to at least one other client device and the stream recognition module (203). Next, the flow forwarding module (202) receives at least one generated message from the flow recognition module (203) and sends the said message to the unified information space.

[0054] In a particular implementation example, an SRTP connection can serve as a secure connection.

[0055] The data stream received from at least one client device contains at least an audio and/or video stream. For example, the client device may not have a camera, and therefore the client device will only transmit an audio stream. If the user of the client device communicates only in sign language, then transmission of the audio stream is not a prerequisite for communication between conference participants and the stream forwarding module (202) can only receive the video stream.

[0056] In one of the particular implementation examples, the stream forwarding module (202) receives a command from at least one client device to recognize an audio and/or video stream and transmits the audio and/or video stream to the stream recognition module (203) after receiving the said command for recognition. The ability of the stream forwarding module (202) to receive a recognition command allows the claimed solution to effectively use computational and time resources and either not spend them on audio recognition and/or video stream, if such recognition is not required, or determine which of the data streams (audio or video stream) is subject to recognition.

[0057] It should be noted that in one of the alternative embodiments of the claimed solution, the flow forwarding module (202) sends the generated message received from the flow recognition module (203) to the common information space by sending said generated message to the messaging module (201), which in turn sends the said generated message to a single information space.

[0058] The stream recognition module (203) processes the data stream received from the stream forwarding module (202) and generates at least one message based on the processed data stream and sends the generated message to the stream forwarding module (202). Processing of the data stream and generation of at least one message occurs in parallel with the transmission of the data stream by the stream forwarding module (202) to at least one other client device. The ability to parallelly perform operations to recognize a data stream and transfer said data stream to a client device for further output allows the claimed solution to maintain the transmission rate of the data stream to the client device while simultaneously providing the data stream recognition function.

[0059] In a preferred embodiment of the claimed solution, depending on the type of data stream, the stream recognition module (203) processes the data stream as follows:

- if the data stream includes an audio stream, then processing of the audio data stream includes speech-to-text (ASR) conversion,

- if the data stream includes a video stream, then processing of the video stream may include both the conversion of gestures into text and/or graphic information (Gesture Recognition) and sign language recognition (SLR).

[0060] In a particular embodiment, the stream recognition module (203) is further configured to determine the type of the data stream and perform appropriate processing of the data stream depending on the type of the data stream. The process of determining the data flow type involves examining the data flow and determining its type. [0061] In general, the stream recognition module (203) is further configured to partition the data stream into fragments while processing the data stream, process the fragments resulting from the fragmentation of the data stream, and generate at least one message based on the processed fragments . The fragments can be, but are not limited to, the following examples: speech phrases (for an audio stream) or frames (for a video stream).

[0062] In a preferred embodiment of the invention for the case of speech-to-text (ASR) conversion, the stream recognition module (203) is further configured to split the audio stream into fragments during the speech-to-text conversion process, process the fragments resulting from the splitting of the audio stream, and generate at least one message based on the processed fragments. In this case, splitting the audio stream into fragments includes detecting the end of the fragment by detecting silence in the audio stream, characteristic, for example, of the end of a phrase or a grammatical pause.

[0063] Processing the fragments obtained as a result of splitting the audio stream by means of the stream recognition module (203) further includes generating a set of hypotheses for each word in the fragment based on the language model. In this case, for each word spoken by the user of the client device, there can be several variants of the recognized text. A hypothesis is a set of pairs consisting of a variant of a recognized word and the probability that this word occurs in the recognized fragment. For example, for the recognized word “Bow” the following set of pairs acts as a hypothesis:

Onion - 0.5

Meadow - 0.3

Hatch - 0.2

[0064] Thus, the thread recognition module (203) does not initially recognize each word, but only generates a set of options for defining a particular word with their probability (where the sum of the probabilities is 1 or 100%). In this case, in the preferred embodiment, the word that has the highest probability is selected from the set of options. As an example, but not limited to, a probability estimate may be based on the similarity of the sound in a word to the acoustic ASR models or the location of a word in a sentence (for example, taking into account a language or grammatical model).

[0065] Said fragment processing further includes the stream recognition module (203) selecting, for each word from the fragment, the most likely hypothesis from the set of hypotheses. Based on the selected words, the stream recognition module (203) generates at least one message containing the recognized fragment of the audio stream. The thread recognition module (203) can additionally cleanse the generated message from boring vocabulary. In a particular implementation example, clearing the generated message of lunch vocabulary is carried out by comparing recognized words with the lunch vocabulary dictionary, and replacing detected words related to lunch vocabulary with special characters, for example, asterisks, hash marks, or any combinations thereof. In addition, the stream recognition module (203) can additionally restore punctuation (i.e., place punctuation marks) and capitalization (i.e., place capital letters where necessary according to the rules of the Russian language) of the generated message. As a result, the stream recognition module (203) sends at least one generated message, cleared of obscene vocabulary with restored punctuation and capitalization, to the stream forwarding module (202).

[0066] In the preferred embodiment of the invention, speech-to-text (ASR) conversion is implemented based on proprietary technology from SberDevices - SmartSpeech, which provides high speed and excellent recognition quality, including names, complex terms and long words. SmartSpeech uses the latest developments in the field of Deep Learning. Neural networks are trained on huge amounts of data using the power of the Christofari supercomputer (up to 1000+ GPU Tesla vl00) from Sber. Neural networks use GPUs for ultra-fast operation. Speech recognition uses ultra-precise architectures such as Jasper, QuartzNet and others.

[0067] The pre-trained model for speech recognition used in SmartSpeech was trained on the largest manually labeled dataset of 1240 hours of audio data in the Russian language (Golos speech dataset).

[0068] In one particular implementation example, speech-to-text conversion can be implemented, but is not limited to, based on the following technologies: Microsoft Speech SDK, IBM Embedded Via Voice, etc. [0069] In a preferred embodiment of the invention, for the case of converting gestures into text and/or graphic information (Gesture Recognition), the stream recognition module (203) splits the video stream into frames, determines the area of interest (for example, an arm or palm) at least one frame of the video stream, segments a specific area of interest, identifies key points in the segmented area, and determines the gesture by matching the key points with one of the known gestures from the database of known gestures.

[0070] When a gesture is detected, the thread recognition module (203) determines the type of gesture and, depending on the gesture type, generates at least one corresponding message and sends it to the thread forwarding module (202).

[0071] In one particular implementation example, if the gesture type is defined as a gesture characterizing the user's response, then the flow recognition module (203) generates at least one message containing text and/or graphic information. For example, if the stream recognition module (203) detects a user in a video stream showing a "thumbs up" gesture, then the generated message may contain both a thumbs up emoji and the text message "Excellent!" or “Totally support!”, or a combination of emoji and text message.

[0072] If the gesture type is determined to be a control gesture, the thread recognition module (203) is further configured to send a message containing the recognized control input to the thread forwarding module (202). The recognized control input includes a control command to the client device. Such control commands may include, but are not limited to: performing various actions to control audio volume, mute/unmute audio, control a video conference camera, and/or access settings for a video conference. Having received the said message, the thread forwarding module (202) sends it to the messaging module (201), which extracts the control command from the message and sends it to the client device.

[0073] For example, if the stream recognition module (203) detects a user in the video stream showing a finger-up gesture, then a control command is sent to the client device to increase the audio volume. [0074] In a preferred embodiment of the invention for the case of sign language recognition (Sign Language Recognition), the stream recognition module (203) splits the video stream into frames, determines at least one area of interest (for example, hands) in at least one video frame flow, segments at least one defined region of interest, identifies key points on the at least one segmented region of interest, normalizes the identified key points, and determines at least one gesture by matching the normalized key points with one of the known gestures from a database of known gestures . Next, the flow recognition module (203) creates a matrix of recognized letters or words from the gestures determined in the previous step, and generates at least one message based on the created matrix of recognized letters or words and sends the at least one message to the flow forwarding module (202) .

[0075] It should be noted that in one of the alternative embodiments of the claimed solution, part of the functions of the stream recognition module (203) related to pre-processing of the video stream, such as dividing the video stream into frames, determining the area of interest in the frame of the video stream, segmenting the area of interest , can be implemented on the client device side. Thus, the stream recognition module (203) will receive from the stream forwarding module (202) a video stream that has already been pre-processed by the client device.

[0076] The audio stream filtering module (204) performs noise suppression. In a particular embodiment of the claimed solution, before processing the audio stream, the stream recognition module (203) sends the audio stream to the audio stream filtering module (204), which, after performing the noise suppression operation, sends the noise-free audio stream back to the stream recognition module (203).

[0077] As an example, but not limited to, the noise reduction capability in the audio stream filtering module (204) can be implemented using smart noise reduction technology based on NVIDIA's Maxine platform or based on SberDevices' proprietary technology, SberNoise. These solutions use artificial intelligence to automatically remove extraneous noise. [0078] Integration of the noise reduction function into the claimed solution increases the comfort of communication if a person is in a noisy place during a call, removing extraneous noise and harsh sounds from the sound. For example, if in the background of a conversation a child is crying, a dog is barking, or construction noise is coming from the window, thanks to the use of an AI model, the proposed solution drowns out these sounds, leaving only the voice of the speaker.

[0079] The video stream modification module (205) performs background replacement, applies filters that improve the visual characteristics of the user's image, as well as augmented reality filters (AR filters).

[0080] The need to replace the background displayed behind the user of the client device during video conferencing may arise due to the user being in a non-traditional place of work, for example, in a home office, on vacation, etc., in addition, the current background of the user's video stream may distract other video conference participants.

[0081 ] The background replacement process is carried out by the video stream modification module

(205) as follows:

- receive a video stream from the stream forwarding module (202);

- identify the background part of the video stream and the image of the video conferencing participant;

- perform background segmentation (BGS);

- replacing the segmented background with at least one of: any image selected by the user, a background template, or a blurred image of the current background;

- sending a video stream containing the replaced background to the stream forwarding module (202) for further transmission to client devices.

[0082] In the claimed solution, improving the visual characteristics of the user's image means smoothing the skin, evening out the complexion (for example, removing dark circles under the eyes) and correcting errors in the phone camera lens. This process is called beautification.

[0083] The process of improving the visual characteristics of the user’s image (beautification) is carried out by the video stream modification module (205) as follows: - receive at least one frame of the video stream from the stream forwarding module (202);

- identify a person in at least one frame of the video stream;

- segment parts of the face by skin tone, for example, eyes and/or mouth;

- identifying at least one defective area, which differs in brightness by a threshold value from areas of skin tone without defects;

- smoothing the brightness of at least one defective area to create a smoothed brightness;

- generating at least one improved frame of the video stream, including an improved version of the face, in which the original brightness of at least one defective area is replaced by a smoothed brightness; And

- sending a video stream containing enhanced frames to the stream forwarding module (202) for further transmission to client devices.

[0084] In the claimed solution, AR filters refer to computer-generated visual effects designed to be superimposed on real-life images by adding an additional layer to the foreground or background of the real-life image.

[0085] The process of applying AR filters is carried out by the video stream modification module (205) as follows:

- receive at least one frame of the video stream from the stream forwarding module (202);

- identifying at least one region of interest (for example, a face) in at least one frame of the video stream;

- determining the coordinates of landmarks of at least one region of interest (for example, facial landmarks, wherein facial landmarks include at least the nose, eyes, mouth and facial outline);

- superimposing at least one AR element on at least one corresponding area of interest (for example, superimposing an AR element “glasses” on the face area, and specifically on the coordinates corresponding to the eyes) taking into account certain coordinates of landmarks and the size of at least one area interest; - generating at least one modified frame of the video stream with an AR filter applied, in which the AR element is superimposed on the corresponding region of interest; And

- send a video stream containing modified frames to the stream transfer module (202) for further transmission to client devices.

[0086] In one of the particular implementation examples, the process of applying AR filters additionally includes tracking the movement of landmarks of at least one area of interest (for example, moving the head to the sides, changing facial expressions (blinking, smiling), moving hands, etc.) and adding at least one AR element to at least one frame of the video stream based on changes in the motion of landmarks of the at least one region of interest.

[0087] It should be noted that in one of the alternative embodiments of the claimed solution, the video stream modification module (205) can be implemented on the client device side. Thus, the stream forwarding module (202) will receive an already modified video stream from client devices.

[0088] However, in a preferred embodiment of the claimed solution, the video stream modification module (205) is implemented on the side of the video conferencing server (104). This implementation allows the stream recognition module (203) to process the original, rather than modified, video stream, which increases the efficiency and accuracy of processing when converting gestures into text and/or graphic information (Gesture Recognition) and sign language recognition (Sign Language Recognition), since in The modified video stream may contain visual effects that will degrade the quality of processing (recognition).

[0089] In general, the negotiation module (206) is designed to negotiate the sending of messages to a single information space. The matching module (206) determines in real time whether the stream recognition module (203) is processing a data stream, in particular an audio stream. This feature can be implemented, for example, by sending requests to the thread recognition module (203) at a certain interval, where the result of sending the request is to return to the coordination module (206) a response about the state of the thread recognition module (203) regarding the processing of the data stream. For example, this condition may indicate that whether the client device user's voice input is currently being performed or sign language input is being processed by the stream recognition module (203).

[0090] The negotiation module (206) then sets the value of a flag identifying the state of the client device user's current voice input. In another embodiment of the claimed solution, the value of the flag identifies the state of the current input using sign language.

[0091 ] Next, the negotiation module (206) receives from the messaging module (201) a request for permission to send a message to the common information space from another client device other than the client device performing the current voice or sign language input. In addition, the negotiation module (206) may send to the messaging module (201) permission to send a message to the common information space from another client device based on the value of the flag.

[0092] In a particular implementation example, the negotiation module (206) is configured to send to the stream forwarding module (202) a request to send to a single information space a generated message containing already processed fragments if the value of the flag identifies the presence of current voice input of the client device user or input using sign language. The processed fragments can be, for example, text messages obtained as a result of converting speech to text or sign language to text.

[0093] The common information space control module (207) determines that at least one generated message has been received from the stream forwarding module (202) and displays, when outputting the at least one generated message to the common information space, a label identifying the generation of the generated message from audio stream or video stream. Such a label allows the user to unambiguously determine what type of message a specific message sent to a single information space belongs to. In a particular implementation example, the said label may be different for a generated message from an audio stream (i.e., for a message resulting from recognition of an audio stream by converting speech to text) and a generated message from a video stream (i.e., for message resulting from video stream recognition by converting sign language into text).

[0094] The common information space management module (207) further allows messages located in the common information space to be stored and to allow connected client devices to access the stored messages. This feature allows the user of the client device during a conference to concentrate more on communication rather than on taking notes, since unlimited access is provided to the entire set of messages in a single information space.

[0095] In addition, the common information space management module (207) allows you to download the full text of the conversation (i.e., the entire set of messages sent to the common information space) at any time during the conference to any client device that has gone through the process of registering a connection to the conference with using the messaging module (201). By way of example, but not limitation, the client device sends a request to download the full text of the conversation to the common information space management module (207). In response to the mentioned request, the common information space control module (207) generates the entire set of messages sent to the common information space, for example, in the form of a text file, and sends it to the client device.

[0096] In one of the particular implementation examples, the common information space management module (207) provides control over the display of messages located in the common information space. At the request of the user of the client device, the following can be reflected in the unified information space, but not limited to: only messages sent directly to the unified information space; only messages generated by the stream recognition module (203), which are text transcripts of an audio or video stream; all messages sent to a single information space.

[0097] The common information space management module (207) also searches the contents of messages posted in the common information space, which speeds up the process of finding necessary information among the entire set of messages generated by participants during the conference.

[0098] The translation module (208) translates at least one message received from at least one client device via the messaging module (201) and/or a message generated based on the processed data stream by the stream recognition module (203).

[0099] The translation process is carried out by the translation module (208) as follows:

- receiving a request from at least one client device to translate at least one message, wherein said request contains an indication of the desired translation language and an indication of at least one message that needs to be translated;

- translate at least one message specified in the request into the required translation language;

- sending at least one translated message to the client device via the messaging module (201).

[0100] In FIG. 3 shows a method (300) for performing video conferencing. The method (300) is carried out using a video conferencing server (104) and at least two client devices (101, 102). In the first step (301), at least two client devices (101, 102) are connected to the videoconferencing server (401). In a particular implementation example, at least two client devices (101, 102) are connected to a messaging module (201) included in the video conferencing server (104). Connection to the video conferencing server (401) (in a particular case, to the messaging module (201)) can be realized by the client device receiving a link to join the conference and following the specified link, for example, but not limited to, using WebRTC technology.

[0101] After connecting at least two client devices (101, 102), at step (302) the connections of said client devices (101, 102) to the conference are registered via the messaging module (201). In a particular implementation example, at step (302), the messaging module (201) may additionally send notifications to client devices about newly connected client devices. [0102] At step (303), a secure connection is established with client devices through the stream forwarding module (202). In a particular implementation example, an SRTP connection is established with client devices. Subsequently, the established secure connection is used for two-way transmission of data streams between at least one client device and the video conferencing server (104) (in the particular case, the stream forwarding module

(202), which is part of the video conferencing server (104)), as well as between client devices through the video conferencing server (104) (in a particular case, through the stream forwarding module (202)).

[0103] After establishing a secure connection, at step (304), a data stream from at least one client device is received through the stream forwarding module (202). The received data stream is transmitted in parallel by the stream forwarding module (202) at step (305) to at least one other client device and the stream recognition module

(203).

[0104] In a particular implementation example, at step (305), transmitting a data stream to the stream recognition module (203) includes the stream forwarding module (202) sending a gRPC request to the stream recognition module (203). Said gRPC request may contain, but is not limited to:

- information for authentication in the flow recognition module (203) (for example, placed in the gRPC request header);

- recognition parameters, such as data stream type;

- data stream for recognition in the stream recognition module (203).

[0105] After receiving the data stream, at step (306), the data stream is processed by the stream recognition module (203) and at least one message is generated based on the processed data stream, wherein the processing of the data stream and generation of the at least one message is carried out in parallel with transmitting the data stream by the stream forwarding module (202) to at least one other client device.

[0106] In a particular implementation example, at step (306), processing the data stream by the stream recognition module (203) includes receiving a gRPC request from the stream forwarding module (202), as well as determining the type of the data stream and performing appropriate processing of the data stream depending on the type data flow. The process of determining the type of data stream can be carried out, for example, by reading the data stream type from the recognition parameters contained in the gRPC request received from the stream forwarding module (202).

[0107] After generating at least one message based on the processed data stream, the thread recognition module (203) transmits the at least one generated message to the stream forwarding module (202). In a particular implementation example, transmitting the at least one generated message to the thread forwarder (202) includes the thread recognition module (203) sending a gRPC response to the thread forwarder (202). Said gRPC response contains recognition results obtained from processing the data stream by the stream recognition module (203).

[0108] At step (307), the flow forwarding module (202) receives at least one generated message from the flow recognition module (203) and sends it to the common information space. In a particular implementation example, at least one generated message may contain, but is not limited to:

- client device identifier;

- time stamp identifying the beginning of message generation;

- time stamp identifying the end of message generation;

- message containing recognition results;

- recognition session identifier for a specific client device.

[0109] In one of the particular examples of implementation of the claimed solution, at stage (307), by means of the flow forwarding module (202), a generated message received from the flow recognition module (203) is sent to a single information space by sending said generated message to the messaging module (201 ), which sends the said generated message to a single information space.

[0110] One of the alternative embodiments of the claimed invention discloses a solution to the problem of incorrect output of messages into a single information space. Consider the case where the user of the first client device transmits a data stream (for example, reads a report), users of other client devices receive this data stream (i.e., hears and/or sees the report of the first user) through the forwarding module streams (202). Since the stream forwarder (202) transmits the data stream in parallel to the client devices and the stream recognizer (203), users of other client devices can receive the data stream before the stream recognizer (203) generates a message based on the processed data stream, and it will be displayed in a single information space (hereinafter referred to as chat). In this regard, a situation may arise when the user of the second client device, having heard part of the report, asks a question on the topic of what he heard, for example, generates a message and sends it to the chat. Thus, a message from a user of a second client device may be output to chat before a message from a user of a first client device is converted from the data stream (e.g., a report from a user of the first client device is recognized and converted from an audio stream to a text message), resulting in incorrect output of messages into a single information space and, as a consequence, loss of logical, structural and cause-and-effect relationships between a set of messages in a single information space.

[0111] To eliminate this problem, the claimed solution is implemented as follows.

[0112] At step (306), the stream recognition module (203) further partitions the text into fragments during the speech-to-text conversion process, and generates at least one message based on the fragments resulting from the text fragmentation. Then, in real time, it is determined by the matching module (206) whether the stream recognition module (203) is processing a data stream (eg, an audio stream); and setting the value of a flag identifying the state of the client device user's current input (eg, voice input or sign language input). By way of example, but not limitation, flag=l is set if the stream recognition module (203) is currently processing a data stream, and flag=0 is set if the stream recognition module (203) is not currently processing a data stream. time, i.e. no conference participant is performing voice or sign language input that needs to be recognized by the stream recognition module (203). [0113] Next, through the messaging module (201), at least one message is received from another client device (for example, from the user of the second client device), and a request is sent to the negotiation module (206) for permission to send a message from another client device to the unified information space devices. The negotiation module (206) receives from the messaging module (201) a request for permission to send a message to the common information space from another client device and decides to allow sending based on the value of the set flag.

[0114] If the flag value identifies no current input (voice or sign language) from the client device user (eg, flag=0), the negotiation module (206) sends to the messaging module (201) permission to send a message to the common information space from another client device. Next, the messaging module (201) receives the said permission from the negotiation module (206) and sends messages from another client device to the common information space. [0115] If the flag value identifies the presence of current input (voice or sign language) from the user of the client device (for example, flag=l), the negotiation module (206) sends a request to the stream forwarding module (202) to send the generated message to the common information space containing already processed fragments. Having received the said request, the thread forwarding module (202) sends a request to the thread recognition module (203) to generate a message containing the already processed fragments. In response to the request, the thread recognition module (203) generates a message containing the already processed fragments and sends the generated message to the thread forwarding module (202), which sends the generated message containing the already processed fragments to the unified information space. The negotiation module (206) then sends permission to the messaging module (201) to send a message from the other client device to the common information space. Next, the messaging module (201) receives the said permission from the negotiation module (206) and sends a message from another client device to the common information space. Thus, the safety of the logical, structural and cause-and-effect relationship between a set of messages in a single information space.

[0116] In one alternative embodiment of the claimed solution, the video conferencing system (100) is configured to seamlessly switch between client devices during video conferencing. Said seamless switching is implemented using at least one client device that has gone through the process of registering to join the conference via the messaging module (201), as follows:

- activate the function of transferring the conference to another client device and search for a client device with BLE support by scanning nearby devices;

- obtain scan results containing at least one BLE-enabled client device;

- check the distance to the found client device by comparing the received BLE signal strength indicator (BLE RSSI) with a threshold value;

- identify the found client device as being located next to the client device connected to the conference if the BLE RSSI is less than a threshold value;

- establish a BLE connection between the found client device and the client device connected to the conference;

- start a conference on the found client device by transferring data between the mentioned client devices via a BLE connection.

[0117] In FIG. 4 shows a general view of a computing device (400), on the basis of which elements of the claimed system (100), such as client devices (101, 102) and a video conferencing server (104), can be implemented. In a particular implementation example, each of the modules (201, 202, 203, 204, 205, 206, 207, 208) included in the video conferencing server (104) can also be implemented on the basis of a computing device (400).

[0118] In general, a computing device (400) contains one or more processors (401) connected by a common data exchange bus, memory means such as RAM (402) and ROM (403), input/output interfaces (404), input/output devices (405), and network communication means (406).

[0119] The processor (401) (or multiple processors, multi-core processor) may be selected from a variety of devices commonly used today, such as those from Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ and etc. A graphics processor, for example, Nvidia, AMD, Graphcore, etc., can also be used as the processor (401).

[0120] RAM (402) is a random access memory and is designed to store machine-readable instructions executable by the processor (401) to perform the necessary logical data processing operations. RAM (402) typically contains executable operating system instructions and associated software components (applications, software modules, etc.).

[0121] ROM (403) is one or more permanent storage devices, such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0122] To organize the operation of device components (400) and organize the operation of external connected devices, various types of I/O interfaces (404) are used. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but is not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/ Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0123] To provide user interaction with the computing device (400), various I/O information devices (405) are used, for example, a keyboard, a display (monitor), a touch display, a touch pad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.

[0124] The network communication means (406) ensures that the device (400) transmits data via an internal or external computer network, for example, Intranet, Internet, LAN, etc. One or more means (406) can be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and/or BLE module, Wi-Fi module and etc. [0125] Additionally, the device (400) can also use satellite navigation tools, for example, GPS, GLONASS, BeiDou, Galileo.

[0126] The submitted application materials disclose preferred examples of implementation of a technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

FORMULA

1. Video conferencing system containing:

- register connections of client devices to the conference,

- establish a secure connection with client devices,

- receive a data stream from at least one client device,

- transmit a data stream in parallel mode to at least one other client device and a stream recognition module,

2. The system according to claim 1, characterized in that the data stream received from at least one client device contains at least an audio and/or video stream.

3. The system according to claim 2, characterized in that the processing of the audio stream includes converting speech into text (Automatic Speech Recognition).

29

4. The system according to claim 2, characterized in that the processing of the video stream includes the conversion of gestures into text and/or graphic information (Gesture Recognition).

5. The system according to claim 2, characterized in that the processing of the video stream includes sign language recognition (Sign Language Recognition).

6. The system according to any one of claims 1 to 5, characterized in that the stream recognition module is additionally configured to determine the type of data stream and carry out appropriate processing of the data stream depending on the type of the data stream.

7. The system according to claim 6, characterized in that the video conferencing server additionally contains an audio stream filtering module configured to suppress noise.

8. The system according to claim 1, characterized in that the messages generated based on the processed data stream contain at least text and/or graphic information.

9. The system according to any one of claims 1-5, characterized in that the video conferencing server additionally contains a video stream modification module configured to

- background replacement,

- application of augmented reality filters (AR filters).

10. The system according to any one of claims 2-5, characterized in that the stream forwarding module is additionally configured to receive a command from at least one client device to recognize an audio and/or video stream; and transmit the audio and/or video stream to the stream recognition module after receiving said recognition command.

11. The system according to claims 2-3, characterized in that the flow recognition module is additionally configured to

- process fragments obtained as a result of splitting the audio stream,

thirty - generate at least one message based on the processed fragments; The video conferencing server additionally contains a matching module configured to

• if the flag value identifies the absence of current voice input from the client device user, the negotiation module sends permission to the messaging module to send a message from another client device to the common information space;

• if the flag value identifies the presence of current voice input from the user of the client device, the matching module is configured to send the stream forwarding module a request to send a generated message containing already processed fragments to the unified information space, after the stream forwarding module sends the generated message containing the already processed fragments, into a single information space, send to the messaging module permission to send a message from another client device to a single information space; The messaging module is additionally configured to

- send a request to the coordination module for permission to send a message from another client device to a single information space, send messages from another client device to a single information space in response to receiving permission to send from the coordination module.

12. The system according to claim 1, characterized in that the video conferencing server further comprises a common information space control module configured to determine that at least one generated message has been received from the stream forwarding module; and display, when outputting the at least one generated message, a label identifying generation of the generated message from the audio stream.

13. The system according to claim 1, characterized in that the united information space control module is additionally configured to

- save messages posted in the mentioned space,

- provide connected client devices with access to saved messages.

14. The system according to claim 1, characterized in that the united information space control module is additionally configured to implement

- search by the content of messages posted in the mentioned space,

- downloading the contents of messages posted in the mentioned space,

- control the display of messages posted in the mentioned space.

15. The system according to claim 1, characterized in that the video conferencing server further comprises a translation module configured to translate at least one message received from at least one client device and/or a message generated based on the processed data stream.

16. A method for implementing video conferencing, implemented using a video conferencing server and containing the steps of:

• connect at least two client devices to the video conferencing server,

• register connections of client devices to the conference,

• establish a secure connection with client devices, • receive a data stream from at least one client device,

33