WO2023066023A1

WO2023066023A1 - Gesture-based communication method and apparatus, storage medium, and electronic apparatus

Info

Publication number: WO2023066023A1
Application number: PCT/CN2022/123487
Authority: WO
Inventors: 陈小丽; 章璐; 王梦晓; 陈世林; 方琰崴
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-10-20
Filing date: 2022-09-30
Publication date: 2023-04-27
Also published as: CN113660449B; CN113660449A

Abstract

Embodiments of the present disclosure provide a gesture-based communication method and apparatus, a storage medium, and an electronic apparatus. The method comprises: when a first terminal and a second terminal make a video call or an audio call, acquiring a first request sent by the first terminal or the second terminal, the first request being used for requesting creation of a gesture recognition service; in response to the first request, creating the gesture recognition service; acquiring a group of gestures recognized from a group of video frames collected by the first terminal; performing, by means of the gesture recognition service, semantic recognition on the group of gestures recognized from the group of video frames collected by the first terminal to obtain target semantics represented by the group of gestures; and sending the target semantics to the second terminal.

Description

Gesture communication method, device, storage medium and electronic device

technical field

The present disclosure relates to the communication field, and in particular, to a gesture communication method, device, storage medium and electronic device.

Background technique

Gestures are often used in daily life. Gesture users, such as deaf-mute people in special groups, have great obstacles in communicating with normal people. Their gestures are extremely difficult to understand as a communication language (sign language), and it is difficult for non-professionals and normal people to accurately recognize the gestures of deaf-mute people: when deaf-mute users dial various public service calls (119, 110, 120, etc.) , public service personnel cannot directly understand what deaf-mute users want to express; when deaf-mute users participate in online teaching, deaf-mute users cannot interact with teachers in real time in a simple way; deaf-mute users cannot directly communicate with normal users on the phone. normal communication, etc. This requires the recognition and translation of gestures (sign language) of the deaf and the delivery of communication. There are also some gesture users in specific application scenarios, such as military sign language and sign language for special industries, which also need to be recognized and translated accordingly.

But at present, most of the gesture recognition relies on specific equipment such as wearing equipment gloves. These devices are expensive and are only suitable for interaction within a certain range. There are often limitations in time and space, and they are not direct and natural interaction and communication. There are also some gesture recognition based on vision, which rely on specific collectors such as somatosensors to collect gestures. Data and analysis data, basic phone calls, rely on terminal equipment, have high requirements for terminal processing, not economical and convenient, information and data updates are not timely, and communication experience is poor.

Aiming at the technical problem in the related art that the gesture communication mainly depends on a specific device, resulting in high cost, no effective solution has been proposed yet.

Contents of the invention

Embodiments of the present disclosure provide a gesture communication method, device, storage medium, and electronic device, so as to at least solve the technical problem in the related art that gesture communication mainly depends on specific equipment, resulting in high cost.

According to an aspect of an embodiment of the present disclosure, a gesture communication method is provided, including: when a first terminal and a second terminal make a video call or an audio call, acquiring the first message sent by the first terminal or the second terminal A request, wherein the first request is used to request creation of a gesture recognition service, wherein the gesture recognition service is used to perform semantic recognition on gestures recognized in video frames collected by the first terminal; in response to the Create the gesture recognition service based on the first request; in the video call or audio call, obtain a group of gestures identified in a group of video frames collected by the first terminal; through the gesture recognition service, Perform semantic recognition on a group of gestures identified in a group of video frames collected by the first terminal to obtain target semantics represented by the group of gestures; and send the target semantics to the second terminal.

According to still another aspect of the embodiments of the present disclosure, there is also provided a gesture communication device, including: a first acquisition module, configured to acquire the first terminal when the first terminal and the second terminal make a video call or an audio call Or the first request sent by the second terminal, where the first request is used to request creation of a gesture recognition service, where the gesture recognition service is used to identify Semantic recognition of the gesture; the first creation module is configured to create the gesture recognition service in response to the first request; the second acquisition module is configured to acquire the first gesture recognition service during the video call or audio call A group of gestures recognized in a group of video frames collected by the terminal; the recognition module is configured to perform semantic recognition on a group of gestures recognized in a group of video frames collected by the first terminal through the gesture recognition service, The target semantics represented by the group of gestures are obtained; the first sending module is configured to send the target semantics to the second terminal.

According to still another aspect of the embodiments of the present disclosure, there is also provided a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, any one of the above Steps in the method examples.

According to yet another aspect of the embodiments of the present disclosure, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the above-mentioned processor executes any of the above-mentioned tasks through the computer program. Steps in a method embodiment.

Description of drawings

The drawings described here are used to provide a further understanding of the present disclosure, and constitute a part of the present application. The exemplary embodiments of the present disclosure and their descriptions are used to explain the present disclosure, and do not constitute improper limitations to the present disclosure. In the attached picture:

FIG. 1 is a block diagram of a mobile terminal hardware structure of a gesture communication method according to an embodiment of the disclosure;

FIG. 2 is a flowchart of a gesture communication method according to an embodiment of the present disclosure;

Fig. 3 is a gesture communication system structure and a media path diagram according to a specific embodiment of the present disclosure;

Fig. 4 is an example diagram 1 of a gesture communication method according to a specific embodiment of the present disclosure;

Fig. 5 is a second example diagram of a gesture communication method according to a specific embodiment of the present disclosure;

Fig. 6 is a third example diagram of a gesture communication method according to a specific embodiment of the present disclosure;

Fig. 7 is a fourth example diagram of a gesture communication method according to a specific embodiment of the present disclosure;

Fig. 8 is a fifth example diagram of a gesture communication method according to a specific embodiment of the present disclosure;

Fig. 9 is a structural block diagram of a gesture communication device according to an embodiment of the disclosure;

Fig. 10 is a preferred structural block diagram 1 of a gesture communication device according to an embodiment of the present disclosure;

FIG. 11 is a second preferred structural block diagram of a gesture communication device according to an embodiment of the present disclosure.

Detailed ways

In order to enable those skilled in the art to better understand the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only It is an embodiment of a part of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

The method embodiments provided in the embodiments of this application can be executed in mobile terminals, computer terminals or similar computing devices. Taking running on a mobile terminal as an example, FIG. 1 is a block diagram of a mobile terminal hardware structure of a gesture communication method according to an embodiment of the present disclosure. As shown in Figure 1, the mobile terminal may include one or more (only one is shown in Figure 1) processors 102 (processors 102 may include but not limited to processing devices such as microprocessor MCU or programmable logic device FPGA, etc.) and a memory 104 configured to store data, in an exemplary embodiment, the above-mentioned mobile terminal may further include a transmission device 106 and an input/output device 108 configured to communicate. Those skilled in the art can understand that the structure shown in FIG. 1 is only for illustration, and it does not limit the structure of the above mobile terminal. For example, the mobile terminal may also include more or fewer components than those shown in FIG. 1 , or have a different configuration from that shown in FIG. 1 .

The memory 104 can be set to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the gesture communication method in the embodiment of the present disclosure, and the processor 102 executes the computer program stored in the memory 104 by running the computer program. Various functional applications and data processing are to realize the above-mentioned method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory that is remotely located relative to the processor 102, and these remote memories may be connected to the mobile terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Transmission device 106 is configured to receive or transmit data via a network. The specific example of the above network may include a wireless network provided by the communication provider of the mobile terminal. In one example, the transmission device 106 includes a network interface controller (NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet. In an example, the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is configured to communicate with the Internet in a wireless manner.

In this embodiment, a gesture communication method is provided. FIG. 2 is a flowchart of a gesture communication method according to an embodiment of the present disclosure. As shown in FIG. 2 , the process includes the following steps:

Step S2002, when the first terminal and the second terminal make a video call or an audio call, obtain a first request sent by the first terminal or the second terminal, wherein the first request is used to request to create a gesture recognition service, wherein the gesture recognition service is used to perform semantic recognition on the gestures recognized in the video frames collected by the first terminal;

Step S2004, creating the gesture recognition service in response to the first request;

Step S2006, during the video call or audio call, acquire a group of gestures identified in a group of video frames collected by the first terminal;

Step S2008, through the gesture recognition service, perform semantic recognition on a group of gestures recognized in a group of video frames collected by the first terminal, and obtain the target semantics represented by the group of gestures;

Step S2010, sending the target semantics to the second terminal.

Through the above steps, the communication terminal can request the network-side device to create a gesture recognition service during a video call or audio call, and the gesture recognition service created by the network-side device can perform semantic recognition on the gestures recognized in the video frames collected by the communication terminal , without the need to complete gesture semantic recognition on the communication terminal through a specific device on the communication terminal, thus solving the technical problem in the related art that gesture communication mainly depends on specific devices and resulting in high costs, and achieving a reduction in the gesture communication process The technical effect of the cost in it further improves the user experience.

Wherein, the executor of the above steps may be a network end, or a network side device, for example, a network device including a service control node, an application control node, and a media server, or a network device with a service control node function, an application control node function, and a media server function For other network devices, the execution subject of the above steps may also be other processing devices or processing units with similar processing capabilities, but is not limited thereto. The following is an example of performing the above operations on the network side (it is only an exemplary description, and other devices or modules may also be used to perform the above operations in actual operation):

In the above embodiment, when the first terminal and the second terminal make a video call or an audio call, the network side obtains the first request sent by the first terminal or the second terminal, and the first request is used to request to create a gesture recognition service, Recognizing gestures collected by the first terminal during a video call or audio call, specifically requesting recognition of a group of gestures identified in a group of video frames collected by the first terminal, of course, in practical applications, if The second terminal uses gestures to communicate. The first request can be used to request recognition of the gestures collected by the second terminal. After receiving the first request, the network creates a gesture recognition service. The gesture recognition service is used for Recognize the above gestures; in a video or audio call, obtain a group of gestures identified in a group of video frames collected by the first terminal. In practical applications, the video frame images collected by the first terminal can be obtained, and from Recognize a group of gestures in the frame image, and then use the gesture recognition service created above to perform semantic recognition on the group of gestures recognized from the video frame image to obtain the target semantics represented by a group of gestures, and then send the target semantics to to the second terminal. By recognizing the gestures recognized in the video frame images collected from the first terminal to obtain the target semantics represented by the gestures, and sending the target semantics to the second terminal, the purpose of gesture communication in video or audio calls is realized , which avoids the problem in related technologies that needs to rely on specific devices or must be used in video calls to realize gesture communication, solves the problem in related technologies that gesture communication mainly depends on specific devices, resulting in high cost and poor experience, and achieves Broaden the application range of gesture communication and improve the effect of user experience.

In an optional embodiment, the method further includes: acquiring a second request sent by the first terminal or the second terminal, where the second request is used to request creation of a target data channel; in response The second request is to create the target data channel, where the target data channel is a channel allowed to be used by the first terminal or the second terminal; The first request sent by the terminal includes: obtaining the first request transmitted by the first terminal or the second terminal on the target data channel. In this embodiment, during the video call or audio call between the first terminal and the second terminal, the second request sent by the first terminal or the second terminal can be obtained to create a target data channel. In practical applications, usually The second request is initiated by a terminal that supports the use of the target data channel. At least one of the first terminal and the second terminal supports the use of the target data channel, or both terminals support the use of the target data channel. The above-mentioned first request is transmitted by the first terminal or the second terminal through the target data channel. Through this embodiment, the purpose of creating a data channel and the purpose of transmitting the first request through the data channel are achieved.

In an optional embodiment, the obtaining the second request sent by the first terminal or the second terminal includes: obtaining the first terminal or the second terminal through the access control entity SBC/ The second request sent by the P-CSCF, the session control entity I/S-CSCF and the service control node to the media server; the creating the target data channel in response to the second request includes: responding to the The second request is to create the target data channel through the media server, where the target data channel is used to transmit data between the first terminal or the second terminal and the media server. In this embodiment, the second request is sent by the first terminal or the second terminal to the media server through the access control entity SBC/P-CSCF, the session control entity I/S-CSCF and the service control node, and in order to respond Based on the second request, a target data channel is created by the media server, and the target data channel is used to transmit data between the first terminal or the second terminal and the media server. Through this embodiment, the purpose of establishing a dedicated data channel between the terminal and the media server is achieved.

In an optional embodiment, the obtaining the first request transmitted by the first terminal or the second terminal on the target data channel includes: obtaining the first request transmitted by the first terminal or the second terminal The first request transmitted by the terminal to the application control node on the target data channel; the creation of the gesture recognition service in response to the first request includes: sending the service to the service by the application control node The control node sends a first instruction, wherein the first instruction is used to instruct the service control node to send a second instruction to the media server, and the second instruction is used to instruct the media server to create the gesture recognition service ; In response to the second instruction, create the gesture recognition service through the media server, or instruct a third-party service component to create the gesture recognition service through the media server. In this embodiment, the acquisition of the first request by the network end is the acquisition of the first request transmitted by the first terminal or the second terminal to the application control node on the target data channel; and in order to respond to the first request, the application control node sends The service control node sends a first instruction to instruct the service control node to send a second instruction to the media server, the second instruction is used to instruct the media server to create a gesture recognition service, and then responds to the second instruction to create a gesture recognition service through the media server, or , instruct the third-party service component to create a gesture recognition service through the media server. Through this embodiment, the purpose of creating a gesture recognition service is achieved.

In an optional embodiment, the method further includes: sending a third instruction to the media server through the service control node, where the third instruction is used to request creation of a mixed media service, and the mixed media service is used to Processing the video stream, audio stream and data stream in the video call, or processing the audio stream and data stream in the audio call, the data stream is a data stream representing the target semantics; response In the third instruction, create the mixed media service through the media server, or instruct a third-party service component to create the mixed media service through the media server. In this embodiment, the service control node may request the media server to create the mixed media service, and then create the mixed media service through the media server, or the media server may instruct a third-party service component to create the mixed media service. Through this embodiment, the purpose of creating a mixed media service is achieved, and preparations are made for processing related audio and video streams and data streams in the subsequent gesture communication process.

In an optional embodiment, the gesture recognition service performs semantic recognition on a group of gestures recognized in a group of video frames collected by the first terminal, and obtains the gestures represented by the group of gestures. Target semantics, including: performing semantic recognition on the group of gestures recognized in a group of video frames collected by the first terminal through the gesture recognition service to obtain one or more semantics, wherein each of the The semantics are semantics expressed by one or more gestures in the group of gestures; based on the one or more semantics, the target semantics corresponding to the group of gestures are generated. In this embodiment, through the gesture recognition service, semantic recognition is performed on a group of gestures identified in the video frame images collected by the first terminal to obtain one or more semantic meanings, and then based on the one or more semantic meanings, a The complete target semantics for group gestures. Through this embodiment, the purpose of converting gestures obtained from terminals using gestures for communication into target semantics is achieved.

In an optional embodiment, the sending the target semantics to the second terminal includes: when the target semantics is the semantics formed by concatenating the one or more semantics, sending the target Each of the semantics included in the semantics is sent to the second terminal synchronously with corresponding video frames in the group of video frames; or, when the target semantics is composed of data corresponding to the group of video frames stream representation, and when the data stream is a text stream and an audio stream, the text stream is synchronously synthesized with the corresponding video frames in the group of video frames to obtain a target video stream; the target video stream is combined with the The audio stream is synchronously sent to the second terminal. In this embodiment, each semantics included in the target semantics is sent to the second terminal synchronously with the corresponding video frames in a group of video frames. For example, when the second terminal also supports the use of the target data channel, the The data stream representing the target semantics is sent to the second terminal synchronously with the video stream formed by the video frame through the target data channel; or, when the second terminal does not support the use of the target data channel, the data stream used to represent the target semantics The text stream included in and the video frame are synchronously synthesized to obtain the target video stream, and then the target video stream and the audio stream are synchronously sent to the second terminal. Through this embodiment, when the second terminal supports the target data channel, Transmit the data stream through the target data channel, and send it to the second terminal synchronously with the video stream, and if the second terminal does not support the use of the target data channel, synthesize the text stream included in the data stream with the video frame, and then It is sent to the second terminal synchronously with the audio stream.

In an optional embodiment, the method further includes: making the video call between the first terminal and the second terminal, and both the first terminal and the second terminal support the use of target data In the case of a channel, obtain the second request sent by the first terminal, where the second request is used to request to create a target data channel; in response to the second request, create the target data channel, where the The target data channel includes a first target data channel and a second target data channel, the first target data channel is a data channel between the first terminal and the media server, and the second target data channel is the first target data channel A data channel between the second terminal and the media server; the obtaining the first request sent by the first terminal or the second terminal includes: obtaining the first terminal on the first target data channel The first request transmitted; the creating the gesture recognition service in response to the first request includes: sending a target instruction to the media server through a service control node in response to the first request, wherein, The target instruction is used to request to create a mixed media service and the gesture recognition service, the mixed media service is used to process the video stream, audio stream and data stream in the video call, and the data stream represents the Create the mixed media service and the gesture recognition service through the media server, or instruct a third-party service component to create the mixed media service and the gesture recognition service through the media server; In the video call or audio call, obtaining a group of gestures recognized in a group of video frames collected by the first terminal includes: in the video call, obtaining the first gestures collected by the first terminal A group of video frames and a corresponding first group of audio frames, and a first group of gestures identified in the first group of video frames; after obtaining the target semantics, the method further includes: using the mixed media service , performing synchronous processing on the first video stream formed by the first group of video frames, the first audio stream formed by the first group of audio frames, and the first data stream used to represent the target semantics, to obtain all synchronized The first video stream, the first audio stream, and the first data stream; the sending the target semantics to the second terminal includes: synchronizing the first video stream, the first An audio stream and the first data stream are sent to the second terminal, wherein the synchronized first data stream is sent on the second target data channel. In this embodiment, when both the first terminal and the second terminal support the use of the target data channel, after the gesture recognition service is created, the first group of video frame images acquired by the first terminal are recognized A group of gestures are semantically recognized to obtain the target semantics. The first data stream used to represent the target semantics may include text streams and voice streams, that is, to convert gestures into voice or text, etc. After the semantics are recognized, the mixed data stream provided by the media server Media service and gesture recognition service, the first video stream, the first audio stream and the first data stream are synchronized, and then sent to the second terminal, and the first data stream is passed through the second target data channel (or called dedicated data channel) to the second terminal; in this embodiment, the second terminal uses a non-gesture communication method, that is, uses a normal video or voice communication method, and the second terminal is sent to the second terminal through a media server and/or a third-party service component. The voice frame of the terminal is converted into a gesture stream and a target text stream, and the gesture stream, the target text stream and the video frames and audio frames collected by the second terminal are synchronously sent to the first target data channel (or called a dedicated data channel). a terminal. Through this embodiment, when both the first terminal and the second terminal support the use of the target data channel, the purpose of using gestures for interactive communication at one end is achieved, and the gesture is converted into a data stream and then sent through the target data channel Purpose.

In an optional embodiment, the method further includes: performing the video call between the first terminal and the second terminal, and the first terminal supports the use of the target data channel and the second terminal If the use of the target data channel is not supported, obtain a second request sent by the first terminal, where the second request is used to request to create a target data channel; in response to the second request, create the A target data channel, wherein the target data channel is a data channel between the first terminal and a media server; the obtaining the first request sent by the first terminal or the second terminal includes: obtaining the The first request transmitted by the first terminal on the target data channel; the creating the gesture recognition service in response to the first request includes: in response to the first request, through a service control node Sending a target instruction to the media server, wherein the target instruction is used to request the creation of a mixed media service, a composition service, and the gesture recognition service, and the mixed media service is used for the video stream, audio stream and data stream, the data stream is a data stream representing the target semantics; create the mixed media service, the composition service and the gesture recognition service through the media server, or, through the media server The server instructs the third-party service component to create the mixed media service, the synthesis service, and the gesture recognition service; during the video call or audio call, obtain the identified information in a group of video frames collected by the first terminal A set of gestures, including: during the video call, acquiring a second set of video frames and a corresponding second set of audio frames collected by the first terminal, and the first set of gestures identified in the second set of video frames Two groups of gestures; after obtaining the target semantics, the method further includes: using the synthesis service, synthesizing the first text stream used to represent the target semantics and the video stream formed by the second group of video frames process to obtain a second video stream, and use the mixed media service to synchronize the second audio stream included in the data stream representing the target semantics with the second video stream to obtain the synchronized first Two video streams and the second audio stream, wherein the data stream includes the first text stream; sending the target semantics to the second terminal includes: synchronizing the second video stream . Send the second audio stream to the second terminal. In this embodiment, when the first terminal supports the use of the target data channel and the second terminal does not support the use of the target data channel, after creating the mixed media service, composition service and gesture recognition service through the media server, the obtained first Semantic recognition is performed on a group of gestures identified in the second group of video frame images collected by a terminal to obtain target semantics. The first data stream used to represent the target semantics may include a first text stream and a voice stream, that is, gestures are converted into Voice or text, etc., after recognizing the semantics, through the synthesis service provided by the media server, the first text stream used to represent the target semantics and the video stream formed by the second group of video frames are synthesized to obtain the second video stream, and then through Mixed media service, synchronizing the second audio stream and the second video stream included in the data stream used to represent the target semantics, obtaining the synchronized second video stream and the second audio stream, and sending them to the second terminal; In this embodiment, a non-gesture communication method is adopted for the second terminal, that is, normal video or voice communication is adopted, and the voice frame of the second terminal is converted into a gesture stream and target text through a media server and/or a third-party service component. stream, and synchronously send the gesture stream, the target text stream, and the video frames and audio frames collected by the second terminal to the first terminal through the first target data channel (or called a dedicated data channel). Through this embodiment, when the first terminal supports the use of the target data channel and the second terminal does not support the use of the target data channel, the purpose of using gestures for interactive communication at one end is realized, and the conversion of gestures into text streams and video The stream is synthesized and then sent synchronously with the audio stream.

In an optional embodiment, the method further includes: performing the video call between the first terminal and the second terminal, and the first terminal does not support the use of the target data channel and the second terminal When the terminal supports the use of the target data channel, obtain a second request sent by the second terminal, where the second request is used to request to create a target data channel; in response to the second request, create the A target data channel, wherein the target data channel is a data channel between the second terminal and the media server; the obtaining the first request sent by the first terminal or the second terminal includes: obtaining the The first request transmitted by the second terminal on the target data channel; the creating the gesture recognition service in response to the first request includes: in response to the first request, through a service control node sending a target instruction to the media server, wherein the target instruction is used to request the creation of a mixed media service and the gesture recognition service, and the mixed media service is used to process the video stream, audio stream and data in the video call The data stream is a data stream representing the target semantics; the mixed media service and the gesture recognition service are created through the media server, or the media server instructs a third-party service component to create the The mixed media service and the gesture recognition service; during the video call or audio call, acquiring a group of gestures identified in a group of video frames collected by the first terminal includes: during the video call, Obtaining a third group of video frames and a corresponding third group of audio frames collected by the first terminal, and a third group of gestures identified in the third group of video frames; after obtaining the target semantics, the The method further includes: using the mixed media service, the third video stream formed by the third group of video frames, the third audio stream formed by the third group of audio frames, and the third audio stream used to represent the target semantics Perform synchronous processing on the three data streams to obtain the synchronized third video stream, the third audio stream, and the third data stream; the sending the target semantics to the second terminal includes: synchronizing The third video stream, the third audio stream, and the third data stream are sent to the second terminal, wherein the synchronized third data stream is sent on the target data channel. In this embodiment, when the first terminal does not support the use of the target data channel and the second terminal supports the use of the target data channel, after creating the mixed media service and the gesture recognition service through the media server, the acquired first terminal collects Semantic recognition is performed on a group of gestures identified in the third group of video frame images to obtain the target semantics. The third data stream used to represent the target semantics may include text streams and voice streams, that is, gestures are converted into voice or text, etc. , after identifying the semantics, the media server provides mixed media services, performs synchronous processing on the third video stream, the third audio stream and the third data stream, and then sends them to the second terminal, and the third data stream is on the target data channel Sending; in this embodiment, non-gesture communication is used for the second terminal, that is, normal video or voice communication is used, and the voice frame of the second terminal is converted into a gesture stream through a media server and/or a third-party service component , the target text stream, and then, through the synthesis service provided by the media server, the gesture stream, the target text stream and the video frames collected by the second terminal are synthesized to obtain the target video stream, and the target video stream is combined with the second terminal The collected audio frames are synchronously sent to the first terminal. Through this embodiment, when the first terminal does not support the use of the target data channel and the second terminal supports the use of the target data channel, the purpose of using gestures for interactive communication at one end is achieved, and the gestures are converted into text streams and passed through the target The purpose of the data channel to send.

In an optional embodiment, the method further includes: conducting the audio call between the first terminal and the second terminal, and both the first terminal and the second terminal support the use of target data In the case of a channel, obtain the second request sent by the first terminal, where the second request is used to request to create a target data channel; in response to the second request, create the target data channel, where the The target data channel includes a first target data channel and a second target data channel, the first target data channel is a data channel between the first terminal and the media server, and the second target data channel is the first target data channel A data channel between the second terminal and the media server; the obtaining the first request sent by the first terminal or the second terminal includes: obtaining the first terminal on the first target data channel The first request transmitted; the creating the gesture recognition service in response to the first request includes: sending a target instruction to the media server through a service control node in response to the first request, wherein, The target instruction is used to request the creation of a mixed media service and the gesture recognition service, the mixed media service is used to process the audio stream and data stream in the audio call, and the data stream represents the target semantic the data stream; create the mixed media service and the gesture recognition service through the media server, or instruct a third-party service component to create the mixed media service and the gesture recognition service through the media server; During a video call or an audio call, obtaining a group of gestures identified in a group of video frames collected by the first terminal includes: obtaining a fourth group of video frames collected by the first terminal during the audio call and the corresponding fourth group of audio frames, and the fourth group of gestures identified in the fourth group of video frames; after obtaining the target semantics, the method further includes: using the mixed media service, performing synchronous processing on the second text stream representing the target semantics and the fourth group of audio streams formed by the fourth group of audio frames to obtain the synchronized second text stream and fourth audio stream, wherein the data The stream includes the second text stream; the sending the target semantics to the second terminal includes: sending the synchronized second text stream and the fourth audio stream to the second terminal, Wherein, the synchronized second text stream is sent on the second target data channel. In this embodiment, when both the first terminal and the second terminal support the use of the target data channel, after the gesture recognition service is created, the acquired gestures identified in the fourth group of video frame images collected by the first terminal A group of gestures are semantically recognized to obtain the target semantics. The first data stream used to represent the target semantics may include text streams and voice streams, that is, to convert gestures into voice or text, etc. After the semantics are recognized, the mixed data stream provided by the media server Media services and gesture recognition services, synchronously process the audio stream formed by the fourth group of audio frames collected by the first terminal and the first data stream, and then send it to the second terminal, and the first data stream is passed through the second target data channel (or called a dedicated data channel) to the second terminal; in this embodiment, the second terminal adopts a non-gesture communication method, that is, communicates in a normal voice mode, through a media server and/or a third-party service The component converts the voice frame of the second terminal into gesture stream and target text stream, and combines the gesture stream and target text stream with the video frames and/or The audio frame is synchronously sent to the first terminal. Through this embodiment, when both the first terminal and the second terminal support the use of the target data channel, the purpose of using gestures for interactive communication at one end is achieved, and the gesture is converted into a data stream and then sent through the target data channel Purpose.

In an optional embodiment, the method further includes: conducting the audio call between the first terminal and the second terminal, and the first terminal supports the use of a target data channel and the second terminal If the use of the target data channel is not supported, obtain a second request sent by the first terminal, where the second request is used to request to create a target data channel; in response to the second request, create the A target data channel, wherein the target data channel is a data channel between the first terminal and a media server; the obtaining the first request sent by the first terminal or the second terminal includes: obtaining the The first request transmitted by the first terminal on the target data channel; the creating the gesture recognition service in response to the first request includes: in response to the first request, through a service control node sending a target instruction to the media server, wherein the target instruction is used to request creation of the gesture recognition service; creating the gesture recognition service through the media server, or instructing a third-party service component to create a gesture recognition service through the media server The gesture recognition service; during the video call or audio call, obtaining a group of gestures recognized in a group of video frames collected by the first terminal includes: during the audio call, obtaining the first The fifth group of video frames and the corresponding fifth group of audio frames collected by the terminal, and the fifth group of gestures recognized in the fifth group of video frames; the sending the target semantics to the second terminal , including: sending the fifth audio stream used to represent the target semantics to the second terminal. In this embodiment, when the first terminal supports the use of the target data channel and the second terminal does not support the use of the target data channel, after the gesture recognition service is created, the acquired fifth group of video frames collected by the first terminal A group of gestures identified in the image are subjected to semantic recognition to obtain the target semantics. The data stream used to represent the target semantics can include text streams and voice streams, that is, gestures are converted into voice or text, etc. After recognizing the semantics, the target will be represented The fifth audio stream of the voice is sent to the second terminal; in this embodiment, the second terminal adopts a non-gesture communication mode, that is, uses a normal voice mode to communicate, and the media server and/or third-party service component will send the fifth audio stream to the second terminal. The voice frames of the second terminal are converted into a gesture stream and a target text stream, and the gesture stream, target text stream and the audio stream collected by the second terminal are synchronously sent to the first terminal through the target data channel (or called a dedicated data channel). Through this embodiment, when the first terminal supports the use of the target data channel and the second terminal does not support the use of the target data channel, the purpose of using gestures for interactive communication at one end is achieved, and the gestures are converted into audio streams before sending the goal of.

Apparently, the embodiments described above are only some of the embodiments of the present disclosure, not all of them. The present disclosure is specifically described below in conjunction with specific embodiments:

Fig. 3 is a gesture communication system structure and media path diagram according to a specific embodiment of the present disclosure. As shown in Fig. 3, the system includes:

S101 terminal (type 1): a new type of terminal, type 1 is equivalent to the aforementioned terminal that supports the target data channel (hereinafter referred to as "type 1"), supports real-time audio and video stream channels, and also supports real-time data stream dedicated channel (dedicated data channel, corresponding to the aforementioned target data channel); in this disclosure, the terminal interacts with network-side entities through a dedicated data channel to provide end users with a new service experience, receives network-side data streams through a dedicated channel, and passes audio The video stream channel receives audio and video streams; in this disclosure, the terminal type can be an independent application program or a dedicated terminal device;

S102 terminal (type 2): a traditional terminal, type 2 is equivalent to the aforementioned terminal that does not support the target data channel (hereinafter referred to as "type 2"), and only supports real-time audio and video stream channels; the terminal is connected to the "SBC/P-CSCF "Entity interaction on the network side provides service experience for end users and receives audio and video streams through audio and video stream channels;

S103 Access Control Entity (SBC/P-CSCF): Provide signaling and media access for terminals, support audio and video stream channels and data stream channels, and forward audio and video streams and data streams;

S104 Session Control Entity (I/S-CSCF): Interrogating/Serving-CSCF (Call Session Control Function) query/service-call session control function, providing registration authentication, session control, call routing, etc. for multiple types of terminals in the IMS network The basic function of triggering the call to the "service control node";

S105 Service Control Node: As a signaling control network element of the gesture communication system, it undertakes the IMS call management capability and is responsible for controlling calls; as a service provider network element of gesture communication, it can call related services through the service bus, and other The application provides communication capabilities and service capabilities, and the service calls and controls the forwarding of various media data streams, including calling real-time audio and video streaming media forwarding and data stream forwarding;

Specific enhancements include, but are not limited to:

(1) Provide management of audio and video calls and data flow channel calls, including but not limited to, call establishment, media transparent transmission, media path redirection, call removal, call event reporting, service invocation, service result notification, etc. ;

(2) Provide communication capabilities and services to the outside world, process business requests for application control, and convert business requests into specific control operations. For example, through the open interface provided by the service control node, the application control node can call the media server and third-party service components, apply for resources, realize gesture recognition translation to voice, gesture flow animation generation, synthesized audio and video media stream, and data stream integration. media stream. and notify of service results;

(3) Invoke and control various services provided by the media server through the service bus, including but not limited to the creation, modification, and deletion of data channels, the application, modification, and deletion of audio and video media resources, and gesture recognition and translation capabilities application, modification, deletion, etc.;

For this application, the service control node can exist independently, or it can be set up together with the application control node;

S106 Application Control Node: implement various business service logics. Specific enhancements include but are not limited to: (1) The media stream and data stream category to be sent can be determined according to the application form of the terminal (version number, device type, specific label, etc.); It is converted into a real-time media stream; (2) Send an application control request to the service control node, and call a third-party service component and media server to realize image processing, gesture recognition, conversion, and synthesis; (3) It can communicate with the media server through the service bus Invoke various services provided and report the service results;

It should be noted that the application control node can exist independently, or it can be co-established with the service control node.

S107 Media Server: Provide various media services. Specific functions include but are not limited to: (1) image recognition, such as image recognition through feature data comparison, and gesture recognition; (2) real-time media stream generation services, such as converting voice clips into corresponding RTP media streams; (3) Real-time gesture stream generation, which automatically generates gesture stream video for recognized gestures; (4) synthesis service, which synthesizes and outputs existing and generated media streams and gesture streams (output to real-time audio and video streams), and converts video streams , gesture stream, and text stream are combined in the video stream; (5) real-time audio and video stream forwarding, anchoring, processing, and forwarding the audio and video stream of the current call; (6) data stream forwarding service, gesture stream, text stream Streaming and other data streams are forwarded through a dedicated data channel, and a dedicated channel is established for the synthesized integrated data stream for forwarding; (7) the service control node and the application control node can call various services provided by the media server through the service bus; ( 8) Mixed media service, which supports the processing of audio and video streams and data streams in a mixed media; (9) Establishes a dedicated data channel to safely transmit gesture information through encryption.

S108 Third-party service component: can be called by the service control node and the application control node, and provide gesture language translation, audio-to-text conversion services, etc.

S109HSS: Provide user service data and other related content.

The overall technical solution process of the embodiment of the present disclosure is roughly described as follows:

1) User UE A carries the terminal ID to initiate an audio or video call request to the IMS network, and calls UE B. Establish an audio or video call with UE B through SBC/P-CSCF, I/SCSCF, service control node and other network elements;

UE A and UE B can be different terminal types: terminal (type 1) is a new type of terminal, it has real-time audio and video stream channels, and also has a dedicated channel for real-time data streams; terminal (type 2) is a traditional The terminal only supports real-time audio and video streaming channels;

2) After the video or audio call is established, the terminal (type 1) user that supports the data flow channel applies to the "media server" to create a data channel through "SBC/P-CSCF", "I/SCSCF", and "service control node" resource;

3) The "media server" returns the successfully created data channel resource;

4) The terminal (type 1, dedicated data channel) initiates a gesture recognition conversion request to the "application control node" through the data channel;

The "application control node" instructs the "service control node" to create gesture recognition resources;

The "service control node" instructs the "media server" to create a mixed media service, which requires gesture recognition related services;

The "media server" applies for the gesture recognition service from the "third-party service component", and the mixed media service is created successfully.

5) The "service control node" invites UE A and UE B to join the conference respectively through Reinvite; applies to the "media server" for UE A and UE B membership resources;

6) UE A and UE B media are anchored to the "media server";

7) The "service control node" applies to the "media server" for processing such as gesture recognition, gesture translation business types and synthesis;

8) The "media server" applies to the "third-party service component" for services such as gesture recognition, gesture translation, speech-to-text, text-to-speech, gesture stream generation, voice stream generation, gesture stream, voice stream, text stream, video stream synthesis, forwarding, etc. "Media Server" and "Third Party Service Components" perform corresponding services;

9) "Media Server" sends different stream information (synthetic and non-synthetic) to different terminal types UE A and UE B, including voice stream, video stream, gesture stream, text stream, etc.;

10) The "media server" returns operation responses such as gesture recognition, gesture stream, text stream, voice stream, etc. to the "service control node".

Specific embodiment one: gesture user (terminal type 1, with dedicated data channel) and non-gesture user (terminal type 1, with dedicated data channel) video call:

Figure 4 is an example of a gesture communication method according to a specific embodiment of the present disclosure. User UE B takes a video call as an example for illustration:

Step S201: The gesture user UE A of the terminal (type 1) carries the terminal identifier to initiate a video call to the SBC/P-CSCF, and calls the non-gesture user UE B. Inivite carries the SDP related information of terminal audio and video video and audio;

Step S202: SBC/P-CSCF transparently transmits the Invite call information to I/S-CSCF;

Step S203: The I/S-CSCF finds the service control node corresponding to the user, and sends call information to it;

Steps S204-S206: make a video call to the non-gesture user UE B of the terminal (type 1);

Steps S207-S218: UE B sends a 200OK message carrying the terminal ID, and answers by off-hook; UE A returns an ACK message; UE A and UE B establish a video call;

Steps S219-S229: UE A applies for the creation of data channel resources; UE A needs gesture recognition, sends an Invite request carrying a dedicated data channel SDP data channel, and reaches the "service control node" through SBC/P-CSCF and I/S-CSCF "; "Service Control Node" applies to the "Media Server" to create a UE A data channel; "Media Server" returns to the "Service Control Node" that the creation of the data channel is completed;

Step S230: UE A initiates a gesture recognition conversion request through the data channel;

Step S231: the "application control node" instructs the "service control node" to create gesture recognition resources;

Step S232: the "service control node" instructs the "media server" to create a mixed media service, which needs to use the gesture recognition service;

Step S233: the "media server" applies to the "third-party service component" for a gesture recognition service;

Step S234: the "media server" returns to the "service control node" the success of creating the mixed media service;

Steps S235 to S246: "Serving Control Node" invites UE B to join the conference and applies for mixed media resources for UE B; "Serving Control Node" sends a Reinvite message carrying SDP to UE B; UE B returns a 200OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE B. The media of UE B is anchored to the media server;

Steps S247 to S258: "Serving Control Node" invites UE A to join the conference and applies for mixed media resources for UE A; "Serving Control Node" sends a Reinvite message carrying SDP to UE A; UE A returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE A; the media of UE A is anchored to the media server;

Step S259: The "service control node" applies to the "media server" for gesture translation service types and synthesis processing;

Step S260: The "media server" applies to the "third-party service component" for voice-to-text processing of terminal data, gesture image recognition for feature data extraction, real-time gesture stream generation, real-time media stream generation, synthesis service, real-time audio and video stream forwarding , data flow forwarding and other services;

Steps S261-S264: The "media server" sends the media stream information of gesture stream, text stream, voice stream, and video stream to UE A; the media stream information can be that the "media server" passes through the "service control node" and "application control node". "to the SBC/PCSCF and then to the terminal; it can also be the "media server" to the SBC/PCSCF and then to the terminal through the "application control node";

Step S265: The "media server" applies to the "third-party service component" for gesture translation, synthesis and forwarding services;

Steps S266-S268: The "media server" sends the media stream information of voice stream, text stream, and video stream to UE B; the media stream information can be sent from the "media server" to the SBC through the "service control node" and "application control node" /PCSCF and then to the terminal; it can also be that the "media server" passes through the "application control node" to the SBC/PCSCF and then to the terminal;

Step S269: The "media server" returns operation responses such as gesture recognition, gesture stream, text stream, voice stream, etc. to the "service control node".

Specific embodiment two: Non-gesture user (terminal type 2, without dedicated data channel) and gesture user (terminal type 1, with dedicated data channel) video call:

Fig. 5 is a second example of a gesture communication method according to a specific embodiment of the present disclosure. As shown in Fig. 5 , in this embodiment, a non-gesture user UE A (terminal type 2, no dedicated data channel) and a gesture user UE B ( Terminal type 1, with a dedicated data channel) to make a video call as an example:

Step S301: The non-gesture user UE A of the terminal (type 2) carries the terminal identifier to initiate a video call to the SBC/P-CSCF, calls the gesture user UE B, and Inivite carries the SDP related information of the terminal audio and video video and audio;

Step S302: SBC/P-CSCF transparently transmits the Invite call information to I/S-CSCF;

Step S303: The I/S-CSCF finds the service control node corresponding to the user, and sends call information to it;

Steps S304-S306: video call to the gesture user UE B of the terminal (type 1);

Steps S307-S318: UE B sends a 200OK message carrying the terminal ID, picks up the phone to answer, and UE A returns an ACK message; UE A and UE B establish a video call;

Steps S319-S329: UE B applies for the creation of data channel resources; UE B needs gesture recognition, sends an Invite request carrying a dedicated data channel SDP data channel, and reaches the "service control node" through SBC/P-CSCF and I/S-CSCF "; "Service Control Node" applies to the "Media Server" to create a UE B data channel; "Media Server" returns to the "Service Control Node" that the data channel creation is completed;

Step S330: UE B initiates a gesture recognition conversion request through the data channel;

Step S331: the "application control node" instructs the "service control node" to create gesture recognition resources;

Step S332: the "service control node" instructs the "media server" to create a mixed media service, which needs to use the gesture recognition service;

Step S333: the "media server" applies to the "third-party service component" for a gesture recognition service;

Step S334: The "media server" returns to the "service control node" the success of creating the mixed media service:

Steps S335 to S346: "Serving Control Node" invites UE A to join the conference and applies for mixed media resources for UE A; "Serving Control Node" sends a Reinvite message carrying SDP to UE A; UE A returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE A; the media of UE A is anchored to the media server;

Steps S347 to S358: "Serving Control Node" invites UE B to join the conference and applies for mixed media resources for UE B; "Serving Control Node" sends a Reinvite message carrying SDP to UE B; UE A returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE B; the media of UE B is anchored to the media server;

Step S359: The "service control node" applies to the "media server" for gesture translation service types and synthesis processing;

Step S360: The "media server" applies to the "third-party service component" for gesture translation, synthesis and forwarding services, voice-to-text processing of terminal data, gesture image recognition for feature data extraction, real-time gesture stream generation, real-time media stream generation, and synthesis services , real-time audio and video stream forwarding, data stream forwarding, etc.:

Steps S361-S362: The "media server" sends to UE A the real-time voice stream converted from gestures, and the media stream information including the video stream synthesized by video and text; To the SBC/PCSCF and then to the terminal; it can also be that the "media server" passes through the "service control node" and "application control node" to the SBC/PCSCF and then to the terminal;

Step S363: The "media server" applies to the "third-party service component" for gesture stream generation, translation, synthesis and forwarding services;

Steps S364-S367: The "media server" sends the media stream information of gesture stream, voice stream, text stream, and video stream to UE B; the media stream information can be sent from the "media server" to the SBC/PCSCF via the "application control node". to the terminal; it can also be that the "media server" passes through the "service control node" and "application control node" to the SBC/PCSCF and then to the terminal;

Step S368: The "media server" returns operation responses such as gesture recognition, gesture stream, text stream, voice stream, etc. to the "service control node".

Specific embodiment three: Gesture user (terminal type 2, without dedicated data channel) and non-gesture user (terminal type 1, with dedicated data channel) video call:

Fig. 6 is a third example of a gesture communication method according to a specific embodiment of the present disclosure. As shown in Fig. 6, in this embodiment, a gesture user UE A (terminal type 2, no dedicated data channel) and a non-gesture user UE B ( Terminal type 1, with a dedicated data channel) to make a video call as an example:

Step S401: The gesture user UE A of the terminal (type 2) carries the terminal identifier to initiate a video call to the SBC/P-CSCF, and calls the non-gesture user UE B; Inivite carries the SDP related information of the terminal audio and video video and audio;

Step S402: SBC/P-CSCF transparently transmits the Invite call information to I/S-CSCF;

Step S403: The I/S-CSCF finds the service control node corresponding to the user, and sends call information to it;

Steps S404-S406: make a video call to the non-gesture user UE B of the terminal (type 1);

Steps S407-S418: UE B sends a 200OK message carrying the terminal ID, and answers by off-hook; UE A returns an ACK message; UE A and UE B establish a video call;

Steps S419-S429: UE B applies to create data channel resources; UE B needs gesture recognition, sends an Invite request carrying a dedicated data channel SDP data channel, and reaches the "service control node" through SBC/P-CSCF and I/S-CSCF "; "Service Control Node" applies to the "Media Server" to create a UE B data channel; "Media Server" returns to the "Service Control Node" that the data channel creation is completed;

Step S430: UE B initiates a gesture recognition conversion request through the data channel;

Step S431: the "application control node" instructs the "service control node" to create gesture recognition resources;

Step S432: the "service control node" instructs the "media server" to create a mixed media service, which needs to use the gesture recognition service;

Step S433: the "media server" applies to the "third-party service component" for a gesture recognition service;

Step S434: the "media server" returns to the "service control node" the success of creating the mixed media service;

Steps S435 to S446: "Serving Control Node" invites UE A to join the conference and applies for mixed media resources for UE A; "Serving Control Node" sends a Reinvite message carrying SDP to UE A; UE A returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE A; the media of UE A is anchored to the media server;

Steps S447 to S458: "Serving Control Node" invites UE B to join the conference and applies for mixed media resources for UE B; "Serving Control Node" sends a Reinvite message carrying SDP to UE B; UE A returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE B; the media of UE B is anchored to the media server;

Step S459: The "service control node" applies to the "media server" for gesture translation service types and synthesis processing;

Step S460: The "media server" applies to the "third-party service component" for gesture translation, gesture stream generation, synthesis and forwarding services, speech-to-text processing of terminal data, gesture image recognition for feature data extraction, real-time gesture stream generation, and real-time media stream generation , synthesis service, real-time audio and video stream forwarding, data stream forwarding, etc.;

Steps S461-S462: The "media server" sends to UE A the real-time voice stream converted from gestures, media stream information containing video, text, and video streams synthesized from video; the media stream information can be the Node" to SBC/PCSCF and then to terminal; it can also be "media server" to SBC/PCSCF and then to terminal through "service control node" and "application control node";

Step S463: The "media server" applies to the "third-party service component" for gesture stream generation, translation, synthesis and forwarding services;

Steps S464-S466: The "media server" sends the media stream information of voice stream, text stream, and video stream to UE B; the media stream information can be from the "media server" to the SBC/PCSCF through the "application control node" and then to the terminal; It can also be that the "media server" passes through the "service control node" and "application control node" to the SBC/PCSCF and then to the terminal;

Step S467: The "media server" returns operation responses such as gesture recognition, gesture stream, text stream, voice stream, etc. to the "service control node".

Specific embodiment four: Gesture user (terminal type 1, has dedicated data channel) and non-gesture user (terminal type 1, has dedicated data channel) audio communication:

Fig. 7 is an example of a gesture communication method according to a specific embodiment of the present disclosure Fig. 4, as shown in Fig. 7, in this embodiment, the gesture user UE A of the terminal (type 1) dials the non-gesture use of the terminal (type 1) Let UE B take an audio call as an example for illustration:

Step S501: The gesture user UE A of the terminal (type 1) carries the terminal identifier to initiate an audio call to the SBC/P-CSCF, and calls the non-gesture user UE B; Inivite carries the SDP related information of the terminal audio audio;

Step S502: SBC/P-CSCF transparently transmits the Invite call information to I/S-CSCF;

Step S503: The I/S-CSCF finds the service control node corresponding to the user, and sends call information to it;

Steps S504-S506: making an audio call to the non-gesture user UE B of the terminal (type 1);

Steps S507-S518: UE B sends a 200OK message carrying the terminal ID, and answers by off-hook; UE A returns an ACK message; UE A and UE B establish an audio call;

Steps S519-S529: UE A activates the gesture recognition application to open the camera, and applies for the creation of data channel resources; UE A needs gesture recognition, sends an Invite request carrying a dedicated data channel SDP data channel, and passes through SBC/P-CSCF, I/S- CSCF reaches the "service control node"; the "service control node" applies to the "media server" to create a UE A data channel; the "media server" returns the data channel creation to the "service control node"; the gesture recognition application will collect gesture data;

Step S530: UE A initiates a gesture recognition conversion request through the data channel;

Step S531: the "application control node" instructs the "service control node" to create gesture recognition resources;

Step S532: the "service control node" instructs the "media server" to create a mixed media service, which needs to use the gesture recognition service;

Step S533: The "media server" applies to the "third-party service component" for a gesture recognition service;

Step S534: the "media server" returns to the "service control node" the success of creating the mixed media service;

Steps S535 to S546: "Serving Control Node" invites UE B to join the conference and applies for mixed media resources for UE B; "Serving Control Node" sends a Reinvite message carrying SDP to UE B; UE B returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE B; the media of UE B is anchored to the media server;

Steps S547 to S558: "Serving Control Node" invites UE A to join the conference and applies for mixed media resources for UE A; "Serving Control Node" sends a Reinvite message carrying SDP to UE A; UE A returns a 200OK message carrying SDP information. The "service control node" applies to the "media server" for the mixed media resources required by UE A; the media of UE A is anchored to the media server;

Step S559: The "service control node" applies to the "media server" for gesture translation service types and synthesis processing;

Step S560: The "media server" applies to the "third-party service component" for voice-to-text processing of terminal data, gesture image recognition for feature data extraction, real-time gesture stream generation, real-time media stream generation, synthesis service, real-time audio stream forwarding, Services such as data flow forwarding;

Steps S561-S563: The "media server" sends gesture stream, text stream, and voice stream media stream information to UE A; the media stream information can be sent from the "media server" to the SBC/ PCSCF and then to the terminal; it can also be that the "media server" passes through the "application control node" to the SBC/PCSCF and then to the terminal;

Step S564: The "media server" applies to the "third-party service component" for gesture translation stream synthesis and forwarding service;

Steps S565-S566: The "media server" sends the media stream information of voice stream and text stream to UE A; the media stream information can be transmitted from the "media server" to the SBC/PCSCF via the "service control node" and "application control node" to the terminal; it can also be that the "media server" passes through the "application control node" to the SBC/PCSCF and then to the terminal;

Step S567: The "media server" returns operation responses such as gesture recognition, gesture stream, text stream, voice stream, etc. to the "service control node".

Specific embodiment five: non-gesture user (terminal type 2, no dedicated data channel) and gesture user (terminal type 1, with dedicated data channel) audio communication

Fig. 8 is a fifth example of a gesture communication method according to a specific embodiment of the present disclosure. As shown in Fig. 8 , in this embodiment, a non-gesture user UE A (terminal type 2, no dedicated data channel) and a gesture user UE B ( Terminal type 1, with a dedicated data channel) is used as an example to make an audio call:

Step S601: The non-gesture user UE A of the terminal (type 2) carries the terminal identifier to initiate an audio call to the SBC/P-CSCF, and calls the gesture user UE B; Inivite carries the SDP related information of the terminal audio audio;

Step S602: SBC/P-CSCF transparently transmits the Invite call information to I/S-CSCF;

Step S603: The I/S-CSCF finds the service control node corresponding to the user, and sends call information to it;

Steps S604-S606: audio call to the gesture user UE B of the terminal (type 1);

Steps S607-S618: UE B sends a 200OK message carrying the terminal ID, and answers by off-hook; UE A returns an ACK message; UE A and UE B establish an audio call;

Steps S619-S629: UE B activates the gesture recognition application, turns on the camera, and applies for creating a data channel resource; UE B needs gesture recognition, sends an Invite request carrying a dedicated data channel SDP data channel, and passes through SBC/P-CSCF, I/S -CSCF, reach "service control node"; "service control node" applies to "media server" to create UE B data channel; "media server" returns data channel creation to "service control node"; gesture recognition application will collect gesture data ;

Step S630: UE B initiates a gesture recognition conversion request through the data channel;

Step S631: the "application control node" instructs the "service control node" to create gesture recognition resources;

Step S632: the "service control node" instructs the "media server" to create a mixed media service, which needs to use the gesture recognition service;

Step S633: the "media server" applies to the "third-party service component" for a gesture recognition service;

Step S634: the "media server" returns to the "service control node" the success of creating the mixed media service;

Steps S635 to S646: "Serving Control Node" invites UE A to join the conference and applies for mixed media resources for UE A; "Serving Control Node" sends a Reinvite message carrying SDP to UE A; UE A returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE A; the media of UE A is anchored to the media server;

Steps S647 to S658: "Serving Control Node" invites UE B to join the conference and applies for mixed media resources for UE B; "Serving Control Node" sends a Reinvite message carrying SDP to UE B; UE A returns a 200 OK message carrying SDP information; "Service The "control node" applies to the "media server" for the mixed media resources required by UE B; the media of UE B is anchored to the media server;

Step S659: The "service control node" applies to the "media server" for gesture translation service types and synthesis processing;

Step S660: The "media server" applies to the "third-party service component" for gesture translation and forwarding services, voice-to-text processing of terminal data, gesture image recognition for feature data extraction, real-time gesture stream generation, real-time media stream generation, synthesis services, Real-time audio stream forwarding, data stream forwarding, etc.;

Step S661: The "media server" sends to UE A the media stream information of the real-time voice stream converted from the gesture; the media stream information can be from the "media server" to the SBC/PCSCF through the "application control node" and then to the terminal; it can also be The "media server" passes through the "service control node" and "application control node" to the SBC/PCSCF and then to the terminal;

Step S662: The "media server" applies to the "third-party service component" for gesture stream generation, translation, synthesis and forwarding services;

Steps S663-S665: The "media server" sends the media stream information of gesture stream, voice stream, and text stream to UE B; the media stream information can be from the "media server" to the SBC/PCSCF through the "application control node" and then to the terminal; It can also be that the "media server" passes through the "service control node" and "application control node" to the SBC/PCSCF and then to the terminal;

Step S666: The "media server" returns operation responses such as gesture recognition, gesture stream, text stream, voice stream, etc. to the "service control node".

Through the above embodiments, the goals that can be achieved include: 1) realizing the purpose of transmitting gesture information by using a dedicated data channel; 2) reducing the requirements for the terminal by performing gesture recognition on the network side, and the terminal only needs to be an integrated A collection device such as a common mobile phone can be instructed by the gesture recognition application when an IMS call is established to collect gestures as required, and the collected gesture-related information is transmitted through a dedicated channel to initiate a gesture recognition request to the gesture recognition application server; 3) through The platform side provides comprehensive services, including gesture recognition, analysis, synthesis, etc., and transmits service information through a dedicated channel; 4) supports two-way conversion between sign language and voice/video, and identifies, analyzes, processes, and Data synthesis, after processing and rendering, synthesizes escaped text, sign language standard video and original voice/video stream; 5) supports the conversion of communication content between different terminal types; The information flow between terminals is converted to realize the purpose of gesture communication between different types of terminals. The terminal type supporting the data channel can be an independent application program or a dedicated terminal device.

Through the embodiment of the present application, the achievable effects include: (1) real-time interaction, user communication is economical, convenient, usable, and effective. This system uses 5G and 6G network dedicated channels to realize simultaneous transmission of multiple business streams through network-side mixed media mode, and a system and method for realizing gesture communication, which is economical, convenient, and rich in experience. Communication; wearable devices that no longer rely on features; traditional gesture recognition that relies on wearable devices is expensive and only suitable for interactions within a certain range. There are often time and space constraints, poor usability, and not direct and natural interaction and Communication; (2) Good scalability. The platform side provides comprehensive services, which can be connected to third-party service components; service expansion; interactive and immersive calls can be provided under the new architecture; (3) good security. Using 5G and 6G network dedicated channels and IMS calls, the data between the terminal and the network is transmitted through encrypted channels to prevent information leakage; (4) Support the conversion of communication content between different terminal types. The platform side converts the information flow between different terminals by identifying different types of terminals, and realizes gesture communication between different types of terminals. The specific beneficial effects include at least: 1) when the gesture user uses terminal type 1 and the non-gesture user (uses terminal type 1 or 2) to make a video call (the call can be that the gesture user dials a video call established by the non-gesture user, or It can be a non-gesture user dialing a video call established by the gesture user), and the gesture user or the non-gesture user using terminal type 1 can apply for gesture recognition conversion; the gesture user can receive and see the non-gesture user on the other end Standard gesture streaming video, text, original voice, and original video converted from voice; non-gesture users can hear and see the voice, text, and original call video converted by the gesture user's gesture, of which non-gesture users use When the terminal type is 1, what the non-gesture user receives and sees and hears is voice stream, text stream, and original video stream; when the non-gesture user uses the terminal type 2, what the non-gesture user receives and sees is voice 2) When the gesture user uses terminal type 2 and the non-gesture user (using terminal type 1) makes a video call (the call can be that the gesture user dials the video created by the non-gesture user) call, or a non-gesture user dials a video call established by the gesture user), and the non-gesture user can also apply for gesture conversion; the gesture user can see and hear the voice conversion of the non-gesture user containing gestures, text, The video stream and voice stream synthesized from the original video; the non-gesture user can see and hear the voice, text, and original call video converted by the gesture user's gesture; 3) the gesture user uses terminal type 1 and the non-gesture user ( When using terminal type 1 or 2) to make an audio call (the call can be made by a gesture user dialing an audio call established by a non-gesture user, or by a non-gesture user dialing an audio call established by a gesture user), the gesture user or the terminal type 1 non-gesture users can apply for gesture recognition conversion; when gesture users apply for gesture recognition conversion, enable gesture recognition applications and turn on the camera; gesture users can receive and see the standard gesture stream converted from speech by peer non-gesture users , text, and original voice; non-gesture users can hear and see the voice stream and text converted from the gesture of the gesture user. Among them, when non-gesture users use terminal type 1, what non-gesture users receive and hear are voice streams and text streams; when non-gesture users use terminal type 2, non-gesture users receive and hear is the voice stream.

The emergence of the fifth-generation communication technology provides users with mobile networks with higher bandwidth, lower latency, and wider coverage, and can provide more applications such as webcasting, virtual reality, and 4K video. 5G technology will face five main application scenarios in the future: 1) Ultra-high-speed scenarios, providing ultra-fast data network access for future mobile broadband users; 2) Supporting large-scale crowds, providing high-quality mobile broadband experience for areas or occasions with high population density ;3) The best experience anytime and anywhere, ensuring that users can still enjoy high-quality services in the mobile state; 4) Ultra-reliable real-time connection, ensuring that new applications and user instances meet strict standards in terms of delay and reliability; 5) Nowhere The non-existent object communication ensures efficient handling of a large number of diverse device communications, including machine-type devices and sensors.

The above applications put forward higher requirements for the communication system in the 5G network. 3GPP (Third Generation Partnership Project, Third Generation Partnership Project) R16 introduced the IMS (IP Multimedia Subsystem, Internet Protocol Multimedia Subsystem) data channel mechanism (Data Chanel), using the characteristics of high bandwidth and low delay of 5G network, it can On the basis of audio and video, it provides users with additional information such as pictures, texts, locations, business cards, actions, expressions, animations, etc., and can provide high-definition, visual, new interactive and immersive business experience.

In the embodiment of this application, a system and method for realizing gesture communication by using a dedicated data channel and using a mixed media method are provided, which can be applied to 5G and 6G networks; the following problems of gesture recognition or gesture translation in related technologies can be avoided Problems: 1) Most of the collection functions that have been realized are provided by specific wearable devices used on the terminal side. These devices are expensive and only suitable for interactions within a certain range. There are time and space constraints that are not economical, convenient, and usable. Direct and natural interaction and communication; 2) Some system functions such as gesture recognition, translation, and synthesis are provided by the terminal side, which has high requirements for the terminal; gesture recognition, translation, and synthesis are not provided by the network side, and information updates are not timely; 3) Cannot Realize the conversion between different terminal types; 4) Some technologies require that the communication parties must be in a video call to realize gesture communication, and the platform side needs to package the gesture content and send it back to the terminal, and the terminal sends it to the terminal on the other side; Realize the user's gesture communication during the voice call.

The user interface involved in the embodiment of this application is briefly described as follows: during an audio call, the terminal can turn on the camera through the "gesture recognition application" on the terminal side; during the call, the terminal can query the menu containing the gesture recognition function, and can initiate a gesture Recognition request; the terminal receives the video, gesture, and text information sent by the data channel, and these contents are displayed synchronously on the local mobile phone.

In this embodiment, a gesture communication device is also provided. FIG. 9 is a structural block diagram of a gesture communication device according to an embodiment of the present disclosure. As shown in FIG. 9 , the device includes:

The first acquiring module 902 is configured to acquire a first request sent by the first terminal or the second terminal when the first terminal and the second terminal make a video call or an audio call, wherein the first request uses requesting to recognize gestures collected by the first terminal during the video call or audio call;

The first creation module 904 is configured to create a gesture recognition service in response to the first request, where the gesture recognition service is used to recognize the gesture collected by the first terminal;

The second obtaining module 906 is configured to obtain a group of gestures identified in a group of video frames collected by the first terminal during the video call or audio call;

The recognition module 908 is configured to perform semantic recognition on a group of gestures recognized in a group of video frames collected by the first terminal through the gesture recognition service, and obtain the target semantics represented by the group of gestures;

The first sending module 910 is configured to send the target semantics to the second terminal.

In an optional embodiment, the above-mentioned device further includes: a third obtaining module 1002 and a second creating module 1004, as shown in FIG. 10 , which is a preferred structural block diagram of a gesture communication device according to an embodiment of the present disclosure. , wherein the third obtaining module 1002 is configured to obtain a second request sent by the first terminal or the second terminal, wherein the second request is used to request to create a target data channel; the second creating module 1004, It is configured to create the target data channel in response to the second request, where the target data channel is a channel allowed to be used by the first terminal or the second terminal; the first obtaining module 902 includes: An acquiring unit configured to acquire the first request transmitted by the first terminal or the second terminal on the target data channel.

In an optional embodiment, the third obtaining module 1002 includes: a second obtaining unit configured to obtain the first terminal or the second terminal through the access control entity SBC/P-CSCF, session control entity The second request sent by the I/S-CSCF and the service control node to the media server; the above-mentioned second creation module 1004 includes: a first creation unit configured to respond to the second request and create the media server through the media server The target data channel, wherein the target data channel is used to transmit data between the first terminal or the second terminal and the media server.

In an optional embodiment, the above-mentioned first acquiring unit includes: a first acquiring subunit configured to acquire all information transmitted by the first terminal or the second terminal to the application control node on the target data channel the first request; the first creation module 904 includes: a first processing unit configured to send a first instruction to the service control node by the application control node, wherein the first instruction is used to indicate the service The control node sends a second instruction to the media server, the second instruction is used to instruct the media server to create the gesture recognition service; the second creation unit is configured to respond to the second instruction, through the media The server creates the gesture recognition service, or instructs a third-party service component to create the gesture recognition service through the media server.

In an optional embodiment, the above-mentioned device further includes: a second sending module 1102 and a third creating module 1104, as shown in FIG. 11 , which is a preferred structural block diagram of a gesture communication device according to an embodiment of the present disclosure. , wherein the second sending module 1102 is configured to send a third instruction to the media server through the service control node, wherein the third instruction is used to request the creation of a mixed media service, and the mixed media service is used for the video call Process the video stream, audio stream and data stream in the audio call, or process the audio stream and data stream in the audio call, the data stream is a data stream representing the target semantics; the third creation module 1104 , set to create the mixed media service through the media server in response to the third instruction, or instruct a third-party service component to create the mixed media service through the media server.

In an optional embodiment, the recognition module 908 includes: a first recognition unit configured to recognize the group of gestures in a group of video frames collected by the first terminal through the gesture recognition service performing semantic recognition to obtain one or more semantics, wherein each of the semantics is the semantics expressed by one or more gestures in the group of gestures; the generation unit is configured to be based on the one or more semantics, The target semantics corresponding to the set of gestures are generated.

In an optional embodiment, the above-mentioned first sending module 910 includes: a first sending unit configured to send the Each of the included semantics is sent to the second terminal synchronously with the corresponding video frames in the group of video frames; or, the synthesis unit is configured to be included when the target semantics is included with the group of video frames When the corresponding data stream represents and the data stream is a text stream and an audio stream, the text stream is synchronously synthesized with the corresponding video frames in the group of video frames to obtain the target video stream; the second sending unit, It is set to send the target video stream and the audio stream to the second terminal synchronously.

In an optional embodiment, the above device further includes: a fourth acquisition module, configured to perform the video call between the first terminal and the second terminal, and the first terminal and the second terminal When all terminals support the use of the target data channel, obtain the second request sent by the first terminal, where the second request is used to request the creation of the target data channel; the fourth creation module is configured to respond to the first Two requests, creating the target data channel, wherein the target data channel includes a first target data channel and a second target data channel, and the first target data channel is the data between the first terminal and the media server channel, the second target data channel is a data channel between the second terminal and the media server; the first acquiring module 902 includes: a third acquiring unit configured to acquire the The first request transmitted on the first target data channel; the above-mentioned first creation module 904 includes: a second processing unit configured to respond to the first request and send a target instruction to the media server through a service control node, Wherein, the target instruction is used to request to create a mixed media service and the gesture recognition service, and the mixed media service is used to process the video stream, audio stream and data stream in the video call, and the data stream is A data stream representing the target semantics; a third creating unit configured to create the mixed media service and the gesture recognition service through the media server, or, through the media server, instruct a third-party service component to create the mixed Media services and the gesture recognition service; the above-mentioned second acquisition module 906 includes: a fourth acquisition unit configured to acquire the first group of video frames and the corresponding first group of video frames collected by the first terminal during the video call. A group of audio frames, and a first group of gestures recognized in the first group of video frames; the above-mentioned device further includes: a first processing module configured to form the first group of video frames through the mixed media service The first video stream, the first audio stream formed by the first group of audio frames, and the first data stream used to represent the target semantics are synchronized to obtain the synchronized first video stream, the first audio stream and the first data stream; the first sending module 910 includes: a third sending unit configured to send the synchronized first video stream, the first audio stream and the first data stream to The second terminal, wherein the synchronized first data stream is sent on the second target data channel.

In an optional embodiment, the above device further includes: a fifth acquisition module, configured to perform the video call between the first terminal and the second terminal, and the first terminal supports the use of the target data channel and when the second terminal does not support the use of the target data channel, obtain a second request sent by the first terminal, where the second request is used to request to create a target data channel; the fifth creating module, It is configured to create the target data channel in response to the second request, wherein the target data channel is a data channel between the first terminal and the media server; the first obtaining module 902 includes: a fifth obtaining A unit configured to obtain the first request transmitted by the first terminal on the target data channel; the above-mentioned first creation module 904 includes: a third processing unit configured to respond to the first request through a service The control node sends a target instruction to the media server, wherein the target instruction is used to request the creation of a mixed media service, a composition service and the gesture recognition service, and the mixed media service is used for the video stream in the video call , audio stream and data stream are processed, and the data stream is a data stream representing the target semantics; a fourth creation unit is configured to create the mixed media service, the synthesis service and the gesture through the media server recognition service, or, through the media server, instruct a third-party service component to create the mixed media service, the synthesis service, and the gesture recognition service; the above-mentioned second acquisition module 906 includes: a sixth acquisition unit, configured to In the above video call, obtain the second group of video frames and the corresponding second group of audio frames collected by the first terminal, and the second group of gestures recognized in the second group of video frames; the above device also includes : The second processing module is configured to synthesize the first text stream used to represent the target semantics and the video stream formed by the second group of video frames through the synthesis service to obtain a second video stream. In the mixed media service, the second audio stream included in the data stream used to represent the target semantics is synchronized with the second video stream to obtain the synchronized second video stream and the second audio stream , wherein, the data stream includes the first text stream; the first sending module 910 includes: a fourth sending unit configured to send the synchronized second video stream and the second audio stream to the first Two terminals.

In an optional embodiment, the above device further includes: a sixth acquisition module, configured to conduct the video call between the first terminal and the second terminal, and the first terminal does not support the use of target data When the channel and the second terminal support the use of the target data channel, obtain a second request sent by the second terminal, where the second request is used to request to create a target data channel; the sixth creation module, It is configured to create the target data channel in response to the second request, wherein the target data channel is a data channel between the second terminal and the media server; the above-mentioned first obtaining module 902 includes: a seventh obtaining A unit configured to obtain the first request transmitted by the second terminal on the target data channel; the above-mentioned first creation module 904 includes: a fourth processing unit configured to respond to the first request through a service The control node sends a target instruction to the media server, wherein the target instruction is used to request the creation of a mixed media service and the gesture recognition service, and the mixed media service is used for the video stream and audio stream in the video call and a data stream, the data stream is a data stream representing the target semantics; the fifth creating unit is configured to create the mixed media service and the gesture recognition service through the media server, or, through the The media server instructs the third-party service component to create the mixed media service and the gesture recognition service; the above-mentioned second acquisition module 906 includes: an eighth acquisition unit, configured to acquire the information collected by the first terminal during the video call The third group of video frames and the corresponding third group of audio frames, and the third group of gestures identified in the third group of video frames; the above-mentioned device also includes: a third processing module, configured to pass the mixed media service, performing synchronous processing on the third video stream formed by the third group of video frames, the third audio stream formed by the third group of audio frames, and the third data stream used to represent the target semantics, to obtain a synchronized The third video stream, the third audio stream, and the third data stream; the first sending module 910 includes: a fifth sending unit configured to send the synchronized third video stream, the third sending the audio stream and the third data stream to the second terminal, wherein the synchronized third data stream is sent on the target data channel.

In an optional embodiment, the above device further includes: a seventh acquisition module, configured to conduct the audio call between the first terminal and the second terminal, and the first terminal and the second terminal When all the terminals support the use of the target data channel, obtain the second request sent by the first terminal, where the second request is used to request the creation of the target data channel; the seventh creation module is configured to respond to the first Two requests, creating the target data channel, wherein the target data channel includes a first target data channel and a second target data channel, and the first target data channel is the data between the first terminal and the media server channel, the second target data channel is a data channel between the second terminal and the media server; the first acquisition module 902 includes: a ninth acquisition unit configured to acquire the The first request transmitted on the first target data channel; the above-mentioned first creation module 904 includes: a fifth processing unit configured to respond to the first request and send a target instruction to the media server through a service control node, Wherein, the target instruction is used to request to create a mixed media service and the gesture recognition service, and the mixed media service is used to process the audio stream and data stream in the audio call, and the data stream represents the The data flow of the target semantics; the sixth creating unit is configured to create the mixed media service and the gesture recognition service through the media server, or instruct a third-party service component to create the mixed media service and the gesture recognition service through the media server The gesture recognition service; the second acquisition module 906 includes: a tenth acquisition unit configured to acquire the fourth group of video frames and the corresponding fourth group of audio frames collected by the first terminal during the audio call , and a fourth group of gestures identified in the fourth group of video frames; the above device further includes: a fourth processing module, configured to, through the mixed media service, perform a second text used to represent the target semantics stream and the fourth group of audio streams formed by the fourth group of audio frames are synchronously processed to obtain the synchronized second text stream and fourth audio stream, wherein the data stream includes the second text stream; the above The first sending module 910 includes: a sixth sending unit configured to send the synchronized second text stream and the fourth audio stream to the second terminal, wherein the synchronized second text stream Send on the second target data channel.

In an optional embodiment, the above device further includes: an eighth acquisition module, configured to conduct the audio call between the first terminal and the second terminal, and the first terminal supports the use of the target data channel and when the second terminal does not support the use of the target data channel, obtain a second request sent by the first terminal, where the second request is used to request to create a target data channel; an eighth creating module, It is configured to create the target data channel in response to the second request, wherein the target data channel is a data channel between the first terminal and the media server; the first obtaining module 902 includes: eleventh An acquisition unit configured to acquire the first request transmitted by the first terminal on the target data channel; the first creation module 904 includes: a sixth processing unit configured to respond to the first request by The service control node sends a target instruction to the media server, where the target instruction is used to request creation of the gesture recognition service; a seventh creation unit is configured to create the gesture recognition service through the media server, or, through The media server instructs the third-party service component to create the gesture recognition service; the second acquiring module 906 includes: a twelfth acquiring unit configured to acquire the fifth gesture collected by the first terminal during the audio call. A group of video frames and a corresponding fifth group of audio frames, and a fifth group of gestures recognized in the fifth group of video frames; the above-mentioned first sending module 910 includes: a seventh sending unit, configured to represent the Send the fifth audio stream of the target semantics to the second terminal.

It should be noted that the above-mentioned modules can be realized by software or hardware. For the latter, it can be realized by the following methods, but not limited to this: the above-mentioned modules are all located in the same processor; or, the above-mentioned modules can be combined in any combination The forms of are located in different processors.

Embodiments of the present disclosure also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the above method embodiments when running.

In an exemplary embodiment, the above-mentioned computer-readable storage medium may include but not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.

Embodiments of the present disclosure also provide an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.

In an exemplary embodiment, the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and exemplary implementation manners, and details will not be repeated here in this embodiment.

Obviously, those skilled in the art should understand that each module or each step of the above-mentioned disclosure can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network composed of multiple computing devices In fact, they can be implemented in program code executable by a computing device, and thus, they can be stored in a storage device to be executed by a computing device, and in some cases, can be executed in an order different from that shown here. Or described steps, or they are fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present disclosure is not limited to any specific combination of hardware and software.

The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the principle of the present disclosure shall be included in the protection scope of the present disclosure.

Claims

A gesture communication method comprising:

When the first terminal and the second terminal make a video call or an audio call, obtain a first request sent by the first terminal or the second terminal, where the first request is used to request creation of a gesture recognition service, where , the gesture recognition service is used to perform semantic recognition on the gestures recognized in the video frames collected by the first terminal;

creating the gesture recognition service in response to the first request;

During the video call or audio call, acquiring a group of gestures identified in a group of video frames collected by the first terminal;

Perform semantic recognition on a group of gestures identified in a group of video frames collected by the first terminal through the gesture recognition service, to obtain the target semantics represented by the group of gestures;

Send the target semantics to the second terminal.
The method according to claim 1, wherein,

The method further includes: acquiring a second request sent by the first terminal or the second terminal, wherein the second request is used to request creation of a target data channel; in response to the second request, creating the A target data channel, wherein the target data channel is a channel allowed to be used by the first terminal or the second terminal;

The acquiring the first request sent by the first terminal or the second terminal includes: acquiring the first request transmitted by the first terminal or the second terminal on the target data channel.
The method of claim 2, wherein,

The obtaining the second request sent by the first terminal or the second terminal includes: obtaining the first terminal or the second terminal through the access control entity SBC/P-CSCF, the session control entity I/ The second request sent by the S-CSCF and the serving control node to the media server;

The creating the target data channel in response to the second request includes: creating the target data channel through the media server in response to the second request, wherein the target data channel is used in the transmitting data between the first terminal or the second terminal and the media server.
The method according to claim 3, wherein,

The acquiring the first request transmitted by the first terminal or the second terminal on the target data channel includes: acquiring the first request transmitted by the first terminal or the second terminal on the target data channel said first request transmitted to an application control node;

The creating the gesture recognition service in response to the first request includes: the application control node sends a first instruction to the service control node, wherein the first instruction is used to instruct the service control The node sends a second instruction to the media server, the second instruction is used to instruct the media server to create the gesture recognition service; in response to the second instruction, the gesture recognition service is created by the media server, Alternatively, the media server instructs a third-party service component to create the gesture recognition service.
The method according to claim 1, wherein the method further comprises:

A third instruction is sent to the media server through the service control node, wherein the third instruction is used to request the creation of a mixed media service, and the mixed media service is used to perform the video stream, audio stream and data stream in the video call Processing, or for processing the audio stream and data stream in the audio call, the data stream is a data stream representing the target semantics; in response to the third instruction, create the Mixed media service, or, the media server instructs a third-party service component to create the mixed media service.
The method according to claim 1, wherein the semantic recognition is performed on a group of gestures recognized in a group of video frames collected by the first terminal through the gesture recognition service, and the gestures identified by the group of gestures are obtained. The target semantics of the representation, including:

Through the gesture recognition service, semantic recognition is performed on the group of gestures identified in a group of video frames collected by the first terminal to obtain one or more semantic meanings, wherein each of the semantic meanings is the one The semantics expressed by one or more gestures in the group gesture;

Based on the one or more semantics, the target semantics corresponding to the set of gestures are generated.
The method according to claim 6, wherein the sending the target semantics to the second terminal comprises:

When the target semantics is the semantics spliced by the one or more semantics, each of the semantics included in the target semantics is synchronously sent to the corresponding video frame in the group of video frames a second terminal; or,

When the target semantics is represented by a data stream corresponding to the group of video frames, and the data stream is a text stream and an audio stream, the text stream and the corresponding video in the group of video frames The frames are synthesized synchronously to obtain a target video stream; and the target video stream and the audio stream are synchronously sent to the second terminal.
The method according to claim 1, wherein,

The method further includes: when the first terminal and the second terminal conduct the video call, and both the first terminal and the second terminal support the use of the target data channel, acquiring the second A second request sent by a terminal, wherein the second request is used to request creation of a target data channel; in response to the second request, the target data channel is created, wherein the target data channel includes the first target data channel and a second target data channel, the first target data channel is a data channel between the first terminal and the media server, and the second target data channel is a data channel between the second terminal and the media server data channel;

The obtaining the first request sent by the first terminal or the second terminal includes: obtaining the first request transmitted by the first terminal on the first target data channel;

The creating the gesture recognition service in response to the first request includes: in response to the first request, sending a target instruction to the media server through a service control node, wherein the target instruction is used to request creation Mixed media service and the gesture recognition service, the mixed media service is used to process the video stream, audio stream and data stream in the video call, the data stream is a data stream representing the target semantics; through The media server creates the mixed media service and the gesture recognition service, or instructs a third-party service component to create the mixed media service and the gesture recognition service through the media server;

In the video call or audio call, obtaining a group of gestures recognized in a group of video frames collected by the first terminal includes: in the video call, obtaining the first gestures collected by the first terminal a set of video frames and a corresponding first set of audio frames, and a first set of gestures identified in the first set of video frames;

After obtaining the target semantics, the method further includes: using the mixed media service, the first video stream formed by the first group of video frames, the first audio stream formed by the first group of audio frames, and performing synchronous processing on the first data stream used to represent the target semantics to obtain the synchronized first video stream, the first audio stream, and the first data stream;

The sending the target semantics to the second terminal includes: sending the synchronized first video stream, the first audio stream, and the first data stream to the second terminal, wherein, The synchronized first data stream is sent on the second target data channel.
The method according to claim 1, wherein,

The method further includes: making the video call between the first terminal and the second terminal, and the first terminal supports the use of the target data channel and the second terminal does not support the use of the target data channel In this case, obtain the second request sent by the first terminal, where the second request is used to request to create a target data channel; in response to the second request, create the target data channel, where the target The data channel is a data channel between the first terminal and the media server;

The obtaining the first request sent by the first terminal or the second terminal includes: obtaining the first request transmitted by the first terminal on the target data channel;

The creating the gesture recognition service in response to the first request includes: in response to the first request, sending a target instruction to the media server through a service control node, wherein the target instruction is used to request creation Mixed media service, composition service and the gesture recognition service, the mixed media service is used to process the video stream, audio stream and data stream in the video call, the data stream is data representing the target semantics stream; create the mixed media service, the composite service and the gesture recognition service through the media server, or instruct a third-party service component to create the mixed media service, the composite service and the gesture recognition service through the media server Gesture recognition services described above;

In the video call or audio call, obtaining a group of gestures recognized in a group of video frames collected by the first terminal includes: obtaining a second gesture collected by the first terminal in the video call a set of video frames and a corresponding second set of audio frames, and a second set of gestures identified in the second set of video frames;

After the target semantics is obtained, the method further includes: using the synthesis service, synthesizing the first text stream used to represent the target semantics and the video stream formed by the second group of video frames to obtain the second Two video streams, using the mixed media service, synchronizing the second audio stream included in the data stream used to represent the target semantics and the second video stream to obtain the synchronized second video stream and the second video stream The second audio stream, wherein the data stream includes the first text stream;

The sending the target semantics to the second terminal includes: sending the synchronized second video stream and the second audio stream to the second terminal.
The method according to claim 1, wherein,

The method further includes: making the video call between the first terminal and the second terminal, and the first terminal does not support the use of the target data channel and the second terminal supports the use of the target data channel In this case, obtain the second request sent by the second terminal, where the second request is used to request to create a target data channel; in response to the second request, create the target data channel, where the target The data channel is a data channel between the second terminal and the media server;

The obtaining the first request sent by the first terminal or the second terminal includes: obtaining the first request transmitted by the second terminal on the target data channel;

The creating the gesture recognition service in response to the first request includes: in response to the first request, sending a target instruction to the media server through a service control node, wherein the target instruction is used to request creation Mixed media service and the gesture recognition service, the mixed media service is used to process the video stream, audio stream and data stream in the video call, the data stream is a data stream representing the target semantics; through The media server creates the mixed media service and the gesture recognition service, or instructs a third-party service component to create the mixed media service and the gesture recognition service through the media server;

In the video call or audio call, obtaining a group of gestures recognized in a group of video frames collected by the first terminal includes: in the video call, obtaining a third gesture collected by the first terminal a set of video frames and a corresponding third set of audio frames, and a third set of gestures identified in the third set of video frames;

After obtaining the target semantics, the method further includes: using the mixed media service, the third video stream formed by the third group of video frames, the third audio stream formed by the third group of audio frames, and performing synchronous processing on the third data stream used to represent the target semantics to obtain the synchronized third video stream, the third audio stream, and the third data stream;

The sending the target semantics to the second terminal includes: sending the synchronized third video stream, the third audio stream and the third data stream to the second terminal, wherein, The synchronized third data stream is sent on the target data channel.
The method according to claim 1, wherein,

The method further includes: when the first terminal and the second terminal conduct the audio call, and both the first terminal and the second terminal support the use of the target data channel, acquiring the second A second request sent by a terminal, wherein the second request is used to request creation of a target data channel; in response to the second request, the target data channel is created, wherein the target data channel includes the first target data channel and a second target data channel, the first target data channel is a data channel between the first terminal and the media server, and the second target data channel is a data channel between the second terminal and the media server data channel;

The obtaining the first request sent by the first terminal or the second terminal includes: obtaining the first request transmitted by the first terminal on the first target data channel;

The creating the gesture recognition service in response to the first request includes: in response to the first request, sending a target instruction to the media server through a service control node, wherein the target instruction is used to request creation Mixed media service and the gesture recognition service, the mixed media service is used to process the audio stream and data stream in the audio call, the data stream is a data stream representing the target semantics; through the media The server creates the mixed media service and the gesture recognition service, or, through the media server, instructs a third-party service component to create the mixed media service and the gesture recognition service;

In the video call or audio call, obtaining a group of gestures recognized in a group of video frames collected by the first terminal includes: in the audio call, obtaining a fourth gesture collected by the first terminal a set of video frames and a corresponding fourth set of audio frames, and a fourth set of gestures identified in the fourth set of video frames;

After obtaining the target semantics, the method further includes: using the mixed media service, performing a second text stream representing the target semantics and a fourth group of audio streams formed by the fourth group of audio frames synchronous processing to obtain the synchronized second text stream and fourth audio stream, wherein the data stream includes the second text stream;

The sending the target semantics to the second terminal includes: sending the synchronized second text stream and the fourth audio stream to the second terminal, wherein the synchronized first Two word streams are sent on the second target data channel.
The method according to claim 1, wherein,

The method further includes: conducting the audio call between the first terminal and the second terminal, and the first terminal supports the use of the target data channel and the second terminal does not support the use of the target data channel In this case, obtain the second request sent by the first terminal, where the second request is used to request to create a target data channel; in response to the second request, create the target data channel, where the target The data channel is a data channel between the first terminal and the media server;

The obtaining the first request sent by the first terminal or the second terminal includes: obtaining the first request transmitted by the first terminal on the target data channel;

The creating the gesture recognition service in response to the first request includes: in response to the first request, sending a target instruction to the media server through a service control node, wherein the target instruction is used to request creation The gesture recognition service; creating the gesture recognition service through the media server, or instructing a third-party service component to create the gesture recognition service through the media server;

In the video call or audio call, obtaining a group of gestures recognized in a group of video frames collected by the first terminal includes: in the audio call, obtaining the fifth gesture collected by the first terminal a set of video frames and a corresponding fifth set of audio frames, and a fifth set of gestures identified in the fifth set of video frames;

The sending the target semantics to the second terminal includes: sending a fifth audio stream used to represent the target semantics to the second terminal.
A gesture communication device, comprising:

The first obtaining module is configured to obtain a first request sent by the first terminal or the second terminal when the first terminal and the second terminal make a video call or an audio call, wherein the first request is used for requesting the creation of a gesture recognition service, wherein the gesture recognition service is used to perform semantic recognition on the gestures recognized in the video frames collected by the first terminal;

A first creating module, configured to create the gesture recognition service in response to the first request;

The second acquisition module is configured to acquire a group of gestures identified in a group of video frames collected by the first terminal during the video call or audio call;

The recognition module is configured to perform semantic recognition on a group of gestures identified in a group of video frames collected by the first terminal through the gesture recognition service, to obtain the target semantics represented by the group of gestures;

The first sending module is configured to send the target semantics to the second terminal.
A computer-readable storage medium, the computer-readable storage medium comprising a stored program, wherein, when the program is executed by a processor, the method according to any one of claims 1 to 12 is implemented.
An electronic device, comprising a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to execute the method according to any one of claims 1 to 12 through the computer program.