CN116110387A

CN116110387A - Multi-mode information transmission method, device, equipment and storage medium

Info

Publication number: CN116110387A
Application number: CN202211634179.7A
Authority: CN
Inventors: 车云飞
Original assignee: Cloudminds Beijing Technologies Co Ltd
Current assignee: Cloudminds Beijing Technologies Co Ltd
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-05-12

Abstract

The embodiment of the application provides a multi-mode information transmission method, device, equipment and storage medium. In the method, target audio and multi-modal description information to be transmitted can be obtained, and the multi-modal description information is encoded according to the audio start-stop time corresponding to the multi-modal description information on the target audio to obtain multi-modal codes; fusing the multi-mode codes and the target audio to obtain fusion information; and sending the fusion information to the receiving equipment. By means of the method, the multi-mode coding and the target audio are fused, so that the multi-mode coding and the target audio can be synchronously transmitted, and the technical problem that different types of information wait for each other due to different time delays in the transmission process is solved. Meanwhile, the multi-mode codes are obtained according to the audio start-stop time corresponding to the multi-mode description information on the target audio, so that the transmitted target audio has information correspondence with the multi-mode codes, and further information processing is facilitated.

Description

Multi-mode information transmission method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for transmitting multi-mode information.

Background

With the development of related technologies of robots, the robots can transmit multi-modal information and interact with users in various forms based on the multi-modal information, for example, the robots can collect multi-modal information such as visual information, tactile information and audio information related to the users, and the robots can control various actuators to display expressions and actions to the users based on the multi-modal information, output corresponding voices and the like. In the above process, the transmission of multi-mode information between different components is required. In the prior art, the problem of mutual waiting exists in the transmission process of the multi-mode information due to different time delays, so that a plurality of interaction forms exist when the robot interacts with a user.

Disclosure of Invention

Aspects of the present application provide a method, an apparatus, a device, and a storage medium for transmitting multi-mode information, so as to solve the technical problem that different types of information wait for each other due to different time delays in a transmission process.

The embodiment of the application provides a multi-mode information transmission method, which is suitable for a sending device and comprises the following steps: acquiring target audio and multi-mode description information to be transmitted; coding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio to obtain multi-modal codes; fusing the multi-mode code and the target audio to obtain fusion information; and sending the fusion information to receiving equipment.

Further optionally, according to the audio start-stop time corresponding to the multi-modal description information on the target audio, the multi-modal description information is encoded to obtain multi-modal encoding, including: determining at least one multi-modal information fragment according to the multi-modal description information; any multi-modal information segment comprises at least one of action, expression and visual description information; and carrying out serialization coding on the multi-modal description information according to the at least one multi-modal information fragment and the audio start-stop time corresponding to the at least one multi-modal information fragment on the target audio to obtain the multi-modal code.

Further optionally, the method further comprises: when the multi-mode description information is coded in a serialization mode according to the audio start-stop time corresponding to the at least one multi-mode information fragment on the target audio, respectively adding mode identifiers to the coding heads of the at least one multi-mode information fragment; and the modal identifier of any multi-modal information fragment is used for marking the position of the encoding result of the multi-modal information fragment in the fusion information.

Further optionally, the audio start-stop time corresponding to the multi-modal description information on the target audio is determined according to the target audio and the acquisition time of the multi-modal description information; or, the audio start-stop time corresponding to the multi-mode description information on the target audio is determined according to the output time of the target audio and the output time of the multi-mode description information.

Further optionally, fusing the multi-modal code with the target audio to obtain fused information, including: respectively using a first transmission channel and a second transmission channel of the UAC audio class as the transmission channels of the multi-mode coding and the target audio; channel mixing is carried out on the first transmission channel and the second transmission channel, so that a mixing channel for transmitting fusion information is obtained; transmitting the fusion information to a receiving device, including: and writing the fusion information in the mixing channel into a protocol interface corresponding to a preset transmission protocol, so that the receiving equipment reads the mixing information through the protocol interface.

The embodiment of the application also provides a multi-mode information transmission method, which is suitable for receiving equipment and comprises the following steps: receiving fusion information sent by a sending device; the fusion information is obtained by fusion of the multi-mode codes and the target audio; the multi-mode codes are obtained according to the multi-mode description information and the audio start-stop time codes corresponding to the multi-mode description information on the target audio; and decoding the fusion information to obtain the multi-mode code and the target audio.

Further optionally, after obtaining the multi-modal code and the target audio, the method further includes: generating a multi-mode interaction instruction according to the multi-mode code; and outputting the target audio and the multi-modal interaction instruction by utilizing the multi-modal interaction component according to the audio start-stop time corresponding to the multi-modal interaction instruction on the target audio.

Further optionally, decoding the fusion information to obtain the multi-mode code includes: identifying at least one modal identification from the fusion information; any mode identifier is positioned at the coding head of the corresponding multi-mode information fragment; determining at least one coding head position from the fusion information according to the at least one mode identifier; decoding at least one multi-modal information segment from the fusion information according to the at least one encoding head position; any multimodal piece of information includes at least one of motion, expression, and visual description information.

The embodiment of the application also provides a robot, which comprises: a transmitting component and a receiving component; the sending component is configured to: acquiring target audio and multi-mode description information to be transmitted; coding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio to obtain multi-modal codes; fusing the multi-mode code and the target audio to obtain fusion information; transmitting the fusion information to the receiving component; the receiving component is used for: and receiving the fusion information sent by the sending component, and decoding the fusion information to obtain the multi-mode code and the target audio.

The embodiment of the application also provides electronic equipment, which comprises: a memory, a processor; wherein the memory is for: store one or more computer instructions; the processor is configured to execute the one or more computer instructions to: and executing the steps in the multi-mode information transmission method.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, which when executed by a processor, causes the processor to implement steps in the multimodal information transfer method.

In this embodiment, a target audio to be transmitted and multi-modal description information may be obtained, and the multi-modal description information is encoded according to the start-stop time of the audio corresponding to the multi-modal description information on the target audio to obtain multi-modal encoding; fusing the multi-mode codes and the target audio to obtain fusion information; and sending the fusion information to the receiving equipment. By means of the method, the multi-mode coding and the target audio are fused, so that the multi-mode coding and the target audio can be synchronously transmitted, and the technical problem that different types of information wait for each other due to different time delays in the transmission process is solved. Meanwhile, the multi-mode codes are obtained according to the audio start-stop time corresponding to the multi-mode description information on the target audio, so that the transmitted target audio has information correspondence with the multi-mode codes, and further information processing is facilitated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a flowchart of a method for transmitting multi-mode information suitable for a transmitting device according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a serialization encoding process according to an exemplary embodiment of the present application;

fig. 3 is a flowchart of a method for transmitting multi-mode information suitable for a receiving device according to an exemplary embodiment of the present application;

FIG. 4 is a schematic view of a robot according to an exemplary embodiment of the present application;

fig. 5 is a schematic diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the prior art, the problem of mutual waiting exists in the transmission process of the multi-mode information due to different time delays, so that a plurality of interaction forms exist when the robot interacts with a user. In response to this technical problem, in some embodiments of the present application, a solution is provided. The following will describe in detail the technical solutions provided in the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for transmitting multi-mode information according to an exemplary embodiment of the present application, where the method is applicable to a transmitting device, as shown in fig. 1, and the method includes:

and 11, acquiring target audio and multi-mode description information to be transmitted.

And step 12, encoding the multi-mode description information according to the audio start-stop time corresponding to the multi-mode description information on the target audio to obtain multi-mode encoding.

And 13, fusing the multi-mode codes with the target audio to obtain fusion information.

And 14, sending the fusion information to the receiving equipment.

In this embodiment, the sending device may be implemented as a cloud server or a local server, or may be implemented as a robot, or may be implemented as a processor on the robot, or may be implemented as a mobile electronic device, for example, a terminal device such as a smart phone, a tablet computer, or a computer, which is not limited in this embodiment.

In the a scenario, the target audio may be: the voice interaction information sent by the user, such as a section of speech spoken by the user. Wherein the multi-modal description information is for: the data describing the user interaction information in a plurality of modes, for example, the multi-mode description information comprises at least one of the following: visual description information, haptic description information, motion description information, and expression description information. For example, visual description information may be used to describe facial features of a user, motion description information may be used to describe handshake actions of a user, and expression description information may be used to describe smiling expressions of a user. The sending equipment can acquire and acquire target audio and multi-mode description information through various sensors such as an audio sensor, an image sensor and the like.

In the B scenario, the target audio may also be: and voice interaction information for interacting with the user. Wherein the multi-modal description information is for: the data description of the multiple modes is used for interaction information for interaction with a user, for example, the multi-mode description information comprises at least one of the following: haptic description information, motion description information, and expression description information. For example, the action description information may be a hand-in action issued to the user, and the expression description information may be a smiling expression issued to the user. The transmitting device may calculate, based on information of multiple modes collected by multiple sensors such as an audio sensor and an image sensor, by using its AI (Artificial Intelligence), or may obtain, from a server or a robot, the target audio and the multi-mode description information, which is not limited in this embodiment.

After the sending device obtains the target audio and the multi-mode description information, the multi-mode description information can be encoded according to the audio start-stop time corresponding to the multi-mode description information on the target audio, so as to obtain multi-mode encoding. In the scene a, the start-stop time of the audio corresponding to the multi-mode description information on the target audio may be preset, or may be set randomly, or may be determined according to the target audio and the acquisition time of the multi-mode description information. When various sensors such as an audio sensor, an image sensor and the like collect target audio and multi-mode description information, the collected target audio and multi-mode description information respectively correspond to collection time, and the sending equipment can obtain audio start-stop time corresponding to the multi-mode description information on the target audio based on the collection time. In the B scene, the audio start-stop time corresponding to the multi-mode description information on the target audio can be preset, can be set randomly, and can be determined according to the output time of the target audio and the output time of the multi-mode description information. The sending device can calculate and output the target audio and the multi-mode description information based on information of multiple modes acquired by multiple sensors such as an audio sensor, an image sensor and the like by using AI (Artificial Intelligence) of the sending device, the target audio and the multi-mode description information respectively correspond to output time, and then the sending device can determine audio start-stop time corresponding to the multi-mode description information on the target audio according to the output time of the target audio and the output time of the multi-mode description information.

The sending device can encode the multi-mode description information according to the corresponding audio start-stop time of the multi-mode description information on the target audio by means of Json (JavaScript Object Notation, JS object numbered musical notation) serialization to obtain multi-mode codes. Serialization is the process of converting a data structure or object into a binary string (byte sequence), i.e., converting an object with a structure into a form that can be stored or transmitted. Since the multi-mode codes are serialized, the multi-mode codes are more convenient to store and transmit. Wherein any coding segment in the multi-mode coding may have a length therein, and any coding segment comprises: describing a field and a field value, wherein the time corresponding relation between the multi-mode description information and the target audio can be represented in the multi-mode coding. For example, as shown in fig. 2, the multi-mode code obtained by coding the multi-mode description information may represent a time correspondence between the multi-mode description information and the target audio, and a plurality of description fields are corresponding to the lower part of the coding segment in fig. 2: "time:3000"," motion: shakehandles "," emoti: smile "and" mouth: nihao ", and a plurality of field values: the time correspondence between the multi-mode description information and the target audio can be represented in the multi-mode coding by '3000', 'shakehands', 'smile' and 'nihao', for example, the time correspondence between 0-3000ms and shakehands, smile and nihao in the target audio exists.

After the multi-mode codes are obtained, the sending equipment can fuse the multi-mode codes with the target audio to obtain fusion information, wherein the sending equipment can fuse the multi-mode codes with the target audio in a multi-channel mixing mode. The multi-channel hybrid mode at least includes two alternative modes of cross coding and parallel coding, which is not limited in this embodiment. In other words, the transmitting device may cross-encode, parallel encode, or other encoding the multi-modal encoding with the target audio, fusing the multi-modal encoding with the target audio.

After the fusion information is obtained, the sending device can send the fusion information to the receiving device, and the receiving device can decode the fusion information to obtain the multi-mode code and the target audio.

In some alternative embodiments, the sending device may use a UAC (USB Audio Class) transmission manner to transmit the obtained target Audio and the multimodal description information. The multi-mode coding and the target audio can be transmitted by adopting two transmission channels (namely two channels), namely, the multi-mode coding is transmitted by adopting a first transmission channel, and the target audio is transmitted by adopting a second transmission channel.

During transmission, the multi-modal code may be fused with the target audio. When the multi-mode coding is fused with the target audio, the first transmission channel corresponding to the multi-mode coding and the second transmission channel corresponding to the target audio can be mixed. For example, the first transmission channel and the second transmission channel may be mixed in a PCM (Pulse Code Modulation ) channel mixing manner.

Optionally, when channels are mixed, the multi-mode codes in different transmission channels and the target audio can be mixed and coded in a cross coding or parallel coding mode, so as to obtain the fusion information. The fusion information obtained by multi-channel hybrid coding can be regarded as information which can be transmitted by adopting a single channel, and further can be transmitted by any transmission protocol. For example, the fusion information may be transmitted via the ALSA (Advanced Linux Sound Architecture) protocol or any other standard protocol. After the channel mixing is completed, an ALSA interface or other standard interfaces can be used for opening a UAC multi-channel port, configuring sound card playing parameters and writing fusion information. That is, the fusion information can be finally transmitted to the receiving end by using a channel (i.e., a protocol interface). The receiving end can obtain the fusion information by reading the ALSA interface or other standard interfaces.

Based on the implementation mode, any transmission protocol can be adopted to transmit the audio signal and the multi-mode signal, a new transmission protocol is not required to be introduced, and an additional protocol or drive between the devices is not required, so that the transmission efficiency is greatly improved, and the development cost is reduced.

In some optional embodiments, the foregoing embodiments "encoding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio, and obtaining the multi-modal encoding" may be based on the following implementation steps:

step 121, determining at least one multi-modal information fragment according to the multi-modal description information; any multimodal piece of information includes at least one of motion, expression, and visual description information.

Step 122, performing serialization coding on the multi-mode description information according to at least one multi-mode information fragment and the audio start-stop time corresponding to the at least one multi-mode information fragment on the target audio to obtain multi-mode coding.

The sending device may divide the description information of the multiple modes in the multi-mode description information to obtain at least one multi-mode information segment, so that each multi-mode information segment includes at least one of an action, an expression and visual description information, as shown in fig. 2. The sending device may divide the description information of the multiple modes in the multi-mode description information according to a preset division rule to obtain at least one multi-mode information fragment, or may randomly divide the description information of the multiple modes in the multi-mode description information to obtain at least one multi-mode information fragment, or may calculate to obtain at least one multi-mode information fragment by using its AI (Artificial Intelligence ), which is not limited in this embodiment. The length of each obtained multi-mode information fragment can be consistent with the duration of the audio fragment corresponding to the target audio. If the multi-mode information segment corresponding to the audio start-stop time does not have or lacks the description information, the multi-mode information segment can be complemented by using magic words or all-zero blank data.

Based on the steps, after obtaining at least one multi-mode information fragment, the sending device can perform serialization coding on the multi-mode description information according to the at least one multi-mode information fragment and the audio start-stop time corresponding to the at least one multi-mode information fragment on the target audio to obtain multi-mode coding.

The transmitting device may perform serialization encoding on at least one description information in any one multi-mode information segment and the audio start-stop time corresponding to the target audio by using a Json (JavaScript Object Notation, JS object numbered musical notation) serialization manner, so as to obtain an encoded segment. For example, the multi-mode description information is abcdefg, and after the multi-mode description information is encoded, T0abcT1defgT2 can be obtained, where abc and defg are two multi-mode information segments, the audio start-stop time corresponding to the multi-mode information segment abc is T0-T1, and the audio start-stop time corresponding to the multi-mode information segment defg is T1-T2.

For another example, as shown in fig. 2, a first code segment in multi-mode coding may represent a temporal correspondence between at least one of the descriptive information and 0-3000 ms.

By the mode, the sending equipment can combine a plurality of coding fragments obtained by the serialization coding, and then the multi-mode coding can be obtained.

In some alternative embodiments, the sending device may cross-encode the multi-modal code with the target audio to fuse the multi-modal code with the target audio to obtain the fused information.

The cross coding scheme and its advantageous effects will be specifically described below.

Through the cross coding mode, the multi-mode coding and the target audio can be distributed in a plurality of code words, so that the error correction capability is improved, the influence caused by information deletion can be effectively relieved, and the integrity of signal transmission is further improved. For example: assume that an error-free message obtained by splicing the multi-mode coding and the target audio according to a normal sequence is as follows: when the aaaaabbbbccccddddeeefffffggg is not used, due to information deletion caused by some transmission errors, a message received by the target device is: aaabbbbcc ____ deeefffffggg, codeword c is changed by 1 bit and can be corrected; codeword d is changed to 3 bits and is not correctly decodable. In this embodiment, the multi-mode encoding and the target audio are arranged in a crossing manner according to a preset rule, the error-free message is encoded into abcdefbcdefbcdegbcdefg, and if a transmission error occurs, the message received by the receiving device is abcdefgabcd ____ bcdefbcdefbcdefg, and the received message is decoded, so that the message may be obtained: aa_abbbbccccdddde_ eef _ ffg _gg; only one bit of each set of codewords is altered so that a one-bit error correction code can be decoded correctly. Based on the method, when the communication environment is bad, the multi-mode coding and the target audio are subjected to cross coding, on one hand, the lost information can be recovered through reasoning, on the other hand, the probability that the multi-mode coding and the target audio are lost simultaneously is greatly reduced, and when one is lost, the other can also be used as the supplement of the information, so that the integrity of transmission is further improved.

In some optional embodiments, when the multi-mode description information is encoded in a serialization manner according to the start-stop time of the audio corresponding to the at least one multi-mode information segment on the target audio, the mode identifier may be added to the encoding header of the at least one multi-mode information segment. The mode identification of any multi-mode information segment is used for marking the position of the multi-mode information segment in the fusion information. The mode identification can be magic word, or other identifiers, such as specific numbers or characters.

For example, the multi-mode description information is gfedcba, which includes gf, e, dcb and a as four multi-mode information fragments, and when the multi-mode description information is coded in a serialization manner, mode identifiers can be added to the coding heads of the four multi-mode information fragments respectively to obtain (m) gf (m) e (m) dcb (m) a, where m is a mode identifier. In this way, the position of the encoding result of the multi-modal information fragment in the fusion information can be marked by adding the modal identification.

The embodiment of the application also provides a multi-mode information transmission method which is suitable for the receiving equipment. As shown in fig. 3, the method includes:

step 31, receiving fusion information sent by a sending device; the fusion information is obtained by fusion of the multi-mode codes and the target audio; the multi-mode coding is obtained according to the multi-mode description information and the audio start-stop time coding corresponding to the multi-mode description information on the target audio.

And step 32, decoding the fusion information to obtain the multi-mode coding and the target audio.

In this embodiment, the receiving device may be a robot body, or may be an execution component/actuator on the robot.

When the receiving device decodes the fusion information, a decoding mode corresponding to the encoding mode of the transmitting device can be adopted to decode the fusion information, so as to obtain the multi-mode encoding and the target audio.

In this embodiment, the receiving device may receive the fusion information sent by the sending device and decode the fusion information to obtain the multi-mode code and the target audio; the fusion information is obtained by fusing the multi-modal code and the target audio, and the multi-modal code is obtained by encoding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio. By the method, the multi-mode coding and the target audio can be synchronously transmitted, and the technical problem that different types of information wait for each other due to different time delays in the transmission process is solved. Meanwhile, the multi-mode codes are obtained according to the audio start-stop time corresponding to the multi-mode description information on the target audio, so that the decoded target audio has information correspondence with the multi-mode codes, and further information processing is facilitated.

Optionally, in the a scene, the target audio is: the voice interaction information sent by the user, such as a section of speech spoken by the user. The multi-modal descriptive information is used to: the data describing the user interaction information in a plurality of modes, for example, the multi-mode description information comprises at least one of the following: visual description information, haptic description information, motion description information, and expression description information. For example, visual description information may be used to describe facial features of a user, motion description information may be used to describe handshake actions of a user, and expression description information may be used to describe smiling expressions of a user. The sending equipment can acquire and acquire target audio and multi-mode description information through various sensors such as an audio sensor, an image sensor and the like. In this scenario, after the receiving device obtains the multi-modal code and the target audio, the receiving device may generate a multi-modal interaction instruction according to the multi-modal code, and output the target audio and the multi-modal interaction instruction according to the start-stop time of the audio corresponding to the multi-modal interaction instruction on the target audio by using the multi-modal interaction component. The receiving device can identify and analyze the interaction information of the user represented by the multi-modal code to determine the multi-modal interaction instruction with the highest matching degree with the interaction information of the user represented by the multi-modal code.

Because the multi-modal coding is obtained according to the audio start-stop time corresponding to the multi-modal description information on the target audio, the generated multi-modal interaction instruction can have a time corresponding relation with the target audio, and the receiving equipment can output the target audio and the multi-modal interaction instruction according to the audio start-stop time corresponding to the multi-modal interaction instruction on the target audio by utilizing the multi-modal interaction component. From the perspective of a user, when the multi-mode interaction instruction is sent out, the multi-mode interaction instruction corresponds to the target audio being played, for example, the robot can synchronously take hands when speaking you't get good; the 'I' can synchronously say 'to go to the cheer' when the leg is lifted and the user walks.

In some alternative embodiments, when the receiving device decodes the fusion information to obtain the multi-mode code, at least one mode identifier may be identified from the fusion information. Wherein any modality identifier is located at the coding header of the corresponding multi-modality information fragment. The receiving device may determine at least one encoded header position from the fusion information based on the at least one modality identifier and decode at least one multi-modality information piece from the fusion information based on the at least one encoded header position. Any multi-modal information segment comprises at least one of actions, expressions and visual description information. For example, assuming that the fusion information is encoded as (m 1) zxc (m 2) v (m 3) bn, the receiving device may identify at least one modality identifier from the fusion information: m1, m2 and m3, determining at least one coding head position from the fusion information according to the mode identifications, and decoding the multi-mode information fragment zxc, the multi-mode information fragment v and the multi-mode information fragment bn from the fusion information according to the at least one coding head position.

The method for transmitting multi-mode information provided in the foregoing embodiment will be further described below in conjunction with the actual scenario.

In an actual scene, the sending device can acquire target audio and multi-mode description information to be transmitted through a plurality of sub-channels (i.e. a plurality of channels) of a USB audio class transmission channel (UAC channel); the multi-mode description information is obtained by serializing the original multi-mode information acquired by the sensor. Then, the sending equipment can encode the multi-mode description information according to the audio start-stop time corresponding to the multi-mode description information on the target audio to obtain multi-mode codes; and carrying out PCM (Pulse Code Modulation ) multichannel mixing on the multi-modal code and the target audio so as to fuse the multi-modal code and the target audio to obtain fusion information. The PCM multi-channel mixing, i.e. PCM mixing, can be understood as cross-coding target audio and multi-mode codes belonging to different channels, so as to integrate the target audio and the multi-mode codes into one channel to obtain fusion information.

Thereafter, the transmitting device may use the ALSA (Advanced Linux Sound Architecture, a sound architecture) interface or other standard interface to turn on the recording sound card and configure the sound card playback parameters. Then, the sending device can send the fusion information to the receiving device through the recording sound card and through a single channel in the UAC channel.

The receiving device may open the UAC capture sound card and automatically configure sound card recording parameters consistent with the recording sound card to receive the fusion information. The ALSA protocol is used to capture the fusion information sent by the sending device and decode the fusion information into target audio and multi-modal codes. Then, the receiving device can further perform corresponding recognition analysis based on the multi-modal codes, so that multi-modal interaction instructions are obtained.

Based on the steps, the receiving device can output target audio by utilizing the multi-mode interaction component and output multi-mode interaction instructions according to the audio start-stop time of the multi-mode description information.

It should be noted that, the execution subjects of each step of the method provided in the above embodiment may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 11 to 14 may be the device a; for another example, the execution subject of

steps

11 and 12 may be device a, and the execution subject of

steps

13 and 14 may be device B; etc.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations such as 11, 12, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

Fig. 4 is a schematic structural view of a robot according to an exemplary embodiment of the present application, and as shown in fig. 4, the robot includes: a transmitting component 401 and a receiving component 402.

Wherein, send component 401 is for: acquiring target audio and multi-mode description information to be transmitted; coding the multi-mode description information according to the corresponding audio start-stop time of the multi-mode description information on the target audio to obtain multi-mode codes; fusing the multi-mode codes with the target audio to obtain fusion information; and sending the fusion information to a receiving component.

A receiving component 402 for: and receiving the fusion information sent by the sending component, and decoding the fusion information to obtain the multi-mode code and the target audio.

Further, as shown in fig. 4, the robot further includes: acquisition component 403, multimodal interaction component 404, and communication component 405, among other components. Only part of the components are schematically shown in fig. 4, which does not mean that the robot only comprises the components shown in fig. 4. Wherein, collection assembly 403 comprises: the communication component 405 can be configured to facilitate wired or wireless communication between a device in which the communication component is located and other devices. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi,2G, 3G, 4G, or 5G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

Further optionally, in the a scene, the target audio is: the voice interaction information sent by the user, such as a section of speech spoken by the user. The multi-modal descriptive information is used to: the data describing the user interaction information in a plurality of modes, for example, the multi-mode description information comprises at least one of the following: visual description information, haptic description information, motion description information, and expression description information. For example, visual description information may be used to describe facial features of a user, motion description information may be used to describe handshake actions of a user, and expression description information may be used to describe smiling expressions of a user. Wherein the receiving component 402 can collect and acquire the target audio and the multi-modal description information through the collecting component 403. In this scenario, after the receiving component 402 obtains the multi-modal code and the target audio, a multi-modal interaction instruction may also be generated according to the multi-modal code, and the multi-modal interaction component 404 is utilized to output the target audio and the multi-modal interaction instruction according to the start-stop time of the audio corresponding to the multi-modal interaction instruction on the target audio. The receiving component 402 may identify and analyze interaction information of the user represented by the multimodal code to determine a multimodal interaction instruction having a highest degree of matching the interaction information of the user represented by the multimodal code.

Because the multi-modal code is obtained according to the audio start-stop time corresponding to the multi-modal description information on the target audio, the generated multi-modal interaction instruction can have a time corresponding relation with the target audio, and the receiving component 402 can output the target audio and the multi-modal interaction instruction according to the audio start-stop time corresponding to the multi-modal interaction instruction on the target audio by using the multi-modal interaction component. From the perspective of a user, when the multi-mode interaction instruction is sent out, the multi-mode interaction instruction corresponds to the target audio being played, for example, the robot can synchronously take hands when speaking you't get good; the 'I' can synchronously say 'to go to the cheer' when the leg is lifted and the user walks.

In this embodiment, a sending component of the robot may obtain target audio to be transmitted and multi-modal description information, and encode the multi-modal description information according to start-stop time of audio corresponding to the multi-modal description information on the target audio to obtain multi-modal encoding; fusing the multi-mode codes and the target audio to obtain fusion information; and sending the fusion information to a receiving component, wherein the receiving component can decode the fusion information to obtain the multi-mode code and the target audio. By means of the method, the multi-mode coding and the target audio are fused, so that the multi-mode coding and the target audio can be synchronously transmitted, and the technical problem that different types of information wait for each other due to different time delays in the transmission process is solved. Meanwhile, the multi-mode codes are obtained according to the audio start-stop time corresponding to the multi-mode description information on the target audio, so that the transmitted target audio has information correspondence with the multi-mode codes, and further information processing is facilitated.

Fig. 5 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application, as shown in fig. 5, including: memory 501, processor 502, and communication component 503.

The memory 501 is used for storing a computer program and may be configured to store various other data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, video, etc.

In some alternative embodiments, processor 502, coupled to memory 501, is configured to execute a computer program in memory 501 for: acquiring target audio and multi-mode description information to be transmitted; coding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio to obtain multi-modal codes; fusing the multi-mode code and the target audio to obtain fusion information; and sending the fusion information to receiving equipment.

Further optionally, the processor 502 encodes the multi-mode description information according to the start-stop time of the audio corresponding to the multi-mode description information on the target audio, and is specifically configured to: determining at least one multi-modal information fragment according to the multi-modal description information; any multi-modal information segment comprises at least one of action, expression and visual description information; and carrying out serialization coding on the multi-modal description information according to the at least one multi-modal information fragment and the audio start-stop time corresponding to the at least one multi-modal information fragment on the target audio to obtain the multi-modal code.

Further optionally, the processor 502 is further configured to: when the multi-mode description information is coded in a serialization mode according to the audio start-stop time corresponding to the at least one multi-mode information fragment on the target audio, respectively adding mode identifiers to the coding heads of the at least one multi-mode information fragment; and the modal identifier of any multi-modal information fragment is used for marking the position of the encoding result of the multi-modal information fragment in the fusion information.

Further optionally, the processor 502 is configured to fuse the multi-modal code with the target audio to obtain fusion information, where the fusion information is specifically configured to: respectively using a first transmission channel and a second transmission channel of the UAC audio class as the transmission channels of the multi-mode coding and the target audio; channel mixing is carried out on the first transmission channel and the second transmission channel, so that a mixing channel for transmitting fusion information is obtained; transmitting the fusion information to a receiving device, including: and writing the fusion information in the mixing channel into a protocol interface corresponding to a preset transmission protocol, so that the receiving equipment reads the mixing information through the protocol interface.

Further, as shown in fig. 5, the electronic device further includes: acquisition component 504, multimodal interaction component 505, and the like. Only some of the components are schematically shown in fig. 5, which does not mean that the electronic device only comprises the components shown in fig. 5. Wherein the communication component 503 is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi,2G, 3G, 4G, or 5G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In some alternative embodiments, processor 502, coupled to memory 501, is configured to execute a computer program in memory 501 for: receiving fusion information sent by a sending device; the fusion information is obtained by fusion of the multi-mode codes and the target audio; the multi-mode codes are obtained according to the multi-mode description information and the audio start-stop time codes corresponding to the multi-mode description information on the target audio; and decoding the fusion information to obtain the multi-mode code and the target audio.

Further optionally, after the processor 502 obtains the multi-modal code and the target audio, the method further comprises: generating a multi-mode interaction instruction according to the multi-mode code; and outputting the target audio and the multi-modal interaction instruction according to the audio start-stop time corresponding to the multi-modal interaction instruction on the target audio by utilizing the multi-modal interaction component 505.

Further optionally, when the processor 502 decodes the fusion information to obtain the multi-mode code, the method is specifically used for: identifying at least one modal identification from the fusion information; any mode identifier is positioned at the coding head of the corresponding multi-mode information fragment; determining at least one coding head position from the fusion information according to the at least one mode identifier; decoding at least one multi-modal information segment from the fusion information according to the at least one encoding head position; any multimodal piece of information includes at least one of motion, expression, and visual description information.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A multi-mode information transmission method, which is suitable for a transmitting device, comprising:

acquiring target audio and multi-mode description information to be transmitted;

coding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio to obtain multi-modal codes;

fusing the multi-mode code and the target audio to obtain fusion information;

and sending the fusion information to receiving equipment.

2. The method of claim 1, wherein encoding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio, to obtain multi-modal encoding, comprises:

determining at least one multi-modal information fragment according to the multi-modal description information; any multi-modal information segment comprises at least one of action, expression and visual description information;

And carrying out serialization coding on the multi-modal description information according to the at least one multi-modal information fragment and the audio start-stop time corresponding to the at least one multi-modal information fragment on the target audio to obtain the multi-modal code.

3. The method according to claim 2, wherein the method further comprises:

when the multi-mode description information is coded in a serialization mode according to the audio start-stop time corresponding to the at least one multi-mode information fragment on the target audio, respectively adding mode identifiers to the coding heads of the at least one multi-mode information fragment; and the modal identifier of any multi-modal information fragment is used for marking the position of the encoding result of the multi-modal information fragment in the fusion information.

4. A method according to any one of claims 1-3, wherein the audio start-stop time of the multi-modal description information corresponding to the target audio is determined according to the target audio and the acquisition time of the multi-modal description information; or,

and the audio start-stop time corresponding to the multi-mode description information on the target audio is determined according to the output time of the target audio and the output time of the multi-mode description information.

5. A method according to any one of claims 1-3, wherein fusing the multi-modal code with the target audio to obtain fused information comprises:

respectively using a first transmission channel and a second transmission channel of the UAC audio class as the transmission channels of the multi-mode coding and the target audio;

channel mixing is carried out on the first transmission channel and the second transmission channel, so that a mixing channel for transmitting fusion information is obtained;

transmitting the fusion information to a receiving device, including:

and writing the fusion information in the mixing channel into a protocol interface corresponding to a preset transmission protocol, so that the receiving equipment reads the mixing information through the protocol interface.

6. A multi-mode information transmission method, adapted for a receiving device, comprising:

receiving fusion information sent by a sending device; the fusion information is obtained by fusion of the multi-mode codes and the target audio; the multi-mode codes are obtained according to the multi-mode description information and the audio start-stop time codes corresponding to the multi-mode description information on the target audio;

and decoding the fusion information to obtain the multi-mode code and the target audio.

7. The method of claim 6, further comprising, after obtaining the multi-modal code and the target audio:

generating a multi-mode interaction instruction according to the multi-mode code;

and outputting the target audio and the multi-modal interaction instruction by utilizing the multi-modal interaction component according to the audio start-stop time corresponding to the multi-modal interaction instruction on the target audio.

8. The method according to claim 6 or 7, wherein decoding the fusion information to obtain the multi-modal code comprises:

identifying at least one modal identification from the fusion information; any mode identifier is positioned at the coding head of the corresponding multi-mode information fragment;

determining at least one coding head position from the fusion information according to the at least one mode identifier;

decoding at least one multi-modal information segment from the fusion information according to the at least one encoding head position; any multimodal piece of information includes at least one of motion, expression, and visual description information.

9. A robot, comprising: a transmitting component and a receiving component;

The sending component is configured to: acquiring target audio and multi-mode description information to be transmitted; coding the multi-modal description information according to the audio start-stop time corresponding to the multi-modal description information on the target audio to obtain multi-modal codes; fusing the multi-mode code and the target audio to obtain fusion information; transmitting the fusion information to the receiving component;

the receiving component is used for: and receiving the fusion information sent by the sending component, and decoding the fusion information to obtain the multi-mode code and the target audio.

10. An electronic device, comprising: a memory, a processor; wherein the memory is for: store one or more computer instructions; the processor is configured to execute the one or more computer instructions to: performing the steps of the method of any one of claims 1-5 or 6-8.

11. A computer readable storage medium storing a computer program, which when executed by a processor causes the processor to carry out the steps of the method of any one of claims 1-5 or 6-8.