WO2019214370A1 - 多媒体信息的传输方法及装置、终端 - Google Patents

多媒体信息的传输方法及装置、终端 Download PDF

Info

Publication number
WO2019214370A1
WO2019214370A1 PCT/CN2019/080876 CN2019080876W WO2019214370A1 WO 2019214370 A1 WO2019214370 A1 WO 2019214370A1 CN 2019080876 W CN2019080876 W CN 2019080876W WO 2019214370 A1 WO2019214370 A1 WO 2019214370A1
Authority
WO
WIPO (PCT)
Prior art keywords
description information
multimedia data
content
data
information
Prior art date
Application number
PCT/CN2019/080876
Other languages
English (en)
French (fr)
Inventor
沈灿
林亚
李加周
孙健
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to EP19800178.6A priority Critical patent/EP3792731A4/en
Publication of WO2019214370A1 publication Critical patent/WO2019214370A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/014Hand-worn input/output arrangements, e.g. data gloves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/016Input arrangements with force or tactile feedback as computer generated output to the user
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/401Support for services or applications wherein the services involve a main real-time session and one or more additional parallel real-time or time sensitive sessions, e.g. white board sharing or spawning of a subconference
    • H04L65/4015Support for services or applications wherein the services involve a main real-time session and one or more additional parallel real-time or time sensitive sessions, e.g. white board sharing or spawning of a subconference where at least one of the additional parallel sessions is real time or time sensitive, e.g. white board sharing, collaboration or spawning of a subconference
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/70Media network packetisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • H04L65/762Media network packet handling at the source 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/23614Multiplexing of additional data and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/4104Peripherals receiving signals from specially adapted client devices
    • H04N21/4131Peripherals receiving signals from specially adapted client devices home appliance, e.g. lighting, air conditioning system, metering devices
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43079Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of additional data with content streams on multiple devices
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47202End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for requesting content on demand, e.g. video on demand
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/643Communication protocols
    • H04N21/6437Real-time Transport Protocol [RTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/85406Content authoring involving a specific file format, e.g. MP4 format
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast

Definitions

  • the present application relates to the field of communications, for example, to a method, device, and terminal for transmitting multimedia information.
  • a panoramic video is formed by multi-camera acquisition and splicing technology, multi-channel sound is collected through multiple microphones, and the terminal performs playback by progressive downloading, so that the receiving end can see the panoramic video of each viewing angle.
  • panoramic video communication brings a better user experience by transmitting video images of multiple viewing angles.
  • the embodiment of the present application provides a method, a device, and a terminal for transmitting multimedia information.
  • a method for transmitting multimedia information including: acquiring multimedia data and description information in a first device, wherein the description information is used to describe that the first device is recording the An environment at the time of multimedia data and the multimedia data; transmitting the multimedia data and the description information from the first device to a second device.
  • another method for transmitting multimedia information including: receiving, on a second device, multimedia data and description information sent by a first device, where the description information is used to describe the An environment when the device records the multimedia data and the multimedia data; parsing the multimedia data and the description information respectively to obtain a first content and a second content; and when the first content is played, presenting the first Two content.
  • a multimedia information transmission apparatus including: an acquisition module, configured to acquire multimedia data and description information in a first device, wherein the description information is used to describe the An environment in which the device records the multimedia data and the multimedia data; and a transmission module configured to transmit the multimedia data and the description information from the first device to the second device.
  • another apparatus for transmitting multimedia information including: a receiving module, configured to receive, on a second device, multimedia data and description information sent by a first device, where the description information For describing the environment of the first device when recording the multimedia data and the multimedia data; the parsing module is configured to parse the multimedia data and the description information to obtain the first content and the second content respectively; and the output module And being arranged to present the second content when the first content is played.
  • a terminal including a first device and a second device, where the first device includes: an obtaining module configured to acquire multimedia data and description information, wherein the description The information is used to describe an environment of the first device when recording the multimedia data and the multimedia data; and a transmission module configured to transmit the multimedia data and the description information from the first device to the first device
  • the second device includes: a receiving module, configured to receive multimedia data and description information sent by the first device; and a parsing module configured to parse the multimedia data and the description information to obtain a first content and a second content; an output module configured to present the second content when the first content is played.
  • a storage medium having stored therein a computer program, wherein the computer program is configured to execute the steps of any one of the method embodiments described above.
  • an electronic device comprising a memory and a processor, wherein the memory stores a computer program, the processor being configured to run the computer program to perform any of the above The steps in the method embodiments.
  • FIG. 1 is a network architecture diagram of an embodiment of the present application.
  • FIG. 2 is a flowchart of transmission of multimedia information according to an embodiment of the present application.
  • FIG. 3 is a structural block diagram of a multimedia information transmission apparatus according to an embodiment of the present application.
  • FIG. 4 is a structural block diagram of another multimedia information transmission apparatus according to an embodiment of the present application.
  • Example 5 is a schematic diagram of a transmitting end of Example 1 of the present application.
  • FIG. 6 is a schematic diagram of packet description information in an RTP protocol according to this embodiment.
  • FIG. 7 is a schematic diagram showing the structure of an information content package in the embodiment.
  • FIG. 8 is a schematic diagram of a receiving end of Example 1 of the present application.
  • FIG. 1 is a network architecture diagram of the embodiment of the present application.
  • the network architecture includes: a first device, a second device, where Interaction between a device and a second device.
  • FIG. 2 is a flowchart of transmitting multimedia information according to an embodiment of the present application. As shown in FIG. 2, the process includes steps. Step S202 and step S204.
  • step S202 the multimedia data and the description information are acquired in the first device, where the description information is used to describe the environment of the first device when recording the multimedia data and the multimedia data.
  • step S204 the multimedia data and the description information are transmitted from the first device to the second device.
  • the description of the environment in which the first device records the multimedia data and the multimedia data is also transmitted to the second device, so that the second device can
  • the content related to the description information is presented at the same time, which avoids the situation that the description information cannot be transmitted when transmitting the multimedia information in the related technology, realizes the fusion of the multimedia content and the environment, realizes the somatosensory interaction of the communication parties, and presents the body.
  • the experience of the environment is not
  • the execution body of the foregoing steps may be a terminal, such as a mobile phone, a virtual reality (VR) terminal, etc., but is not limited thereto.
  • a terminal such as a mobile phone, a virtual reality (VR) terminal, etc., but is not limited thereto.
  • VR virtual reality
  • the multimedia data in this embodiment includes audio data, video data, and the like.
  • transmitting the multimedia data and the description information from the first device to the second device may be, but is not limited to, the following.
  • the first channel may be a multimedia data transmission channel; storing the multimedia data and the description information in the first file, by reading Obtaining the first file to obtain multimedia data and description information, and implementing non-real-time transmission to the second device; the first file is optional, and the non-real-time application such as the encoder and the camera can be stored into a file, and the real-time communication service does not need to be stored. To the file, in order to realize the recording function, the non-real time VR video can be played. When playing, the first file is read, and the multimedia data and the description information are transmitted by using the first channel and the second channel; One channel and the second channel are established between the first device and the second device.
  • the second channel is an independent channel, or the second channel may also be merged into the first channel for transmission, and the second channel is one of the first channels.
  • transmitting the multimedia data and the description information from the first device to the second device may be, but is not limited to, the following:
  • the method before transmitting the multimedia data and the description information from the first device to the second device, the method further includes: compressing and encoding the multimedia data, packaging the first data packet, and packaging the description information into the second data packet. data pack.
  • the description information of the embodiment includes at least one, which is exemplified herein, and the description information includes a timestamp, duration, coordinates, identification information of the object, and description content.
  • the timestamp is used to describe the recording time of the multimedia data.
  • the duration is used to describe the duration of the multimedia data from the initial time to the current time.
  • the coordinates are used to describe the recording location of the multimedia data.
  • the identification information of the object is used to identify an object in the corresponding picture of the multimedia data; the object may be a person or a scene or the like.
  • the description content is used to describe at least one of an environment in the multimedia data corresponding picture and a data type obtained by analyzing the multimedia data.
  • the description includes at least one of: text content after speech recognition of audio data, language of audio data, intonation of audio data, emotion of an object in a video image, physical characteristics of an object in a video image, motion of an object in a video image
  • the power of the object in the video image, the wind in the environment where the video image corresponds to the scene, the wind direction in the corresponding picture of the video image, the temperature in the environment where the multimedia data corresponds to the picture, the taste and video in the environment where the video image corresponds to the picture The image corresponds to the taste in the environment in which the picture is located, and the tactile sensation of the object in the video image, wherein the multimedia data includes video data and audio data.
  • the solution of this embodiment can be applied to different scenarios, using different transmission protocols, including: using Real-time Transport Protocol (RTP) to transmit multimedia data and description information from the first device to the second device; a Session Initiation Protocol (SIP) transmits the description information from the first device to the second device; the description information is transmitted from the first device to the second device using a Real Time Streaming Protocol (RTSP); A custom transmission protocol transmits the multimedia data and the description information from the first device to a second device.
  • RTP Real-time Transport Protocol
  • SIP Session Initiation Protocol
  • RTSP Real Time Streaming Protocol
  • a custom transmission protocol transmits the multimedia data and the description information from the first device to a second device.
  • acquiring the multimedia data and the description information in the first device comprises step S11 and step S12.
  • step S11 data is collected by a plurality of sensors; the plurality of sensors respectively collect multimedia data and original data describing the information, and the plurality of sensors include at least one of the following: a motion sensor, an environmental sensor, a laser radar, a millimeter wave radar, and a fragrance sensor. , as well as sensing gloves.
  • the sensor (such as a camera, a microphone) may be disposed in the first device, or may be externally connected to the first device, and processed by the first device.
  • step S12 the data is analyzed and processed in the first device, and the multimedia data and the description information are extracted.
  • the description information is obtained by analyzing and processing the raw data collected by a plurality of sensors, such as obtaining temperature-related description information by a temperature sensor, and also analyzing and processing the original data of the audio and video, for example, analyzing character expressions and emotions in the video image. Related description information, etc.
  • step S302 another method for transmitting multimedia information running on the network architecture.
  • step S304 another method for transmitting multimedia information running on the network architecture.
  • step S302 the multimedia data and the description information sent by the first device are received on the second device, where the description information is used to describe the environment of the first device when recording the multimedia data and the multimedia data.
  • step S304 the multimedia data and the description information are parsed to obtain the first content and the second content, respectively.
  • step S306 when the first content is played, the second content is presented.
  • receiving the multimedia data and the description information sent by the first device on the second device includes: receiving, in the first channel of the second device, the multimedia data sent by the first device, and receiving the first data in the second channel Descriptive information sent by the device; wherein the first channel and the second channel are established between the first device and the second device.
  • the multimedia data and the description information may also be obtained by reading the first file, and then transmitting the multimedia data and the description information by using the first channel and the second channel to implement non-real-time transmission.
  • the second content when the first content is played, the second content is presented, including: playing the first content on the at least one third device, and presenting the second content on the at least one fourth device.
  • the first content is multimedia content, including video content and audio content, and can be played through a display screen or a speaker; the second content is presented or simulated by a corresponding presentation terminal, such as a time stamp or duration, displayed through the display screen, and the temperature is displayed.
  • the rendering is simulated by a refrigeration device or a heating device, the taste is presented by releasing a specific scent, the force is presented by the driving device, and the like.
  • the method according to the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware.
  • the technical solution of the present application which is essential or contributes to the related art, may be embodied in the form of a software product stored in a storage medium such as a read-only memory/random access memory. (Read Only Memory/Random Access Memory, ROM/RAM), a disk or an optical disk, including a plurality of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, a network device, etc.) to execute various embodiments of the present application Said method.
  • a terminal device which may be a mobile phone, a computer, a server, a network device, etc.
  • a device for transmitting multimedia information is provided, and the device is configured to implement the foregoing embodiments and example embodiments, and details are not described herein.
  • the term "module" may implement a combination of at least one of software and hardware for a predetermined function.
  • the devices described in the following embodiments may be implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 3 is a structural block diagram of a multimedia information transmission apparatus according to an embodiment of the present application. As shown in FIG. 3, the apparatus is applied to a first device, where the apparatus includes an acquisition module 30 and a transmission module 32.
  • the obtaining module 30 is configured to acquire multimedia data and description information in the first device, where the description information is used to describe an environment of the first device when recording the multimedia data and the multimedia data.
  • the transmission module 32 is configured to transmit the multimedia data and the description information from the first device to the second device.
  • FIG. 4 is a structural block diagram of another multimedia information transmission apparatus according to an embodiment of the present application. As shown in FIG. 4, the apparatus is applied to a second device, and the apparatus includes a receiving module 40, a parsing module 42, and an output module 44.
  • the receiving module 40 is configured to receive the multimedia data and the description information sent by the first device on the second device, where the description information is used to describe an environment of the first device when recording the multimedia data and the multimedia data.
  • the parsing module 42 is configured to parse the multimedia data and the description information to obtain the first content and the second content, respectively.
  • the output module 44 is configured to present the second content when the first content is played.
  • the embodiment further provides a terminal, which combines the functional modules included in the first device and the second device, and can implement the functions of the first device and the second device.
  • each of the above modules may be implemented by software or hardware.
  • the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the above modules are in any combination.
  • the forms are located in different processors.
  • the video communication method provided in this embodiment can provide an immersive experience.
  • the application provides an immersive interactive video communication method, where the method includes: the generating end corresponds to the first device, and the plurality of sensors and cameras are And the original data collected by the microphone is analyzed and processed, the description information is extracted and packaged, the original data of the audio and video is compressed and encoded, and the description information is transmitted in the description channel, and the audio and video encoded data is transmitted in the media channel; the receiving end corresponds to The second device receives the description information and the audio and video encoded data, and after decoding, decoding the audio and video encoded data, analyzing and processing the description data, and synchronously playing the audio and video, presenting the action and the environment through the presentation unit.
  • the description information and the audio and video encoded data may be stored in the file, and the receiving end extracts the audio and video encoded data and the description information from the file for presentation.
  • the plurality of sensors include: a motion sensor, an environmental sensor, a laser radar, and a millimeter wave radar, etc.
  • the camera may be an array of at least one camera
  • the microphone may be an array of at least one microphone.
  • the description information is information obtained after analyzing and processing raw data collected by a plurality of sensors, and analyzing information obtained by processing original data of audio and video, the description information including: time stamp, duration, coordinates, corresponding person or object
  • the specific content of the description information including but not limited to: text content after speech recognition, language, intonation, emotion, physical characteristics, movement, strength, wind, wind direction, temperature, and taste.
  • the timestamp refers to the time when the description information occurs.
  • the duration gives the duration of the description information.
  • the coordinates give the description information collected at which position, and the corresponding person or object number gives the object of the description information.
  • the receiving end Since the audio and video data is also carried with a time stamp, the receiving end realizes the synchronous presentation of the description information and the audio and video according to the description information time stamp and the audio and video time stamp, and the receiving end can use the same coordinates and time according to the description information at the transmitting end.
  • the description information is presented on the object.
  • the description channel may be extended in an RTP, SIP or RTSP transmission protocol, or may be other transmission protocols.
  • the present application also provides a system and apparatus for immersive interactive video communication, including: a transmitting device and a receiving device.
  • the sending device includes the following processing steps: the collecting unit collects the original data through a plurality of sensors, a camera, and a microphone; the description information generating unit analyzes and processes the original data, extracts the description information, and the encoding unit performs video compression encoding on the video original data, and the audio is The original data is subjected to audio compression coding; the packing unit packs the description information, and separately packs the video encoded data and the audio encoded data; the transmitting unit transmits the description information in the description channel, and transmits the video encoded data and the audio encoded data in the media channel. If not transmitted in the network, the storage unit stores the description information and the audio and video encoded data into a file.
  • the Acquisition unit collecting raw data through a camera, a microphone and a plurality of sensors, the plurality of sensors including: a motion sensor, an environmental sensor, a laser radar, and a millimeter wave radar, etc.
  • the camera may be an array of at least one camera
  • the microphone It can be an array of at least one microphone.
  • Descriptive information generating unit extracting original data and audio and video raw data from a plurality of sensors, extracting action, environment and audio and video description information, and the description information includes: time stamp, duration, coordinates, corresponding person or object number, specific Content, etc.
  • the specific content in the description information includes but is not limited to: text content after speech recognition, language, intonation, emotion, physical characteristics, motion, strength, wind, wind direction, temperature, and taste.
  • the timestamp refers to the time when the description information occurs.
  • the duration gives the duration of the description information.
  • the coordinates give the description information collected at which position, and the corresponding person or object number gives the object of the description information.
  • Coding unit Compress and encode the collected audio and video raw data.
  • the packing unit packs the encoded audio and video encoded data, forms an audio and video encoded data packet, and packs the description information to form a description information data packet.
  • the description information and the audio and video coding data packets are respectively packaged and packaged according to their respective formats. In the embodiment, a possible packaging format is provided, but the specific packaging format of the audio and video description information is not limited in this application.
  • Transmitting unit Sends a description information packet in the description channel, and transmits an audio and video encoded data packet in the media channel.
  • the description channel may be extended in an RTP or SIP transmission protocol, or may be a separately established transmission channel.
  • Storage unit stores audio and video encoded data and description information into a file.
  • the receiving device includes the following processing steps: the receiving unit receives the description packet and the audio and video encoded data packet from the description channel and the media channel, or reads the description information and the audio and video encoded data from the file; and the unpacking unit splits the audio frame encoded data.
  • the receiving unit receiving the audio and video encoded data packet from the media channel, and receiving the description information data packet from the description channel, where the description channel may be extended in the RTP or SIP transmission protocol, or may be a separately established transmission channel. Alternatively, the audio and video encoded data packet and the description information data packet are read from the file.
  • the unpacking unit parses the audio and video encoded data packet and the description information data packet, and obtains the voice encoded data, the video encoded data, and the description information.
  • the description information includes: a time stamp, a duration, a coordinate, a number of a corresponding person or object, and specific content, and the specific content in the description information includes but is not limited to: text content, language, intonation, emotion after speech recognition, Physical characteristics, movements, strength, wind, wind direction, temperature, and taste.
  • Decoding unit Decodes the speech encoded data and the video encoded data to obtain playable audio and video raw data.
  • the description information processing unit analyzes and processes the audio and video description information according to the description information, and restores different types of description information of the audio and video.
  • a presentation unit presenting an audio and video according to a timestamp corresponding to the audio and video data, and presenting the description information according to the timestamp of the description information, realizing the synchronous presentation of the description information and the audio and video, and controlling the length of the presentation according to the duration of the description information, Descriptive information is presented at different locations and objects based on the description information coordinates and objects.
  • the first four examples are combined with different application scenarios to give the description information through different methods such as transport protocol RTP, file storage MP4, signaling control protocol SIP, RTSP, etc.
  • the fifth example gives an extraction description.
  • the implementation method of information the sixth example gives an implementation method of simultaneous presentation of audio and video and environment and action behavior.
  • the transmitting device and the receiving device are included, and the transmitting device and the receiving device first establish a call connection through the SIP protocol or the RTSP protocol.
  • FIG. 5 is a schematic diagram of a transmitting end of the first embodiment of the present application, including steps 1 to 5.
  • step 1 raw data is collected by sensors such as motion and environment, including: gloves, clothing, hats, shoes, temperature, taste, and wind.
  • the video raw data is collected by the camera array, and the audio raw data is collected by the microphone array.
  • step 2 the data collected in step 1 is processed and analyzed, the video is spliced, the noise is eliminated, the object of interest in the video is identified, the features and the like are extracted, the original data is processed, the noise, the echo are eliminated, and the speech is extracted.
  • the coordinates and voice are transformed into text, intonation, and emotion of speech.
  • description information such as action and environment is extracted.
  • the description information includes: time stamp, duration, coordinates, number of corresponding person or object.
  • specific content, etc. the specific content in the description information includes but is not limited to: text content after speech recognition, language, intonation, emotion, physical characteristics, motion, strength, wind, wind direction, temperature, and taste.
  • each description information further includes: description information type, description information number, termination flag, etc., when the description information of the same description information number first appears, the termination flag is set to 1, the same description information When the numbered description information last appears, the termination flag is set to 0.
  • Table 1 gives a description of the content name, content code, and description, and the description information can be expanded to include more content.
  • step 3 the video is compression-encoded, and the encoding algorithm uses H.265 to compress and encode the audio.
  • the encoding algorithm adopts Adaptive Multi-Rate Wideband (AMR WB) to obtain the encoded audio encoding and Video encoded data.
  • AMR WB Adaptive Multi-Rate Wideband
  • step 4 the video encoded data is packaged and packaged according to a series of numbered files 7798 (Request For Comments, RFC 7798), and the voice encoded data is packaged and packaged according to RFC 4867.
  • the audio and video description information is RTP packaged and encapsulated, and the RTP timestamp information realizes the synchronous presentation of the description information and the audio and video stream.
  • Each RTP packet contains description information corresponding to the same time slice, and the specific packaging format is shown in FIG. 6 .
  • 6 is a schematic diagram of packet description information in the RTP protocol in this embodiment.
  • the fields of the RTP header are encapsulated according to RFC 3550, and the timestamp field of the description packet is consistent with the clock frequency of the audio and video media packet, so as to facilitate synchronization when the receiver is presented. deal with.
  • the RTP payload includes at least one description information, and each description data corresponds to an indication information block in the RTP payload header: including an F indicator bit and a description information length; and an F indicator bit indicating whether the current data is the last description information, if the current data is The last description information, the F indicator bit is 1, if the current data is not the last description information, the F indicator bit is 0; the length of the description information 1 indicates the length of the first description information, and the length of the description information N indicates The length of the Nth description message, in bytes.
  • FIG. 7 is a schematic diagram of the information content encapsulation structure in the embodiment.
  • the description information may include a plurality of content, and each content may further include: a content code, a G indicator bit, a content length, and a content value. See Table 1 for the content code.
  • the G indicator indicates whether the current content is the last content information. If the current content is the last content information, the G indication bit is 1; if the current content is not the last content information, the G indication bit is 0. If the content value is too long, the same description information can be split and transmitted in multiple RTP packets. The G indicator bit is set to 0, and the next RTP packet will continue to transmit the same content value. When the G indicator bit is set to 1, the subsequent A packet without the same content.
  • the content length refers to the length of the content value in units of bytes.
  • information such as the PT value of the corresponding RTP header of the audio and video description information stream can be described in the SDP.
  • step 5 the packaged description information and the audio and video encoded data packet are transmitted.
  • FIG. 8 is a schematic diagram of the receiving end of the example 1 of the present application, including steps 1 to 5.
  • step 1 network data packets are respectively received from the corresponding ports to obtain a description packet and an audio and video encoded data packet.
  • step 2 the package contents are parsed according to the corresponding format to obtain description information and audio and video encoded data.
  • step 3 the audio and video frame encoder data is decoded to obtain playable audio and video data.
  • step 4 the description information is analyzed and processed to restore different types of description information for controlling different peripherals for presentation, for example, the taste information is used to control the taste generation device to synthesize the specified taste.
  • the audio and video are analyzed and processed to restore audio and video data of different viewing angles.
  • step 5 according to the time stamp corresponding to the audio and video data, the audio and video data is synchronously played, and the description information is presented according to the timestamp of the description information, so that the description information and the audio and video are synchronously presented, and at the same time, according to the duration of the description information, the control is presented.
  • the length of time, according to the description information coordinates and objects, on different positions and objects, according to the description information to control the peripherals, present their corresponding action behavior and environment, according to the description information, you can insert different languages in the played video Subtitles, switching angles of view according to the orientation of the voice, and the like.
  • step 1 the encoder collects the description data and the audio and video data, and analyzes and encodes them. Refer to steps 1 through 3 of Example 1 for detailed steps.
  • step 2 the encoder stores the description information, audio and video encoded data in the MP4 file.
  • Audio and video data are paired to establish an audio track (Audio track), a video track (Video track), and a text track is created for the description information.
  • the description information data is placed in a media data box (mdat box), and the description information data includes coordinates, a corresponding person or object number, duration, description information type, description information number, termination flag, and specific content.
  • the timestamp information describing the information is placed in the file header by a duration field.
  • an executable file format is provided herein, and the file format stored by the server is not limited to the MP4 file.
  • step 3 the MP4 file is transferred to the streaming server.
  • step 4 the player sends a Hyper Text Transport Protocol Get (HTTP GET) request to the streaming server, downloads the Moov atom in the MP4 file, and parses the index information.
  • HTTP GET Hyper Text Transport Protocol Get
  • step 5 the player sends a GET request with a Range field specifying that an MP4 file of a particular location is played.
  • step 6 the player reads the audio and video encoded content and the description information in the mdat box through the index information.
  • the audio and video frames are decoded and analyzed; the description information is analyzed and processed to restore different types of description information.
  • step 7 the player synchronously plays the audio and video data, and simultaneously indexes the description information of the same time stamp in the MP4 file to realize synchronous presentation of the description information and the audio and video, according to the number and coordinates of the person or object in the description information. Apply the corresponding action to the specified object. At the same time, according to the duration of the description information, control the length of the presentation time. According to the description information coordinates and the object, control the peripheral device according to the description information according to the description information and the object, and present the corresponding Action behavior and environment.
  • step 1 the player and camera establish a connection via SIP signaling.
  • step 2 the camera collects the description data and the audio and video data, and analyzes and encodes them. Refer to steps 1 through 4 of Example 1 for detailed steps.
  • step 3 the camera transmits an RTP packet of audio and video data to the player.
  • step 4 the camera places the description information in text in a SIP extended message method and sends it to the player in time.
  • a SIP extended message method E.g:
  • the content of the description information text described above is composed of description information coordinates, corresponding person or object number, time stamp, duration, description information type, description information number, termination flag, and specific content.
  • step 5 the player receives the RTP packet of the audio and video, parses the RTP packet, and decodes and analyzes the audio and video frames.
  • step 6 the player receives the SIP Message message, parses the description information, and analyzes it to restore different types of description information.
  • step 7 the player synchronously plays the audio and video data, and at the same time, finds the description information of the corresponding time stamp in the description information received in step 6, realizes synchronous presentation of the description information and the audio and video, and simultaneously detects the peripheral sensor according to the description information. Control and present its corresponding action behavior.
  • the terminal A and the terminal B respectively include a sending module and a receiving module.
  • a call is established between the terminal A and the terminal B through the SIP protocol.
  • the interactive communication steps are as follows:
  • the transmitting module of the terminal A analyzes and processes the original data collected by various sensors, cameras, and microphones, extracts the description information and packs it, compresses and encodes the original data of the audio and video, and transmits the description information in the description channel, in the media channel.
  • the audio and video encoded data is transmitted;
  • the receiving module of the terminal B receives the description information and the audio and video encoded data, after unpacking, decoding the audio and video encoded data, analyzing and processing the description data, synchronously playing the audio and video through the presentation unit, and synchronously presenting the action and the environment. .
  • the sending module of terminal B analyzes and processes the original data collected by various sensors, cameras and microphones, extracts the description information and packs it, compresses and encodes the original data of the audio and video, and transmits the description information in the description channel.
  • the audio channel encodes the data in the media channel;
  • the receiving module of the terminal B receives the description information and the audio and video encoded data, and after decoding the packet, decodes the audio and video encoded data, analyzes and processes the description data, and synchronously plays the audio and video through the presentation unit, and synchronously presents the action. And the environment.
  • the sending module of the terminal A collects data such as the amplitude and strength of the handshake action of the user, and transmits the data to the receiving module of the terminal B through the description channel, and the receiving module passes the terminal.
  • the glove in B synchronizes the real-time handshake action, and the user of terminal B also performs the same handshake action.
  • the sending module of terminal B collects the data of the handshake action, it transmits the data to the receiving module of terminal A through the description channel, and the receiving module passes the terminal.
  • A's gloves are synchronized to present a handshake action in real time. Thereby real-time interaction of two terminals is realized, just like face-to-face real-time communication.
  • the sending module of the terminal A collects the concentration and coordinates of the perfume taste, and transmits the same to the receiving module of the terminal B through the description channel, and the receiving module presents the same taste and concentration through the taste developing device of the terminal B, and the taste description transmitted by the terminal A When the end of the message is 0, the taste is over.
  • An implementation method for extracting and generating description information in bidirectional real-time video communication including steps 1 to 5.
  • step 1 the data collected by various peripheral sensors is extracted, and related information is recorded by time.
  • timestamp 1 sensor glove: handshake action amplitude, velocity, duration, corresponding person's number, and coordinates
  • timestamp 2 fragrance sensor: flavor concentration, fragrance, duration, and coordinates.
  • step 2 the video content is analyzed, the information of interest is extracted, and recorded by time.
  • timestamp 1 corresponds to the person's number, coordinates, expression emotions, and duration.
  • Timestamp 2 The number, coordinates, and action behavior of the person or thing (such as the warning line, etc.).
  • step 3 the audio content is speech-recognized, the information of interest is extracted, and recorded in time.
  • timestamp 1 the number, coordinates, intonation, language of the corresponding person, and the translated voice content.
  • step 4 all the recorded information in the above steps 1 to 3 are analyzed and combined by time.
  • the merged information is uniformly defined according to the format specified by the description information, such as: timestamp, description information number, description information type, coordinates, corresponding person or object number, duration, termination flag, and specific contents.
  • a method of describing information and synchronizing presentation of audio and video content is provided, including steps 1 to 3.
  • the reference track is selected as the alignment reference for the synchronization process.
  • the playback speed of other media tracks or description information is adjusted by the reference track (fast or slow).
  • speech is selected as a reference.
  • step 2 the starting points are aligned.
  • the absolute playing time of the first frame voice such as the NTP time
  • the time stamp of the first frame voice are calculated.
  • step 3 the audio and video frames are played synchronously, and the description information of the corresponding phase stamp is analyzed.
  • the duration of the description information the length of time of the presentation is controlled, and according to the description information coordinates and the object, the peripheral device is controlled according to the description information according to the description information, and the corresponding action behavior and environment are presented, for example, through the sensing glove.
  • the magnitude and intensity of the handshake are presented, and the fragrance and the concentration of the fragrance are generated by the fragrance generating device within the time period indicated by the description information.
  • the embodiment of the present application realizes immersive interactive video communication and improves the user experience.
  • the immersive video experience can be better realized in the video communication, and the audio and video can be simultaneously displayed, and the action and the environment can be completely synchronized, and the audio, video and environment can be integrated with each other to realize communication.
  • the somatosensory interaction of the party presents an immersive experience. At the same time, it supports two-way or multi-party real-time communication, and also supports content storage and distribution to support various video services.
  • Embodiments of the present application also provide a storage medium having stored therein a computer program, wherein the computer program is configured to execute the steps of any one of the method embodiments described above.
  • the storage medium may be configured to store a computer program configured to perform steps S1 and S2.
  • step S1 the multimedia data and the description information are acquired in the first device, wherein the description information is used to describe the environment of the first device when recording the multimedia data and the multimedia data.
  • step S2 the multimedia data and the description information are transmitted from the first device to the second device.
  • the foregoing storage medium may include, but is not limited to, a U disk, a ROM, a RAM, a mobile hard disk, a magnetic disk, or an optical disk, and the like, which can store a computer program.
  • Embodiments of the present application also provide an electronic device including a memory and a processor having a computer program stored therein, the processor being configured to execute a computer program to perform the steps of any of the above method embodiments.
  • the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.
  • the above processor may be arranged to perform steps S1 and S2 by a computer program.
  • step S1 the multimedia data and the description information are acquired in the first device, where the description information is used to describe an environment of the first device when recording the multimedia data and the multimedia data;
  • step S2 the multimedia data and the description information are transmitted from the first device to the second device.
  • modules or steps of the present application described above may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices. For example, they may be implemented in program code executable by a computing device such that they may be stored in a storage device for execution by a computing device and, in some instances, may be performed in a different order than that illustrated herein. Or the steps described, either separately as individual integrated circuit modules, or as a plurality of modules or steps in a single integrated circuit module. Thus, the application is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Security & Cryptography (AREA)
  • Automation & Control Theory (AREA)
  • Software Systems (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请提供了一种多媒体信息的传输方法及装置、终端,其中,该方法包括:在第一设备中获取多媒体数据和描述信息,其中,描述信息用于描述第一设备在录制多媒体数据时的环境与所述多媒体数据;将多媒体数据和描述信息从第一设备传输至第二设备。

Description

多媒体信息的传输方法及装置、终端
本申请要求在2018年05月10日提交中国专利局、申请号为201810444330.8的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信领域,例如涉及一种多媒体信息的传输方法及装置、终端。
背景技术
随着网络带宽的不断发展,处理器的运算速度不断提高,传感器技术不断发展,虚拟现实(Virtual Reality,VR)技术开始得到应用,人们对视频通信的体验要求越来越高,除了呈现三维(3Dimensions,3D)视频、3D音频外,还要获取与呈现环境、行为动作等相关的更多信息,要求能身临其境。
相关技术中,通过多摄像头采集及拼接技术形成全景视频,通过多麦克风采集多声道声音,终端通过渐进式下载后进行播放等技术,使得接收端能够看到各个视角的全景视频。相较于传统的视频通信方式,全景视频通信通过传输多个视角的视频画面,带来了更好的用户体验。但同时也存在以下情况:对于其他信息如行为动作、环境等,接收端不能根据发生端的信息,同步呈现原始的声音画面及动作环境,直接影响了用户体验。
针对相关技术中存在的上述情况,目前尚未发现有效的解决方案。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本申请实施例提供了一种多媒体信息的传输方法及装置、终端。
根据本申请的一个实施例,提供了一种多媒体信息的传输方法,包括:在第一设备中获取多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备。
根据本申请的一个实施例,提供了另一种多媒体信息的传输方法,包括: 在第二设备上接收第一设备发送的多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;解析所述多媒体数据和所述描述信息分别得到第一内容和第二内容;在播放所述第一内容时,呈现所述第二内容。
根据本申请的另一个实施例,提供了一种多媒体信息的传输装置,包括:获取模块,设置为在第一设备中获取多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;传输模块,设置为将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备。
根据本申请的另一个实施例,提供了另一种多媒体信息的传输装置,包括:接收模块,设置为在第二设备上接收第一设备发送的多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;解析模块,设置为解析所述多媒体数据和所述描述信息分别得到第一内容和第二内容;输出模块,设置为在播放所述第一内容时,呈现所述第二内容。
根据本申请的又一个实施例,提供了一种终端,包括第一设备和第二设备,其中,所述第一设备包括:获取模块,设置为获取多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;传输模块,设置为将所述多媒体数据和所述描述信息从所述第一设备传输至所述第二设备;所述第二设备包括:接收模块,设置为接收所述第一设备发送的多媒体数据和描述信息;解析模块,设置为解析所述多媒体数据和所述描述信息分别得到第一内容和第二内容;输出模块,设置为在播放所述第一内容时,呈现所述第二内容。
根据本申请的又一个实施例,还提供了一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本申请的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
在阅读并理解了附图和详细描述后,可以明白其他方面。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是本申请实施例的网络构架图;
图2是根据本申请实施例的一种多媒体信息的传输的流程图;
图3是根据本申请实施例的一种多媒体信息的传输装置的结构框图;
图4是根据本申请实施例的另一种多媒体信息的传输装置的结构框图;
图5为本申请实例1的发送端示意图;
图6是本实施例在RTP协议打包描述信息的示意图;
图7是本实施例在描述信息内容封装结构示意图;
图8为本申请实例1的接收端示意图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
实施例1
本申请实施例可以运行于图1所示的网络架构上,图1是本申请实施例的网络构架图,如图1所示,该网络架构包括:第一设备、第二设备,其中,第一设备和第二设备之间进行交互。
在本实施例中提供了一种运行于上述网络架构的多媒体信息的传输方法,图2是根据本申请实施例的一种多媒体信息的传输的流程图,如图2所示,该流程包括步骤步骤S202和步骤S204。
在步骤S202中,在第一设备中获取多媒体数据和描述信息,其中,描述信息用于描述第一设备在录制多媒体数据时的环境与所述多媒体数据。
在步骤S204中,将多媒体数据和描述信息从第一设备传输至第二设备。
通过上述步骤,在传输多媒体数据时,还将用于描述所述第一设备在录制 所述多媒体数据时的环境与所述多媒体数据的描述信息也传输给第二设备,从而第二设备可以在播放多媒体内容时,同时呈现描述信息相关的内容,避免了相关技术中在传输多媒体信息时不能传输描述信息的情况,实现了多媒体内容与环境的互相融合,实现通讯各方的体感交互,呈现身临其境的体验。
在一实施例中,上述步骤的执行主体可以为终端,如手机,虚拟现实(VR)终端等,但不限于此。
本实施例中的多媒体数据包括音频数据,视频数据等。
在一实施例中,将多媒体数据和描述信息从第一设备传输至第二设备可以但不限于为以下方式。
在第一通道中传输多媒体数据至第二设备,在第二通道中传输描述信息至第二设备;第一通道可以是多媒体数据传输通道;将多媒体数据和描述信息存储在第一文件,通过读取第一文件获得多媒体数据和描述信息,实现非实时传输至第二设备;第一文件是可选的,针对编码器、摄像机这类非实时应用可以存储成文件,实时通讯业务就不需要存储到文件,是为了实现录制功能,可以播放非实时的VR视频,播放时,读取该第一文件,再使用第一通道和第二通道传输所述多媒体数据与所述描述信息;其中,第一通道和第二通道建立在第一设备与第二设备之间。
在一实施例中,所述第二通道是独立通道,或者所述第二通道也可以合并到第一通道中传输,所述第二通道为第一通道中的一个隧道。
在一实施例中,将多媒体数据和描述信息从第一设备传输至第二设备可以但不限于为以下方式:
将所述多媒体数据实时传输给所述第二设备,以使得所述多媒体数据实时显示在所述第二设备;实时将所述描述信息发送给所述第二设备;将所述多媒体数据和所述描述信息实时传输给所述第二设备,以使得所述多媒体数据和所述描述信息实时显示在所述第二设备。
在一实施例中,在将多媒体数据和描述信息从第一设备传输至第二设备之前,还包括:对多媒体数据进行压缩编码,并打包成第一数据包,以及将描述信息打包成第二数据包。
在一实施例中,本实施例的描述信息包括至少一个,在此进行举例说明,描述信息包括时间戳,持续时间,坐标,对象的标识信息,以及描述内容。
时间戳用于描述多媒体数据的录制时间。
持续时间用于描述多媒体数据从初始时间到当前时间的持续时间。
坐标用于描述多媒体数据的录制位置。
对象的标识信息用于标识多媒体数据对应画面中的对象;对象可以是人物或景物等。
描述内容用于描述多媒体数据对应画面中的环境和分析多媒体数据获得的数据种的至少一种。
例如,描述内容包括以下至少之一:语音识别音频数据后的文字内容、音频数据的语种、音频数据的语调、视频图像中对象的情感、视频图像中对象的身体特征、视频图像中对象的动作、视频图像中对象的力量、视频图像对应画面所处环境中的风力、视频图像对应画面中的风向、多媒体数据对应画面所处环境中的温度、视频图像对应画面所处环境中的味觉、视频图像对应画面所处环境中的味道,以及视频图像中对象的触觉,其中,多媒体数据包括视频数据和音频数据。
本实施例的方案可以应用在不同的场景,使用不同的传输协议,包括:使用实时传输协议(Real-time Transport Protocol,RTP)将多媒体数据和描述信息从第一设备传输至第二设备;使用会话初始协议(Session Initiation Protocol,SIP)将描述信息从第一设备传输至第二设备;使用实时流传输协议(Real Time Streaming Protocol,RTSP)将描述信息从第一设备传输至第二设备;使用自定义传输协议将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备。
在一实施例中,在第一设备中获取多媒体数据和描述信息包括步骤S11和步骤S12。
在步骤S11中,通过多个传感器采集数据;多个传感器分别采集多媒体数据和描述信息的原始数据,多个传感器包括以下至少之一:运动传感器、环境传感器、激光雷达、毫米波雷达、香味传感器,以及传感手套。传感器(如摄像头,麦克风)可以设置在第一设备,也可以外接到第一设备,通过第一设备汇总处理。
在步骤S12中,在第一设备中对数据进行分析处理,并提取出多媒体数据和描述信息。描述信息通过分析处理多种传感器采集到的原始数据后获得,如通过温度传感器获得与温度相关的描述信息,也可以分析处理音视频原始数据 后获得,如分析视频画面中的人物表情获得与情感相关的描述信息等。
在本实施例中提供了另一种运行于上述网络架构的多媒体信息的传输方法,在接收端的第二设备上,如图2所示,该流程包括步骤S302,步骤S304和步骤S306。
在步骤S302中,在第二设备上接收第一设备发送的多媒体数据和描述信息,其中,描述信息用于描述第一设备在录制多媒体数据时的环境与所述多媒体数据。
在步骤S304中,解析多媒体数据和描述信息分别得到第一内容和第二内容。
步骤S306中,在播放第一内容时,呈现第二内容。
在一实施例中,在第二设备上接收第一设备发送的多媒体数据和描述信息包括:在第二设备的第一通道中接收第一设备发送的多媒体数据,在第二通道中接收第一设备发送的描述信息;其中,第一通道和第二通道建立在第一设备与第二设备之间。也可以通过读取第一文件获取多媒体数据与所述描述信息,再使用第一通道和第二通道传输所述多媒体数据与所述描述信息,实现非实时传输。
本实施例中,在播放第一内容时,呈现第二内容,包括:在至少一个第三设备上播放第一内容,在至少一个第四设备上呈现第二内容。第一内容为多媒体内容,包括视频内容和音频内容等,可以通过显示屏或者喇叭来播放;第二内容通过对应的呈现终端来呈现或者模拟,如时间戳或者持续时间通过显示屏来显示,温度通过制冷设备或者制热设备来模拟呈现,味道通过释放特定的气味来呈现,力量通过驱动设备来呈现等。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质,例如只读内存/随机存取存储器(Read Only Memory/Random Access Memory,ROM/RAM)、磁碟或光盘中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
实施例2
在本实施例中还提供了一种多媒体信息的传输装置,该装置设置为实现上述实施例及示例实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和硬件中至少一种的组合。尽管以下实施例所描述的装置可以以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图3是根据本申请实施例的一种多媒体信息的传输装置的结构框图,如图3所示,应用在第一设备,该装置包括获取模块30和传输模块32。
获取模块30,设置为在第一设备中获取多媒体数据和描述信息,其中,描述信息用于描述第一设备在录制多媒体数据时的环境与所述多媒体数据。
传输模块32,设置为将多媒体数据和描述信息从第一设备传输至第二设备。
图4是根据本申请实施例的另一种多媒体信息的传输装置的结构框图,如图4所示,应用在第二设备,该装置包括接收模块40,解析模块42和输出模块44。
接收模块40,设置为在第二设备上接收第一设备发送的多媒体数据和描述信息,其中,描述信息用于描述第一设备在录制多媒体数据时的环境与所述多媒体数据。
解析模块42,设置为解析多媒体数据和描述信息分别得到第一内容和第二内容。
输出模块44,设置为在播放第一内容时,呈现第二内容。
本实施例还提供一种终端,组合了上述第一设备和第二设备所包括的功能模块,可以实现第一设备和第二设备的功能。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。
实施例3
本实施例用于结合具体的实例对本申请的方案进行详细解释和说明,可以作为本申请的补充或者延伸。
本实施例提供的视频通信方法,可以带来沉浸式的体验,本申请提供了一种沉浸交互式的视频通信方法,该方法包括:发生端对应于第一设备,把多种传感器、摄像头,以及麦克风采集到的原始数据进行分析处理,提取出描述信 息并打包,对音视频原始数据进行压缩编码并打包,在描述通道传输描述信息,在媒体通道中传输音视频编码数据;接收端对应于第二设备,接收描述信息与音视频编码数据,解包后,解码音视频编码数据,分析处理描述数据,通过呈现单元,同步播放音视频、呈现动作与环境。
上述过程中,如不在媒体通道或描述通道中传输,可以把描述信息与音视频编码数据存储到文件中,接收端从文件中提取音视频编码数据与描述信息来进行呈现。
所述多种传感器包括:运动传感器、环境传感器、激光雷达,以及毫米波雷达等,所述摄像头可以是至少一个摄像头组成的阵列,所述麦克风可以是至少一个麦克风组成的阵列。
所述描述信息是分析处理多种传感器采集到的原始数据后获得的信息,及分析处理音视频原始数据后获得的信息,所述描述信息包括:时间戳、持续时间、坐标、对应人或物的编号,以及具体内容等,描述信息中的具体内容包括但不限于:语音识别后的文字内容、语种、语调、情感、身体特征、动作、力量、风力、风向、温度,以及味觉等。时间戳是指描述信息发生的时间,持续时间给出了描述信息的持续时间,坐标给出了在哪个位置采集到的描述信息,对应人或物的编号给出了描述信息产生的对象。由于音视频数据打包时也携带了时间戳,接收端根据描述信息时间戳与音视频时间戳实现描述信息与音视频的同步呈现,接收端可以根据描述信息,在与发送端的相同的坐标、时间、对象上呈现描述信息。
所述描述通道可以在RTP、SIP或RTSP传输协议中扩展,也可以是其他传输协议。
本申请还提供了一种沉浸交互式视频通信的系统和装置,包括:发送装置和接收装置。
发送装置包括下列处理步骤:采集单元通过多种传感器、摄像头、麦克风采集原始数据;描述信息生成单元对原始数据进行分析处理,提取出描述信息,编码单元对视频原始数据进行视频压缩编码、对音频原始数据进行音频压缩编码;打包单元对描述信息进行打包,对视频编码数据、音频编码数据也分别打包;发送单元在描述通道中传输描述信息,在媒体通道中传输视频编码数据、音频编码数据,如不在网络中传输,存储单元把描述信息与音视频编码数据存 储到文件中。
采集单元:通过摄像头、麦克风与多种传感器采集原始数据,多种传感器包括:运动传感器、环境传感器、激光雷达,以及毫米波雷达等,所述摄像头可以是至少一个摄像头组成的阵列,所述麦克风可以是至少一个麦克风组成的阵列。
描述信息生成单元:从多种传感器采集原始数据及音视频原始数据中,提取出动作、环境及音视频描述信息,描述信息包括:时间戳、持续时间、坐标、对应人或物的编号、具体内容等,描述信息中的具体内容包括但不限于:语音识别后的文字内容、语种、语调、情感、身体特征、动作、力量、风力、风向、温度,以及味觉等。时间戳是指描述信息发生的时间,持续时间给出了描述信息的持续时间,坐标给出了在哪个位置采集到的描述信息,对应人或物的编号给出了描述信息产生的对象。
编码单元:对采集到的音频和视频原始数据进行压缩编码。
打包单元:打包编码后的音视频编码数据,形成音视频编码数据包,打包描述信息形成描述信息数据包。描述信息和音视频编码数据包分别按各自的格式进行打包封装,在实施例中提供了可能的打包格式,但音视频描述信息的具体打包格式本申请并不做限定。
发送单元:在描述通道中发送描述信息数据包,在媒体通道中发送音视频编码数据包。所述描述通道可以在RTP或SIP传输协议中扩展,也可以是单独建立的传输通道。
存储单元:存储音视频编码数据与描述信息到文件中。
接收装置包括下列处理步骤:接收单元从描述通道与媒体通道中接收描述信息包与音视频编码数据包,或者从文件中读取描述信息与音视频编码数据;解包单元拆开音帧编码数据包、视频编码数据包、描述信息包;解码单元解码音频编码数据、视频编码数据获得音视频原始数据,描述信息处理单元处理描述信息,处理音视频原始数据;呈现单元同步播放处理后的音视频原始数据,同步呈现动作与环境。
其中,接收单元:从媒体通道接收音视频编码数据包,从描述通道接收描述信息数据包,所述描述通道可以在RTP或SIP传输协议中扩展,也可以是单独建立的传输通道。或者,从文件中读取音视频编码数据包与描述信息数据包。
解包单元:解析音视频编码数据包与描述信息数据包,得到语音编码数据、视频编码数据以及描述信息。所述描述信息包括:时间戳、持续时间、坐标、对应人或物的编号,以及具体内容等,描述信息中的具体内容包括但不限于:语音识别后的文字内容、语种、语调、情感、身体特征、动作、力量、风力、风向、温度,以及味觉等。
解码单元:解码语音编码数据和视频编码数据,得到可播放的音视频原始数据。
描述信息处理单元:根据描述信息对音视频描述信息进行分析处理,还原出音视频不同类型的描述信息。
呈现单元:根据音视频数据对应的时间戳,呈现音视频,根据描述信息时间戳呈现描述信息,实现描述信息与音视频的同步呈现,同时,根据描述信息的持续时间,控制呈现的时间长度,根据描述信息坐标与对象,在不同的位置与对象上呈现描述信息。
本实施例还包括以下实例:
前4个实例结合不同的应用场景分别给出了描述信息通过传输协议RTP、文件存储MP4、信令控制协议SIP、RTSP等不同方式进行实现的方法;第5个实例给出了一种提取描述信息的实现方法,第6个实例给出一种音视频和环境、动作行为同步呈现的实现方法。
实例1:直播场景
包括发送装置与接收装置,发送装置与接收装置先通过SIP协议或RTSP协议建立呼叫连接。
发送装置如图5所示,图5为本申请实例1的发送端示意图,包括步骤1至步骤5。
在步骤1中,通过运动和环境等传感器采集原始数据,传感器包括:手套、衣服、帽子、鞋子、温度、味觉,以及风力等。用摄像头阵列采集视频原始数据,用麦克风阵列采集音频原始数据。
在步骤2中,对步骤1采集的数据进行处理与分析,对视频进行拼接处理、消除噪声,识别视频中感兴趣对象,提取其特征等信息,处理语音原始数据,消除噪声、回声,提取语音的坐标、语音转化为文本、语调、语音的情感等,从运动与环境等传感器中,提取出动作、环境等描述信息,描述信息包括:时 间戳、持续时间、坐标、对应人或物的编号,以及具体内容等,描述信息中的具体内容包括但不限于:语音识别后的文字内容、语种、语调、情感、身体特征、动作、力量、风力、风向、温度,以及味觉等。时间戳是指描述信息发生的时间,持续时间给出了描述信息的持续时间,坐标给出了在哪个位置采集到的描述信息,对应人或物的编号给出了描述信息产生的对象。对描述信息进行定义及分级量化,每个描述信息还包括:描述信息类型、描述信息编号、终止标志等,相同描述信息编号的描述信息第一次出现时,终止标志设为1,相同描述信息编号的描述信息最后一次出现时,终止标志设为0。
表1给出了描述信息内容名称、内容代码及说明,描述信息可以扩展以包括更多的内容。
表1
Figure PCTCN2019080876-appb-000001
Figure PCTCN2019080876-appb-000002
在步骤3中,对视频进行压缩编码,编码算法采用H.265,对音频进行压缩编码,编码算法采用自适应多速率宽带(Adaptive Multi-Rate Wideband,AMR WB),得到编码后的音频编码和视频编码数据。
在步骤4中,对视频编码数据按照一系列以编号排定的文件7798(Request For Comments,RFC 7798)进行打包封装,对语音编码数据按照RFC 4867进行打包封装。
对音视频描述信息进行RTP打包封装,RTP时间戳信息实现了描述信息和音视频流的同步呈现,每个RTP包中包含相同时间片所对应的描述信息,具体打包格式如图6所示,图6是本实施例在RTP协议打包描述信息的示意图,RTP包头各字段按照RFC 3550进行封装,描述信息包的时间戳字段分别和音视频媒体包的时钟频率保持一致,以便于接收端呈现时的同步处理。RTP负载中包含至少一个描述信息,每个描述数据对应RTP负载头中的一个指示信息块:包括F指示位、描述信息长度;F指示位指示当前数据是否是最后一个描述信息,若当前数据是最后一个描述信息,则F指示位为1,若当前数据不是最后一个描述信息,则F指示位为0;描述信息1的长度指示了第一个描述信息的长度,描述信息N的长度指示了第N个描述信息的长度,单位字节。
图7是本实施例在描述信息内容封装结构示意图,如图7所示,描述信息,可以包括多个内容,每个内容又可以包括:内容代码、G指示位、内容长度,以及内容值。内容代码参见表一。G指示位指示当前内容是否是最后一个内容信息,若当前内容是最后一个内容信息,则G指示位为1;若当前内容不是最后一个内容信息,则G指示位为0。如果内容值太长,同一个描述信息可以拆 分在多个RTP包中传输,G指示位设为0,下一个RTP包会继续传输同一个内容值;G指示位设为1时,则后续没有相同内容的数据包。内容长度是指内容值的长度,单位字节。
此外,音视频描述信息流相应的RTP包头的PT值等信息可在SDP中进行描述。
在步骤5中,发送打包封装后的描述信息和音视频编码数据包。
接收端如图8所示,图8为本申请实例1的接收端示意图,包括步骤步骤1至步骤5。
在步骤1中,分别从相应的端口接收网络数据包,得到描述信息包和音视频编码数据包。
在步骤2中,按照相应的格式解析包内容,得到描述信息和音视频编码数据。
在步骤3中,对音视频帧编码器数据进行解码,得到可播放的音视频数据。
在步骤4中,对描述信息进行分析处理,还原出不同类型的描述信息,以用于控制不同的外设进行呈现,例如味道信息用于控制味道生成装置合成指定的味道。对音视频进行分析处理,还原出不同视角的音视频数据。
在步骤5中,根据音视频数据对应的时间戳,同步播放音视频数据,根据描述信息时间戳呈现描述信息,实现描述信息与音视频的同步呈现,同时,根据描述信息的持续时间,控制呈现的时间长度,根据描述信息坐标与对象,在不同的位置与对象上,根据描述信息对外设进行控制,呈现其相应动作行为与环境,根据描述信息,可以在播放的视频中,插入不同语种的字幕、根据语音的方位切换视角等。
实例2:点播场景
包括编码器、流媒体服务器和播放器,包括步骤1至步骤7。
在步骤1中,编码器采集描述数据和音视频数据,并对它们进行分析处理、编码。详细步骤参考实例1的步骤1至步骤3。
在步骤2中,编码器将描述信息、音视频编码数据存入MP4文件。
对音视频数据分对建立音频轨道(Audio track)、视频轨道(Video track),对描述信息建立文本轨道(Text track)。
描述信息数据放在媒体数据盒(mdat box)中,描述信息数据包括坐标、对 应人或物的编号、持续时间、描述信息类型、描述信息编号、终止标志、具体内容。
描述信息的时间戳信息通过持续时间(duration)字段放在文件头中。
需要说明的是,这里提供了一种可实施的文件格式,服务器所存储的文件格式并不限于MP4文件。
在步骤3中,将MP4文件传输到流媒体服务器。
在步骤4中,播放器向流媒体服务器发送超文本传输协议接入(Hyper Text Transport Protocol Get,HTTP GET)请求,下载MP4文件中的Moov原子,解析索引信息。
在步骤5中,播放器发送带范围(Range)字段的GET请求,指定播放某段特定位置的MP4文件。
在步骤6中,播放器通过索引信息读取mdat box中的音视频编码内容以及描述信息。对音视频帧进行解码及分析处理;对描述信息进行分析处理还原出不同类型的描述信息。
在步骤7中,播放器同步播放音视频数据,同时在MP4文件中索引到相同时戳的描述信息,实现描述信息与音视频的同步呈现,根据描述信息中的人或物的编号及坐标,将相应动作应用于指定对象上,同时,根据描述信息的持续时间,控制呈现的时间长度,根据描述信息坐标与对象,在不同的位置与对象上,根据描述信息对外设进行控制,呈现其相应动作行为与环境。
实例3:实时监控场景
包括摄像头、播放器,包括步骤1至步骤7。
在步骤1中,播放器和摄像头通过SIP信令建立连接。
在步骤2中,摄像头采集描述数据和音视频数据,并对它们进行分析处理、编码。详细步骤参考实例1的步骤1至步骤4。
在步骤3中,摄像头向播放器发送音视频数据的RTP包。
在步骤4中,摄像头将描述信息以文本形式放在SIP扩展的消息(Message)方法中,及时发送给播放器。例如:
CSeq:1MESSAGE
Content-Type:text/plain
Content-Length:200
描述信息文本内容
上述描述信息文本内容由描述信息坐标、对应人或物的编号、时间戳、持续时间、描述信息类型、描述信息编号、终止标志、具体内容组成。采用文本格式进行封装,具体是:名称代码代码:内容值。
在步骤5中,播放器接收音视频的RTP包,解析RTP包,对音视频帧进行解码及分析处理。
在步骤6中,播放器接收SIP的Message消息,解析出描述信息,并对其进行分析处理还原出不同类型的描述信息。
在步骤7中,播放器同步播放音视频数据,同时在步骤6中收到的描述信息中查找到相应时间戳的描述信息,实现描述信息与音视频的同步呈现,同时根据描述信息对外设传感器进行控制,呈现其相应动作行为。
实例4:实时交互通讯场景
包括终端A与终端B,终端A和终端B各包含一个发送模块和一个接收模块,首先通过SIP协议终端A与终端B之间建立起呼叫,交互式通讯步骤如下:
终端A的发送模块把多种传感器、摄像头、麦克风采集到的原始数据进行分析处理,提取出描述信息并打包,对音视频原始数据进行压缩编码并打包,在描述通道传输描述信息,在媒体通道中传输音视频编码数据;终端B的接收模块接收描述信息与音视频编码数据,解包后,解码音视频编码数据,分析处理描述数据,通过呈现单元,同步播放音视频,同步呈现动作与环境。
同时,终端B的发送模块把多种传感器、摄像头、麦克风采集到的原始数据进行分析处理,提取出描述信息并打包,对音视频原始数据进行压缩编码并打包,在描述通道传输描述信息,在媒体通道中传输音视频编码数据;终端B的接收模块接收描述信息与音视频编码数据,解包后,解码音视频编码数据,分析处理描述数据,通过呈现单元,同步播放音视频,同步呈现动作与环境。
当终端A中的用户通过手套与终端B中的用户进行握手时,终端A的发送模块采集到用户的握手动作幅度与力量等数据,通过描述通道传输给终端B的接收模块,接收模块通过终端B中的手套同步实时呈现握手动作,终端B的用户也做出同样的握手动作,终端B的发送模块采集到握手动作的数据后,通过描述通道传输给终端A的接收模块,接收模块通过终端A的手套同步实时呈现握手动作。从而实现了2个终端的实时互动,犹如面对面实时交流一样。
同时,终端A的发送模块采集到香水味道的浓度、坐标,通过描述通道传输给终端B的接收模块,接收模块通过终端B的味道显现装置呈现相同的味道及浓度,当终端A传输的味道描述信息的终止标志为0时,味道呈现结束。
实例5:
提供一种双向实时视频通讯中提取并生成描述信息的实现方法,包括步骤1至步骤5。
在步骤1中,提取各种外设传感器采集到的数据,按时间记录相关信息。例如,时间戳1:传感手套:握手动作幅度、力度、持续时间、对应人的编号,以及坐标;时间戳2:香味传感器:味道浓度、香型、持续时间,以及坐标等。
在步骤2中,对视频内容进行分析,提取感兴趣信息,并按时间记录。例如,时间戳1:对应人的编号、坐标、表情情绪,以及持续时间等。时间戳2:对应人或物的编号、坐标,以及动作行为(如闯过警戒线等)。
在步骤3中,对音频内容进行语音识别,提取感兴趣信息,并按时间记录。例如,时间戳1:对应人的编号、坐标、语调、语种,以及翻译后的语音内容等。
在步骤4中,对上述步骤1至步骤3中所有记录的信息按时间进行分析合并。
在步骤5中,对分析合并后的信息按描述信息规定的格式进行统一定义,例如:时间戳、描述信息编号、描述信息类型、坐标、对应人或物的编号、持续时间、终止标志,以及具体内容。
实例6:
提供一种描述信息和音视频内容同步呈现的方法,包括步骤1至步骤3。
在步骤1中,选取参考轨作为同步处理的对齐基准。当选择某一媒体轨作为参考时,其他媒体轨或描述信息的播放速度会受到参考轨的影响而进行调整(快播或慢播)。结合人的视觉和听觉特性,在音视频的播放过程中,人们对于语音帧的变速播放更加敏感,而对于视频帧或描述信息在展示时间上的微小变化不易察觉,因此选择语音作为参考。
在步骤2中,起始点对齐。根据第一帧语音的绝对播放时间,如NTP时间,以及第一帧语音的时间戳,计算出与之对齐的视频帧的时间戳和描述信息的时间戳。
在步骤3中,同步播放音视频帧,并对相应相间戳的描述信息进行分析处 理。根据描述信息的持续时间,控制呈现的时间长度,并且根据描述信息坐标与对象,在不同的位置与对象上,根据描述信息对外设进行控制,呈现其相应动作行为与环境,例如通过传感手套呈现握手的幅度与力度,通过香味发生装置在描述信息所指示的时间段内产生相应香型和浓度的香味等。
综上所述,通过本申请实施例,实现了沉浸交互式视频通讯,提高了用户的体验。
通过本实施例的方案,在视频通信中可以更好地实现沉浸式视频体验,呈现音视频的同时,也能完整地同步呈现动作与环境,实现音频、视频与环境的互相融合,实现通讯各方的体感交互,呈现身临其境的体验。同时,支持两方或多方实时通讯的同时,也支持内容存储后进行分发,支持各种视频业务。
实施例4
本申请的实施例还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在一实施例中,上述存储介质可以被设置为存储设置为执行包括步骤S1和步骤S2的计算机程序。
在步骤S1中,在第一设备中获取多媒体数据和描述信息,其中,描述信息用于描述第一设备在录制多媒体数据时的环境与所述多媒体数据。
在步骤S2中,将多媒体数据和描述信息从第一设备传输至第二设备。
在一实施例中,上述存储介质可以包括但不限于:U盘、ROM、RAM、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本申请的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
在本实施例中,上述处理器可以被设置为通过计算机程序执行步骤S1和步骤S2。
在步骤S1中,在第一设备中获取多媒体数据和描述信息,其中,描述信息用于描述第一设备在录制多媒体数据时的环境与所述多媒体数据;
在步骤S2中,将多媒体数据和描述信息从第一设备传输至第二设备。
在一实施例中,本实施例中的具体示例可以参考上述实施例及示例实施方式中所描述的示例,本实施例在此不再赘述。
本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,例如,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。

Claims (18)

  1. 一种多媒体信息的传输方法,包括:
    在第一设备中获取多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;
    将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备。
  2. 根据权利要求1所述的方法,其中,所述将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备包括:
    在第一通道中传输所述多媒体数据至所述第二设备,在第二通道中传输所述描述信息至所述第二设备;
    其中,所述第一通道和所述第二通道建立在所述第一设备与所述第二设备之间。
  3. 根据权利要求2所述的方法,其中,所述第二通道是独立通道或为所述第一通道中的一个隧道。
  4. 根据权利要求1所述的方法,其中,所述将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备包括以下之一:
    将所述多媒体数据实时传输给所述第二设备,以使得所述多媒体数据实时显示在所述第二设备;实时将所述描述信息发送给所述第二设备;
    将所述多媒体数据和所述描述信息实时传输给所述第二设备,以使得所述多媒体数据和所述描述信息实时显示在所述第二设备。
  5. 根据权利要求1所述的方法,在将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备之前,所述方法还包括:
    对所述多媒体数据进行压缩编码,并打包成第一数据包,以及将所述描述信息打包成第二数据包。
  6. 根据权利要求1所述的方法,其中,所述描述信息包括以下至少之一:
    时间戳,其中,所述时间戳用于描述所述多媒体数据的录制时间;
    持续时间,其中,所述持续时间用于描述所述多媒体数据从初始时间到当前时间的持续时间;
    坐标,其中,所述坐标用于描述所述多媒体数据的录制位置;
    对象的标识信息,其中,所述对象的标识信息用于标识所述多媒体数据对应画面中的对象;
    描述内容,其中,所述描述内容用于描述所述多媒体数据对应画面中的环 境和分析所述多媒体数据获得的信息中的至少一种。
  7. 根据权利要求6所述的方法,其中,所述描述内容包括以下至少之一:
    语音识别音频数据后的文字内容、音频数据的语种、音频数据的语调、视频图像中对象的情感、视频图像中对象的身体特征、视频图像中对象的动作、视频图像中对象的力量、视频图像对应画面所处环境中的风力、视频图像对应画面中的风向、所述多媒体数据对应画面所处环境中的温度、视频图像对应画面所处环境中的味觉、视频图像对应画面所处环境中的味道,以及视频图像中对象的触觉,其中,所述多媒体数据包括视频数据和音频数据。
  8. 根据权利要求1所述的方法,其中,将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备包括以下之一:
    使用实时传输协议RTP将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备;
    使用会话初始协议SIP将所述描述信息从所述第一设备传输至第二设备;
    使用实时流传输协议RTSP将所述描述信息从所述第一设备传输至第二设备;
    使用自定义传输协议将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备。
  9. 根据权利要求1所述的方法,其中,在第一设备中获取多媒体数据和描述信息包括:
    通过多个传感器采集数据;
    在第一设备中对所述数据进行分析处理,并提取出所述多媒体数据和所述描述信息。
  10. 根据权利要求9所述的方法,其中,所述多个传感器包括以下至少之一:运动传感器、环境传感器、激光雷达、毫米波雷达、香味传感器,以及传感手套。
  11. 一种多媒体信息的传输方法,包括:
    在第二设备上接收第一设备发送的多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;
    解析所述多媒体数据和所述描述信息分别得到第一内容和第二内容;
    在播放所述第一内容时,呈现所述第二内容。
  12. 根据权利要求11所述的方法,其中,在第二设备上接收第一设备发送的多媒体数据和描述信息包括:
    在所述第二设备的第一通道中接收第一设备发送的所述多媒体数据,在第二通道中接收第一设备发送的所述描述信息;
    其中,所述第一通道和所述第二通道建立在所述第一设备与所述第二设备之间。
  13. 根据权利要求11所述的方法,其中,在播放所述第一内容时,呈现所述第二内容,包括:
    在至少一个第三设备上播放所述第一内容,在至少一个第四设备上呈现所述第二内容。
  14. 一种多媒体信息的传输装置,包括:
    获取模块,设置为在第一设备中获取多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;
    传输模块,设置为将所述多媒体数据和所述描述信息从所述第一设备传输至第二设备。
  15. 一种多媒体信息的传输装置,包括:
    接收模块,设置为在第二设备上接收第一设备发送的多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;
    解析模块,设置为解析所述多媒体数据和所述描述信息分别得到第一内容和第二内容;
    输出模块,设置为在播放所述第一内容时,呈现所述第二内容。
  16. 一种终端,包括第一设备和第二设备,其中,
    所述第一设备包括:
    获取模块,设置为获取多媒体数据和描述信息,其中,所述描述信息用于描述所述第一设备在录制所述多媒体数据时的环境与所述多媒体数据;
    传输模块,设置为将所述多媒体数据和所述描述信息从所述第一设备传输至所述第二设备;
    所述第二设备包括:
    接收模块,设置为接收所述第一设备发送的多媒体数据和描述信息;
    解析模块,设置为解析所述多媒体数据和所述描述信息分别得到第一内容和第二内容;
    输出模块,设置为在播放所述第一内容时,呈现所述第二内容。
  17. 一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至13任一项中所述的方法。
  18. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至13任一项中所述的方法。
PCT/CN2019/080876 2018-05-10 2019-04-01 多媒体信息的传输方法及装置、终端 WO2019214370A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP19800178.6A EP3792731A4 (en) 2018-05-10 2019-04-01 METHOD AND APPARATUS FOR TRANSMISSION OF MULTIMEDIA INFORMATION, AND TERMINAL

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810444330.8A CN110475159A (zh) 2018-05-10 2018-05-10 多媒体信息的传输方法及装置、终端
CN201810444330.8 2018-05-10

Publications (1)

Publication Number Publication Date
WO2019214370A1 true WO2019214370A1 (zh) 2019-11-14

Family

ID=68467232

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/080876 WO2019214370A1 (zh) 2018-05-10 2019-04-01 多媒体信息的传输方法及装置、终端

Country Status (3)

Country Link
EP (1) EP3792731A4 (zh)
CN (1) CN110475159A (zh)
WO (1) WO2019214370A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111327941A (zh) * 2020-03-10 2020-06-23 腾讯科技(深圳)有限公司 一种离线视频播放方法、装置、设备及介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021227580A1 (zh) * 2020-05-15 2021-11-18 Oppo广东移动通信有限公司 信息处理方法及编码器、解码器、存储介质设备
CN111708902A (zh) * 2020-06-04 2020-09-25 南京晓庄学院 一种多媒体数据采集方法
CN113473163B (zh) * 2021-05-24 2023-04-07 康键信息技术(深圳)有限公司 网络直播过程中的数据传输方法、装置、设备及存储介质
CN114697303B (zh) * 2022-03-16 2023-11-03 北京金山云网络技术有限公司 一种多媒体数据处理方法、装置、电子设备及存储介质
CN114710568B (zh) * 2022-04-28 2023-12-01 中移(杭州)信息技术有限公司 音视频数据通信方法、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150352437A1 (en) * 2014-06-09 2015-12-10 Bandai Namco Games Inc. Display control method for head mounted display (hmd) and image generation device
CN106775528A (zh) * 2016-12-12 2017-05-31 合肥华耀广告传媒有限公司 一种虚拟现实的旅游系统
CN107305435A (zh) * 2016-04-18 2017-10-31 迪斯尼企业公司 在增强现实和虚拟现实环境之间链接和交互的系统和方法
CN107895330A (zh) * 2017-11-28 2018-04-10 特斯联(北京)科技有限公司 一种面向智慧旅游实现场景构建的游客服务平台

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1881667A1 (en) * 2006-07-17 2008-01-23 Motorola, Inc., A Corporation of the State of Delaware; Apparatus and method for presenting an event during a broadcast
CN103488291B (zh) * 2013-09-09 2017-05-24 北京诺亦腾科技有限公司 一种基于运动捕捉的浸入式虚拟现实系统
CN104159203A (zh) * 2014-08-29 2014-11-19 蓝信工场(北京)科技有限公司 一种在聊天对话框中进行多媒体信息混排的方法和装置
CN106411821A (zh) * 2015-07-30 2017-02-15 北京奇虎科技有限公司 基于通讯录接收多媒体信息的方法及装置
CN105898211A (zh) * 2015-12-21 2016-08-24 乐视致新电子科技(天津)有限公司 一种处理多媒体信息的方法及装置
CN105760141B (zh) * 2016-04-05 2023-05-09 中兴通讯股份有限公司 一种实现多维控制的方法、智能终端及控制器
CN105929941B (zh) * 2016-04-13 2021-02-05 Oppo广东移动通信有限公司 信息处理方法、装置和终端设备
CN106302427B (zh) * 2016-08-09 2019-11-29 深圳市摩登世纪科技有限公司 在虚拟现实环境中的分享方法及装置
CN107645651A (zh) * 2017-10-12 2018-01-30 北京临近空间飞艇技术开发有限公司 一种增强现实的远程指导方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150352437A1 (en) * 2014-06-09 2015-12-10 Bandai Namco Games Inc. Display control method for head mounted display (hmd) and image generation device
CN107305435A (zh) * 2016-04-18 2017-10-31 迪斯尼企业公司 在增强现实和虚拟现实环境之间链接和交互的系统和方法
CN106775528A (zh) * 2016-12-12 2017-05-31 合肥华耀广告传媒有限公司 一种虚拟现实的旅游系统
CN107895330A (zh) * 2017-11-28 2018-04-10 特斯联(北京)科技有限公司 一种面向智慧旅游实现场景构建的游客服务平台

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3792731A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111327941A (zh) * 2020-03-10 2020-06-23 腾讯科技(深圳)有限公司 一种离线视频播放方法、装置、设备及介质

Also Published As

Publication number Publication date
EP3792731A4 (en) 2022-01-05
EP3792731A1 (en) 2021-03-17
CN110475159A (zh) 2019-11-19

Similar Documents

Publication Publication Date Title
WO2019214370A1 (zh) 多媒体信息的传输方法及装置、终端
US11128893B2 (en) Live streaming method and system, server, and storage medium
JP6639602B2 (ja) オフラインハプティック変換システム
US20200314460A1 (en) Video stream processing method, computer device, and storage medium
CN110868600B (zh) 目标跟踪视频推流方法、显示方法、装置和存储介质
WO2019205872A1 (zh) 视频流处理方法、装置、计算机设备及存储介质
CN105991962B (zh) 连接方法、信息展示方法、装置及系统
TW200418328A (en) Instant video conferencing method, system and storage medium implemented in web game using A/V synchronization technology
CN112533014B (zh) 视频直播中目标物品信息处理和显示方法、装置及设备
CN114040255A (zh) 直播字幕生成方法、系统、设备及存储介质
CN114554277B (zh) 多媒体的处理方法、装置、服务器及计算机可读存储介质
US20230045876A1 (en) Video Playing Method, Apparatus, and System, and Computer Storage Medium
CN110602523A (zh) 一种vr全景直播多媒体处理合成系统和方法
CN110139128B (zh) 一种信息处理方法、拦截器、电子设备及存储介质
WO2022116822A1 (zh) 沉浸式媒体的数据处理方法、装置和计算机可读存储介质
US20110276662A1 (en) Method of constructing multimedia streaming file format, and method and apparatus for servicing multimedia streaming using the multimedia streaming file format
CN108124183A (zh) 以同步获取影音以进行一对多影音串流的方法
WO2024179335A1 (zh) 一种媒体文件处理方法、装置及设备
US11689776B2 (en) Information processing apparatus, information processing apparatus, and program
CN115102932B (zh) 点云媒体的数据处理方法、装置、设备、存储介质及产品
CN113873275B (zh) 一种视频媒体数据的传输方法及装置
CN114554243B (zh) 点云媒体的数据处理方法、装置、设备及存储介质
WO2024160068A1 (zh) 一种触觉媒体的数据处理方法及相关设备
CN113141536B (zh) 视频封面的添加方法、装置、电子设备及存储介质
KR102273795B1 (ko) 영상 동기화 처리를 위한 시스템 및 그 제어방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19800178

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019800178

Country of ref document: EP

Effective date: 20201210