CN105893452B

CN105893452B - Method and device for presenting multimedia information

Info

Publication number: CN105893452B
Application number: CN201610044505.7A
Authority: CN
Inventors: 冯歆鹏; 周骥
Original assignee: Kunshan Zhaoguan Electronic Technology Co ltd; Shaoxing Zhaoguan Electronic Technology Co ltd; NextVPU Shanghai Co Ltd
Current assignee: Kunshan Zhaoguan Electronic Technology Co., Ltd.; Shaoxing Zhaoguan Electronic Technology Co., Ltd.
Priority date: 2016-01-22
Filing date: 2016-01-22
Publication date: 2020-04-17
Anticipated expiration: 2036-01-22
Also published as: CN105893452A

Abstract

The invention discloses a method and a device for presenting multimedia information, wherein the method comprises the following steps: receiving a four-dimensional spatiotemporal model for characterizing the representation information, the four-dimensional spatiotemporal model having attributes capable of characterizing the representation information over time in a digitized form; decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model; according to the decoded four-dimensional space-time model, the representation information represented by the four-dimensional space-time model is presented, and in the scheme, the four-dimensional space-time model has the attribute capable of representing the representation information changing along with time in a digital form, so that the problem that the presented representation information has time delay is solved to a certain extent, and therefore the defect that the prior art has time delay is overcome to a certain extent.

Description

Method and device for presenting multimedia information

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for presenting multimedia information.

Background

With the development of communication technology, people have changed from the original single voice demand to the video and audio communication demand, and therefore, video communication services for transmitting voice, data and video are more and more important in the communication field and are more and more widely applied in the aspects of video conference, remote video medical treatment, remote video education and the like.

VR (Virtual Reality) technology is a highly realistic human-computer interaction technology that can simulate human visual, auditory, tactile, and other sensory behaviors, and by using this technology, a person can be in an environment created by a computer, and "interact, talk" and move freely in a natural manner in the environment through sensory language and gestures, and can observe surrounding objects at will, see objects, hear sound, and feel power, and thus, the person can feel as if he is completely present.

However, in the current method of processing the collected multimedia information by adopting the VR technology, because the real-time collected multimedia information cannot be processed, there is a time delay between the time of presenting the multimedia information and the time of the real scene represented by the multimedia information.

In summary, the current method for presenting multimedia information has the defect of long time delay.

Disclosure of Invention

The present invention has been made in view of the above problems, and aims to provide a method and apparatus for presenting multimedia information that overcomes or at least partially solves the above problems.

According to a first aspect of the present invention, there is provided a method of presenting multimedia information, comprising: receiving a four-dimensional spatiotemporal model for characterizing the appearance information, the four-dimensional spatiotemporal model having properties capable of characterizing the appearance information over time in a digitized form, the appearance information comprising electromagnetic field spectral information capable of being observed with the naked eye and/or capable of being collected with a device for characterizing an object; decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model; and according to the decoded four-dimensional space-time model, presenting the representation information represented by the four-dimensional space-time model.

Optionally, in the method for presenting multimedia information according to the above embodiment of the present invention, before presenting the representation information characterized by the four-dimensional spatio-temporal model, the method further includes:

fusing the four-dimensional space-time model with the first space-time model to obtain a target four-dimensional space-time model, wherein the first space-time model is used for representing the appearance information of an object presenting the place where the multimedia information is located;

presenting the representation information represented by the four-dimensional space-time model, specifically comprising:

and according to the target four-dimensional space-time model, presenting the representation information represented by the four-dimensional space-time model and the representation information represented by the first space-time model.

Optionally, in the method for presenting multimedia information according to any of the above embodiments of the present invention, before presenting the representation information characterized by the four-dimensional spatio-temporal model, the method further includes:

fusing the four-dimensional space-time model with a first space-time model and a second space-time model of a local terminal to obtain a target four-dimensional space-time model, wherein the first space-time model is used for representing the appearance information of an object in a place where multimedia information is present, and the second space-time model is used for representing the appearance information of a virtual object;

and according to the target four-dimensional space-time model, presenting the representation information represented by the four-dimensional space-time model, the representation information represented by the first space-time model and the representation information represented by the second space-time model.

Optionally, in the method for presenting multimedia information according to any of the above embodiments of the present invention, the representation information further includes sound field information that can be sensed by ears and/or collected by a device, and the four-dimensional spatio-temporal model is further used for characterizing the sound field information of an object corresponding to the representation information;

the method further comprises the following steps:

and playing the sound field information represented by the four-dimensional space-time model.

determining front facing information of a device presenting multimedia information;

presenting the appearance information represented by the four-dimensional space-time model, comprising:

and presenting the representation information represented by the four-dimensional space-time model according to the front orientation information.

Optionally, in the method for presenting multimedia information according to any of the above embodiments of the present invention, the method further includes:

determining front orientation information and target multimedia information of a device presenting the multimedia information;

and feeding back the front orientation information and the target multimedia information to a device for sending a four-dimensional space-time model.

According to a second aspect of the present invention, there is provided an apparatus for presenting multimedia information, comprising: a receiving unit for receiving a four-dimensional spatio-temporal model for characterizing the representation information, the four-dimensional spatio-temporal model having properties capable of characterizing the representation information over time in a digitized form, the representation information comprising electromagnetic field spectral information capable of being observed with the naked eye and/or capable of being collected with a device for characterizing an object; the four-dimensional space-time model processing unit is used for decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model; and the presentation unit is used for playing the representation information represented by the four-dimensional space-time model according to the decoded four-dimensional space-time model.

Optionally, in the apparatus for presenting multimedia information according to the above embodiment of the present invention, the apparatus further includes a model fusion unit, configured to fuse the four-dimensional spatiotemporal model with the first spatiotemporal model to obtain a target four-dimensional spatiotemporal model, where the first spatiotemporal model is used to represent the representation information of a place where the apparatus for presenting multimedia information is located;

when the presentation unit presents the representation information represented by the four-dimensional spatio-temporal model, the presentation unit specifically comprises:

Optionally, in the apparatus for presenting multimedia information according to any of the above embodiments of the present invention, the apparatus further includes a model fusion unit, which fuses the four-dimensional spatio-temporal model with a first spatio-temporal model and a second spatio-temporal model of the apparatus for presenting multimedia information to obtain a target four-dimensional spatio-temporal model, where the first spatio-temporal model is used to represent appearance information of an object in a location where the apparatus for presenting multimedia information is located, and the second spatio-temporal model is used to represent appearance information of a virtual object;

when the presentation unit presents the representation information represented by the four-dimensional space-time model, the method specifically comprises the following steps:

Optionally, in the apparatus for presenting multimedia information according to any of the above embodiments of the present invention, the representation information further includes sound field information that can be sensed by ears and/or collected by devices, and the four-dimensional spatio-temporal model is further used for characterizing the sound field information of an object corresponding to the representation information;

the device also comprises a playing unit used for playing the sound field information represented by the four-dimensional space-time model.

Optionally, in an apparatus for presenting multimedia information according to any of the above embodiments of the present invention, the apparatus further includes a processing unit, configured to determine front orientation information of the apparatus for presenting multimedia information;

and presenting the appearance information characterized by the four-dimensional space-time model according to the forward orientation.

Optionally, in an apparatus for presenting multimedia information according to any of the above embodiments of the present invention, the apparatus further includes a processing unit, configured to determine front orientation information and target multimedia information of the apparatus holding the presented multimedia information;

the device also comprises a feedback unit for feeding back the front orientation information and the target multimedia information to a device for sending the four-dimensional space-time model.

The embodiment of the invention provides a method and a device for presenting multimedia information, wherein the method comprises the following steps: receiving a four-dimensional spatiotemporal model for characterizing the representation information, the four-dimensional spatiotemporal model having attributes capable of characterizing the representation information over time in a digitized form; decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model; according to the decoded four-dimensional space-time model, the representation information represented by the four-dimensional space-time model is presented, and in the scheme, the four-dimensional space-time model has the attribute capable of representing the representation information changing along with time in a digital form, so that the problem that the presented representation information has time delay is solved to a certain extent, and therefore the defect that the prior art has time delay is overcome to a certain extent.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1A is a schematic flow chart of a method of presenting multimedia information according to an embodiment of the invention;

FIG. 1B is a schematic flow chart diagram of a method of presenting multimedia information, in accordance with an embodiment of the present invention;

FIG. 2A is a schematic diagram of an apparatus for presenting multimedia information according to an embodiment of the present invention;

FIG. 2B is a schematic flow chart of building a four-dimensional spatio-temporal model according to an embodiment of the present invention;

FIG. 2C is another schematic flow diagram for building a four-dimensional spatiotemporal model, according to an embodiment of the present invention;

FIG. 2D is a diagram illustrating an apparatus for processing multimedia information according to an embodiment of the present invention;

FIG. 2E is a schematic diagram of an acquisition unit according to an embodiment of the invention;

FIG. 2F is another schematic diagram of an acquisition unit according to an embodiment of the invention;

FIG. 2G is a top view of an acquisition unit according to an embodiment of the invention;

FIG. 2H is a side view of an acquisition unit according to an embodiment of the invention;

FIG. 3A is a schematic diagram of a scenario provided in accordance with an embodiment of the present invention;

FIG. 3B is a schematic diagram of another scenario provided in accordance with an embodiment of the present invention;

FIG. 3C is a schematic view of another scenario provided in accordance with an embodiment of the present invention;

FIG. 3D is a schematic diagram of another scenario provided in accordance with an embodiment of the present invention;

FIG. 3E is a schematic diagram of another scenario provided in accordance with an embodiment of the present invention;

fig. 3F is a schematic view of another scenario provided in accordance with an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method and apparatus for processing multimedia information provided by the present invention can be applied to the following scenarios, but are not limited to the following scenarios:

a real-time communication scene: for example, the first transmits the collected information of the first person and the surrounding environment to the second person in real time, and the second person roams in the second person and interacts with the first person; for another example, both the first and second users collect their own environment and their surrounding environment and transmit them to each other in real time, roaming and interacting in an environment optionally including the environment where both users are physically located or any third party environment;

remotely observing and monitoring a scene;

the working scene is as follows: for example, individuals or multiple people telecommuting, telepresence meetings, telepresence collaboration, or assisting a customer in solving a problem, or, telepresence training;

an education scene is as follows: for example, one may interact personally in a virtual classroom and with a teacher in a virtual environment;

medical scenes: for example, telemedicine and interaction with doctors in virtual environments;

business scenario: for example, remote shopping and interaction with merchants in a virtual environment, full-range fitting mirrors;

sports scenes: for example, an individual or multiple persons and a sprint champion compete for sprints in a virtual environment;

entertainment scenes: for example, a game of an individual or multiple persons in a virtual space may physically attend a live television broadcast or interact with a movie character;

personal life scenario: for example, recording and showing of four-dimensional diaries, remote visits to museums, remote accompanying family or pets, remote adult applications;

it can also be applied to the following scenarios:

virtual reality or augmented reality content generation scenario: including movies, television, games, video content production;

and (4) making a four-dimensional historical record scene of a specific time-space place.

Fig. 1A shows a schematic flow chart of a method for presenting multimedia information according to the present invention, which includes the following specific processes:

step 100: receiving a four-dimensional spatiotemporal model for characterizing the appearance information, the four-dimensional spatiotemporal model having properties capable of characterizing the appearance information over time in a digitized form, the appearance information including electromagnetic field spectral information capable of being observed with the naked eye and/or capable of being collected with a device for characterizing an object.

In this embodiment of the present invention, the electromagnetic spectrum information described in step 100 may be transmitted by an object, may also be reflected by the object, or may also be refracted by the object, which is not specifically limited herein.

In an embodiment of the present invention, the electromagnetic spectrum information described in step 100 may include at least one of radio wave information, infrared information, visible light information, ultraviolet information, X-ray information, and gamma ray information, wherein the visible light information may include laser light.

In the embodiment of the invention, the object corresponding to the representation information can be an object with any view field size and angle indoors and/or outdoors.

In the embodiment of the invention, the four-dimensional space-time model at least comprises the following attributes in content:

spatial position attribute: the coordinate of each point on the object at any moment in a coordinate system which does not change along with time can be referred to;

appearance attribute: may refer to the texture and spectral features (e.g., color) of the surface of the object, the geometric characteristics (e.g., normal, curvature, smoothness, etc.) of the surface of the object at any time;

a sound attribute;

motion attribute: the motion velocity vector and the acceleration vector of each point on the object at any moment can be referred to; alternatively, it may also refer to angular velocity vector, angular acceleration vector of each portion on the object that can be regarded as a rigid body;

other attributes: may refer to at least one of the category, identity, material, relationship, etc. of the object, which can be inferred from the appearance information and the change of the appearance information with time.

Formally, four-dimensional spatio-temporal models exist in storage media in digitized data form that can be stored, rendered, retrieved, edited, transmitted, encrypted, and used for more advanced intelligent applications.

Step 110: and decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model.

In the embodiment of the present invention, further, the four-dimensional spatio-temporal model received in step 100 may be compressed, and at this time, the four-dimensional spatio-temporal model may be decompressed.

Further, to improve the security of transmission, the received four-dimensional spatio-temporal model may be encrypted, and at this time, the received four-dimensional spatio-temporal model is decrypted.

Step 120: and according to the decoded four-dimensional space-time model, presenting the representation information represented by the four-dimensional space-time model.

In this embodiment of the present invention, further, a scene at the end of a device for presenting multimedia information may be presented, so that before presenting the representation information represented by the four-dimensional spatio-temporal model, the following operations are further included:

at this time, when the representation information represented by the four-dimensional spatio-temporal model is presented, optionally, the following method may be adopted:

For example, the scene corresponding to the representation information represented by the four-dimensional spatio-temporal model is a scene at sea, and the scene corresponding to the representation information represented by the first spatio-temporal model is a scene at a desk, in which case, the presented scene may be such a fused scene that is at sea in front of the desk.

Furthermore, the system can also detect, track and identify human bodies and objects: the real physical area can be superimposed to the virtual area, for example, an observer wearing a VR headset faces a grassland, and a room where the observer is located has a wall, and at this time, "based on object detection" can superimpose the information of the real physical wall on the grassland of the VR headset, presenting a translucent wall in the grassland; for another example, for human hand detection, the gesture of a real hand may be detected and then a virtual hand may be superimposed into a four-dimensional model, that is, some virtual scenes may also be fused, so before the representation information represented by the four-dimensional spatio-temporal model is presented, the method further includes the following operations:

fusing the four-dimensional space-time model with a first space-time model and a second space-time model of a local terminal to obtain a target four-dimensional space-time model, wherein the first space-time model is used for representing the appearance information of a place where a device for presenting multimedia information is located, and the second space-time model is used for representing the appearance information of a virtual object;

at this time, when the representation information represented by the four-dimensional spatio-temporal model is presented, optionally, the following manner may be adopted:

For example, the scene corresponding to the representation information represented by the four-dimensional space-time model is a scene at sea, the scene corresponding to the representation information represented by the first space-time model is a scene at a desk, in this case, the scene to be represented can be a fused scene with the front of the desk at sea, further, a flower is to be placed on the desk to be represented, but no flower is actually present on the desk, at this time, the flower can be represented by the second space-time model, and the four-dimensional space-time model is fused with the first space-time model and the second space-time model at the local end to obtain the target four-dimensional space-time model, in this case, the scene to be represented can be a scene with the front of the desk at sea and the flower is placed on the desk.

In the embodiment of the invention, the scene can have not only pictures but also sounds, therefore, the representation information also comprises sound field information which can be sensed by ears and/or collected by equipment, and the four-dimensional space-time model is also used for representing the sound field information of an object corresponding to the representation information; at this time, the method further includes the operations of:

In the embodiment of the present invention, in order to improve the similarity between the scene corresponding to the presented representation information and the real scene, when the representation information represented by the four-dimensional spatio-temporal model is presented, the front orientation information of the device presenting the multimedia information is referred to, so before presenting the representation information represented by the four-dimensional spatio-temporal model, the following operations are further included:

In the embodiment of the present invention, when determining the front orientation information of the device that presents multimedia information, optionally, the following method may be adopted:

and performing attitude calculation on inertial navigation associated with the device presenting the multimedia information to obtain the front orientation information of the device presenting the multimedia information.

The inertial navigation can be any one or any combination of a gyroscope, a magnetometer and an accelerometer.

In the embodiment of the present invention, the accuracy of the part of interest of the observer can be selectively improved, and further, the following operations are further included:

For example, the scene corresponding to the presentation information includes a beach, a person, and a sailing boat, and if the eyeball of the user who holds the device for presenting the multimedia information is gazing at the person, the person is regarded as the target multimedia information. Thus, the apparatus for transmitting the four-dimensional space-time model may acquire only the appearance information of the character and may not include the appearance information of the sailing boat when acquiring the appearance information.

In the embodiment of the present invention, when determining the target multimedia information, the determination may be based on "eyeballs" of a camera of a device presenting the multimedia information.

It should be noted that, the first spatio-temporal model and the second spatio-temporal model described in the embodiments of the present invention may be pre-established by a device for presenting multimedia information, or may be established in real time, or may be pre-established by another device, or may be established in real time and sent to the device for presenting multimedia information, which is not limited herein.

In some scenarios, only the representation information represented by the four-dimensional spatio-temporal model may be presented, for example, in a telecommuting or telecommunication scenario, the device presenting the multimedia information only wants to experience a "real remote" scenario sent by the device sending the four-dimensional spatio-temporal model, and at this time, only the representation information represented by the four-dimensional spatio-temporal model is presented. In some scenarios, on the basis of presenting the representation information represented by the four-dimensional space-time model, the representation information represented by the first space-time model or the second space-time model may be further presented, and some virtual props may be added at the end of presenting the representation information, for example, an apparatus presenting multimedia information may not only experience the scene sent by an apparatus sending the four-dimensional space-time model, but also add virtual props in the scene, for example, a white board is drawn in the air at any time, or some virtual props are added for a game (for example, a "lightning" is sent to hit a stone in the scene on hand).

In the embodiment of the present invention, further, the first annotation information and/or the second annotation information may also be presented.

In the embodiment of the present invention, the four-dimensional space-time models respectively transmitted by the plurality of devices may also be received, for example, a scene corresponding to the appearance information represented by the first four-dimensional space-time model transmitted by the first transmitting end is a Temple, a scene corresponding to the appearance information represented by the second four-dimensional space-time model transmitted by the second transmitting end is an Eiffel iron tower, and the Temple and the Eiffel iron tower may be presented side by side when presented.

The invention provides a detailed flow for presenting a four-dimensional space-time model, and as shown in figure 1B, the four-dimensional space-time model, a first space-time model and a second space-time model are fused to obtain a target four-dimensional space-time model, the front orientation information of a device presenting multimedia information and the target multimedia information are determined, the representation information represented by the four-dimensional space-time model is presented according to the front orientation information and the target four-dimensional space-time model, and the front orientation information and the target multimedia information are fed back to the device sending the four-dimensional space-time model.

In the embodiment of the invention, a method for presenting multimedia information is disclosed: receiving a four-dimensional spatiotemporal model for characterizing the representation information, the four-dimensional spatiotemporal model having attributes capable of characterizing the representation information over time in a digitized form; decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model; and presenting the representation information represented by the four-dimensional space-time model according to the decoded four-dimensional space-time model, wherein the four-dimensional space-time model has the attribute of representing the representation information which changes along with time in a digital form, so that the problem of time delay of the presented representation information is solved to a certain extent by the scheme, and the defect of time delay in the prior art is overcome to a certain extent.

Referring to fig. 2A, an embodiment of the present invention further provides an apparatus for presenting multimedia information, including:

a receiving unit 20 for receiving a four-dimensional spatio-temporal model for characterizing the representation information, the four-dimensional spatio-temporal model having properties capable of characterizing the representation information over time in a digitized form, the representation information comprising electromagnetic field spectral information capable of being observed with the naked eye and/or capable of being collected with a device for characterizing an object;

a four-dimensional space-time model processing unit 21, configured to perform decoding processing on the four-dimensional space-time model to obtain a decoded four-dimensional space-time model;

and the presentation unit 22 is used for playing the representation information represented by the four-dimensional space-time model according to the decoded four-dimensional space-time model.

In the embodiment of the present invention, further, the four-dimensional spatio-temporal model received by the receiving unit 20 may be compressed, and at this time, the four-dimensional spatio-temporal model may also be decompressed.

Further, in order to improve the security of transmission, the four-dimensional spatio-temporal model received by the receiving unit 20 may be encrypted, and at this time, the received four-dimensional spatio-temporal model is decrypted.

In the embodiment of the present invention, further, a scene at the end of the device for presenting multimedia information may be presented, so that the device further includes a model fusion unit 23, configured to fuse the four-dimensional spatio-temporal model with the first spatio-temporal model to obtain a target four-dimensional spatio-temporal model, where the first spatio-temporal model is used to represent appearance information of a place where the device for presenting multimedia information is located;

at this time, when the presenting unit 22 presents the representation information represented by the four-dimensional spatio-temporal model, optionally, the following method may also be adopted:

For example, the scene corresponding to the representation information represented by the four-dimensional spatio-temporal model is a scene at sea, and the scene corresponding to the representation information represented by the first spatio-temporal model is a scene at a desk, in which case, the scene presented by the presentation unit 22 may be such a merged scene that is at sea in front of the desk.

Furthermore, the system can also detect, track and identify human bodies and objects: the real physical area can be superimposed to the virtual area, for example, an observer wearing a VR headset faces a grassland, and a room where the observer is located has a wall, and at this time, "based on object detection" can superimpose the information of the real physical wall on the grassland of the VR headset, presenting a translucent wall in the grassland; for another example, the hand detection may detect gestures of a real hand and then superimpose virtual hands into a four-dimensional model, that is, some virtual scenes may also be fused, the apparatus further includes a model fusion unit 23 configured to fuse the four-dimensional spatiotemporal model with a first spatiotemporal model and a second spatiotemporal model of the apparatus for presenting multimedia information, so as to obtain a target four-dimensional spatiotemporal model, where the first spatiotemporal model is used to represent the representation information of a place where the apparatus for presenting multimedia information is located, and the second spatiotemporal model is used to represent the representation information of a virtual object;

at this time, when the presentation unit 22 presents the representation information represented by the four-dimensional spatio-temporal model, optionally, the following method may be adopted:

For example, the scene corresponding to the representation information represented by the four-dimensional space-time model is a scene at sea, the scene corresponding to the representation information represented by the first space-time model is a scene at a desk, in this case, the scene represented by the rendering unit 22 may be a fused scene in which the front of the desk is at sea, further, a flower is desired to be placed on the rendered desk, but no flower is actually present on the desk, at this time, the flower may be represented by the second space-time model, and the four-dimensional space-time model is fused with the first space-time model and the second space-time model at the local end to obtain the target four-dimensional space-time model, in this case, the scene represented by the rendering unit 22 may be a scene in which the front of the desk is at sea and a flower is placed on the desk.

In the embodiment of the invention, the scene can have not only pictures but also sounds, therefore, the representation information also comprises sound field information which can be sensed by ears and/or collected by equipment, and the four-dimensional space-time model is also used for representing the sound field information of an object corresponding to the representation information;

at this time, the apparatus further includes a playing unit 24 for playing the sound field information characterized by the four-dimensional space-time model.

In the embodiment of the present invention, in order to improve the similarity between the scene corresponding to the presented representation information and the real scene, when presenting the representation information represented by the four-dimensional spatio-temporal model, the presenting unit 22 refers to the front orientation information of the apparatus for presenting multimedia information, and therefore, the apparatus further includes a processing unit 25 for determining the front orientation information of the apparatus for presenting multimedia information;

when the presentation unit 22 presents the representation information represented by the four-dimensional spatio-temporal model, optionally, the following method may be adopted:

In this embodiment of the present invention, when the processing unit 25 determines the front orientation information of the device presenting multimedia information, optionally, the following manner may be adopted:

In the embodiment of the present invention, the accuracy of the interested part of the observer can be selectively improved, and further, the apparatus further includes a processing unit 25 for determining the front orientation information and the target multimedia information of the apparatus holding the multimedia information;

the apparatus further comprises a feedback unit 26 for feeding back the front orientation information and the target multimedia information to the apparatus for transmitting the four-dimensional spatio-temporal model.

In the embodiment of the present invention, when the processing unit 25 determines the target multimedia information, the determination may be based on an "eyeball" of a camera of a device presenting the multimedia information.

In some scenarios, the presentation unit 22 may present only the representation information represented by the four-dimensional spatio-temporal model, for example, in a telecommuting or telecommunication scenario, the device presenting the multimedia information only wants to experience the "real remote" scenario sent by the device sending the four-dimensional spatio-temporal model, and in this case, only the representation information represented by the four-dimensional spatio-temporal model is presented. In some scenarios, the presentation unit 22 may further present the representation information represented by the first space-time model or the second space-time model on the basis of the representation information represented by the four-dimensional space-time model, and add some virtual props at the end of the representation information, for example, a device presenting multimedia information may not only experience a scene sent by a device sending the four-dimensional space-time model, but also add virtual props in the scene, for example, drawing a white board in the air with one hand, or adding some virtual props for a game (for example, sending a "lightning" to hit a stone in the scene).

In this embodiment of the present invention, the receiving unit 20 may also receive four-dimensional space-time models respectively sent by multiple devices, for example, a scene corresponding to the appearance information represented by the first four-dimensional space-time model sent by the first sending end is a Temple, a scene corresponding to the appearance information represented by the second four-dimensional space-time model sent by the second sending end is an Eiffel iron tower, and the Temple and the Eiffel iron tower may be presented side by side when presenting.

In the embodiment of the invention, a device for presenting multimedia information is disclosed: a receiving unit 20 for receiving a four-dimensional spatio-temporal model for characterizing the representation information, the four-dimensional spatio-temporal model having properties capable of characterizing the representation information over time in a digitized form, the representation information comprising electromagnetic field spectral information capable of being observed with the naked eye and/or capable of being collected with a device for characterizing an object; a four-dimensional space-time model processing unit 21, configured to perform decoding processing on the four-dimensional space-time model to obtain a decoded four-dimensional space-time model; and the presentation unit 22 is configured to play the representation information represented by the four-dimensional spatio-temporal model according to the decoded four-dimensional spatio-temporal model, in this scheme, the four-dimensional spatio-temporal model has an attribute capable of representing the representation information changing with time in a digitized form, so that the problem that the presented representation information has a time delay is solved to a certain extent, and therefore, the defect of time delay in the prior art is solved.

Referring to fig. 2B, in the embodiment of the present invention, the received four-dimensional spatio-temporal model may be established as follows:

step 200: acquiring appearance information, wherein the appearance information comprises electromagnetic field spectrum information which can be observed by naked eyes and/or collected by equipment and is used for characterizing an object;

step 210: establishing a four-dimensional space-time model for representing the representation information according to the acquired representation information, wherein the four-dimensional space-time model has the attribute of representing the representation information which changes along with time in a digital form;

step 220: and carrying out coding processing on the established four-dimensional space-time model, and sending the four-dimensional space-time model after the coding processing.

The electromagnetic field spectrum information described in the embodiments of the present invention may be transmitted by an object, may also be reflected by an object, or may also be refracted by an object, and is not particularly limited herein.

The electromagnetic field spectrum information described in the embodiments of the present invention may include at least one of radio wave information, infrared ray information, visible light information, ultraviolet ray information, X-ray information, and gamma ray information, wherein the visible light information may include laser light.

In the embodiment of the invention, when the representation information is acquired, 24 frames to 120 frames per second can be acquired.

In the embodiment of the present invention, the obtained representation information may be representation information obtained at different spatial points and/or different time points.

a sound attribute;

In the embodiment of the invention, after the four-dimensional space-time model is established, the four-dimensional space-time model can be further modified, enhanced and optimized.

In the embodiment of the present invention, when a four-dimensional spatio-temporal model for representing the representation information is established according to the obtained representation information, optionally, the following method may be adopted:

processing the image information to obtain first labeling information;

according to the first labeling information and the appearance information, obtaining first point cloud information including geometric information and second point cloud information including texture information;

fusing the first point cloud information and the second point cloud information to obtain target point cloud information;

obtaining visual information according to the target point cloud information;

acquiring a space model according to the visual information, and fusing the space models aiming at different moments;

and obtaining a four-dimensional space-time model according to the space model, the first labeling information and the second labeling information obtained by fusion.

In practical applications, the representation information may include sound field information in addition to electromagnetic field spectrum information that is observed with the naked eye and/or collected with a device for characterizing the object, in which case, before obtaining the spatial model from the visual information, the method further includes the following operations:

calculating sound field information of an object corresponding to the representation information according to the representation information, wherein the representation information also comprises sound field information which can be sensed by ears and/or collected by equipment;

at this time, when obtaining the spatial model from the visual information, optionally, the following manner may be adopted:

and fusing the visual information and the acoustic field information to obtain a space model.

The sound field information described in the embodiment of the present invention not only refers to the audio information itself, but also includes implicit sound source spatial position information, and may include acquirable sound wave information and/or ultrasonic wave information.

In the embodiment of the present invention, after the first point cloud information and the second point cloud information are fused to obtain the target point cloud information and before the visual information is obtained, the method further includes the following operations:

processing the target point cloud information to obtain second labeling information;

at this time, when obtaining the visual information according to the target point cloud information, optionally, the following method may be adopted:

and obtaining visual information according to the second labeling information and the target point cloud information.

In the embodiment of the present invention, when obtaining the visual information according to the second labeling information and the target point cloud information, optionally, the following method may be adopted:

performing geometric vertex position optimization and normal calculation on target point cloud information to obtain a first result;

performing surface fitting and triangular gridding processing on the first result to obtain a second result;

and obtaining visual information according to the second result.

In the embodiment of the present invention, when the image information is processed to obtain the first label information, optionally, the following manner may be adopted:

and carrying out digital image processing and analysis on the appearance information to obtain first labeling information.

In the embodiment of the present invention, when performing digital image processing and analysis on the image information, optionally, the following method may be adopted:

and performing processing such as segmentation, detection, tracking, identification and the like on the appearance information.

In the embodiment of the invention, the steps of segmentation, detection, tracking and identification have no definite sequence relation, for example, the representation information can be segmented firstly and then detected; or the detection can be carried out firstly and then the segmentation is carried out. In order to improve the accuracy of the obtained first labeling information, the segmentation, detection, tracking and identification can be performed several times in a loop. For example, after performing the segmentation, detection, tracking and recognition once, at least one round of segmentation, detection, tracking and recognition is performed again according to the current result, which can improve the precision.

In the embodiment of the present invention, the segmentation may refer to segmenting the image into a foreground and a background, for example, into the sky, the ground or the like, the detection may refer to detecting pedestrians and detecting license plates, the tracking may refer to tracking arm movements of people, and the recognition may refer to recognizing vehicles.

In the embodiment of the present invention, when obtaining the first point cloud information including the geometric information according to the first annotation information and the appearance information, optionally, the following method may be adopted:

processing the image information according to the first marking information to obtain coordinate information of the object corresponding to the image information;

and generating first point cloud information comprising geometric information according to the coordinate information.

In the embodiment of the present invention, the coordinate information of the object corresponding to the representation information may correspond to different coordinate systems at different times, and at this time, in order to improve the accuracy of the obtained first point cloud information, after the coordinate information of the object corresponding to the representation information is obtained, the coordinate information of the object in different local coordinate systems at different times may be fused to the same coordinate system, and then the first point cloud information including the geometric information may be generated according to the coordinate information fused to the same coordinate system.

In the embodiment of the present invention, when obtaining the second point cloud information including the texture information according to the first labeling information and the appearance information, optionally, the following method may be adopted:

and extracting information from the image information according to the first marking information in a point-by-point and/or image synthesis mode to obtain second point cloud information comprising texture information.

calculating surface normal information of the object according to the second marking information and the target point cloud information;

and obtaining visual information according to the surface normal information.

The invention provides a detailed process for establishing a four-dimensional space-time model, which is shown in figure 2C, obtains first marking information and sound field information according to the representation information, and obtaining first point cloud information and second point cloud information according to the appearance information and the first label information, fusing the first point cloud information and the second point cloud information to obtain target point cloud information, obtaining second labeling information according to the target point cloud information, performing geometric vertex position optimization and normal calculation on the target point cloud information to obtain a first result, and performing surface fitting and triangular gridding processing on the first result to obtain a second result, obtaining visual information according to the second result and second labeling information, fusing the visual information and the acoustic field information to obtain a space model, fusing the space model to obtain a fused space model, and processing the fused space model, the first labeling information and the second labeling information to obtain a four-dimensional space-time model.

Referring to fig. 2D, an embodiment of the invention further provides an apparatus for processing multimedia information, including:

an obtaining unit 2100 for obtaining representation information including electromagnetic field spectrum information that can be observed with the naked eye and/or collected with a device for characterizing an object;

a model establishing unit 2200, configured to establish a four-dimensional spatio-temporal model for representing the representation information according to the obtained representation information, where the four-dimensional spatio-temporal model has an attribute capable of representing the representation information changing with time in a digitized form;

the processing unit 2300 is used for encoding the established four-dimensional space-time model;

a sending unit 2400, configured to send the four-dimensional spatio-temporal model after the coding process.

In this embodiment of the present invention, the electromagnetic spectrum information acquired by the acquiring unit 2100 may be transmitted by an object, reflected by an object, or refracted by an object, and is not limited in this embodiment.

In an embodiment of the present invention, the electromagnetic field spectrum information described by the obtaining unit 2100 may include at least one of radio wave information, infrared information, visible light information, ultraviolet information, X-ray information, and gamma ray information, wherein the visible light information may include laser light.

In the embodiment of the present invention, when the obtaining unit 2100 obtains the representation information, 24 frames to 120 frames may be obtained every second.

In this embodiment of the present invention, the appearance information acquired by the acquiring unit 2100 may be appearance information acquired at different spatial points and different time points.

a sound attribute;

other attributes: may refer to at least one of the category, identity, material, interrelationship, etc. of the object, all of which may be inferred from the representation information and the change in the representation over time.

In the embodiment of the present invention, after the model establishing unit 2200 establishes the four-dimensional spatio-temporal model, the four-dimensional spatio-temporal model may be further modified, enhanced, and optimized.

In practical applications, the representation information may include sound field information in addition to the electromagnetic field spectrum information for characterizing the object, which is observed by naked eyes and/or collected by a device, in this case, in an embodiment of the present invention, the apparatus may further include a sound field information calculating unit 2500, configured to calculate sound field information of the object corresponding to the representation information according to the representation information, where the representation information further includes sound field information which is sensed by ears and/or collected by a device;

when the model building unit 2200 builds the four-dimensional spatio-temporal model for representing the representation information according to the representation information, the method specifically comprises the following steps:

and establishing a four-dimensional space-time model for representing the representation information and the sound field information according to the representation information and the sound field information.

In this embodiment of the present invention, optionally, the model establishing unit 2200 includes a first labeling information generating unit 2200A, a point cloud information generating unit 2200B, a point cloud information fusing unit 2200C, a visual information generating unit 2200D, and a four-dimensional spatiotemporal model generating unit 2200E, where:

the first annotation information generating unit 2200A is configured to process the representation information to obtain first annotation information;

the point cloud information generating unit 2200B is configured to obtain first point cloud information including geometric information and second point cloud information including texture information according to the first label information and the appearance information;

the point cloud information fusion unit 2200C is configured to fuse the first point cloud information and the second point cloud information to obtain target point cloud information;

the visual information generating unit 2200D is configured to obtain visual information according to the target point cloud information;

the four-dimensional spatiotemporal model generating unit 2200E is configured to obtain a spatial model according to the visual information, fuse spatial models at different times, and obtain the four-dimensional spatiotemporal model according to the fused spatial model, the first labeling information, and the second labeling information.

In this embodiment of the present invention, further, the apparatus further includes a sound field information calculating unit 2500, configured to calculate, according to the representation information, sound field information of an object corresponding to the representation information, where the representation information further includes sound field information that can be sensed by ears and/or collected by a device;

when the four-dimensional spatio-temporal model generating unit 2200E obtains a spatial model according to the visual information, optionally, the following manner may be adopted:

and fusing the visual information and the sound field information to obtain the space model.

In this embodiment of the present invention, optionally, the point cloud information generating unit 2200B is further configured to process the target point cloud information to obtain second labeling information;

when the visual information generating unit 2200D obtains the visual information according to the target point cloud information, optionally, the following manner may be adopted:

and obtaining the visual information according to the second labeling information and the target point cloud information.

In this embodiment of the present invention, further, the visual information generating unit 2200D is further configured to:

performing geometric vertex position optimization and normal calculation on the target point cloud information to obtain a first result;

and obtaining the visual information according to the second result.

In this embodiment of the present invention, optionally, when the first annotation information generating unit 2200A processes the appearance information to obtain the first annotation information, optionally, the following manner may be adopted:

and carrying out digital image processing and analysis on the appearance information to obtain the first labeling information.

In this embodiment of the present invention, when the first labeled information generating unit 2200A performs digital image processing analysis on the representation information, the representation information is subjected to processing such as segmentation, detection, tracking, and recognition.

In this embodiment of the present invention, when the point cloud information generating unit 2200B obtains the first point cloud information including geometric information according to the first annotation information and the representation information, optionally, the following manner may be adopted:

processing the representation information according to the first labeling information to obtain coordinate information of an object corresponding to the representation information;

and generating first point cloud information comprising the geometric information according to the coordinate information.

In the embodiment of the present invention, the coordinate information of the object corresponding to the representation information may correspond to different coordinate systems at different times, and at this time, in order to improve the accuracy of the obtained first point cloud information, after the coordinate information of the object corresponding to the representation information is obtained, the point cloud information generating unit 2200B may further fuse the coordinate information of the object in different local coordinate systems at different times into the same coordinate system, and then generate the first point cloud information including the geometric information according to the coordinate information fused into the same coordinate system.

In this embodiment of the present invention, optionally, when the point cloud information generating unit 2200B obtains the second point cloud information including texture information according to the first annotation information and the appearance information, the following manner may be adopted:

and extracting information from the representation information according to the first marking information in a point-by-point and/or image synthesis mode to obtain second point cloud information comprising texture information.

In this embodiment of the present invention, when the visual information generating unit 2200D obtains the visual information according to the second annotation information and the target point cloud information, the following manner may be adopted:

calculating normal information of the object surface according to the second labeling information and the target point cloud information;

and obtaining visual information according to the surface normal information.

In the embodiment of the present invention, after the processing unit 2300 performs coding processing on the established four-dimensional spatio-temporal model, the four-dimensional spatio-temporal model after the coding processing is compressed, and the transmitting unit 2400 transmits the four-dimensional spatio-temporal model after the compression processing.

Further, in order to improve transmission security, processing unit 2300 may encrypt the four-dimensional spatio-temporal model after the encoding process before transmitting unit 2400 transmits the four-dimensional spatio-temporal model after the encoding process, or may encrypt the four-dimensional spatio-temporal model after the compression process before transmitting the four-dimensional spatio-temporal model after the compression process.

In this embodiment of the present invention, the obtaining unit 2100 may be any one of a cylinder, a rectangular parallelepiped, a prism, a ring, a sphere, and a hemisphere, and includes at least one camera, which may be a color camera, a depth camera, or an infrared camera.

Further, the obtaining unit 2100 may further include at least one microphone, as shown in fig. 2E and 2F, where fig. 2G is a top view of fig. 2E or 2F, and fig. 2H is a side view of fig. 2E or 2F.

Optionally, the obtaining unit 2100 includes 8 pairs of color cameras and 8 microphones, where: the top is provided with 1 pair of color cameras, and the visual angles are 180 degrees respectively; the side surface is provided with 6 pairs of color cameras, and the visual angles are 70 degrees respectively; the bottom is provided with 1 pair of color cameras, and the visual angles are 180 degrees respectively; there are 1 microphone in the middle of every 1 pair of cameras.

Alternatively, the obtaining unit 2100 may also be in the form of:

the top of the camera is provided with 1 or 1 pair of color cameras, and the visual angle is 45-180 degrees; 2 or 8 pairs of color cameras are arranged on the side surface, and the visual angles are 45-180 degrees respectively; 1 or 1 pair of color cameras are arranged at the bottom, and the visual angle is 45-180 degrees; there are 1 microphone, or there are 1 microphone in the middle of every 1 pair of cameras, optionally, the number of microphones is between 1 ~ 8.

In the embodiment of the present invention, optionally, the top camera may also be one or any combination of a stereo camera, a multi-focal-length camera, a structured light camera, a time of flight (ToF) camera, and a light field camera group.

In the embodiment of the present invention, the side camera may be one of a stereo camera, a multi-focal-length camera, a structured light camera, a time-of-flight (ToF) camera, and a light field camera group, or any combination thereof.

For example, the acquisition unit 2100 is cylindrical, and has six pairs of binocular cameras on the side surface of the cylinder, each camera having a field of view of 70 degrees; the top surface and the bottom surface of the cylinder are respectively provided with a pair of binocular cameras, the visual field of each camera is 180 degrees, the visual field coverage of the three-dimensional panorama can be realized, and all the cameras are calibrated in advance and obtain a parameter matrix. The acquisition unit 2100 may also have eight microphones built in.

In the embodiment of the present invention, the color camera may be composed of an optical lens, a light sensing device (Image Sensor), and an ISP (Image Signal Processing Unit).

The VPU (Vision Processing Unit) may include a model building Unit 2200 and a Processing Unit 2300, wherein the cameras may be connected to the VPU chip by MIPI (Mobile Industry Processor Interface), and one VPU chip processes data transmitted from two pairs of cameras, such that four VPU chips are inside one cylinder.

In this embodiment of the present invention, the model building unit 2200 may include a processor, a video card, a memory, a video memory, a flash memory, a hard disk, a wireless transmission, a wired transmission, and a plurality of bus interface chips.

The following provides a scenario in which embodiments of the present invention are applicable.

The scenario shown in FIG. 3A is: in the first scene, the second scene, the first scene and the second scene enable the environment around the first scene and the second scene to be remotely displayed in front of the second scene in real time through the method provided by the embodiment of the invention, and the second scene can interact with the first scene.

Further, the multimedia information processing apparatus may store the four-dimensional spatiotemporal model in the storage device first, and the apparatus that receives and processes the four-dimensional spatiotemporal model, which is held by B, may acquire the four-dimensional spatiotemporal model from the storage device, as shown in fig. 3B, in which case the scene "seen" by B may be the same as the scene seen in the case shown in fig. 3A.

When the multimedia information processing apparatus stores the four-dimensional spatiotemporal model in the storage device, the first apparatus may also hold an apparatus capable of receiving and processing the four-dimensional spatiotemporal model, and acquire the four-dimensional spatiotemporal model from the storage device, and perceive the first scene where the first apparatus was located at a past time point, as shown in fig. 3C.

The scenario shown in fig. 3D is: in the first scene, the second scene, the first scene and the second scene enable the environment around the first scene and the second scene to be remotely presented in front of the second scene in real time through the embodiment of the invention, and the second scene can interact with the first scene; the first and second devices realize bidirectional real-time remote reality and mixed reality through the first embodiment of the invention, the first device senses the superposition of a first scene where the first device is located and the second device, and the second device senses the first scene where the first device and the second device are located; it should be noted that a and b can have multiple choices for the scene to be perceived, and both sides can choose to see the first scene where the first party is located, or choose to see the second scene where the second party is located, or see the third scene where the other party is located; a and B can see the same real or virtual scene and can also see different real or virtual scenes.

The scenario shown in FIG. 3E is: the embodiment provided by the invention realizes remote office.

The scenario shown in FIG. 3F is: the embodiment provided by the invention can realize that the virtual environment is felt by the first and the second users, and further, the interaction can be realized, like 'being personally on the scene'.

The methods and apparatus provided herein are not inherently related to any particular computer, virtual machine system, or other apparatus. Various general purpose systems may also be used with the examples based on this disclosure. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of an embodiment may be adaptively changed and disposed in one or more apparatuses other than the embodiment. Several modules of embodiments may be combined into one module or unit or assembly and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or modules are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Various apparatus embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the modules in an apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method of presenting multimedia information, comprising:

receiving a four-dimensional spatio-temporal model for characterizing imagery information, the four-dimensional spatio-temporal model including spatial location attributes and temporal attributes, the imagery information being capable of characterizing, in digitized form, the imagery information including electromagnetic field spectral information, including at least one of object transmitted, reflected or refracted radio wave information, infrared information, visible light information, ultraviolet information, X-ray information, and gamma ray information, that is observable with the naked eye and/or is device-collectable for characterizing an object;

decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model;

according to the decoded four-dimensional space-time model, presenting the representation information represented by the four-dimensional space-time model;

wherein, the step of establishing the four-dimensional space-time model comprises the following steps:

processing the appearance information to obtain first labeling information;

obtaining first point cloud information including geometric information and second point cloud information including texture information according to the first labeling information and the appearance information;

obtaining visual information according to the target point cloud information;

obtaining a four-dimensional space-time model according to the space model, the first labeling information and the second labeling information obtained by fusion;

obtaining first point cloud information including geometric information according to the first labeling information and the appearance information comprises:

the coordinate information of the object under different local coordinate systems at different moments is fused under the same coordinate system, and then first point cloud information including geometric information is generated according to the coordinate information fused under the same coordinate system.

2. The method of claim 1, prior to presenting the appearance information characterized by the four-dimensional spatio-temporal model, further comprising:

fusing the four-dimensional space-time model with a first space-time model to obtain a target four-dimensional space-time model, wherein the first space-time model is used for representing the appearance information of an object presenting the place where the multimedia information is located;

3. The method of claim 1, prior to presenting the appearance information characterized by the four-dimensional spatio-temporal model, the method further comprising:

4. The method of claim 1, the representation information further comprising sound field information perceptible to the ear and/or collectable by the device, the four-dimensional spatio-temporal model further being used to characterize sound field information of objects corresponding to the representation information;

the method further comprises the following steps:

5. The method of claim 1, prior to presenting the appearance information characterized by the four-dimensional spatio-temporal model, further comprising:

6. The method of any one of claims 1-5, further comprising:

7. An apparatus for presenting multimedia information, comprising:

a receiving unit, configured to receive a four-dimensional spatio-temporal model for representing appearance information, the four-dimensional spatio-temporal model including a spatial position attribute and a time attribute, the appearance information being capable of representing the appearance information varying with time in a digitized form, the appearance information including electromagnetic field spectrum information capable of being observed with the naked eye and/or being collected with a device for representing an object, the electromagnetic field spectrum information including at least one of radio wave information emitted, reflected or refracted by the object, infrared ray information, visible light information, ultraviolet ray information, X-ray information, and gamma ray information;

the four-dimensional space-time model processing unit is used for decoding the four-dimensional space-time model to obtain a decoded four-dimensional space-time model;

the presentation unit is used for playing the representation information represented by the four-dimensional space-time model according to the decoded four-dimensional space-time model;

the model establishing unit is used for processing the representation information to obtain first labeling information; obtaining first point cloud information including geometric information and second point cloud information including texture information according to the first labeling information and the appearance information; fusing the first point cloud information and the second point cloud information to obtain target point cloud information; processing the target point cloud information to obtain second labeling information; obtaining visual information according to the target point cloud information; acquiring a space model according to the visual information, and fusing the space models aiming at different moments; obtaining a four-dimensional space-time model according to the space model, the first labeling information and the second labeling information obtained by fusion;

processing the representation information according to the first labeling information to obtain coordinate information of an object corresponding to the representation information; and fusing coordinate information of the object under different local coordinate systems at different moments into the same coordinate system, and generating first point cloud information comprising geometric information according to the coordinate information fused into the same coordinate system.

8. The apparatus according to claim 7, further comprising a model fusion unit for fusing the four-dimensional spatio-temporal model with a first spatio-temporal model to obtain a target four-dimensional spatio-temporal model, wherein the first spatio-temporal model is used to represent the appearance information of the place where the apparatus for presenting multimedia information is located;

9. The apparatus according to claim 7, further comprising a model fusion unit for fusing the four-dimensional spatio-temporal model with a first spatio-temporal model and a second spatio-temporal model of the apparatus for presenting multimedia information to obtain a target four-dimensional spatio-temporal model, wherein the first spatio-temporal model is used for representing the appearance information of the place where the apparatus for presenting multimedia information is located, and the second spatio-temporal model is used for representing the appearance information of a virtual object;

10. The apparatus of claim 7, the representation information further comprising sound field information perceptible to the ear and/or collectable by a device, the four-dimensional spatio-temporal model further being used to characterize sound field information of an object corresponding to the representation information;

11. The apparatus of claim 7, further comprising a processing unit for determining front facing information for an apparatus presenting multimedia information;

12. The device of any of claims 7-11, further comprising a processing unit for determining front facing information and target multimedia information for a device holding the presentation multimedia information;