CN114363557A - Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system - Google Patents

Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system Download PDF

Info

Publication number
CN114363557A
CN114363557A CN202210207225.9A CN202210207225A CN114363557A CN 114363557 A CN114363557 A CN 114363557A CN 202210207225 A CN202210207225 A CN 202210207225A CN 114363557 A CN114363557 A CN 114363557A
Authority
CN
China
Prior art keywords
semantic
level
library
hierarchical
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210207225.9A
Other languages
Chinese (zh)
Other versions
CN114363557B (en
Inventor
高大化
杨旻曦
石光明
刘丹华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210207225.9A priority Critical patent/CN114363557B/en
Publication of CN114363557A publication Critical patent/CN114363557A/en
Application granted granted Critical
Publication of CN114363557B publication Critical patent/CN114363557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/157Conference systems defining a virtual conference space and using avatars or agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms

Abstract

The invention relates to a semantic fidelity oriented virtual conference method and a three-dimensional virtual conference system, wherein the method comprises the following steps: according to semantic concept tree semantic architecture; setting a default signal corresponding to each semantic concept at different levels in the tree semantic framework; extracting semantic description of a scene where the participant is located from the multi-modal signal and updating a private level semantic library; updating the public level semantic library by using the private level semantic library; the server side sends the hierarchical semantic description composed of all contents in the public hierarchical semantic library to all the user sides, and updates the corresponding private hierarchical semantic library at each user side according to the hierarchical semantic description composed of all the contents; generating a virtual conference scene in the participant view angle; and generating a virtual conference scene in the view angle of the participants in the conference process. The method solves the problems of rough expression and action of the virtual image, unreal meeting scene and difficult interaction between people and the scene, increases the scale of the virtual meeting and improves the sense of reality of the participants.

Description

Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system
Technical Field
The invention belongs to the technical field of virtual conferences, and particularly relates to a semantic fidelity-oriented virtual conference method and a three-dimensional virtual conference system.
Background
Under the influence of economic globalization and the popularity of new crown epidemic situation, teleconferencing technology has been widely used in government, school, company and other scenes and tends to gradually replace face-to-face conversations as the mainstream medium of conferences. Currently, a commonly used teleconference system mainly includes a teleconference system, a video conference system, a multimedia platform sharing system, and the like. However, compared with the face-to-face conversation, the existing teleconferencing system has the problems of insufficient realism and immersion of participants and insufficient communication means. With the progress of the augmented display and virtual reality technology, the virtual conference technology comes up. The virtual conference technology combines a video communication technology and a virtual reality or augmented reality technology, and realizes a conference service with stronger interaction capability and immersion by creating a virtual meeting place and enabling participants to participate in the virtual meeting place in an avatar.
There are two main types of existing virtual conferencing techniques. One is a high definition video signal transmission scheme. The scheme includes that firstly, a multi-angle high-definition camera is used at a client to record whole-body video information and voice information of a participant, a head posture is used to capture the visual angle of the participant, and the visual angle is transmitted to a server through a communication network; then, the server receives and processes the multi-angle video signals sent by the multi-path terminals, a high-definition three-dimensional model of each user is constructed through a three-dimensional reconstruction algorithm, three-dimensional rendering of a virtual meeting room is completed, and video signals under the visual angle of each user are recorded and sent according to the head posture of each user; and finally, receiving and playing the video by the user side. This solution has two problems: 1) in order for the human visual system to feel realistic, virtual reality video needs to reach 21K resolution of 60 frames per second. The single-path signal now requires 2Gbps bandwidth. At present, the theoretical bandwidth of 5G is only 10Gbps, and the normal conference requirement cannot be met. 2) The existing algorithm for generating the three-dimensional model according to the high-definition image is time-consuming and labor-consuming, and the existing server is difficult to meet the real-time computing force requirement of a large-scale conference.
Since the first scheme has poor practicability, the industry proposes a scheme for extracting main features from a user signal. The scheme is that a user virtual image is established at a server end in advance; after a conference begins, firstly, a client captures the action posture of a participant by using a posture sensor, collects voice signals by using an audio sensor and transmits the voice signals to a server through a communication network; then, driving the virtual image at the server end according to the user posture to finish the three-dimensional rendering of the virtual meeting room, and recording and sending a video signal under the visual angle of each user according to the head posture of each user; and finally, receiving and playing the video by the user side. According to the scheme, the bandwidth pressure is reduced by extracting and transmitting the user gesture, and the computational force pressure is reduced by avoiding three-dimensional reconstruction through gesture driving of the preset virtual image. However, the preset virtual image cannot reflect the important features of the appearance, the expression and the like of the participants, so that the reality sense of the conference is insufficient, and even the counterintuitive sense of the participants is caused by the terror effect.
Therefore, the first solution has two problems: 1) in order for the human visual system to feel realistic, virtual reality video needs to reach 21K resolution of 60 frames per second. The single-path signal now requires 2Gbps bandwidth. At present, the theoretical bandwidth of 5G is only 10Gbps, and the normal conference requirement cannot be met. 2) The existing algorithm for generating the three-dimensional model according to the high-definition image is time-consuming and labor-consuming, and the existing server is difficult to meet the real-time computing force requirement of a large-scale conference. The second scheme has the problems that the preset virtual image cannot reflect the important characteristics of appearance, expression and the like of participants, so that the reality sense of the conference is insufficient, and even the counterintuitive sense of the participants is caused by the terror effect.
In a conference scenario, communication does not require the video signal to be transmitted completely, nor does three-dimensional rendering require every detail to be studied, but rather, semantic information represented by the video signal is conveyed. For example, in an interactive video conference scenario, what is needed by both parties to the communication is the meaning conveyed by facial expressions and body movements, without information such as the environment, clothing texture, etc. in which the other party is located.
Therefore, the defects of the existing scheme are caused by the fact that semantic fidelity is not achieved, namely semantic depiction, signal semantic extraction, transmission and reproduction and user intention perception feedback are not achieved.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a semantic fidelity oriented virtual conference method and a three-dimensional virtual conference system. The technical problem to be solved by the invention is realized by the following technical scheme:
one embodiment of the present invention provides a semantic fidelity-oriented virtual conference method, which includes:
step 1, constructing a tree-shaped semantic architecture with not less than two levels according to each semantic concept at least comprising participants, wherein nodes of each level of the tree-shaped semantic architecture consist of semantic concepts and attributes thereof, and the bottommost layer of the tree-shaped semantic architecture is a signal without attributes;
step 2, setting a corresponding default signal for each semantic concept of different levels in the tree semantic framework;
step 3, according to a multi-modal signal of a scene where a participant is located, extracting semantic descriptions of the scene where the participant is located from the multi-modal signal in a hierarchical manner and correspondingly updating private-level semantic libraries, wherein each private-level semantic library corresponds to a user side and is used for storing data of the user side;
step 4, updating a public level semantic library by using the private level semantic library, wherein the public level semantic library is used for storing all data so as to realize data sharing;
step 5, the server side sends the hierarchical semantic description composed of all the contents in the public hierarchical semantic library to all the user sides by using a communication link, and updates the corresponding private hierarchical semantic library at each user side according to the hierarchical semantic description composed of all the contents;
step 6, starting the conference, and generating a virtual conference scene in the participant visual angle by the user side according to the private level semantic library updated in the step 5 and displaying the virtual conference scene;
and 7, after the step 6, generating a virtual conference scene in the view angle of the participant in the conference process and displaying the virtual conference scene.
In an embodiment of the present invention, a top root node of the tree-shaped semantic framework is the semantic concept and its attribute, and according to a sequence from high to low, a semantic concept of a next level of the tree-shaped semantic framework is decomposed from a semantic concept of a previous level, and the decomposed semantic concept and its attribute are used as a node of the next level.
In one embodiment of the present invention, the step 3 comprises:
step 3.1, obtaining multi-mode signals of a scene where the participant is located;
and 3.2, extracting semantic descriptions of scenes where the participants are located from the multi-modal signals in a hierarchical manner based on the tree semantic architecture, and correspondingly updating the private-level semantic library.
In one embodiment of the invention, said step 3.2 comprises:
step 3.21, identifying the participants;
step 3.22, instantiating a tree-shaped semantic framework of a predefined participant to obtain a first semantic object, identifying semantic concepts of the first semantic object from high to low, recording attributes of the semantic concepts, directly recording original signals of the semantic concepts at the bottommost layer of the tree-shaped semantic framework, and updating the private level semantic library by using the first semantic object;
3.23, identifying an object which interacts with the participant and predefines a tree semantic architecture;
and 3.24, instantiating a tree semantic architecture of the predefined object to obtain a second semantic object, identifying semantic concepts of the second semantic object from high to low, recording attributes of the semantic concepts, directly recording original signals of the semantic concepts in the tree semantic architecture, and updating the private level semantic library by using the second semantic object.
In one embodiment of the present invention, the step 4 comprises:
step 4.1, setting an initial semantic level;
step 4.2, intercepting the part which is not lower than the initial semantic level in each semantic object in the private level semantic library as basic semantic level description;
4.3, the user side sends the basic semantic hierarchy description to a server side through a communication link;
and 4.4, the server side updates the public level semantic library by using the received basic semantic level description.
In one embodiment of the present invention, the step 6 comprises:
6.1, representing the semantic objects in the private level semantic library by using a default signal at the bottommost layer;
step 6.2, arranging all semantic objects represented by the default signals in a virtual meeting place;
6.3, obtaining the postures of the participants of the user side according to the private level semantic library, and generating a virtual conference scene in the view angle of the participants;
and 6.4, displaying the virtual conference scene in the view angle of the participant through a display module of the user side.
In one embodiment of the present invention, the step 7 comprises:
step 7.1, obtaining multi-mode signals of a scene where the participant is located;
step 7.2, extracting semantic descriptions of scenes where the participants are located from the multi-modal signals in a hierarchical manner based on the tree semantic architecture, and correspondingly updating the private-level semantic library;
7.3, analyzing the query intention or broadcast intention of the participants by utilizing a private level semantic library;
7.4, receiving the query intention of other participants from the server side through the communication link;
step 7.5, forming a hierarchical semantic description by the queried or broadcast semantic object and the semantic objects at higher levels through a communication link, sending the hierarchical semantic description to a server, and updating the public hierarchical semantic library by using the received hierarchical semantic description;
7.6, the server side sends the broadcasted hierarchical semantic descriptions to all the user sides by using the communication link, sends the inquired hierarchical semantic descriptions according to the inquiry intention of the user sides, and sends and updates the private hierarchical semantic library by using the hierarchical semantic descriptions of all the user sides and the hierarchical semantic descriptions inquired by the user sides;
7.7, the user side generates a virtual conference scene in the participant visual angle according to the private level semantic library updated in the step 7.6;
and 7.8, displaying the virtual conference scene in the view angle of the participant through a display module of the user side.
On the other hand, another embodiment of the present invention further provides a three-dimensional virtual conference system oriented to semantic fidelity, which is used to implement the virtual conference method oriented to semantic fidelity according to any of the above embodiments, and the three-dimensional virtual conference system includes: the system comprises a server and at least two user sides, wherein the server is connected with all the user sides through communication links.
In one embodiment of the present invention, the server includes:
the server-side communication module is used for transmitting hierarchical semantic description and user intention with at least two user sides based on the communication link, wherein the user intention comprises query intention and/or broadcast intention;
and the public level semantic library module is used for summarizing and storing the level semantic descriptions sent by all the user sides.
In an embodiment of the present invention, the user side includes:
the multi-modal sensor module is used for collecting multi-modal signals from a scene where the participant is located and at least comprises a visual sensor and an auditory sensor;
the hierarchical perception module is used for hierarchically extracting hierarchical semantic description of a scene where the participant is located from the multi-modal signal;
the private level semantic library module is used for storing a corresponding semantic object in a conference scene where the user side is located;
the user intention module is used for analyzing the user intention from the private level semantic library and sending or receiving level semantic description according to the user intention;
the user side communication module is used for carrying out hierarchical semantic description and user intention transmission with the server side based on the communication link;
and the display module is used for obtaining the postures of the participants according to the private level semantic library and generating and displaying the virtual meeting room scene under the visual angle of the participants.
Compared with the prior art, the invention has the beneficial effects that:
the invention aims to provide a three-dimensional virtual conference method facing semantic fidelity, which can be used for extracting, transmitting and reproducing semantics in signals in a targeted manner by actively perceiving user intention or passively receiving feedback in a conference process through predefining the semantics in a conference scene so as to realize a virtual conference facing the semantic fidelity, so that the contradiction between communication bandwidth, computing power and user experience in the prior art is solved.
The three-dimensional virtual conference method oriented to semantic fidelity solves the problems that in the prior art, the number of participants is small, the expression and the action of virtual images are rough, the conference scene is not real, and interaction between people and the scene is difficult, increases the scale of a virtual conference, and improves the sense of reality of the participants of users.
Drawings
Fig. 1 is a schematic flowchart of a virtual conference method oriented to semantic fidelity according to an embodiment of the present invention;
fig. 2 is a three-dimensional virtual conference system oriented to semantic fidelity according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto. Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of a virtual conference method oriented to semantic fidelity according to an embodiment of the present invention. The embodiment of the invention provides a semantic fidelity-oriented virtual conference method, which comprises the following steps:
step 1, constructing a tree semantic architecture with not less than two levels according to each semantic concept at least comprising participants, wherein nodes of each level of the tree semantic architecture consist of the semantic concepts and attributes thereof, and the bottommost layer of the tree semantic architecture is a signal without the attributes.
In this embodiment, the semantic concept refers to an object type to be presented in a virtual meeting scene, for example: participants, tables, chairs, etc. One type often corresponds to multiple entities, such as more than one participant to a meeting, but they all belong to the semantic concept of "participant".
Specifically, semantic concepts are hierarchical, meaning of the semantic concepts needs to be described from multiple levels, and as semantic levels are reduced, the semantic concepts are more finely described, and the more semantic concepts are included in one level. Therefore, the high-level semantic objects are regarded as father nodes of the tree, and the low-level semantic objects are regarded as child nodes, so that the hierarchical semantic concept described by the tree structure is formed, namely the tree semantic framework. For example, for the semantic concept "participant": the top-most semantic concept is a symbol, such as the word "participant"; then, the low-level semantic concepts such as "head", "hand", "leg", "trunk", etc. are used as child nodes of "participant" at a lower level; then, a layer of low-level semantic concepts such as 'eyes' and 'nose' are used as child nodes of 'head' at a lower level; by analogy, at the lowest level, descriptions of signal fidelity levels are provided, for example, a picture, a sound and other signal descriptions "participant" are transmitted as they are.
In addition, a semantic concept is a type corresponding to a class of things, and the class of things has a characteristic of diversity, which is an attribute of the semantic concept. For example, the "participants" have the attributes of height, weight, sex, age, etc., and all the participants have these characteristics, but the values are different.
Furthermore, the top root node of the tree-shaped semantic framework is a semantic concept and an attribute thereof, the semantic concept of the next level of the tree-shaped semantic framework is decomposed from the semantic concept of the previous level according to the sequence from high to low, and the semantic concept and the attribute thereof obtained by decomposition are used as the node of the next level. That is, from the top-level semantic concept, the high-level semantic concept is decomposed into not less than two low-level semantic concepts in sequence from high to low until the decomposition is impossible. Taking the decomposed low-level semantic concepts and the attributes thereof as nodes of a lower level in the tree semantic architecture, and taking father nodes of the low-level semantic concepts before decomposition; and the bottom layer of the tree semantic architecture is a signal without attributes, which is a child node of all semantic objects in the upper layer, namely: the bottom-most layer of the tree structure of an instance of all semantic concepts is a signal level description of the instance, such as the image and sound of the participant at three current times. The signal of an entity is unique and therefore does not have any properties. The significance of setting the signal level is that all details of an instance cannot be described with the finest semantics, and when a user needs to carefully understand the instance, the signal of the instance needs to be transmitted completely without distortion.
The semantic decomposition method can be, for example, a method of designing a decomposition scheme by a human expert, such as decomposing a "participant" into a "head", "hand", "leg" and "torso", and is suitable for the decomposition of a language describable higher-level semantic concept; the semantic decomposition method can also be realized by data statistics learning, for example, various textures on clothes can be obtained by means of statistical clustering on a data set, and the method is suitable for the decomposition of low-level semantic concepts which are not described by language.
And 2, setting a default signal corresponding to each semantic concept of different levels in the tree semantic framework.
In this embodiment, the most representative signal of a semantic concept is its default signal. Most users will be able to identify their corresponding top-most semantic concepts from the default signal. For example, for the semantic concept "participant", a three-dimensional image model of a person whose general face is the default stature of a suit may be used as a default signal, and all users may consider the default signal to represent the semantic concept "participant" rather than other semantic concepts such as "table".
And 3, extracting semantic descriptions of scenes where the participants are located from the multi-modal signals in a hierarchical manner and correspondingly updating private-level semantic libraries according to the multi-modal signals of the scenes where the participants are located, wherein each private-level semantic library corresponds to a user side and is used for storing data of the user side.
In this embodiment, the private hierarchical semantic library is private to each user, and is used for storing user data and participating in reconstruction of a virtual conference scene under an individual view angle.
Before the conference is started, a communication link for connecting the server side and not less than two user sides is constructed, and after the communication link is constructed, the conference is started.
In a specific embodiment, step 3 may specifically include:
and 3.1, acquiring multi-mode signals of the scene where the participant is located.
In particular, multimodal signals of the scene in which the participant is located are collected using sensors, for example, of the visual type and/or of the auditory type.
And 3.2, extracting semantic description of the scene where the participant is positioned from the multi-modal signal in a hierarchical manner and correspondingly updating the private-level semantic library based on the tree semantic architecture, wherein the semantic description is communication information formed by all semantic instances in the scene, the communication information is content which can be transmitted on a channel after being coded and decoded by a communication module, and the semantic instances correspond to words of speaking or characters.
Specifically, the specific manner of hierarchical extraction is as follows: it is first determined which instances of semantic concepts are present in the scene. For each instance, a tree semantic architecture description is instantiated to which the semantic concept corresponds. That is, each semantic concept and its attributes in the tree semantic architecture of predefined semantic concepts are sequentially identified and stored in the instance in order from high to low.
In a specific embodiment, step 3.2 may specifically include:
and 3.21, identifying the participants.
In particular, a pattern recognition method may be used to find an instance of a semantic concept "participant" from the signal, such as a matching template to which the semantic concept corresponds or a deep network based object recognizer.
And 3.22, instantiating a tree semantic architecture of the predefined participant to obtain a first semantic object, identifying semantic concepts of the first semantic object from high to low, recording attributes of the semantic concepts, directly recording original signals of the semantic concepts at the bottommost layer of the tree semantic architecture, and updating the private level semantic library by using the first semantic object.
In particular, a semantic concept is an abstract class and a semantic object is a concrete object instance. And after a certain semantic concept is identified, filling the attribute corresponding to each low-level semantic concept in the tree semantic framework and the value of the signal at the bottommost layer with placeholders according to the definition, so that the low-level semantic concepts are sequentially assigned in the hierarchical extraction process.
The original signal is the signal belonging to the entity in the scene collected by the sensor. Such as the images and sounds of the participants three.
And 3.23, identifying an object which has interaction with the participant and predefines a tree semantic architecture.
Specifically, all instances in the scene are combined one-to-one with all participants. And evaluating the interaction degree of all the combinations by using a pattern recognition method, wherein the pattern recognition method comprises but is not limited to template matching and a deep network model. And comparing the evaluation value with a preset threshold, and considering that the interaction occurs when the evaluation value is higher than the threshold, otherwise, considering that the interaction does not occur. Methods of setting the threshold include, but are not limited to, manual setting and data statistics learning results.
Wherein objects are instances of other semantic concepts than "participants" that predefine the tree-like semantic architecture, such as tables, chairs, stools.
And 3.24, instantiating a tree-form semantic framework of the predefined object to obtain a second semantic object, identifying the semantic concepts of the second semantic object from high to low, recording the attributes of the semantic concepts, directly recording the original signals of the semantic concepts in the tree-form semantic framework, and updating the private level semantic library by using the second semantic object.
And 4, updating the public level semantic library by using the private level semantic library, wherein the public level semantic library is used for storing all data so as to realize data sharing.
Specifically, the public hierarchical semantic library is public and used for data sharing, and once the participants are many, data cannot be stored in the private hierarchical semantic library, so that the public hierarchical semantic library at the server side is needed.
In a specific embodiment, step 4 may specifically include:
and 4.1, setting an initial semantic level.
Specifically, the initial semantic level, that is, the initially set level, needs to be set in consideration of conditions such as bandwidth. The bandwidth is high, and a more accurate meeting place can be described by using a lower initial semantic level. Each semantic concept employs a different initial semantic hierarchy. For example, the user often pays more attention to the facial expressions of other participants during the meeting process than to the texture of the chair on which the other participants sit. A lower initial semantic level may then be set for the "participant".
And 4.2, intercepting parts which are not lower than the initial semantic level in each semantic object in the private level semantic library as basic semantic level description.
Specifically, the number of iterations from the root node to each node in the tree semantic architecture is the number of layer levels. And comparing the level number with the initial semantic level, and intercepting the semantic objects in the private level semantic library which are not lower than the initial semantic level, wherein the intercepted part is the basic semantic level description. For example, the initial semantic level is set to be 5 levels, so that all semantic objects larger than 5 levels can be intercepted as the basic semantic level description.
And 4.3, the user side sends the basic semantic hierarchy description to the server side through a communication link.
And 4.4, the server side updates the public level semantic library by using the received basic semantic level description.
And 5, the server side sends the hierarchical semantic description consisting of all the contents in the public hierarchical semantic library to all the user sides by using the communication link, and updates the corresponding private hierarchical semantic library at each user side according to the hierarchical semantic description consisting of all the contents.
That is, each client receives all semantic instances owned by the current server, and then updates each client with a hierarchical semantic description composed of all contents in the common hierarchical semantic library. Because the private level semantic library only has the participants and scenes thereof shot by the corresponding sensors at the beginning, and has no other participants, the virtual meeting scene is not constructed. Therefore, this step is actually to make each user end have the semantic instance necessary to construct the basic conference scene.
And 6, starting the conference, and generating a virtual conference scene in the participant visual angle by the user side according to the private level semantic library updated in the step 5 and displaying the virtual conference scene.
And 6.1, representing the semantic objects in the private level semantic library by using the lowest default signals.
And 6.2, arranging all semantic objects represented by default signals in the virtual meeting place.
Specifically, the attributes of the semantic object include spatial pose information, such as a three-dimensional position plus a three-dimensional pose. And the user side can place the signal of the semantic object in the virtual meeting place according to the spatial pose.
And 6.3, obtaining the postures of the participants of the user side according to the private level semantic library, and generating a virtual conference scene in the view angle of the participants.
Specifically, in the hierarchical semantic extraction process, for semantic objects such as heads of each participant, the spatial pose information of the semantic objects is extracted and recorded by a pattern recognition method. The user side can obtain the gestures of the participants of the user side by inquiring the private level semantic library.
The method for generating the virtual meeting scene in the view of the participants includes, but is not limited to, a virtual physics engine and an end-to-end countermeasure generating network.
And 6.4, displaying the virtual conference scene in the view angle of the participant through a display module of the user side.
And 7, after the step 6, generating a virtual conference scene in the view angle of the participant in the conference process and displaying the virtual conference scene.
And 7.1, acquiring multi-mode signals of the scene where the participant is located.
In particular, multimodal signals of the scene in which the participant is located are collected using sensors, for example, of the visual type and/or of the auditory type.
And 7.2, extracting semantic descriptions of scenes where the participants are located from the multi-modal signals in a hierarchical manner based on a tree semantic architecture, and correspondingly updating the private-level semantic library.
For the specific hierarchical extraction process in step 7.2, refer to step 3.2, which is not described herein again.
And 7.3, analyzing the query intention or broadcast intention of the participants by utilizing the private level semantic library.
Specifically, with the private knowledge base, the query intention of a participant who wants to know a semantic object in detail or the broadcast intention of the participant who wants to share the details of a semantic object is analyzed, wherein, for the query intention: the gaze focusing position of the participant can be found through eye movement identification, the interest of the participant is judged according to the focusing time, the longer the time is, the more the interest is, the lower the level of the provided semantic object is until the semantic object is described by a signal level; or by analyzing the conversation of the participants, for example, when talking about wearing both parties, the participants should be worn with a low-level semantic description. For broadcast intent: through gesture recognition, when a participant holds another hand and points at an object, the object can be broadcasted to be described by low-level semantics; or through voice recognition, when a participant says "please notice something", it indicates that the participant intentionally broadcasts a low-level semantic description of the object.
Step 7.4, receiving the query intention of other participants from the server end through the communication link.
And 7.5, forming a hierarchical semantic description by the queried or broadcast semantic object and the semantic objects at higher levels through a communication link, sending the hierarchical semantic description to a server, and updating the public hierarchical semantic library by using the received hierarchical semantic description.
And 7.6, the server side sends the broadcasted hierarchical semantic descriptions to all the user sides by using the communication link, sends the inquired hierarchical semantic descriptions in a targeted manner according to the inquiry intention of the user sides, and sends an updated private hierarchical semantic library by using the hierarchical semantic descriptions of all the user sides and the hierarchical semantic descriptions inquired by the user sides.
And 7.7, the user side generates a virtual conference scene in the view angle of the participant according to the private level semantic library updated in the step 7.6.
And 7.8, displaying the virtual conference scene in the view angle of the participant through a display module of the user side.
And 8, finishing the conference.
Optionally, the method for updating the hierarchical semantic library includes the following steps:
1) receiving a hierarchical semantic description;
2) traversing each semantic object in the hierarchical semantic description according to the hierarchical order from high to low;
3) judging whether the semantic object exists in the hierarchical semantic library, if so, executing 5), and if not, executing 4);
4) instantiating the semantic object and the lower-layer semantic object thereof according to the tree semantic architecture;
5) the attributes of the semantic object are updated.
The invention aims to provide a three-dimensional virtual conference method facing semantic fidelity, which can be used for extracting, transmitting and reproducing semantics in signals in a targeted manner by actively perceiving user intention or passively receiving feedback in a conference process through predefining the semantics in a conference scene so as to realize a virtual conference facing the semantic fidelity, so that the contradiction between communication bandwidth, computing power and user experience in the prior art is solved.
The three-dimensional virtual conference method oriented to semantic fidelity solves the problems that in the prior art, the number of participants is small, the expression and the action of virtual images are rough, the conference scene is not real, and interaction between people and the scene is difficult, increases the scale of a virtual conference, and improves the sense of reality of the participants of users.
Example two
Referring to fig. 2, fig. 2 is a three-dimensional virtual conference system oriented to semantic fidelity according to an embodiment of the present invention. The embodiment of the invention provides a three-dimensional virtual conference system facing semantic fidelity, which comprises: the system comprises a server and at least two user sides, wherein the server is connected with all the user sides through communication links.
In a specific embodiment, the server side includes:
the server-side communication module is used for transmitting the hierarchical semantic description and the user intention with at least two user sides based on the communication link, wherein the user intention comprises a query intention and/or a broadcast intention;
and the public level semantic library module is used for summarizing and storing the level semantic descriptions sent by all the user sides.
In one embodiment, the user terminal includes:
the multi-modal sensor module is used for collecting multi-modal signals from a scene where the participant is located and at least comprises a visual sensor and an auditory sensor;
the hierarchical perception module is used for hierarchically extracting hierarchical semantic description of a scene where the participant is located from the multi-modal signal;
the private level semantic library module is used for storing a corresponding semantic object in a conference scene where the user side is located;
the user intention module is used for analyzing the user intention from the private level semantic library and sending or receiving level semantic description according to the user intention;
the client communication module is used for transmitting the hierarchical semantic description and the user intention with the server based on the communication link;
and the display module is used for obtaining the postures of the participants according to the private level semantic library and generating and displaying the virtual meeting room scene under the visual angle of the participants.
The invention aims to provide a three-dimensional virtual conference system facing semantic fidelity, which can be used for extracting, transmitting and reproducing the semantics in signals in a targeted manner by actively perceiving user intention or passively receiving feedback in the conference process through predefining the semantics in a conference scene so as to realize a virtual conference facing the semantic fidelity, so as to solve the contradiction between communication bandwidth, computing power and user experience in the prior art.
The three-dimensional virtual conference system oriented to semantic fidelity solves the problems that in the prior art, the number of participants is small, the expression and the action of virtual images are rough, the conference scene is not real, and interaction between people and the scene is difficult, increases the scale of virtual conferences, and improves the sense of reality of the participants of users.
In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic data point described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristic data points described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A semantic fidelity-oriented virtual conference method is characterized by comprising the following steps:
step 1, constructing a tree-shaped semantic architecture with not less than two levels according to each semantic concept at least comprising participants, wherein nodes of each level of the tree-shaped semantic architecture consist of semantic concepts and attributes thereof, and the bottommost layer of the tree-shaped semantic architecture is a signal without attributes;
step 2, setting a corresponding default signal for each semantic concept of different levels in the tree semantic framework;
step 3, according to a multi-modal signal of a scene where a participant is located, extracting semantic descriptions of the scene where the participant is located from the multi-modal signal in a hierarchical manner and correspondingly updating private-level semantic libraries, wherein each private-level semantic library corresponds to a user side and is used for storing data of the user side;
step 4, updating a public level semantic library by using the private level semantic library, wherein the public level semantic library is used for storing all data so as to realize data sharing;
step 5, the server side sends the hierarchical semantic description composed of all the contents in the public hierarchical semantic library to all the user sides by using a communication link, and updates the corresponding private hierarchical semantic library at each user side according to the hierarchical semantic description composed of all the contents;
step 6, starting the conference, and generating a virtual conference scene in the participant visual angle by the user side according to the private level semantic library updated in the step 5 and displaying the virtual conference scene;
and 7, after the step 6, generating a virtual conference scene in the view angle of the participant in the conference process and displaying the virtual conference scene.
2. The virtual conference method oriented to semantic fidelity according to claim 1, wherein the top root node of the tree-shaped semantic framework is the semantic concept and its attribute, and according to the sequence from high to low, the semantic concept of the next level of the tree-shaped semantic framework is decomposed from the semantic concept of the previous level, and the decomposed semantic concept and its attribute are used as the node of the next level.
3. The virtual conference method oriented to semantic fidelity according to claim 1, wherein the step 3 comprises:
step 3.1, obtaining multi-mode signals of a scene where the participant is located;
and 3.2, extracting semantic descriptions of scenes where the participants are located from the multi-modal signals in a hierarchical manner based on the tree semantic architecture, and correspondingly updating the private-level semantic library.
4. Semantic fidelity-oriented virtual conferencing method according to claim 3, wherein the step 3.2 comprises:
step 3.21, identifying the participants;
step 3.22, instantiating a tree-shaped semantic framework of a predefined participant to obtain a first semantic object, identifying semantic concepts of the first semantic object from high to low, recording attributes of the semantic concepts, directly recording original signals of the semantic concepts at the bottommost layer of the tree-shaped semantic framework, and updating the private level semantic library by using the first semantic object;
3.23, identifying an object which interacts with the participant and predefines a tree semantic architecture;
and 3.24, instantiating a tree semantic architecture of the predefined object to obtain a second semantic object, identifying semantic concepts of the second semantic object from high to low, recording attributes of the semantic concepts, directly recording original signals of the semantic concepts in the tree semantic architecture, and updating the private level semantic library by using the second semantic object.
5. The virtual conference method oriented to semantic fidelity according to claim 1, wherein the step 4 comprises:
step 4.1, setting an initial semantic level;
step 4.2, intercepting the part which is not lower than the initial semantic level in each semantic object in the private level semantic library as basic semantic level description;
4.3, the user side sends the basic semantic hierarchy description to a server side through a communication link;
and 4.4, the server side updates the public level semantic library by using the received basic semantic level description.
6. The virtual conference method oriented to semantic fidelity according to claim 1, wherein the step 6 comprises:
6.1, representing the semantic objects in the private level semantic library by using a default signal at the bottommost layer;
step 6.2, arranging all semantic objects represented by the default signals in a virtual meeting place;
6.3, obtaining the postures of the participants of the user side according to the private level semantic library, and generating a virtual conference scene in the view angle of the participants;
and 6.4, displaying the virtual conference scene in the view angle of the participant through a display module of the user side.
7. The virtual conference method oriented to semantic fidelity according to claim 1, wherein the step 7 comprises:
step 7.1, obtaining multi-mode signals of a scene where the participant is located;
step 7.2, extracting semantic descriptions of scenes where the participants are located from the multi-modal signals in a hierarchical manner based on the tree semantic architecture, and correspondingly updating the private-level semantic library;
7.3, analyzing the query intention or broadcast intention of the participants by utilizing a private level semantic library;
7.4, receiving the query intention of other participants from the server side through the communication link;
step 7.5, forming a hierarchical semantic description by the queried or broadcast semantic object and the semantic objects at higher levels through a communication link, sending the hierarchical semantic description to a server, and updating the public hierarchical semantic library by using the received hierarchical semantic description;
7.6, the server side sends the broadcasted hierarchical semantic descriptions to all the user sides by using the communication link, sends the inquired hierarchical semantic descriptions according to the inquiry intention of the user sides, and sends and updates the private hierarchical semantic library by using the hierarchical semantic descriptions of all the user sides and the hierarchical semantic descriptions inquired by the user sides;
7.7, the user side generates a virtual conference scene in the participant visual angle according to the private level semantic library updated in the step 7.6;
and 7.8, displaying the virtual conference scene in the view angle of the participant through a display module of the user side.
8. A three-dimensional virtual conference system oriented to semantic fidelity, for implementing the virtual conference method oriented to semantic fidelity of any one of claims 1 to 7, the three-dimensional virtual conference system comprising: the system comprises a server and at least two user sides, wherein the server is connected with all the user sides through communication links.
9. The three-dimensional virtual conference system oriented to semantic fidelity according to claim 8, wherein the server side comprises:
the server-side communication module is used for transmitting hierarchical semantic description and user intention with at least two user sides based on the communication link, wherein the user intention comprises query intention and/or broadcast intention;
and the public level semantic library module is used for summarizing and storing the level semantic descriptions sent by all the user sides.
10. The three-dimensional virtual conference system oriented to semantic fidelity according to claim 8, wherein the user side comprises:
the multi-modal sensor module is used for collecting multi-modal signals from a scene where the participant is located and at least comprises a visual sensor and an auditory sensor;
the hierarchical perception module is used for hierarchically extracting hierarchical semantic description of a scene where the participant is located from the multi-modal signal;
the private level semantic library module is used for storing a corresponding semantic object in a conference scene where the user side is located;
the user intention module is used for analyzing the user intention from the private level semantic library and sending or receiving level semantic description according to the user intention;
the user side communication module is used for carrying out hierarchical semantic description and user intention transmission with the server side based on the communication link;
and the display module is used for obtaining the postures of the participants according to the private level semantic library and generating and displaying the virtual meeting room scene under the visual angle of the participants.
CN202210207225.9A 2022-03-04 2022-03-04 Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system Active CN114363557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210207225.9A CN114363557B (en) 2022-03-04 2022-03-04 Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210207225.9A CN114363557B (en) 2022-03-04 2022-03-04 Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system

Publications (2)

Publication Number Publication Date
CN114363557A true CN114363557A (en) 2022-04-15
CN114363557B CN114363557B (en) 2022-06-24

Family

ID=81094755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210207225.9A Active CN114363557B (en) 2022-03-04 2022-03-04 Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system

Country Status (1)

Country Link
CN (1) CN114363557B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102170361A (en) * 2011-03-16 2011-08-31 西安电子科技大学 Virtual-reality-based network conference method
CN103678569A (en) * 2013-12-09 2014-03-26 北京航空航天大学 Construction method of virtual scene generation-oriented video image material library
CN103888714A (en) * 2014-03-21 2014-06-25 国家电网公司 3D scene network video conference system based on virtual reality
US20140184496A1 (en) * 2013-01-03 2014-07-03 Meta Company Extramissive spatial imaging digital eye glass apparatuses, methods and systems for virtual or augmediated vision, manipulation, creation, or interaction with objects, materials, or other entities
US20160378861A1 (en) * 2012-09-28 2016-12-29 Sri International Real-time human-machine collaboration using big data driven augmented reality technologies
CN107071334A (en) * 2016-12-24 2017-08-18 深圳市虚拟现实技术有限公司 3D video-meeting methods and equipment based on virtual reality technology
US20180045963A1 (en) * 2016-08-11 2018-02-15 Magic Leap, Inc. Automatic placement of a virtual object in a three-dimensional space
US20190188895A1 (en) * 2017-12-14 2019-06-20 Magic Leap, Inc. Contextual-based rendering of virtual avatars
US20190340825A1 (en) * 2016-12-26 2019-11-07 Interdigital Ce Patent Holdings Device and method for generating dynamic virtual contents in mixed reality
CN113315972A (en) * 2021-05-19 2021-08-27 西安电子科技大学 Video semantic communication method and system based on hierarchical knowledge expression
WO2021247156A1 (en) * 2020-06-04 2021-12-09 Microsoft Technology Licensing, Llc Classification of auditory and visual meeting data to infer importance of user utterances

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102170361A (en) * 2011-03-16 2011-08-31 西安电子科技大学 Virtual-reality-based network conference method
US20160378861A1 (en) * 2012-09-28 2016-12-29 Sri International Real-time human-machine collaboration using big data driven augmented reality technologies
US20140184496A1 (en) * 2013-01-03 2014-07-03 Meta Company Extramissive spatial imaging digital eye glass apparatuses, methods and systems for virtual or augmediated vision, manipulation, creation, or interaction with objects, materials, or other entities
CN103678569A (en) * 2013-12-09 2014-03-26 北京航空航天大学 Construction method of virtual scene generation-oriented video image material library
CN103888714A (en) * 2014-03-21 2014-06-25 国家电网公司 3D scene network video conference system based on virtual reality
US20180045963A1 (en) * 2016-08-11 2018-02-15 Magic Leap, Inc. Automatic placement of a virtual object in a three-dimensional space
CN107071334A (en) * 2016-12-24 2017-08-18 深圳市虚拟现实技术有限公司 3D video-meeting methods and equipment based on virtual reality technology
US20190340825A1 (en) * 2016-12-26 2019-11-07 Interdigital Ce Patent Holdings Device and method for generating dynamic virtual contents in mixed reality
US20190188895A1 (en) * 2017-12-14 2019-06-20 Magic Leap, Inc. Contextual-based rendering of virtual avatars
WO2021247156A1 (en) * 2020-06-04 2021-12-09 Microsoft Technology Licensing, Llc Classification of auditory and visual meeting data to infer importance of user utterances
US20210383127A1 (en) * 2020-06-04 2021-12-09 Microsoft Technology Licensing, Llc Classification of auditory and visual meeting data to infer importance of user utterances
CN113315972A (en) * 2021-05-19 2021-08-27 西安电子科技大学 Video semantic communication method and system based on hierarchical knowledge expression

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
PIERLUIGI ZAMA RAMIREZ等: "Shooting Labels: 3D Semantic Labeling by Virtual Reality", 《2020 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND VIRTUAL REALITY》, 15 January 2021 (2021-01-15) *
ZHONGQIANG ZHANG等: "S³Net: Spectral–Spatial–Semantic Network for Hyperspectral Image Classification With the Multiway Attention Mechanism", 《 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING》, 14 April 2021 (2021-04-14) *
纪连恩等: "虚拟环境下基于语义的三维交互技术", 《软件学报》, no. 07, 23 July 2006 (2006-07-23) *

Also Published As

Publication number Publication date
CN114363557B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
US11546550B2 (en) Virtual conference view for video calling
WO2021229415A1 (en) Method and system for virtual 3d communications
US20160134840A1 (en) Avatar-Mediated Telepresence Systems with Enhanced Filtering
US20220084275A1 (en) Method and system for generating data to provide an animated visual representation
US11551393B2 (en) Systems and methods for animation generation
WO2022017083A1 (en) Data processing method and apparatus, device, and readable storage medium
US20060210045A1 (en) A method system and apparatus for telepresence communications utilizing video avatars
CN107636684A (en) Emotion identification in video conference
US11568646B2 (en) Real-time video dimensional transformations of video for presentation in mixed reality-based virtual spaces
CN110418095B (en) Virtual scene processing method and device, electronic equipment and storage medium
WO2003058518A2 (en) Method and apparatus for an avatar user interface system
CN111405234A (en) Video conference information system and method with integration of cloud computing and edge computing
Nakanishi FreeWalk: a social interaction platform for group behaviour in a virtual space
CN115515016B (en) Virtual live broadcast method, system and storage medium capable of realizing self-cross reply
WO2024078243A1 (en) Training method and apparatus for video generation model, and storage medium and computer device
CN110427227B (en) Virtual scene generation method and device, electronic equipment and storage medium
US20020164068A1 (en) Model switching in a communication system
CN110536095A (en) Call method, device, terminal and storage medium
CN114363557B (en) Semantic fidelity-oriented virtual conference method and three-dimensional virtual conference system
Otsuka Multimodal conversation scene analysis for understanding people’s communicative behaviors in face-to-face meetings
CN113825018B (en) Video processing management platform based on image processing
WO2022255980A1 (en) Virtual agent synthesis method with audio to video conversion
Shae et al. Immersive whiteboard collaborative system
Chen Technologies for building networked collaborative environments
Zhu et al. Virtual avatar enhanced nonverbal communication from mobile phones to PCs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant