CN113315972B

CN113315972B - Video semantic communication method and system based on hierarchical knowledge expression

Info

Publication number: CN113315972B
Application number: CN202110543408.3A
Authority: CN
Inventors: 石光明; 高大化; 杨旻曦; 张中强; 董宇波; 谢雪梅; 刘丹华
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2022-04-19
Anticipated expiration: 2041-05-19
Also published as: CN113315972A

Abstract

The invention provides a video semantic communication method based on hierarchical knowledge expression, which mainly solves the problems of incomplete semantic extraction, insufficient semantic representation capability and redundant semantic description in the prior art. The implementation scheme is as follows: constructing a hierarchical knowledge base consisting of a multi-level signal perception network, a semantic abstract network, a semantic reconstruction network and a video reconstruction network; collecting a video signal to be transmitted; extracting the structural semantic features of the video signal based on a signal perception network and a semantic abstract network in a hierarchical knowledge base, and transmitting the video signal through an ultra-narrow band channel; and reconstructing a video signal by utilizing a semantic reconstruction network and a signal reconstruction network in the hierarchical knowledge base according to the structural semantic features. According to the invention, by mining semantic features of different scales and using a structured data structure to represent semantics, not only is the integrity of semantic extraction improved, but also the semantic representation capability and the communication bandwidth utilization rate are improved, and the method can be used for online conferences, human-computer interaction and intelligent Internet of things.

Description

Video semantic communication method and system based on hierarchical knowledge expression

Technical Field

The invention belongs to the technical field of video communication, and particularly relates to a video semantic communication method and system based on hierarchical knowledge expression, which can be used for online conferences, man-machine interaction and intelligent Internet of things.

Background

Video communication is a communication service that delivers video information. The video communication technology commonly used at present realizes the transmission of video information by completely transmitting video signals, and judges the quality of video communication according to the integrity of the transmitted signals. With the rapid improvement of video signal definition and the geometric increase of the number of communication terminals, the communication bandwidth gradually approaching the growth limit cannot meet the requirements of video communication scenes in intelligent times backgrounds such as large online interactive conferences and intelligent internet of things.

In the conventional video communication method, a video signal is compressed and encoded based on a signal compression algorithm represented by wavelet transform, the obtained code is transmitted through a channel, and finally the video signal is reconstructed at a receiving end. However, due to the limitation of the compression algorithm, the code rate after compression still cannot meet the video communication requirement of the intelligent era, for example, the bandwidth of 40Mbps is required for transmitting a single 4K/30 frame of video by using the current latest video coding method H265, and tens of channels of video cannot be transmitted simultaneously by using a 5G terminal device to meet the interaction requirement of a large online conference.

The video communication method based on deep learning includes that a video encoder based on a deep network converts video signals into feature vectors, then the feature vectors are transmitted through a channel, and finally a video decoder based on a countermeasure generation network restores the video signals according to the obtained feature vectors at a receiving end. Compared with the traditional video communication method, the neural network can generate the feature vectors with any length, so that the compression rate of the video communication method based on deep learning can be high. However, since a large amount of data and time are needed for training a deep network, the trained network can only be used in a specific scene, and when the scene changes, a data set needs to be reconstructed and trained, and the training is not flexible. In addition, since the extracted features are not targeted, in the case of high compression rate, key details are easily lost, resulting in distortion of the generated video.

In many application scenarios, video communication does not need to transmit the video signal completely, but needs to convey semantic information represented by the video signal, for example, in an interactive video conference scenario, the two communicating parties need meaning conveyed by facial expressions and body movements, but do not need information such as environment, clothes texture, etc. of the other party. Therefore, the video semantic communication method for extracting and transmitting the semantics expressed by the video signals in the target scene and reproducing the video signals according to the semantics can effectively save the communication bandwidth so as to meet the video communication requirement of the intelligent era. For example, a patent application with publication number CN111246176A entitled "a segmented video transmission method" discloses a video transmission method based on text semantics. The method comprises the steps of firstly identifying and extracting a foreground target in a video signal by a sender, then using feature point coordinates of the foreground target described by a text as text semantics, then transmitting the text semantics to a receiver, then reconstructing a target contour by the sender according to the text semantics, and finally reconstructing the video signal according to the target contour by a trained generation network. The method improves the compression efficiency and transmits the semantic information of the video signal by extracting, transmitting and reconstructing the semantics in the video signal, but has the following two disadvantages:

firstly, because the transmission semantics of the method are the outlines of the foreground targets, only coordinates can be transmitted, information including target colors, textures and the like cannot be transmitted, the semantics extraction is incomplete, and the application scene is limited;

secondly, the method adopts the unstructured data structures such as texts to represent semantics, and the semantic representation capability of the unstructured data structures is insufficient, so that the semantic types which can be transmitted by the method are not rich, the expression is not accurate, and the unstructured description has redundancy, so that the communication bandwidth is wasted.

Disclosure of Invention

The invention aims to provide a video semantic communication method and system based on hierarchical knowledge expression, which are used for solving the problems of incomplete semantic extraction, insufficient semantic representation capability and redundant semantic description in the prior art, expanding application scenes and avoiding waste of communication bandwidth.

In order to achieve the purpose, the video semantic communication method based on hierarchical knowledge expression comprises the following steps:

1) constructing a hierarchical knowledge base K:

1a) establishing semantic perception knowledge base K₀For storing the primary structured semantic features G extracted from the video₀Signal-aware network W_e ⁰And structuring semantic features G from the primary₀Signal reconstruction network for reconstructing video

1b) Establishing a semantic abstract knowledge base K with L levels gradually increased_lFor storing semantic features G structured from a lower level_l-1Middle generation of high-level structured semantic features G_lSemantic abstract network of

And structuring semantic features G from a higher level_lReconstructing a low-level structured semantic feature G_l-1Semantic restructuring network

Wherein L is more than or equal to 1, L is the serial number of the semantic hierarchy, and L is more than or equal to 1 and less than or equal to L;

1c) to semantically perceive a knowledge base K₀And L semantic abstract knowledge bases K with gradually increased levels_lForming a hierarchical knowledge base K according to a hierarchical sequence;

2) collecting an F frame original video V to be transmitted, wherein F is more than or equal to 1;

3) signal perception network W based on hierarchical knowledge base K_e ⁰And semantic abstract network

Extracting semantic features in the original video V to obtain top-level structured semantic features G corresponding to the original video V_L；

4) Setting an ultra-narrow band channel with the bandwidth Q less than or equal to 4Kbps, and performing special semantic processing on the top level structured semantic corresponding to the original video VSign G_LCarrying out transmission;

5) signal reconstruction network based on hierarchical knowledge base K

And semantic restructuring networks

For received top level structured semantic features G_LAnd restoring to obtain a reconstructed video V'.

In order to achieve the above object, the video semantic communication system based on hierarchical knowledge expression of the present invention includes:

the video acquisition device is used for acquiring an original video;

the semantic encoder is connected with the video acquisition device and used for performing semantic encoding on the original video to obtain semantic features of the original video;

the ultra-narrow band communication device is connected with the semantic encoder and is used for transmitting the characteristics of the video on an ultra-narrow band channel;

the semantic decoder is connected with the ultra-narrow band communication device and is used for reconstructing semantic features to obtain a reconstructed video;

the video display device is connected with the semantic decoder and is used for displaying the reconstructed video;

the method is characterized in that:

the knowledge base query port of the semantic encoder is connected with an information source level knowledge base which stores a semantic extraction network and is used for encoding the original video level by level to obtain the structural semantic features corresponding to the original video;

and the knowledge base query port of the semantic encoder is connected with an information sink level knowledge base in which a video reconstruction network is stored, and is used for reconstructing the structural semantic features level by level to obtain a reconstructed video.

Compared with the prior art, the invention has the following advantages:

firstly, in the process of extracting the video signal semantics, the invention adopts the step-by-step semantics extraction based on the hierarchical knowledge base for the video signal, fully excavates the semantic features of different scales in the video signal, avoids the single type of semantic features used in the prior art, and effectively improves the integrity of the semantics extraction;

secondly, in the process of representing the video signal semantics, the invention uses more flexible structural semantic representation, and efficiently represents the semantic objects in the video signal and the interactive relationship between the semantic objects by using the relationship between the objects in the structure, thereby avoiding the phenomena of inaccurate semantic description and redundant description caused by using text to represent the semantics in the prior art, and effectively improving the semantic representation capability and the communication bandwidth utilization rate.

Drawings

FIG. 1 is a flow chart of an implementation of a video semantic communication method based on hierarchical knowledge representation according to the present invention;

fig. 2 is a schematic structural diagram of a video semantic communication system based on hierarchical knowledge expression according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments.

The embodiment is oriented to the video semantic communication requirement of the online interactive conference, and the video semantic communication method and the video semantic communication system based on hierarchical knowledge expression are realized aiming at single body action and multi-person interactive behavior in a conference scene.

Referring to fig. 1, the video semantic communication method based on hierarchical knowledge expression of the present embodiment includes the following steps:

step 1, constructing a hierarchical knowledge base K.

1.1) establishing a semantic perception knowledge base K₀For storing the primary structured semantic features G extracted from the video₀Signal-aware network W_e ⁰And structuring semantic features G from the primary₀Signal reconstruction network for reconstructing video

In this example, both single-person limb actions and multi-person interaction actions in scene semanticsCan be described by semantic objects, such as joint points, persons and relations between the semantic objects, such as bones and person interaction relations, and in order to effectively represent the relations between the semantic objects and the semantic objects, the primary structured semantic features G₀Signal aware network W_e ⁰And signal reconstruction network

The structures of (a) are respectively as follows:

the primary structured semantic feature G₀Characterized by a semantic graph consisting of points and edges, which comprises a primary set of nodes A₀And a primary set of edges B₀Wherein:

primary node set

By

A primary node

Comprises 14 human body joint points of a head, a neck, a left shoulder, a right shoulder, a left elbow, a right elbow, a left hand, a right hand, a left hip, a right hip, a left knee, a right knee, a left foot and a right foot, i₀Is the sequence number of the node in the primary set of nodes,

is the total number of nodes in the primary set of nodes,

is the ith₀A node defined by the semantic class of the node

Semantic feature vector

And child node set

Composition, semantic feature vectors

A two-dimensional vector formed by plane two-dimensional coordinates of the joint points;

primary set of edges

By

A primary side

Composition comprising the bones of the arms, thighs, which connect the joint points, j₀Is the sequence number of the edge in the primary edge set,

is the total number of edges in the primary edge set,

is jth₀An edge defined by the semantic class of the edge

And semantic feature vectors

Composition, semantic feature vectors

A one-dimensional vector formed by scalars reflecting the shielding degree;

the signal-aware network W_e ⁰According to character actions and interactions in a meeting scene, a trained human posture detection network is selected as a primary structured semantic feature G extracted from a video₀Signal-aware network W_e ⁰；

The signal reconstruction network

According to character actions and interactions in a meeting scene, a trained video image generation network based on a semantic graph is selected as a primary structured semantic feature G extracted from a video₀Signal-aware network W_e ⁰；

At present, openpos, alphapos and the like are commonly used as human body posture detection networks, and graph2image, SPADE and the like are commonly used as semantic graph-based video image generation networks, and the example preferably but not limited to use alphapos as W_e ⁰Using graph2image as

1.2) establishing a semantic abstract knowledge base K with L levels gradually increased_lFor storing semantic features G structured from a lower level_l-1Middle generation of high-level structured semantic features G_lSemantic abstract network of

in this example, character actions and interactions in a meeting scenario involve three levels of semantics, from primary to advanced: the joint point level, the single-person posture level and the multi-person interaction level need to use a level semantic graph formed by a plurality of semantic graphs as a structural representation mode of multi-level semantics. Its high-level structured semantic features G₀Semantic abstract network

And semantic restructuring networks

The structures of (a) are respectively as follows:

the advanced structured semantic feature G₀Characterized using a semantic graph consisting of points and edges, which includes a set of advanced nodes A_lAnd advanced edge set B_lWherein:

advanced node set

i_lThe sequence numbers of the nodes in the higher level node set,

is the total number of nodes in the advanced set of nodes,

is the ith_lA node defined by the semantic class of the node

Semantic feature vector

And child node set

Composition is carried out;

advanced edge set

j_lIs the sequence number of an edge in the advanced edge set,

is the total number of edges in the advanced edge set,

is jth_lAn edge defined by the semantic class of the edge

And semantic feature vectors

Composition is carried out;

the semantic abstraction network

According to the feature extraction of graph structure data, a trained downsampling network based on a graph convolution neural network is selected as a low-level structured semantic feature G_l-1Middle generation of high-level structured semantic features G_lSemantic abstract network of

The semantic restructuring network

According to the feature reduction of the graph structure data, a trained up-sampling network based on a graph convolution neural network is selected as a semantic feature G which is structured from a higher level_lReconstructing a low-level structured semantic feature G_l-1Semantic restructuring network

The convolutional neural networks commonly used at present are GCN, GraphSAGE and GAT, and the example is preferably but not limited to using a downsampling network based on GAT as the convolutional neural network

Using a GAT-based up-sampling network as

1.3) knowledge base K of semantic perception₀And L semantic abstract knowledge bases K with gradually increased levels_lAnd (4) forming a hierarchical knowledge base K according to the hierarchical order.

Step 2, acquiring an F-frame original video V to be transmitted, where in this example, F is 1.

Step 3, based on the signal perception network W in the hierarchical knowledge base K_e ⁰And semantic abstract network

Extracting semantic features in the original video V to obtain top-level structured semantic features G corresponding to the original video V_L。

3.1) knowledge base K based on semantic perception₀Signal aware network W in_e ⁰Extracting primary semantic features in the original video V to obtain primary structured semantic features G₀；

3.2) the semantic abstract knowledge base K which is gradually increased based on L levels in sequence_lExtracting primary structured semantic features G₀Obtaining top level structural semantic feature G from the high level semantic features_L：

3.2.1) let l be 1;

3.2.2) semantic-based abstract knowledge base K_lSemantic abstract network in (1)

Structuring semantic features G from a lower level_l-1Generating high-level structured semantic features G_l；

3.2.3) judging whether L is more than or equal to L, if so, obtaining the top-level structural semantic feature G_LOtherwise, let l equal to l +1, return to 3.2.2).

Step 4, transmitting the top-level structured semantic features G corresponding to the original video V through the ultra-narrow band channel_L。

4.1) Top-level structured semantic features G at the sending end_LBinary coding is carried out to obtain binary code S_b；

The currently used binary coding methods include arithmetic coding, huffman coding and the like, and the example is preferably but not limited to arithmetic coding;

4.2) modulation of binary code S at the transmitting end_bObtaining a signal S and processing the signal S through an ultra-narrow band channel with the bandwidth of Q ═ 3KbpsTransmitting;

4.3) demodulating the signal S at the receiving end to obtain a binary code S_bAnd for the binary code S_bBinary decoding is carried out to obtain top-level structural semantic features G_L。

Step 5, restoring the received top-level structured semantic features G based on the hierarchical knowledge base K_LAnd a reconstructed video V' is obtained.

For the top level structured semantic feature G_LThe reduction of (1) is to extract the top-level structural semantic feature G corresponding to the original video V_LThe implementation steps are as follows:

5.1) the semantic abstract knowledge base K which is gradually reduced based on L levels in sequence_lRestoring top-level structured semantic features G_LTo obtain primary structured semantic features G₀：

5.1.1) making L ═ L;

5.1.2) semantic-based abstract knowledge base K_lSemantic restructuring networks in

Structuring semantic features G from a higher level_lReducing lower-level structured semantic features G_l-1；

5.1.3) judging whether l is less than or equal to 1, if so, obtaining a primary structured semantic feature G₀Execution 5.2), otherwise, let l ═ l-1, return 5.1.2);

5.2) knowledge base K based on semantic perception₀Signal reconstruction network in

For primary structured semantic features G₀And carrying out video reconstruction to obtain a reconstructed video V'.

Referring to fig. 2, the video semantic communication system based on hierarchical knowledge expression in the present example includes: the system comprises a video acquisition device 1, a semantic encoder 2, an information source level knowledge base 6, an ultra-narrow band communication device 3, a semantic decoder 4, an information sink level knowledge base 7 and a video display device 5, wherein:

the video acquisition device 1 is connected with the semantic encoder 2 through a video data receiving port and is used for acquiring an original video;

in this example, it is preferable, but not limited, to use a camera with a resolution of 2K and a frame rate of 30 frames per second as the video capture device;

the semantic encoder 2 is connected with the video acquisition device 1 through a video data receiving port, is connected with the information source level knowledge base 6 through a knowledge base query port, and is connected with the ultra-narrow band communication device 3 through a semantic transmitting port, and is used for encoding the original video level by level to obtain the structural semantic features corresponding to the original video;

and the information source level knowledge base 6 is connected with the semantic encoder 2 through a knowledge base query port and is used for storing the semantic extraction network. In this example, it is preferable, but not limited, to use a PC workstation as the sender in the system and as the hardware platform for the semantic encoder 2 and the source-level knowledge base 6;

the ultra-narrow band communication device 3 is connected with the semantic encoder 2 through a semantic sending port and connected with the semantic decoder 4 through a semantic receiving port, and is used for transmitting the structural semantic features of the video on an ultra-narrow band channel. In this example, in order to intuitively show that the bandwidth adopted in this example is only Q ═ 3Kbps, so this example preferably but not limited to the sound wave with frequency of 3KHz that can be heard by human ears as the carrier wave, the transmitting end uses the loudspeaker to transmit the sound wave signal, and the receiving end uses the microphone to receive the sound wave signal;

the semantic decoder 4 is connected with the video display device 5 through a video data sending port, is connected with the information sink level knowledge base 7 through a knowledge base query port, and is connected with the ultra-narrow band communication device 3 through a semantic receiving port, and is used for reconstructing the structural semantic features level by level to obtain a reconstructed video;

and the information sink level knowledge base 7 is connected with the semantic decoder 4 through a knowledge base query port and is used for storing the video reconstruction network. In this example, it is preferable, but not limited, to use a PC workstation as the receiving end in the system, and as the hardware platform for the semantic decoder 4 and the sink level knowledge base 7;

and the video display device 5 is connected with the semantic decoder 4 through a video signal sending port and is used for displaying the reconstructed video. In this example, it is preferable, but not limited, to use a display with a resolution of 2K and a frame rate of 30 frames per second as the video display device.

The working principle of the system of the embodiment is as follows:

the video acquisition device 1 acquires an original video signal and sends the original video signal to the semantic encoder 2; the semantic encoder 2 queries the information source level knowledge base 6 to obtain a semantic extraction network; the semantic encoder 2 encodes the original video layer by layer according to the queried semantic extraction network to obtain the structural semantic features corresponding to the original video and sends the structural semantic features to the ultra-narrow band communication device 3; the ultra-narrow band communication device 3 transmits the structural semantic features of the video on an ultra-narrow band channel and sends the structural semantic features to the semantic decoder 4; the semantic decoder 4 queries the information sink level knowledge base 7 to obtain a video reconstruction network; the semantic decoder 4 reconstructs the structural semantic features layer by layer according to the inquired video reconstruction network to obtain a reconstructed video and sends the reconstructed video to the video display device 5; the video display device 5 displays the reconstructed video.

The foregoing description is only an example of the present invention and is not intended to limit the invention, so that it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. The video semantic communication method based on hierarchical knowledge expression is characterized by comprising the following steps of:

1) constructing a hierarchical knowledge base K:

1a) establishing semantic perception knowledge base K₀For storing the primary structured semantic features G extracted from the video₀Signal aware network of

And structuring semantic features G from the primary₀Signal reconstruction network for reconstructing video

Primary structured semantic features G₀By using a primary node set A₀And a primary set of edges B₀A graph structure of compositions, wherein:

primary node set

i₀Is the sequence number of the node in the primary set of nodes,

is the total number of nodes in the primary set of nodes,

is the ith₀A node defined by the semantic class of the node

Semantic feature vector

And child node set

Composition is carried out;

primary set of edges

j₀Is the sequence number of the edge in the primary edge set,

is the total number of edges in the primary edge set,

is jth₀An edge defined by the semantic class of the edge

And semantic feature vectors

Composition is carried out;

Wherein L is more than or equal to 1, L is the serial number of the semantic hierarchy, and L is more than or equal to 1 and less than or equal to L; l-th level structured semantic feature G_lBy using a set of advanced nodes A_lAnd advanced edge set B_lA graph structure of compositions, wherein:

advanced node set

i_lThe sequence numbers of the nodes in the higher level node set,

is the total number of nodes in the advanced set of nodes,

is the ith_lA node defined by the semantic class of the node

Semantic feature vector

And child node set

Composition is carried out;

advanced edge set

j_lIs the sequence number of an edge in the advanced edge set,

is the total number of edges in the advanced edge set,

is jth_lAn edge defined by the semantic class of the edge

And semantic feature vectors

Composition is carried out;

3) signal perception network based on hierarchical knowledge base K

And semantic abstract network

Extracting semantic features in the original video V to obtainTo the top level structured semantic feature G corresponding to the original video V_L(ii) a The method is realized as follows:

3a) knowledge base K based on semantic perception₀Signal aware network in

Extracting primary semantic features in the original video V to obtain primary structured semantic features G₀；

3b) Semantic abstract knowledge base K sequentially based on L levels which are gradually increased_lExtracting primary structured semantic features G₀Obtaining top level structural semantic feature G from the high level semantic features_L：

3b1) Let l equal to 1;

3b2) semantic abstract knowledge base K_lSemantic abstract network in (1)

3b3) Judging whether L is more than or equal to L, if so, obtaining top-level structural semantic features G_LOtherwise, let l equal to l +1, return to 3b 2);

4) setting an ultra-narrow band channel with the bandwidth Q less than or equal to 4Kbps, and carrying out top level structural semantic feature G corresponding to the original video V_LCarrying out transmission;

5) signal reconstruction network based on hierarchical knowledge base K

And semantic restructuring networks

2. The method of claim 1, wherein the signal-aware network in 1a)

And signal reconstruction network

And respectively adopting the trained semantic graph to generate a network and the trained video reconstruction network based on the semantic graph.

3. The method of claim 1, wherein the semantic abstraction network of 1b)

And semantic restructuring networks

Respectively adopting a trained down-sampling network based on graph convolution and a trained up-sampling network based on graph convolution.

4. The method according to claim 1, wherein the top-level structured semantic features G corresponding to the original video V in 4) are transmitted through an ultra-narrow band channel_LThe implementation is as follows:

4a) top-level structured semantic features G at a sending end_LBinary coding is carried out to obtain binary code S_b；

4b) Modulating binary code S at transmitting end_bObtaining a signal S and transmitting the signal S through an ultra-narrow band channel;

4c) demodulating the signal S at the receiving end to obtain a binary code S_b；

4d) At the receiving end, a binary code S_bBinary decoding is carried out to obtain top-level structured semantic features G_L。

5. The method of claim 1, wherein 5) reconstructing the network based on the signals in the hierarchical knowledge base K

And semantic restructuring networks

The reconstructed video V' is restored as follows:

5a) semantic abstract knowledge base K which is gradually reduced based on L levels in sequence_lRestoring top-level structured semantic features G_LTo obtain primary structured semantic features G₀：

5a1) Let L be L;

5a2) semantic abstract knowledge base K_lSemantic restructuring networks in

5a3) Judging whether l is less than or equal to 1, if so, obtaining primary structural semantic features G₀Execute 5b), otherwise, let l ═ l-1, return 5a 2);

5b) knowledge base K based on semantic perception₀Signal reconstruction network in