WO2024032119A1 - 一种多模态信源联合编码方法 - Google Patents
一种多模态信源联合编码方法 Download PDFInfo
- Publication number
- WO2024032119A1 WO2024032119A1 PCT/CN2023/098536 CN2023098536W WO2024032119A1 WO 2024032119 A1 WO2024032119 A1 WO 2024032119A1 CN 2023098536 W CN2023098536 W CN 2023098536W WO 2024032119 A1 WO2024032119 A1 WO 2024032119A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- modal
- knowledge base
- information sources
- sources
- feature maps
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000005540 biological transmission Effects 0.000 claims abstract description 10
- 238000007906 compression Methods 0.000 description 10
- 230000006835 compression Effects 0.000 description 8
- 238000012549 training Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
- H04N19/91—Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/40—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video transcoding, i.e. partial or full decoding of a coded input stream followed by re-encoding of the decoded output stream
Definitions
- the invention relates to the technical field of information source coding, and in particular to a multi-modal information source joint coding method.
- Source coding as a basic technology, is widely used in various fields.
- Source coding is a product of the combination of multimedia technology and Internet technology in the information age. It aims to use the fewest bits to represent the source of information under the premise of allowing certain distortion or not allowing distortion.
- High-efficiency source coding technology can greatly improve the quality of decoded sources and reduce storage space under limited bandwidth.
- text compression such as PNG, BMP, JPEG, BPG, WEBP and other compression standards
- video compression such as H.264/AVC, H.265/HEVC, H.266/VVC , VP9, AV1, AVS1, AVS2, AVS3, etc.
- audio coding such as AAC, etc.
- text compression is only for text input
- image compression is only for text input
- video compression is only for images or videos
- audio coding is only for audio input and cannot process other forms.
- Even processing requires pre-processing and is inefficient.
- video compression coding standards cannot intuitively compress text. Although text can be organized into video form through preprocessing, its content is very different from normal video and has no actual physical meaning. The technology in video coding and decoding standards is not targeted at This is an abnormal signal design, so even forced encoding will be inefficient.
- the purpose of the present invention is to provide a multi-modal information source joint coding method, by utilizing the correlation between different information sources in the coding and compression process, thereby reducing the repeated transmission of relevant information. Reduce the transmission bandwidth and storage space; the decoder can recover different modal sources as needed, which means it has modal scalability.
- a multi-modal source joint coding method includes the following steps:
- the common feature map represents the correlation between different modal sources.
- the common part, the personality feature map represents the unique characteristics of each modal source;
- a knowledge base is introduced to jointly encode multi-modal information sources; the knowledge base is multi-modal or Single-modal and multi-modal knowledge bases refer to the knowledge base that stores information containing multiple different forms from different modal sources; single or multiple modal sources obtain the index of the retrieval knowledge base through "modal analysis", "Modal parsing” is used to obtain knowledge base node entities for query and reasoning.
- One form of expression of the multimodal knowledge base includes text and images, which are represented by nodes and edges.
- Each node represents an entity or represents text or an image, and each edge represents a relationship between different nodes.
- the present invention proposes a multi-modal information source joint coding method, which characterizes each modal information source as common characteristics and individual characteristics.
- the common characteristics between different modal information sources are the same, and then Realize joint coding of multiple modal sources.
- the present invention utilizes the correlation between different information sources in the coding and compression process to reduce repeated transmission of relevant information and thus reduce transmission bandwidth and storage space.
- the decoder can recover different modal information sources as needed, which has the advantage of modal scalability.
- the present invention introduces a knowledge base (in which there is known information that is strongly related to the source to be encoded), adds a priori knowledge, and explicitly associates the information sources of different modalities.
- the prior knowledge in the knowledge base is used to guide the multi-modal coding process. Therefore, compared with multi-modal joint coding without a knowledge base, it can further save storage space and reduce bandwidth.
- Figure 1 is a flow chart of a multi-modal source joint encoding method in Embodiment 1 of the present invention.
- Figure 2 shows a knowledge base-assisted multi-modal source joint coding in Embodiment 2 of the present invention. Flowchart of the method.
- Figure 3 is an image and text multi-modal knowledge base in Embodiment 2 of the present invention.
- Figure 4 is a flow chart of a knowledge base-assisted multi-modal source joint coding method in Embodiment 3 of the present invention.
- Embodiment 1 gives an example in which two information sources are given as input.
- a multi-modal information source joint encoding method includes the following steps:
- the first encoder A and the second encoder B can be the convolutional neural network CNN in the neural network without special restrictions. It can also be a temporal recurrent neural network RNN; feature map feature 1 and feature map feature 2 can be a one-dimensional vector, a two-dimensional matrix or even a higher-dimensional tensor;
- the two sets of feature maps are connected and input into the second encoder C, and decoupled into a common feature map and a unique feature map;
- the common feature map represents the differences between different modal sources.
- the common part of usually at the semantic level;
- the personality feature map represents the unique characteristics of each modal source; taking the two modal sources of video and audio as an example, the common feature may be the words spoken by the characters in the video. Audio usually also contains this information;
- the personality characteristics of the video can be the appearance of the characters in the video or other background information other than the characters, such as flowers and plants.
- the personality characteristics of the audio may include other non-related audio, or it can be the video Tone that is often difficult to express, etc.;
- the second encoder C may include a quantization process to To achieve lossy coding, there are no special requirements for its structure. It can be CNN, RNN or hyper prior model; in addition, it should be noted that the internal characteristics of feati 1 , featc and feati 2 are not necessarily the same, such as feati 1 It may contain side information featis 1 and feature featii 1 internally. The side information featis 1 is used to assist the generation of featii 1. The same applies to feati c and feati 2 ;
- Quality 1 ( ⁇ , ⁇ ) and Quality 2 ( ⁇ , ⁇ ) are used to measure the quality loss of mode 1 and mode 2 caused by encoding respectively.
- PSNR Peak Signal-to-Noise Ratio
- MS-SSIM Multi-Scale-Structure Similarity
- perceptual loss can be used. to measure; and used to measure and The number of bits consumed to convert to a binary stream can usually be estimated.
- the encoder uses a variational autoencoder VAE structure
- the code rate and It can be estimated by Shannon entropy
- ⁇ 1 , ⁇ , and ⁇ 3 in the formula are hyperparameters
- ⁇ 1 controls the compromise between the reconstruction quality of mode 1 and mode 2, that is, when the source distortion of mode 1 is more desirable
- ⁇ 1 can be set smaller, and vice versa
- ⁇ 3 allocates the code rate between mode 1 and mode 2, that is, the total bandwidth or storage space of the two modes requires a certain amount, and it tends to be larger when ⁇ 3 is larger.
- ⁇ is used to control the compromise between quality and code rate.
- the higher the quality the greater the code rate consumed, and the lower the quality.
- the smaller the code rate consumed, that is, ⁇ is used to select the final code rate point. The larger ⁇ , the lower the code rate point selected. It is suitable for scenarios with lower bandwidth, and the corresponding reconstruction quality will be lower, and vice versa.
- Embodiment 2 introduces a knowledge base based on Embodiment 1, so that multi-modal information sources can be jointly encoded more efficiently.
- the knowledge base in Figure 2 can be either multi-modal or single-modal.
- Multi-modal knowledge base means that the knowledge base stores different forms of information (usually from different modal sources);
- Figure 3 shows text and Taking images as an example gives an example of a multi-modal knowledge base.
- the multi-modal knowledge base includes text and images, represented by nodes and edges. Each node represents an entity or represents text or an image. Each edge Represents the relationship between different nodes. For example, Claude Shannon is a guest of the World Computer Chess Championship, where "Claude Shannon” and “World Computer Chess Championship” are nodes, The edge "guestOf" represents the relationship between the two. The image of Claude Shannon is shown in the lower right corner of Figure 3.
- Embodiment 2 introduces a knowledge base on the basis of Embodiment 1.
- the modal 1 information source can obtain the index of the retrieval knowledge base through "modal 1 analysis”
- the modal 2 information source can obtain the index of the retrieval knowledge base through "modal 1 analysis”.
- “Mode 2 Analysis” can also obtain the index of retrieving the knowledge base. Only one of the two can be used. There are two types of analysis that can retrieve more relevant information from the knowledge base or enhance robustness for multi-modal information sources. The coding efficiency improvement effect is greater. Among them, “Mode 1 Analysis” and “Mode 2 Analysis” are mainly used to obtain knowledge base node entities for query and reasoning.
- the relevant information can be embedded and encoded by the third encoder D to obtain the knowledge base features, and jointly encoded with the source features through the second encoder C to remove the redundancy between the source coding and the knowledge base. , thereby improving coding efficiency.
- decoder A and decoder B also need to input knowledge base features to decode the modal 1 and modal 2 sources.
- Embodiment 2 The purpose of the knowledge base introduced in Embodiment 2 is to increase prior knowledge and to explicitly associate information sources of different modalities.
- Embodiment 2 The specific process of Embodiment 2 is: a multi-modal source joint encoding method, including the following steps:
- the modal 1 information source obtains the index of the retrieval knowledge base through "modal 1 analysis”
- the modal 2 information source obtains the index of the retrieval knowledge base through "modal 2 analysis”, among which "modal 1 analysis” and “modal 2 "Parsing” is mainly to obtain knowledge base node entities for query and reasoning; after reasoning and querying in the knowledge base, the relevant information is embedded and encoded by the encoder D to obtain the knowledge base features;
- the two sets of feature maps are connected and input into the second encoder C, and decoupled into a common feature map and a unique feature map;
- the common feature map represents the differences between different modal sources.
- the common part of usually at the semantic level;
- the personality feature map represents the unique characteristics of each modal source; taking the two modal sources of video and audio as an example, the common feature may be the words spoken by the characters in the video. Audio usually also contains this information;
- the personality characteristics of the video can be the appearance of the characters in the video or other Beijing information other than the characters, such as flowers and plants.
- the personality characteristics of the audio may include other non-related audio, or it can be the tone that is usually difficult to express in the video, etc. ;
- the second encoder C may include a quantization process to Implement lossy coding
- the knowledge base features and information source features are jointly encoded through the second encoder C to remove the redundancy between the source coding and the knowledge base, thereby improving coding efficiency;
- decoder A and decoder B also need to input knowledge base features to decode the modal 1 and modal 2 sources.
- Embodiment 3 provides an embodiment of introducing a knowledge base.
- the function of the knowledge base is to query the knowledge base based on the "Claude Shannon" keyword in the "text” source. His own image, so there is no need to encode the image part corresponding to Claude Shannon in the "image” source, so the image and text can be encoded more efficiently.
- the inputs of this embodiment are two modal information sources: “text” and “image”, which respectively correspond to “Mode 1" and “Mode 2” in Figure 2 of Embodiment 2.
- “Named entity recognition: BERT” corresponds to "Mode 1 parsing”, that is, you can learn from the BERT technology in the field of natural language processing to parse the named entities in the text to obtain the entity names, such as "Claude Shannon” and “Deep Thought", enter Go to the knowledge base for query and reasoning, and generate knowledge base features after encoding.
- the features are usually embedded feature vectors.
- Mode 2 is not parsed in Figure 4, that is, the "Mode 2 parsing" in Figure 2 is not used. .
- the "text” modality passes through a text encoder, such as GRU, which can be encoded into text features.
- the "image” modality passes through scene graph generation technology to detect objects in the image and establish relationships between objects.
- the scene graph is passed through volume
- the product network generates image feature maps, labeled as image features.
- the text features and image features are connected, they are sent together with the knowledge base features as input to the second encoder C for encoding to generate text personality features, image personality features, and common features of text and images.
- Figure 4 does not show that the features are The process of lossy coding to generate a binary code stream, and the decoding of the binary code stream to generate the corresponding feature part.
- the parsed "entity name” also needs to be encoded and transmitted to the decoding end.
- the personality characteristics of the encoding-side image mainly include the clothing, posture and position characteristics of "Feng-hsiung Hsu”.
- the personality characteristics in the text mainly include "Feng-hsiung Hsu” and “first prize”; the common characteristics include information such as "Claude Shannon” and "Deep Thought”. Therefore, increasing the knowledge base makes coding more efficient.
- the training process of this embodiment is similar to that of Embodiment 1, and the design of the loss function is also similar.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
一种多模态信源联合编码方法,先将多个模态信源经过相应第一编码器提取特征去除每个模态信号内部冗余,得到对应的特征图;然后将多组特征图连接起来输入第二编码器,解耦为共性特征图和个性特征图;共性特征图表示不同模态信源之间的共同部分,个性特征图表示每个模态信源所独有的特征;最后将多个模态信源的个性特征图和共性特征图经过相应解码器解码并重建相应的模态信源,即分别经过熵编码,转换为二进制码流进行存储或者传输;在解码端二进制码流进行熵解码后分别经过相应的解码器恢复得到相应的模态信源;本发明利用不同信源之间的相关性,减少相关信息的重复传输降低传输带宽,降低存储空间;解码端恢复出不同模态信源,具有模态可伸缩性。
Description
本发明涉及信源编码技术领域,具体涉及一种多模态信源联合编码方法。
信源编码作为一种基础技术,被广泛应用于各个领域。信源编码是信息时代多媒体技术和互联网技术结合的产物,旨在允许一定失真或者不允许失真前提下,用最少的比特表示信源。高效率的信源编码技术可以在有限带宽下大大提升解码后的信源质量,降低存储空间。例如,根据输入的不同目前有文本压缩、图像压缩(如PNG,BMP,JPEG,BPG,WEBP等压缩标准)、视频压缩(如H.264/AVC,H.265/HEVC,H.266/VVC,VP9,AV1,AVS1,AVS2,AVS3等)、音频编码(如AAC等)等等,这些标准有一个共同的特点,只针对单一种类的输入,例如文本压缩只针对文本输入,图像压缩只针对图像,视频压缩针对图像或者视频,音频编码只针对音频输入,无法对其他形式的进行处理,即便处理也需要经过前处理,而且效率低下。例如,视频压缩编码标准无法直观对文本进行压缩,尽管可以通过预处理将文本组织成视频形式,但是其内容与正常的视频差异大,无实际物理意义,视频编解码标准中的技术并不针对这种非正常信号设计,因此即便强制编码也会效率低下。
实际中,经常综合几种模态的数据一起进行某种表达,例如,电视剧电影等最常见的模态包含视频、音频和字幕三种模态,按照以上标准,目前的方案几乎都是对三种模态分别编码,但是实际上该三种
模态信号之间存在关联,即存在一定程度的冗余,而现有的独立编码方法无法消除此类冗余,因此是对带宽或者存储空间的一种浪费。因此需要一种能够对多种模态的信号进行联合编码的方法,以去除不同模态信号之间的相关性,降低冗余,从而达到减少带宽和节省存储空间的目的。
发明内容
为了克服上述现有技术的缺点,本发明的目的在于提供了一种多模态信源联合编码方法,通过在编码压缩过程中利用不同信源之间的相关性,减少相关信息的重复传输从而降低传输带宽,降低存储空间;解码端根据需要恢复出不同模态信源,即具有模态可伸缩性。
为了达到上述目的,本发明采取的技术方案为:
一种多模态信源联合编码方法,包括以下步骤:
1)将多个模态信源经过相应第一编码器以提取特征去除每个模态信号内部冗余,得到对应的特征图;
2)为了去除不同模态信号之间的相关性,将多组特征图连接起来输入第二编码器,解耦为共性特征图和个性特征图;共性特征图表示不同模态信源之间的共同部分,个性特征图表示每个模态信源所独有的特征;
3)将多个模态信源的个性特征图和共性特征图经过相应解码器解码并重建相应的模态信源,即分别经过熵编码,转换为二进制码流进行存储或者传输;在解码端二进制码流进行熵解码后分别经过相应的解码器恢复得到相应的模态信源。
引入了知识库,对多模态信源进行联合编码;知识库是多模态或
单模态,多模态知识库是指知识库中存储包含多种不同形式来自不同模态信源的信息;单个或多个模态信源经过“模态解析”获得检索知识库的索引,“模态解析”为了获取知识库节点实体以进行查询和推理。
所述的多模态知识库一种表现形式中有文本和图像,以节点和边表示,每个节点表示一个实体或者表示文本或者表示图像,每条边表示不同节点之间的关系。
本发明的有益效果为:本发明提出了一种多模态信源联合编码方法,将每个模态信源表征为共性特征和个性特征,不同模态信源之间的共性特征相同,进而实现多个模态信源的联合编码。相比于多个模态信源独立编码,本发明通过在编码压缩过程中利用不同信源之间的相关性,减少相关信息的重复传输从而降低传输带宽和存储空间。同时,解码端可以根据需要恢复出不同模态信源,即具有模态可伸缩性的优点。
本发明在以上多模态联合编码方法基础上,引入了知识库(其中存在与待编码信源的强相关已知信息),增加了先验知识,显式关联了不同模态的信源,在编码过程利用知识库中的先验知识对多模态编码过程进行指导。因而,相比没有知识库的多模态联合编码,能够进一步节省存储空间,降低带宽。
图1为本发明实施例1一种多模态信源联合编码方法的流程图。
图2为本发明实施例2一种知识库辅助的多模态信源联合编码
方法的流程图。
图3为本发明实施例2中图像和文本多模态知识库。
图4为本发明实施例3一种知识库辅助的多模态信源联合编码方法的流程图。
下面结合附图和实施例对本发明做详细描述。
实施例1,实施例1给出了给出了两种信源作为输入的例子,一种多模态信源联合编码方法,包括以下步骤:
1)给定两个模态信源“模态1”和“模态2”,分别记为src1和src2,两个模态信号分别经过第一编码器A和第一编码器B以提取特征去除每个模态信号内部冗余,得到特征图feat1和特征图feat2,其中第一编码器A和第二编码器B无特殊限制可以是神经网络中的卷积神经网络CNN,也可以是时序循环神经网络RNN;特征图feat1和特征图feat2可以是一维向量,也可以是二维矩阵甚至更高维度的张量;
2)为了去除不同模态信号之间的相关性,将两组特征图连接起来输入第二编码器C,解耦为共性特征图和个性特征图;共性特征图表示不同模态信源之间的共同部分,通常为语义层面;个性特征图表示每个模态信源所独有的特征;以视频和音频两个模态信源为例,共性特征可能是视频中人物所说的话语,音频中通常也包含该信息;视频的个性特征可以是视频中人物的外表或者人物以外其他如花草等背景信息,音频的个性特征可能包含其他非相关音频,也可以是视频
通常难以表达的语气等;
本实施例进行共性和个性特征解耦,输出模态1的个性特征feati1,两种模态的共性特征featc和模态2的个性特征feati2,第二编码器C中可能包含量化过程以实现有损编码,其结构无特殊要求,既可以是CNN,RNN还可以包含hyper prior模型;另外,需要说明的是,feati1,featc和feati2三类特征内部特性不一定相同,如feati1内部可能包含边信息featis1以及特征featii1,其中的边信息featis1用以辅助featii1生成,featc和feati2同理;
3)featc,feati1和feati2三类特征分别经过熵编码,转换为二进制码流进行存储或者传输;在解码端二进制码流进行熵解码后恢复得到feati1,featc和feati2;之后feati1和featc共同输入解码器A,以恢复模态1,标记为feati1和featc共同输入解码器B,以恢复模态2,记为
以上为测试时的流程,在训练过程中只需要有成对的多模态数据就可以进行训练,训练过程多个模态的编码器和解码器一起进行端到端训练,损失函数设计为以下形式:
其中的Quality1(·,·)和Quality2(·,·)分别用于衡量编码造成的模态1和模态2的质量损失。例如对于对于视频或者图像可以用PSNR(峰值信噪比),MS-SSIM(多尺度-结构相似性)或者感知损失等
进行衡量;和用于衡量和转换为二进制码流所消耗的比特数量,通常可以通过估计获得。例如上述描述中,可以假设featc,feati1和feati2三类特征服从高斯分布,用featis1中的部分特征表示高斯分布的均值,另外部分特征表示方差,即编码器采用变分自编码器VAE结构,则码率和可以用香农熵估计得到;公式中的λ1,λ,λ3属于超参数,λ1控制模态1和模态2重建质量之间的折中,即当更希望模态1的信源失真更小时λ1可以设置比较小,反之亦然;λ3在模态1和模态2之间进行码率分配,即两种模态总的带宽或者存储空间要求一定,λ3较大时倾向于模态1码率更大和模态2码率更小,反之亦然;λ用于控制质量和码率之间的折中,通常质量越高所消耗的码率越大,质量越低所消耗的码率越小,即λ用于选取最终的码率点,λ越大则选取的码率点越低,适用于带宽越低的场景,相应的重建质量会越低,反之亦然。
实施例2,参照图2,实施例2在实施例1的基础上引入了知识库,可以更高效的对多模态信源进行联合编码。
图2中的知识库既可以是多模态也可以是单模态,多模态知识库是指知识库中存储包含不同形式的信息(通常来自不同模态信源);图3以文本和图像为例给出了一种多模态知识库的例子,其中的多模态知识库有文本和图像,以节点和边表示,每个节点表示一个实体或者表示文本或者表示图像,每条边表示不同节点之间的关系,例如Claude Shannon是World Computer Chess Championship的嘉宾,其中“Claude Shannon”和“World Computer Chess Championship”是节点,
边“guestOf”表示两者关系。图3右下角给出了Claude Shannon的图像,“Claude Shannon”和其图像使用带方向的边“imageOf”相连;“Deep Thought”参加“World Computer Chess Championship”比赛,通过两个节点分别表示“Deep Thought”和“World Computer Chess Championship”,通过“attend”表示两者之间的关系。
实施例2在实施例1的基础上引入了知识库,在实施例1的基础上,模态1信源经过“模态1解析”可以获得检索知识库的索引,模态2信源经过“模态2解析”也可以获得检索知识库的索引,两者只有其一也可以,有两种解析可以从知识库中检索到更多的相关信息或者增强鲁棒性,对多模态信源的编码效率提升作用更大。其中“模态1解析”和“模态2解析”主要为了获取知识库节点实体以进行查询和推理。经过知识库的推理和查询后,相关信息可以经过第三编码器D进行嵌入编码得到知识库特征,与信源特征通过第二编码器C进行联合编码,去除信源编码与知识库的冗余,从而提升编码效率。相应的在解码过程解码器A和解码器B也需要输入知识库特征对模态1和模态2信源进行解码。
实施例2所引入的知识库的目的在于增加了先验知识;显式关联了不同模态的信源。
实施例2的具体流程为:一种多模态信源联合编码方法,包括以下步骤:
1)给定两个模态信源“模态1”和“模态2”,分别记为src1和src2,两个模态信号分别经过第一编码器A和第一编码器B以提取
特征去除每个模态信号内部冗余,得到特征图feat1和特征图feat2;
模态1信源经过“模态1解析”获得检索知识库的索引,模态2信源经过“模态2解析”获得检索知识库的索引,其中“模态1解析”和“模态2解析”主要为了获取知识库节点实体以进行查询和推理;经过知识库的推理和查询后,相关信息经过编码器D进行嵌入编码得到知识库特征;
2)为了去除不同模态信号之间的相关性,将两组特征图连接起来输入第二编码器C,解耦为共性特征图和个性特征图;共性特征图表示不同模态信源之间的共同部分,通常为语义层面;个性特征图表示每个模态信源所独有的特征;以视频和音频两个模态信源为例,共性特征可能是视频中人物所说的话语,音频中通常也包含该信息;视频的个性特征可以是视频中人物的外表或者人物以外其他如花草等北京信息,音频的个性特征可能包含其他非相关音频,也可以是视频通常难以表达的语气等;
本实施例进行共性和个性特征解耦,输出模态1的个性特征feati1,两种模态的共性特征featc和模态2的个性特征feati2,第二编码器C中可能包含量化过程以实现有损编码;
知识库特征与信源特征通过第二编码器C进行联合编码,去除信源编码与知识库的冗余,从而提升编码效率;
3)featc,feati1和feati2三类特征分别经过熵编码,转换为二进制码流进行存储或者传输;在解码端二进制码流进行熵解码后恢复得到feati1,featc和feati2;之后feati1和featc共同输入解码器A,
以恢复模态1,标记为feati1和featc共同输入解码器B,以恢复模态2,记为
在解码过程解码器A和解码器B也需要输入知识库特征对模态1和模态2信源进行解码。
实施例3,参照图4,实施例3给出了引入知识库的一种实施例,知识库所起的作用在于根据“文本”信源中的“Claude Shannon”关键字查询知识库中可以获得其本人的图像,从而无需编码“图像”信源中的Claude Shannon所对应的图像部分,因而可以更高效的对图像进行图像和文本进行编码。
参照图4,本实施例的输入为“文本”和“图像”两种模态信源,分别对应实施例2的图2中的“模态1”和“模态2”,文本信源中的“命名实体识别:BERT”对应“模态1解析”,即可以借鉴自然语言处理领域中的BERT技术对文本中的命名实体解析得到实体名称,如“Claude Shannon”和“Deep Thought”,输入到知识库中进行查询与推理,经过编码之后生成知识库特征该特征通常是嵌入后的特征向量;图4中未对模态2进行解析,即没有利用图2中的“模态2解析”。对于主分支,“文本”模态经过文本编码器,如GRU可以编码为文本特征,“图像”模态经过场景图生成技术检测图像中的目标并建立目标之间的关系,该场景图经过卷积网络生成图像特征图,标记为图像特征。之后,文本特征、图像特征连接后,与知识库特征共同作为输入送入第二编码器C进行编码,生成文本个性特征、图像个性特征以及文本和图像的共性特征。图4中未展示将特征进行无
损编码生成二进制码流的过程,以及对二进制码流进行解码生成对应特征的部分。除此之外,解析到的“实体名称”也需要编码传输到解码端。
在解码端,文本个性特征、共性特征和知识库特征共同作为输入,经过文本解码器输出文本;图像个性特征、共性特征和知识库特征共同作为输入经过图像解码器输出图像。从图4可以看到,通过引入知识库,编码端无需传输图像中Claude Shannon对应的部分,只需传输解析到的Claude Shannon实体,解码端通过知识库可以获得知识库中的Claude Shannon对应的图像;此外,也不需要传输编码“Edmonton”和“1989”,可以通过知识库传输推理得到。编码端图像的个性特征主要包含“Feng-hsiung Hsu”的衣着,姿态和位置特征。文本中的个性特征主要包含“Feng-hsiung Hsu”和“first prize”;共性特征包含“Claude Shannon”和“Deep Thought”等信息。因此,增加知识库后使得编码更加高效。本实施例的训练过程和实施例1类似,损失函数的设计也类似。
Claims (3)
- 一种多模态信源联合编码方法,其特征在于,包括以下步骤:1)将多个模态信源经过相应第一编码器以提取特征去除每个模态信号内部冗余,得到对应的特征图;2)为了去除不同模态信号之间的相关性,将多组特征图连接起来输入第二编码器,解耦为共性特征图和个性特征图;共性特征图表示不同模态信源之间的共同部分,个性特征图表示每个模态信源所独有的特征;3)将多个模态信源的个性特征图和共性特征图经过相应解码器解码并重建相应的模态信源,即分别经过熵编码,转换为二进制码流进行存储或者传输;在解码端二进制码流进行熵解码后分别经过相应的解码器恢复得到相应的模态信源。
- 根据权利要求1所述的方法,其特征在于:引入知识库,对多模态信源进行联合编码;知识库是多模态或单模态,多模态知识库是指知识库中存储包含多种不同形式来自不同模态信源的信息;单个或多个模态信源经过“模态解析”获得检索知识库的索引,“模态解析”为了获取知识库节点实体以进行查询和推理。
- 根据权利要求2所述的方法,其特征在于:所述的多模态知识库一种表现形式中有文本和图像,以节点和边表示,每个节点表示一个实体或者表示文本或者表示图像,每条边表示不同节点之间的关系。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210969884.6 | 2022-08-12 | ||
CN202210969884.6A CN115604475A (zh) | 2022-08-12 | 2022-08-12 | 一种多模态信源联合编码方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024032119A1 true WO2024032119A1 (zh) | 2024-02-15 |
Family
ID=84843969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/098536 WO2024032119A1 (zh) | 2022-08-12 | 2023-06-06 | 一种多模态信源联合编码方法 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115604475A (zh) |
WO (1) | WO2024032119A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115604475A (zh) * | 2022-08-12 | 2023-01-13 | 西安电子科技大学(Cn) | 一种多模态信源联合编码方法 |
CN118611821A (zh) * | 2023-03-04 | 2024-09-06 | 北京邮电大学 | 基于模型的信息业务服务提供方法、系统、设备及介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190347523A1 (en) * | 2018-05-14 | 2019-11-14 | Quantum-Si Incorporated | Systems and methods for unifying statistical models for different data modalities |
JP2019220974A (ja) * | 2019-08-22 | 2019-12-26 | 三菱電機株式会社 | 復号装置 |
CN110807122A (zh) * | 2019-10-18 | 2020-02-18 | 浙江大学 | 一种基于深度互信息约束的图文跨模态特征解纠缠方法 |
CN112800292A (zh) * | 2021-01-15 | 2021-05-14 | 南京邮电大学 | 一种基于模态特定和共享特征学习的跨模态检索方法 |
CN113591902A (zh) * | 2021-06-11 | 2021-11-02 | 中国科学院自动化研究所 | 基于多模态预训练模型的跨模态理解与生成方法和装置 |
CN114202024A (zh) * | 2021-12-06 | 2022-03-18 | 深圳市安软科技股份有限公司 | 一种多模态解耦生成模型的训练方法、系统及相关设备 |
CN115604475A (zh) * | 2022-08-12 | 2023-01-13 | 西安电子科技大学(Cn) | 一种多模态信源联合编码方法 |
-
2022
- 2022-08-12 CN CN202210969884.6A patent/CN115604475A/zh active Pending
-
2023
- 2023-06-06 WO PCT/CN2023/098536 patent/WO2024032119A1/zh unknown
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190347523A1 (en) * | 2018-05-14 | 2019-11-14 | Quantum-Si Incorporated | Systems and methods for unifying statistical models for different data modalities |
JP2019220974A (ja) * | 2019-08-22 | 2019-12-26 | 三菱電機株式会社 | 復号装置 |
CN110807122A (zh) * | 2019-10-18 | 2020-02-18 | 浙江大学 | 一种基于深度互信息约束的图文跨模态特征解纠缠方法 |
CN112800292A (zh) * | 2021-01-15 | 2021-05-14 | 南京邮电大学 | 一种基于模态特定和共享特征学习的跨模态检索方法 |
CN113591902A (zh) * | 2021-06-11 | 2021-11-02 | 中国科学院自动化研究所 | 基于多模态预训练模型的跨模态理解与生成方法和装置 |
CN114202024A (zh) * | 2021-12-06 | 2022-03-18 | 深圳市安软科技股份有限公司 | 一种多模态解耦生成模型的训练方法、系统及相关设备 |
CN115604475A (zh) * | 2022-08-12 | 2023-01-13 | 西安电子科技大学(Cn) | 一种多模态信源联合编码方法 |
Also Published As
Publication number | Publication date |
---|---|
CN115604475A (zh) | 2023-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024032119A1 (zh) | 一种多模态信源联合编码方法 | |
Santurkar et al. | Generative compression | |
Liu et al. | Deep image compression via end-to-end learning | |
CN113822147B (zh) | 一种协同机器语义任务的深度压缩方法 | |
CN109903351B (zh) | 基于卷积神经网络和传统编码相结合的图像压缩方法 | |
CN115880762B (zh) | 面向人机混合视觉的可伸缩人脸图像编码方法、系统 | |
CN116233445B (zh) | 视频的编解码处理方法、装置、计算机设备和存储介质 | |
US20230154053A1 (en) | System and method for scene graph lossless compression by context-based graph convolution | |
CN113132735A (zh) | 一种基于视频帧生成的视频编码方法 | |
Zhang et al. | Learned scalable image compression with bidirectional context disentanglement network | |
Akbari et al. | Learned multi-resolution variable-rate image compression with octave-based residual blocks | |
CN116437089B (zh) | 一种基于关键目标的深度视频压缩方法 | |
Gao et al. | Cross modal compression with variable rate prompt | |
CN116208772A (zh) | 数据处理方法、装置、电子设备及计算机可读存储介质 | |
CN116486300A (zh) | 一种基于特征变化的端到端视频文本生成方法 | |
Kumar et al. | Vector quantization with codebook and index compression | |
CN115361556A (zh) | 一种基于自适应的高效视频压缩算法及其系统 | |
CN115496134A (zh) | 基于多模态特征融合的交通场景视频描述生成方法和装置 | |
CN113641846A (zh) | 一种基于强表示深度哈希的跨模态检索模型 | |
Gao et al. | Rate-distortion optimization for cross modal compression | |
KR102072576B1 (ko) | 데이터 인코딩 및 디코딩 장치와 방법 | |
Li et al. | Multiple description coding network based on semantic segmentation | |
Gulia et al. | Comprehensive Analysis of Flow Incorporated Neural Network based Lightweight Video Compression Architecture | |
Luo | Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression | |
Bachard et al. | CoCliCo: Extremely low bitrate image compression based on CLIP semantic and tiny color map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23851347 Country of ref document: EP Kind code of ref document: A1 |