WO2023217163A1 - 一种基于本地自注意力机制的大坝缺陷时序图像描述方法 - Google Patents

一种基于本地自注意力机制的大坝缺陷时序图像描述方法 Download PDF

Info

Publication number
WO2023217163A1
WO2023217163A1 PCT/CN2023/093153 CN2023093153W WO2023217163A1 WO 2023217163 A1 WO2023217163 A1 WO 2023217163A1 CN 2023093153 W CN2023093153 W CN 2023093153W WO 2023217163 A1 WO2023217163 A1 WO 2023217163A1
Authority
WO
WIPO (PCT)
Prior art keywords
attention
vector
sequence
self
image
Prior art date
Application number
PCT/CN2023/093153
Other languages
English (en)
French (fr)
Inventor
马洪琪
肖海斌
毛莺池
迟福东
戚荣志
庞博慧
周晓峰
陈豪
余记远
赵欢
Original Assignee
华能澜沧江水电股份有限公司
河海大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华能澜沧江水电股份有限公司, 河海大学 filed Critical 华能澜沧江水电股份有限公司
Priority to US18/337,409 priority Critical patent/US20230368500A1/en
Publication of WO2023217163A1 publication Critical patent/WO2023217163A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A10/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
    • Y02A10/40Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping

Definitions

  • the invention belongs to the technical field of automatic generation of dam defect time series image description articles, and particularly relates to a dam defect time series image description method based on local self-attention.
  • defects In fields such as construction engineering, inspection items or inspection points whose quality does not meet specified requirements are usually defined as defects. With the long-term operation of hydraulic structures such as hydropower stations and dams, factors such as material aging and environmental impact will produce varying degrees of defects.
  • the data collected by existing defective image collection equipment such as drones and mobile cameras is all video. During the acquisition and transmission process, the video is compressed and encoded to save costs, resulting in the model being unable to directly process the video data. Therefore, it is necessary to convert the video into an image sequence in the time dimension, quickly extract the image features through the model and generate corresponding text to describe the defect content, which can help users quickly generate inspection reports and standardize the inspection process.
  • Description text generation translates sequential images into natural language by modeling the feature relationships between images and text. Since images and text are two different modalities of data, and their underlying features are heterogeneous, it is difficult to directly calculate the corresponding relationship between the two, which can easily cause the loss of feature information and affect the accuracy of generated text. And unlike a single image, time series images often contain a large number of image frames, from which the model cannot directly extract text-related information.
  • the present invention provides a time-series image description method of dam defects based on a local self-attention mechanism, dynamically establishing the contextual feature relationship of the image sequence, and at the same time enabling each word in the text to Directly corresponds to the corresponding image frame, effectively improving the accuracy of text generation.
  • a time-series image description method of dam defects based on a local self-attention mechanism, dynamically establishing the contextual feature relationship of the image sequence, and at the same time enabling each word in the text to Directly corresponds to the corresponding image frame, effectively improving the accuracy of text generation.
  • a time-series image description method of dam defects based on a local self-attention mechanism including the following steps:
  • the LSTM network based on the local attention mechanism is used to generate description text, so that each predicted word can focus on the corresponding image frame, and the accuracy of text generation is improved by establishing the context dependence of the image and text.
  • the input time series images are frame-sampled and a convolutional neural network is used to extract feature sequences.
  • the specific steps are as follows:
  • W Q , W K and W V are the feature matrices required to calculate each vector
  • X is the feature representation of each frame of the input image sequence.
  • the q vector guides the current feature image to selectively focus on contextual features in the time dimension; the k vector is used to calculate the attention weight of the current feature map and other feature maps; the v vector is used to add the information of the current feature map to the self-attention weight.
  • d k is the input vector dimension, which is obtained by dividing the input sequence dimension by the number of self-attention heads.
  • the dot product of q vector and k vector is used to obtain the similarity score of the corresponding sequence elements, divided by Normalization operations are performed to ensure the stability of gradient propagation in the network.
  • P q is the position reference point of the current frame
  • x v is the v vector corresponding to the feature map obtained above
  • W m and W′ m are the weight learnable feature matrices.
  • log is based on 10
  • s is the original text sequence
  • S t represents the t-th word of the text.
  • the conditional probability is parameterized, and the probability of each word can be expressed as: p(S n , i
  • h j is the hidden layer of the recurrent neural network.
  • the function f calculates the hidden state of the current position based on the hidden layer output of the previous position and the current vector. Its output is converted into a vector with the same dimension as the vocabulary through the function g.
  • the position matrix W p and the penalty term v p are both feature parameters whose weights can be learned, and S is the length of the input sequence.
  • the attention window corresponding to this position is [p t -D, p t +D], and D represents the window.
  • the width of , and the hidden layer vectors of the input and output sequences are calculated through the align function and constrained by Gaussian distribution, and the attention weight is obtained:
  • S represents the center position of the window
  • is D/2, which is used to normalize the calculation results.
  • the context features that is, the context vector c t , the attention weight and the previously generated words are connected in series through the LSTM network as input, and the output word at the current position is calculated by the fully connected network and softmax activation function, and finally all positions are The word combination is the complete description text.
  • a computer device which includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, it implements the dam based on the local self-attention mechanism as described above. Defect time series image description method.
  • a computer-readable storage medium stores a computer program that executes the dam defect time-series image description method based on the local self-attention mechanism as described above.
  • the present invention has the following advantages:
  • random frame sampling is used to preprocess the original time series images, which effectively compresses the size of the encoding sequence.
  • feature extraction based on the convolutional neural network can enable the Transformer network to add a self-attention mechanism. Pay attention to the visual characteristics of the image.
  • the Transformer network based on the variable self-attention mechanism can dynamically establish the contextual relationship of each frame, avoiding the slow gradient descent in training caused by the calculation of global feature relationships, which requires long training and large training rounds to make it work. Model converges.
  • the LSTM network based on the local attention mechanism can enable each predicted word to focus on the corresponding image frame, ensuring that the semantic information of the original time series image will not be missed in the generated text, and improving the accuracy of the model. Rate.
  • Figure 1 is an overall framework diagram for description of defect timing images in a specific embodiment
  • Figure 2 is a schematic structural diagram of the Transformer network based on the variable self-attention mechanism in a specific embodiment
  • Figure 3 is a schematic diagram of the LSTM network structure based on the local attention mechanism in a specific embodiment.
  • a certain power station dam project inspection uses drones, mobile cameras and other video collection equipment to capture time-series images of defects.
  • Each image may contain four types of defects, namely cracks, alkaline precipitation, water seepage, and concrete spalling. , it is necessary to extract time-series image features through the model and generate corresponding description text, thereby reducing the time for manual defect judgment and standardizing the inspection process.
  • Figure 1 shows the overall workflow of the dam defect time series image description method based on the local self-attention mechanism.
  • the specific implementation is as follows:
  • the q vector is a query vector, which guides the current feature image to selectively pay attention to contextual features in the time dimension;
  • the k vector is a keyword vector, used to calculate the attention weight of the current feature map and other feature maps;
  • the v vector is a value vector, using To add the information of the current feature map to the self-attention weight, the Transformer network consists of 8 attention heads and a 512-dimensional fully connected network, and the weight of each attention head is calculated independently.
  • d k is the input vector dimension, which is obtained by dividing the input sequence dimension by the number of self-attention heads.
  • the dot product of q vector and k vector is used to obtain the similarity score of the corresponding sequence elements, divided by Normalization operations are performed to ensure the stability of gradient propagation in the network.
  • W m and W m ′ are weight learnable feature matrices. They are all weights learned through the network. Their functions and dimensions are the same, but their weights are different.
  • ⁇ p mqk and A mqk respectively represent the sampling offset and self-attention weight of the k-th sampling point in the m-th self-attention head, which can be normalized as It is obtained through fully connected network training, and finally linearly projected into the query vector, and the sampled frame feature map containing contextual information is output through the 512-dimensional multi-layer perceptron network.
  • S t represents the t-th word of the text.
  • the conditional probability is parameterized.
  • the probability of each word can be expressed as: p(S n
  • h j is the hidden layer of the recurrent neural network.
  • the function f calculates the hidden state of the current position based on the hidden layer output of the previous position and the current vector. Its output is converted into a vector with the same dimension as the vocabulary through the function g.
  • the position matrix W p and the penalty term v p are both feature parameters whose weights can be learned, and S is the length of the input sequence. Then the attention window corresponding to the position [p t -D, p t +D] is passed through the align function. Calculate the hidden layer vectors of the input and output sequences and constrain them by Gaussian distribution to obtain the attention weight:
  • the above-mentioned steps of the dam defect time-series image description method based on the local self-attention mechanism can be implemented by a general-purpose computing device, and they can be concentrated on a single computing device. or distributed on a network composed of multiple computing devices.
  • they can be implemented with program codes executable by the computing device, so that they can be stored in the storage device and executed by the computing device, and in In some cases, the steps shown or described may be performed in a different order than herein, or may be fabricated separately into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. accomplish.
  • embodiments of the present invention are not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开一种基于本地自注意力机制的大坝缺陷时序图像描述方法,对输入大坝缺陷时序图像进行帧采样,使用卷积神经网络提取特征序列,并将该序列作为自注意力编码器的输入;编码器由基于可变自注意力机制的Transformer网络构成,能够动态建立每一帧的上下文特征关系;采用基于本地注意力机制的LSTM网络生成描述文本,使得预测的每一个单词都能与图像帧建立特征关系,建立图像和文本的上下文依赖以提高文本生成的准确率。本发明在计算图像帧的全局自注意力的基础上添加了动态机制,避免了过大的参数量导致模型收敛缓慢。添加本地注意力的LSTM网络能够直接建立图像和文本两个模态数据之间的对应关系,使得生成的描述文本更准确,包含的信息更全面。

Description

一种基于本地自注意力机制的大坝缺陷时序图像描述方法 技术领域
本发明属于大坝缺陷时序图像描述本文自动生成技术领域,特别涉及一种基于本地自注意力的大坝缺陷时序图像描述方法。
背景技术
在建筑工程等领域,通常将质量不符合规定要求的检验项或者检验点定义为缺陷。随着水电站和大坝等水工建筑物的长期运行,材料老化、环境影响等因素都会产生不同程度的缺陷。现有的无人机、移动摄像头等缺陷图像采集设备采集到的数据均为视频,在获取和传输的过程中,为节约成本会对视频进行压缩编码,从而导致模型无法直接处理视频数据。因此需要将视频转换为时间维度的图像序列,通过模型快速提取其中的图像特征并生成相应的文本以描述缺陷内容,可以帮助用户快速生成巡检报告,规范巡检流程。
描述文本生成通过对图像和文本的特征关系建模,将时序图像翻译成自然语言。由于图像和文本是两种不同模态的数据,其底层特征存在异构性,难以直接计算两者对应关系,容易造成特征信息的丢失,影响生成文本的准确性。且区别于单幅图像,时序图像往往包含了大量的图像帧,模型无法直接从中提取与文本相关的信息。
发明内容
发明目的:目前水工建筑物的巡检工作中大量采用无人机、移动摄像头等设备,采集到的视频数据量大,单纯依靠人工查验,找出其中的缺陷难度大且耗费时间长。为了克服现有技术对于描述缺陷的难题,本发明提供一种基于本地自注意力机制的大坝缺陷时序图像描述方法,动态建立图像序列的上下文特征关系,同时使文本中的每一个单词都能够直接对应相应的图像帧,有效提高了生成文本的准确率。为完成大坝安全巡检报告提供直观的文本依据,降低人工成本。
技术方案:一种基于本地自注意力机制的大坝缺陷时序图像描述方法,包括如下步骤:
(1)对输入的时序图像进行帧采样,使用卷积神经网络提取特征序列,并将该特征序列作为自注意力编码器的输入;
(2))采用基于可变自注意力机制的Transformer网络对时序图像的特征序列编码,动态建立每一帧的上下文关系;
(3)采用基于本地注意力机制的LSTM网络生成描述文本,使得预测的每一个单词都能关注相应的图像帧,通过建立图像和文本的上下文依赖以提高文本生成的准确率。
所述对输入的时序图像进行帧采样,使用卷积神经网络提取特征序列,具体步骤如下:
(1.1)将输入的时序图像分割为没有重叠的T个等长片段,从每个片段中随机抽取一帧xt组成集合为[x1,x2,…,xT]以增加训练的多样性,使得网络能够学习同一缺陷的不同实例变化。
(1.2)使用卷积神经网络处理每一帧采样图像(即集合[x1,x2,…,xT]),提取其特征图作为自注意力编码器的输入,记作Ft=[X1,X2,…,Xt],Xt为每一帧抽样图像的特征表示。
所述采用基于可变自注意力机制的Transformer网络对时序图像的特征序列编码的具体步骤如下:
(2.1)为便于计算时序图像的上下文特征关系,首先利用线性全连接层求得每一个采样帧对应的查询向量q、关键词向量k和价值向量v:
q=Linear(X)=WQX
k=Linear(X)=WKX
v=Linear(X)=WVX
其中WQ、WK和WV为计算各向量所需的特征矩阵,X为输入图像序列的每一帧特征表示。q向量指导当前特征图像选择性关注时间维度上的上下文特征;k向量用以计算当前特征图和其他特征图的注意力权重;v向量用以将当前特征图的信息加入自注意力权重中。
(2.2)通过将q向量和k向量的点积结果加入当前图像块得到的注意力权重:
其中dk为输入向量维度,由输入序列维度除以自注意力头个数求得。q向量和k向量点乘求得各自所对应的序列元素的相似度得分,除以进行归一化操作以保证梯度在网络中传播的稳定性。
(2.3)在Transformer网络中引入多头可变形的编码结构,避免计算全局自注意力导致参数过多模型收敛缓慢。使得模型仅对当前帧周围的一组关键帧采样并计算注意力权重,即给序列中每一个元素的查询向量q分配数量一定的关键词向量k:
其中Pq为当前帧的位置参考点,xv为上文求得的特征图对应的v向量,Wm和W′m为权重可学习特征矩阵。Δpmqk和Amqk分别表示第m个自注意力头中的第k个采样点的采样偏移量和自注意力权重,可标准化为∑k∈ΩAmqk=1,通过全连接网络训练得到,并最终线性投影到查询向量中,得到包含上下文信息的采样帧特征图
所述基于本地注意力机制的LSTM网络生成描述文本的具体步骤如下:
(3.1)为时序图像中每一帧抽样图像的特征表示序列,通过对该序列解码,计算每个单词生成的条件概率得到对应事件的描述文本{Sn}:
其中log以10为底,s为原始文本序列,St表示文本的第t个单词,为了便于注意力机制的计算和神经网络实现,将条件概率参数化,每个单词的概率可表示为:
p(Sn,i|Sn<j,s)=softmax(g(hj))
hj=f(hj-1,s)
其中hj为循环神经网络隐藏层,函数f根据之前位置的隐藏层输出和当前向量计算得到当前位置的隐藏状态,其输出通过函数g转换为与词汇表维度相同的向量。
(3.2)在计算文本的过程中,引入上下文关系向量ct,通过拼接ct和序列的隐藏层状态ht再乘以权重可学习的参数矩阵wc,求得携带注意力机制的隐藏层状态:
最后通过softmax函数和全连接神经网络输出对应的单词序列:
(3.3)在生成每一个目标单词时,计算当前注意力的中心位置Pt,及本地注意力机制:
使得输出的单词能够关注与其相关的输入序列位置。其中位置矩阵Wp和惩罚项vp均为权值可学习的特征参数,S为输入序列长度,则该位置对应的注意力窗口为[pt-D,pt+D],D表示窗口的宽度,并通过align函数计算输入和输出序列的隐藏层向量并由高斯分布约束,求得注意力权重:
其中S表示窗口中心位置,σ为D/2,用于计算结果的归一化。最后通过LSTM网络将上下文特征即引入上下文关系向量ct、注意力权重和前文生成的单词三者串联作为输入,并由全连接网络和softmax激活函数计算得到当前位置的输出单词,最后将所有位置的单词组合即为完整的描述文本。
一种计算机设备,该计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行上述计算机程序时实现如上所述的基于本地自注意力机制的大坝缺陷时序图像描述方法。
一种计算机可读存储介质,该计算机可读存储介质存储有执行如上所述的基于本地自注意力机制的大坝缺陷时序图像描述方法的计算机程序。
有益效果:本发明与现有技术相比具有以下优点:
1.在帧采样和特征提取部分,采用随机帧采样对原始时序图像进行预处理,有效压缩了编码序列的大小,同时基于卷积神经网络的特征提取能够使Transformer网络在添加自注意力机制时关注图像视觉特征。
2.基于可变自注意力机制的Transformer网络能够动态建立每一帧的上下文关系,避免了计算全局特征关系导致训练中的梯度下降缓慢,需要长时间的训练和较大的训练轮次才能使模型收敛。
3.基于本地注意力机制的LSTM网络在生成描述文本时,能够使得预测的每一个单词都能关注相应的图像帧,保证生成的文本中不会遗漏原始时序图像的语义信息,提高模型的准确率。
附图说明
图1为具体实施例中缺陷时序图像描述总体框架图;
图2为具体实施例中基于可变自注意力机制的Transformer网络结构示意图;
图3为具体实施例中基于本地注意力机制的LSTM网络结构示意图。
具体实施方式
下面结合具体实施例,进一步阐明本发明,应理解这些实施例仅用于说明本发明而不用于限制本发明的范围,在阅读了本发明之后,本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。
已知有某电站大坝工程巡检是由无人机、移动摄像头等视频采集设备拍摄缺陷时序图像,每段图像中可能包含4类缺陷,分别为裂缝、碱性物析出、渗水、混凝土剥落,需要通过模型提取时序图像特征,并生成相应的描述文本,从而减少人工判断缺陷的时间,规范化巡检流程。
图1给出了基于本地自注意力机制的大坝缺陷时序图像描述方法的总体工作流程,具体实施如下:
(1)对输入的时序图像进行帧采样,使用卷积神经网络提取特征序列,并将该序列作为自注意力编码器的输入。
(1.1)将输入的时序图像分割为没有重叠的T个等长片段,从每个片段中随机抽取一帧xt组成集合为[x1,x2,…,xT]以增加训练的多样性,使得网络能够学习同一缺陷的不同实例变化。
(1.2)使用以ResNet50作为骨干网络的卷积神经网络处理每一帧采样图像,提取其特征图作为自注意力编码器的输入,并将大小压缩为原始图像的一半,记作Ft=[X1,X2,…,Xt],Xt为每一帧抽样图像的特征表示;
(2)采用基于可变自注意力机制的Transformer网络对时序图像的特征序列编码,动态建立每一帧的上下文关系,如图2所示。
(2.1)为便于计算时序图像的上下文特征关系,首先利用线性全连接层求得每一个采样帧对应的查询向量q、关键词向量k和价值向量v:
q=Linear(X)=WQX
k=Linear(X)=WKX
v=Linear(X)=WVX
其中q向量为查询向量,指导当前特征图像选择性关注时间维度上的上下文特征;k向量为关键词向量,用以计算当前特征图和其他特征图的注意力权重;v向量为价值向量,用以将当前特征图的信息加入自注意力权重中,该Transformer网络由8个注意力头和512维全连接网络构成,每个注意力头的权重均独立计算。
(2.2)通过将q向量和k向量的点积结果加入当前图像块得到的注意力权重:
其中dk为输入向量维度,由输入序列维度除以自注意力头个数求得。q向量和k向量点乘求得各自所对应的序列元素的相似度得分,除以进行归一化操作以保证梯度在网络中传播的稳定性。
(2.3)在Transformer网络中引入多头可变形的编码结构,避免计算全局自注意力导致参数过多模型收敛缓慢。使得模型仅对当前帧周围的一组关键帧采样并计算注意力权重,即给序列中每一个元素的查询向量q分配数量一定的关键词向量k:
其中pq为当前帧的位置参考点,Wm和Wm′为权重可学习特征矩阵,都是通过网络学习的权重,功能和维度大小一致,其权值是不同的。Δpmqk和Amqk分别表示第m个自注意力头中的第k个采样点的采样偏移量和自注意力权重,可标准化为通过全连接网络训练得到,并最终线性投影到查询向量中,通过512维多层感知机网络输出包含上下文信息的采样帧特征图
(3)采用基于本地注意力机制的LSTM网络生成描述文本,使得预测的每一个单词都能关注相应的图像帧,通过建立图像和文本的上下文依赖以提高文本生成的准确率,如图3所示。
(3.1)为时序图像中每一帧抽样图像的特征表示序列,通过对该序列解码,计算每个单词生成的条件概率得到对应事件的描述文本{Sn}:
其中St表示该文本的第t个单词,为了便于注意力机制的计算和神经网络实现,将条件概率参数化,每个单词的概率可表示为:
p(Sn|Sn<j,s)=softmax(g(hj))
hj=f(hj-1,s)
其中hj为循环神经网络隐藏层,函数f根据之前位置的隐藏层输出和当前向量计算得到当前位置的隐藏状态,其输出通过函数g转换为与词汇表维度相同的向量。
(3.2)在计算文本的过程中,引入上下文关系向量ct,通过拼接ct和序列的隐藏层状态ht再乘以权重可学习的参数矩阵wc,求得携带注意力机制的隐藏层状态:
最后通过softmax函数和全连接神经网络输出对应的单词序列:
(3.3)在生成每一个目标单词时,计算当前注意力的中心位置Pt,及本地注意力机制:
使得输出的单词能够关注与其相关的输入序列位置。其中位置矩阵Wp和惩罚项vp均为权值可学习的特征参数,S为输入序列长度,则该位置对应的注意力窗口[pt-D,pt+D],并通过align函数计算输入和输出序列的隐藏层向量并由高斯分布约束,求得注意力权重:
最后通过LSTM网络将上下文特征、注意力权重和前文生成的单词串联作为输入,并由全连接网络和softmax激活函数计算得到当前位置的输出单词,最后将所有位置的单词组合即为完整的描述文本。大坝缺陷时序图像描述文本生成结果如图1所示,模型能够根据输入图像序列的特征,将钙化一次对应到与之相关性最强的若干帧图像中,通过直接从图像生成关键词,有效提高了模型描述文本的准确率。
显然,本领域的技术人员应该明白,上述的本发明实施例的基于本地自注意力机制的大坝缺陷时序图像描述方法各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明实施例不限制于任何特定的硬件和软件结合。

Claims (6)

  1. 一种基于本地自注意力机制的大坝缺陷时序图像描述方法,其特征在于,包括如下步骤:
    (1)对输入的时序图像进行帧采样,使用卷积神经网络提取特征序列;
    (2)采用基于可变自注意力机制的Transformer网络对时序图像的特征序列编码,动态建立每一帧的上下文关系;
    (3)采用基于本地注意力机制的LSTM网络生成描述文本,使得预测的每一个单词都能关注相应的图像帧。
  2. 根据权利要求1所述的基于本地自注意力机制的大坝缺陷时序图像描述方法,其特征在于,所述(1)中,对输入的时序图像进行帧采样,使用卷积神经网络提取特征序列,具体步骤如下:
    (1.1)将输入的时序图像分割为没有重叠的T个等长片段,从每个片段中随机抽取一帧xt组成集合为[x1,x2,…,xT];
    (1.2)使用卷积神经网络处理每一帧采样图像,提取其特征图作为自注意力编码器的输入,记作Ft=[X1,X2,…,Xt],Xt为每一帧抽样图像的特征表示。
  3. 根据权利要求1所述的基于本地自注意力机制的大坝缺陷时序图像描述方法,其特征在于,所述(2)中,采用基于可变自注意力机制的Transformer网络对时序图像的特征序列编码的具体步骤如下:
    (2.1)利用线性全连接层求得每一个采样帧对应的查询向量q、关键词向量k和价值向量v:
    q=Linear(X)=WQX
    k=Linear(X)=WKX
    v=Linear(X)=WVX
    其中q向量指导当前特征图像选择性关注时间维度上的上下文特征;k向量用以计算当前特征图和其他特征图的注意力权重;v向量用以将当前特征图的信息加入自注意力权重中;
    (2.2)通过将q向量和k向量的点积结果加入当前图像块得到的注意力权重:
    其中dk为输入向量维度,由输入序列维度除以自注意力头个数求得;q向量和k向量点乘求得各自所对应的序列元素的相似度得分,除以进行归一化操作以保证梯度在网络中传播的稳定性;
    (2.3)在Transformer网络中引入多头可变形的编码结构,使得模型仅对当前帧周围的一组关键帧采样并计算注意力权重,即给序列中每一个元素的查询向量q分配数量一定的关键词向量k:
    其中Pq为当前帧的位置参考点,Wm和W′m为权重可学习特征矩阵;Δpmqk和Amqk分别表示第m个自注意力头中的第k个采样点的采样偏移量和自注意力权重,可标准化为∑k∈ΩAmqk=1,通过全连接网络训练得到,并最终线性投影到查询向量中,得到包含上下文信息的采样 帧特征图
  4. 根据权利要求1所述的基于本地自注意力机制的大坝缺陷时序图像描述方法,其特征在于,所述(3)中,基于本地注意力机制的LSTM网络生成描述文本的具体步骤如下:
    (3.1)为时序图像中每一帧抽样图像的特征表示序列,通过对特征表示序列解码,计算每个单词生成的条件概率得到对应事件的描述文本{Sn}:
    其中St表示文本的第t个单词,每个单词的概率表示为:
    p(Sn,i|Sn<j,s)=softmax(g(hj))
    hj=f(hj-1,s)
    其中hj为循环神经网络隐藏层,函数f根据之前位置的隐藏层输出和当前向量计算得到当前位置的隐藏状态,其输出通过函数g转换为与词汇表维度相同的向量;
    (3.2)在计算文本的过程中,引入上下文关系向量ct,通过拼接ct和序列的隐藏层状态ht再乘以权重可学习的参数矩阵wc,求得携带注意力机制的隐藏层状态:
    最后通过softmax函数和全连接神经网络输出对应的单词序列:
    (3.3)在生成每一个目标单词时,计算当前注意力的中心位置Pt,即 本地注意力机制:
    使得输出的单词能够关注与其相关的输入序列位置,其中位置矩阵Wp和惩罚项vp均为权值可学习的特征参数,S为输入序列长度,则该位置对应的注意力窗口为[pt-D,pt+D],并通过align函数计算输入和输出序列的隐藏层向量并由高斯分布约束,求得注意力权重:
    最后通过LSTM网络将上下文特征、注意力权重和前文生成的单词串联作为输入,并由全连接网络和softmax激活函数计算得到当前位置的输出单词。
  5. 一种计算机设备,其特征在于:该计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行上述计算机程序时实现如权利要求1-4中任一项所述的基于本地自注意力机制的大坝缺陷时序图像描述方法。
  6. 一种计算机可读存储介质,其特征在于:该计算机可读存储介质存储有执行如权利要求1-4中任一项所述的基于本地自注意力机制的大坝缺陷时序图像描述方法的计算机程序。
PCT/CN2023/093153 2022-05-11 2023-05-10 一种基于本地自注意力机制的大坝缺陷时序图像描述方法 WO2023217163A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/337,409 US20230368500A1 (en) 2022-05-11 2023-06-19 Time-series image description method for dam defects based on local self-attention

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210513592.1 2022-05-11
CN202210513592.1A CN114998673B (zh) 2022-05-11 2022-05-11 一种基于本地自注意力机制的大坝缺陷时序图像描述方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/337,409 Continuation US20230368500A1 (en) 2022-05-11 2023-06-19 Time-series image description method for dam defects based on local self-attention

Publications (1)

Publication Number Publication Date
WO2023217163A1 true WO2023217163A1 (zh) 2023-11-16

Family

ID=83026948

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093153 WO2023217163A1 (zh) 2022-05-11 2023-05-10 一种基于本地自注意力机制的大坝缺陷时序图像描述方法

Country Status (2)

Country Link
CN (1) CN114998673B (zh)
WO (1) WO2023217163A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292243A (zh) * 2023-11-24 2023-12-26 合肥工业大学 基于深度学习的心磁信号时空图像预测方法、设备及介质
CN117372936A (zh) * 2023-12-07 2024-01-09 江西财经大学 基于多模态细粒度对齐网络的视频描述方法与系统
CN117493786A (zh) * 2023-12-29 2024-02-02 南方海洋科学与工程广东省实验室(广州) 一种对抗生成网络和图神经网络结合的遥感数据重构方法
CN117807603A (zh) * 2024-02-29 2024-04-02 浙江鹏信信息科技股份有限公司 软件供应链审计方法、系统及计算机可读存储介质
CN118097318A (zh) * 2024-04-28 2024-05-28 武汉大学 基于视觉语义融合的可控缺陷图像生成方法及设备
CN118155227A (zh) * 2024-05-13 2024-06-07 北京中核华辉科技发展有限公司 基于智能化技术的核电设备维护决策方法及系统
CN118332414A (zh) * 2024-06-13 2024-07-12 江西财经大学 融合数值和视觉特征的图表描述文本的生成方法与系统
CN118332342A (zh) * 2024-06-12 2024-07-12 杭州昊清科技有限公司 一种工业流程数据增广与生成方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114998673B (zh) * 2022-05-11 2023-10-13 河海大学 一种基于本地自注意力机制的大坝缺陷时序图像描述方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929092A (zh) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 一种基于动态注意力机制的多事件视频描述方法
CN111597819A (zh) * 2020-05-08 2020-08-28 河海大学 一种基于关键词的大坝缺陷图像描述文本生成方法
CN113392717A (zh) * 2021-05-21 2021-09-14 杭州电子科技大学 一种基于时序特征金字塔的视频密集描述生成方法
CN114998673A (zh) * 2022-05-11 2022-09-02 河海大学 一种基于本地自注意力机制的大坝缺陷时序图像描述方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3598339B1 (en) * 2018-07-19 2024-09-04 Tata Consultancy Services Limited Systems and methods for end-to-end handwritten text recognition using neural networks
CN109389055B (zh) * 2018-09-21 2021-07-20 西安电子科技大学 基于混合卷积和注意力机制的视频分类方法
CN112215223B (zh) * 2020-10-16 2024-03-19 清华大学 基于多元注意力机制的多方向场景文字识别方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929092A (zh) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 一种基于动态注意力机制的多事件视频描述方法
CN111597819A (zh) * 2020-05-08 2020-08-28 河海大学 一种基于关键词的大坝缺陷图像描述文本生成方法
CN113392717A (zh) * 2021-05-21 2021-09-14 杭州电子科技大学 一种基于时序特征金字塔的视频密集描述生成方法
CN114998673A (zh) * 2022-05-11 2022-09-02 河海大学 一种基于本地自注意力机制的大坝缺陷时序图像描述方法

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292243A (zh) * 2023-11-24 2023-12-26 合肥工业大学 基于深度学习的心磁信号时空图像预测方法、设备及介质
CN117292243B (zh) * 2023-11-24 2024-02-20 合肥工业大学 基于深度学习的心磁信号时空图像预测方法、设备及介质
CN117372936A (zh) * 2023-12-07 2024-01-09 江西财经大学 基于多模态细粒度对齐网络的视频描述方法与系统
CN117372936B (zh) * 2023-12-07 2024-03-22 江西财经大学 基于多模态细粒度对齐网络的视频描述方法与系统
CN117493786A (zh) * 2023-12-29 2024-02-02 南方海洋科学与工程广东省实验室(广州) 一种对抗生成网络和图神经网络结合的遥感数据重构方法
CN117493786B (zh) * 2023-12-29 2024-04-09 南方海洋科学与工程广东省实验室(广州) 一种对抗生成网络和图神经网络结合的遥感数据重构方法
CN117807603A (zh) * 2024-02-29 2024-04-02 浙江鹏信信息科技股份有限公司 软件供应链审计方法、系统及计算机可读存储介质
CN117807603B (zh) * 2024-02-29 2024-04-30 浙江鹏信信息科技股份有限公司 软件供应链审计方法、系统及计算机可读存储介质
CN118097318A (zh) * 2024-04-28 2024-05-28 武汉大学 基于视觉语义融合的可控缺陷图像生成方法及设备
CN118155227A (zh) * 2024-05-13 2024-06-07 北京中核华辉科技发展有限公司 基于智能化技术的核电设备维护决策方法及系统
CN118332342A (zh) * 2024-06-12 2024-07-12 杭州昊清科技有限公司 一种工业流程数据增广与生成方法
CN118332414A (zh) * 2024-06-13 2024-07-12 江西财经大学 融合数值和视觉特征的图表描述文本的生成方法与系统

Also Published As

Publication number Publication date
CN114998673A (zh) 2022-09-02
CN114998673B (zh) 2023-10-13

Similar Documents

Publication Publication Date Title
WO2023217163A1 (zh) 一种基于本地自注意力机制的大坝缺陷时序图像描述方法
CN111259940B (zh) 一种基于空间注意力地图的目标检测方法
CN113515951B (zh) 基于知识增强注意力网络和组级语义的故事描述生成方法
US20230368500A1 (en) Time-series image description method for dam defects based on local self-attention
CN111464881B (zh) 基于自优化机制的全卷积视频描述生成方法
EP3885966B1 (en) Method and device for generating natural language description information
CN111325323A (zh) 一种融合全局信息和局部信息的输变电场景描述自动生成方法
CN111738169B (zh) 一种基于端对端网络模型的手写公式识别方法
CN113656581A (zh) 文本分类及模型训练的方法、装置、设备以及存储介质
CN113657115B (zh) 一种基于讽刺识别和细粒度特征融合的多模态蒙古文情感分析方法
CN110516530A (zh) 一种基于非对齐多视图特征增强的图像描述方法
CN114820871B (zh) 字体生成方法、模型的训练方法、装置、设备和介质
CN107463928A (zh) 基于ocr和双向lstm的文字序列纠错算法、系统及其设备
CN114973229B (zh) 文本识别模型训练、文本识别方法、装置、设备及介质
CN115994317A (zh) 基于深度对比学习的不完备多视图多标签分类方法和系统
CN114154016A (zh) 基于目标空间语义对齐的视频描述方法
CN117475462A (zh) 配网线路成图要素及特征自动识别匹配系统及方法
CN116092101A (zh) 训练方法、图像识别方法、装置、设备及可读存储介质
CN115984883A (zh) 一种基于增强视觉变换器网络的印地语图文识别方法
CN115331081A (zh) 图像目标检测方法与装置
Zhu Video captioning in compressed video
CN114970955B (zh) 基于多模态预训练模型的短视频热度预测方法及装置
CN116310984B (zh) 基于Token采样的多模态视频字幕生成方法
CN113627556B (zh) 一种图像分类的实现方法、装置、电子设备和存储介质
CN118135466B (zh) 一种数据处理方法、装置、计算机、存储介质及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802932

Country of ref document: EP

Kind code of ref document: A1