CN115690453A

CN115690453A - Method and system for Thangka image recognition

Info

Publication number: CN115690453A
Application number: CN202211136353.5A
Authority: CN
Inventors: 王露璐; 刘晓静
Original assignee: Qinghai University
Current assignee: Qinghai University
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2023-02-03

Abstract

The invention discloses a method and a system for identifying a Thangka image, which improve the structure of an encoder in order to improve the characteristic extraction effect of an image end, extract the characteristics of the image by using two convolutional neural networks, and fuse the extracted characteristic vectors as the output of the encoder; adding a parallel convolution tail chain in a ResNet-50 network, and simultaneously extracting the characteristics of the main honor and other characteristics of the main honor of the Thangka; and optimizing a text generation model of the long-term and short-term memory network by using a Transformer network, and performing layer normalization on the Transformer network to finally obtain a deep learning Thangka image description algorithm. The method introduces the image description into the Thangka scene, can better assist in understanding the Thangka image, can conveniently search large-scale Thangka images by combining a text information search technology, and has important significance for digital protection and development of Thangka culture resources. The system designed based on the method is simple to operate, has a concise interface and is suitable for all users.

Description

Method and system for Thangka image recognition

Technical Field

The invention relates to the technical field of computer image description, in particular to a Thangka image identification method and system.

Background

As an artistic form with regionality and religious, the tangka is the core competitiveness of the cultural and artistic industry in the Qinghai-Tibet region, but as the main honor, the legal apparatus and the classical allusion described by the modeling type tangka relate to religion, region, history and national domain knowledge, understanding the content of the tangka image is still very difficult for most people. At present, the understanding of the content of the image of the Thangka is still based on the domain knowledge of experts, and there is only a few research on the description of the image of the Thangka.

The task of automatic description of images is also called image caption technology, and the content of the task can be simply understood as the capability of making a machine speak in a look and feel like a word, and specifically, description words are automatically generated for input images. The task of automatic description of images requires the system to create a mapping from the image to the corresponding text description, which requires not only the recognition of salient objects in the image, but also the linguistic ability to describe the most salient aspects of the image, the former part of the task content involving computer vision feature extraction, and the latter part of the task content involving text generation tasks.

From a computer vision perspective, automatic description of images requires, in contrast to other tasks of image understanding, not only the identification of objects in images and their attributes, but also the determination of relationships between objects in images, and the description of these features using grammatically and semantically error-free text. From the perspective of text generation, automatic description of images is seen as a translation problem that "translates" images into text, thus introducing the encoder-decoder framework in machine translation to solve the task of automatic description of images. The encoder-decoder framework contains two parts of content: the encoder part is used for understanding an input image, and the current mainstream image understanding method is to extract image features by using a Convolutional Neural Network (CNN); the content of the decoder part is that image feature vectors extracted by a CNN Network are used as input to generate description texts, and is also a text generation task in the field of NLP, so that a basic Network time-cycle Neural Network (RNN) is used, and particularly a Long Short-Term Memory Network (LSTM) is used to generate text sequences. The Transformer network is constructed on the basis of a self-attention mechanism, and better conforms to the composition rule of Thangka compared with a natural time sequence structure in the LSTM, so that the performance of the whole model can be further improved.

In conclusion, the Thangka is a cultural resource with particularly ethnic features, but the research on the Thangka protection and development in combination with information technology is less, and especially the research on the multi-modal learning task of the Thangka image is much less and less. Therefore, how to introduce a deep neural network to assist in understanding the content of the image of the Thangka remains a topic worthy of research.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method and a system for image recognition of a thangka.

The invention provides a Thangka image identification method, which specifically comprises the following steps:

s1, inputting the acquired Thangka image into an encoder, and extracting a main honor feature vector of the Thangka image;

s2, inputting the characteristic vector of the main goblet and other acquired characteristic vectors of the main goblet to a decoder, and acquiring target characteristic information;

and S3, fusing the main honor feature vector and the target feature information text to obtain corresponding text description.

Further, in step S1, the encoder is two convolutional neural networks, specifically including a main network resenext-50 and a branch network MobileNet parallel to the main network, where the branch network MobileNet is used as an auxiliary classifier to simultaneously extract and output the main honor and other feature vectors of the tangka image; and obtaining a main honor feature vector finally output by the encoder by fusing the extracted features of the two convolutional neural networks.

Optimally, a global average pooling layer is added to the ResNeXt-50 network after the last convolutional layer tail outputs the feature vector and is used for controlling the scale of the feature vector, and the final main honor feature vector is obtained through the vector output by the global average pooling layer through a full connection layer.

Further, in step S2, the decoder employs a transform-based model unit structure, wherein the transform model unit structure is constructed by adding a feedforward full-join unit to the middle of 2 multi-head self-attention units by using 2 multi-head self-attention units and 1 feedforward full-join unit; the multi-entry embedding accumulation scale problem is constrained using a Layer Normalization LN (Layer Normalization) approach.

The invention also protects a Thangka image recognition system, which comprises a front-end page and a rear-end part;

the front-end page is used for interacting with a user and comprises a system home page, a Thangka image uploading page and a Thangka image description text display page;

the rear end part is used for uploading images of Thangka and generating description texts;

and uploading the Thangka image, storing the uploaded Thangka image on a server in a temporary file form, and transmitting the file to a calling model detection main honor part on a storage path of the server.

The method for generating the description text is the steps of the image recognition method of Thangka.

Furthermore, the description text generation also comprises a template method for displaying the generated Thangka image text and the contained extension content (religious meaning), wherein the extension content is stored in the server in the form of static data and is directly called when needed.

Compared with the prior art, the invention has the following beneficial effects:

in order to improve the characteristic extraction effect of an image end, the structure of an encoder is improved, two convolutional neural networks are used for extracting the characteristics of an image, and extracted characteristic vectors are fused to be used as the output of the encoder; modifying the structure of the ResNet-50 network, adding a parallel convolution tail chain in the network, and simultaneously extracting the characteristics of the main honor and other characteristics of the main honor of the Thangka; modifying a decoder, optimizing a text generation model of the long-term and short-term memory network by using a Transformer network, and performing layer normalization on the Transformer network by using condition layer normalization to finally obtain the deep learning Thangka image description algorithm. The method introduces the image description into the Thangka scene, can better assist in understanding the Thangka image, can conveniently search large-scale Thangka images by combining a text information search technology, and has important significance for digital protection and development of Thangka culture resources. The system designed based on the method is simple to operate, has a concise interface and is suitable for all users.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a ResNeXt50 network structure;

FIG. 3 is a diagram illustrating a final image feature vector formula obtained by a vector output from the global average pooling layer through the full-link layer;

FIG. 4 is an encoder structure;

FIG. 5 is a loss function equation for an encoder;

FIG. 6 is a BCEwithLogs function formula for use with the encoder loss function;

FIG. 7 is a BCE with logs Loss flow chart;

FIG. 8 is a BCEwithLotics function calculation formula;

FIG. 9 is a Transformer unit structure;

FIG. 10 is a flow chart of the system of the present invention;

FIG. 11 is a system home page of the present invention;

FIG. 12 is an image upload page of the system of the present invention;

fig. 13 is a Thangka image description text presentation page of the system of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Optimization process of image recognition algorithm model of Thangka

1. Method and thought for Thangka image recognition

Firstly, in order to improve the feature extraction effect of an image end, the structure of an encoder is improved, two convolutional neural networks are used for extracting the features of an image, and extracted feature vectors are fused to be used as the output of the encoder.

Secondly, modifying the structure of the ResNet-50 network, adding a parallel convolution tail chain in the network, and simultaneously extracting the features of the main honour and other features of the main honour of the Thangka.

And thirdly, modifying a decoder, optimizing a text generation model of the long-term and short-term memory network by using a Transformer network, and performing layer normalization on the Transformer network by using condition layer normalization.

2. Encoder for encoding a video signal

In the Thangka image description task, the encoder is used for extracting image features of an input Thangka image, and the convolutional neural network continuously processes the image through multilayer convolution, so that information of different scales in the image can be well extracted.

Typically, different depths of the convolutional layer extract different levels of information of the image. The high-level features of convolutional neural networks generally contain more abstract and complex semantic information, while the low-level features are more focused on spatial information such as edges, contours, and the like in the image. Thus deepening the convolutional neural network helps to extract more abstract semantic information. However, when the depth of the convolutional neural network is too deep, because the convolutional neural network needs to be propagated reversely, a gradient explosion or extraction disappearance phenomenon occurs, in order to alleviate the phenomenon, resNet-50 is adopted as a basic network for image feature extraction and optimized, a parallel tail chain is added, and a main network ResNeXt-50 of an encoder part is obtained. The specific structure of ResNeXt-50 is shown in FIG. 2.

The ResNeXt-50 network of the application is added with a global average pooling layer after the last convolutional layer outputs the feature vector, and is used for controlling the scale of the feature vector. And obtaining the final image feature vector by the vector output by the global average pooling layer through a full connection layer. The formula is shown in figure 3; wherein img represents the input image of the Thangka, convi represents the vector output by the last layer of convolutional layer, and featurei represents the image feature of the final output. And the value of i is one of role and act, if the value of i is role, the network is the main network, and if the value of i is act, the network is the parallel network.

The image of the Thangka mainly describes Tang Kazhu honor types and information such as the statue and the posture of a main honor of a user, usually, an attention mechanism is added at an image end for better extracting Tang Kazhu honor type information and other information related to the Thangka, but the image of the Thangka is used as a drawing image, the depth of field is not obvious, and a foreground and a background are difficult to distinguish, so for better extracting the features of the image of the Thangka, mobileNet is used as a branch network to extract the features of the image of the Thangka, and finally, the feature vector of the image finally output by an encoder is obtained by fusing the feature vectors of two convolutional neural networks. The encoder structure is shown in fig. 4.

And modifying the basic network, and adding a branch as an auxiliary classifier to simultaneously extract and output the characteristics of the main honor of the Thangka image, the handprint, the posture and the like of the main honor and the main honor law device. The formula of the loss function of the encoder is shown in fig. 5.

Therein, loss _base Loss function representing ResNeXt-50 network is composed of loss functions of main network and parallel network _trim Representing the loss function of the MobileNet network. The loss function of the encoder is calculated using the bcewithlogs function, as shown in fig. 6.

Wherein, base _out1 And base _out2 The output results of the main network and the parallel network of the ResNeXt-50 network are shown, basetrim shows the output result of the MobileNet network, and RoleTarget and ActTarget show the actual values of the main respect category and the main respect legal device handprint. The bcewithlogs function computation process first normalizes the input vector using the Sigmoid function and then computes the loss using the binary cross entropy function. The calculation flow is shown in FIG. 7, and the BCEwithLogs function calculation formula obtained from FIG. 7 is shown in FIG. 8.

By using the BCEwithLogitis function, the classification of the main goblet can be more accurate, and more accurate semantic information such as the fingerprint and the posture of the legal instrument can be extracted, so that more accurate image description of the Thangka can be finally obtained.

3. Decoder

The decoder partially references the transform model to construct a self-attention sublayer and a feedforward neural network full-connection sublayer using its self-attention mechanism and feedforward neural network full-connection layer, respectively. The Transformer network gives up a loop structure in the RNN and completely learns the information characteristics in the text by using a multi-head self-attention mechanism, can perform parallelization training while better modeling the dependency relationship of the long text, and has great improvement on performance and efficiency. Meanwhile, in order to increase the generalization of the model, the extracted feature vector of the Thangka image is subjected to mask at the input part, and the formula is as follows:

feature _encoding ＝attention_mask(feature _act )

therefore, in the image description model of the thangka, the objective function to be optimized finally is a log-maximum likelihood function:

wherein, W is a matrix formed by embedding vectors into each word in the sentence, and can be expressed as { W ₁ ,W ₂ ,…,W _L L is sentence length, W _i Word-embedding vector, p (W), representing the ith word in a sentence _i |W _1:i-1 ,feature _encoding ) Indicating the probability of generating the ith word based on the first i-1 words that have already been generated and the image feature vector.

In the field of machine vision, normalization is usually performed in batch dimension, but in text data, the information relevance of data of different batches is not high, and different sentence lengths are inconsistent, so that the effect of Normalization performed in batch dimension on variance reduction is not obvious, and even differences among different sentences can be lost, so that the transform network uses a Layer Normalization LN (Layer Normalization) mode to constrain the scale problem of multi-vocabulary embedding accumulation, and variance is effectively reduced.

In the task of image description of the tangka, the influence of the category of the tangka master base on the tangka description is great, and determining the category of the tangka master base determines most semantic information in the tangka description, including legal instruments, fingerprints and the like, so that a self-attention unit and a feed-forward all-connected unit are constructed by using a Conditional Layer Normalization CLN (Conditional Layer Normalization) as a Normalization method, as shown in fig. 9.

A decoder network is constructed using 2 self-attentive units and 1 fully-connected unit, adding the fully-connected unit to the middle of the 2 self-attentive units. Assuming that N represents the number of images and γ c is the conditional input for the conditional layer normalization, the loss function of the model is as follows:

example 2

A Thangka image identification method specifically comprises the following steps:

s1, inputting the acquired Thangka image into an encoder, and extracting a main honor feature vector of the Thangka image; the encoder comprises two convolutional neural networks, and specifically comprises a main network ResNeXt-50 and a branch network MobileNet parallel to the main network, wherein the branch network MobileNet is used as an auxiliary classifier to simultaneously extract and output the main honor and other feature vectors of the main honor of the Thangka image; obtaining a main honor feature vector finally output by the encoder by fusing the extracted features of the two convolutional neural networks; the ResNeXt-50 network is additionally provided with a global average pooling layer after the last convolution layer outputs the feature vector for controlling the scale of the feature vector, and the final main honor feature vector is obtained by the vector output by the global average pooling layer through a full connection layer;

s2, inputting the characteristic vector of the main goblet and the obtained other characteristic vectors of the main goblet to a decoder, and obtaining target characteristic information; the decoder adopts a transformer model unit structure, wherein the transformer model unit structure is constructed by adding a feedforward full-connection unit to the middle of 2 multi-head self-attention units by using 2 multi-head self-attention units and 1 feedforward full-connection unit; constraining the multi-vocabulary strip embedding accumulation scale problem by using a Layer Normalization (LN) mode;

Example 3

1. System design idea

The method aims to enable a user to generate description texts for local Thangka digital images through a browser and assist the user in understanding the content of the Thangka images. In order to enable the system to better help the user to understand the content of the Thangka image, extension contents are added on the basis of the text generated by the model, and the extension contents are also some domain knowledge related to the main honor, so that the user can know more contents related to the Thangka image. FIG. 10 shows the main flow of the system of the present invention.

2. System implementation

1) Front-end page implementation

The part of the content relates to Web and jsp technology and is mainly responsible for man-machine interaction. The Web technology includes HTML, CSS, and JavaScript, where HTML refers to hypertext Markup Language (HTML), a Language for describing a Web page by a set of tags, and a browser organizes content by tags in an HTML file. CSS describes how HTML elements are displayed on a screen, paper or other media with the purpose of separating content from presentation, the content being organized by tags of the HTML, and in particular how the presentation is controlled by selectors and attribute values of the CSS. JavaScript is a programming language for programming web pages, and the web pages can be subjected to simple human-computer interaction and some animations by programming the behavior of the web pages by using the JavaScript. The webpage developed by using HTML + CSS + Javascript is a static webpage, and the Thangka description displayed by the system is dynamically generated according to the uploaded Thangka image, so that a dynamic display page is developed by using a jsp technology, java codes are embedded into an HTML document of the display page, and a background is programmed and displayed by calling the Thangka description dynamically generated by a model.

2) Implementation of the backend portion

This part of the development involves Java ee's partial technology and Java and Python hybrid development technology. The JavaEE is a series of standards for developing Web application, the usable technologies comprise front-end jsp technology, background Servlet technology, JDBC technology for accessing database data and the like, background logic of the system mainly comprises image uploading and deep learning model calling to generate image description, and the image description is developed under the JavaEE standard. As the main development environment of the system is JavaEE, but the deep learning network part is developed by using Python, a model for calling Python deep learning training by using Java and Python mixed development technology is required. There are three main methods for calling the Python deep learning model by Java.

The first method is to store various parameters in a trained model in a structured document form, realize the network structure again in the background by using Java and read the structured document of the stored parameters, so that the trained model can be reproduced, but the defect of doing so is obvious, on one hand, the workload is very large, and the whole network can only be realized step by step from the bottom layer because a plurality of deep learning frames in Python cannot be used; on the other hand, since the network can only be realized step by step from the bottom layer, the probability of bug occurrence increases dramatically.

The second method is to pack Python codes into jar files and directly call the Python codes in the jar files in the Java codes, and the method has the defects that a plurality of third party libraries cannot be packed into jar files, so that a plurality of deep learning frameworks cannot be used, the defects are basically consistent with those of the first method, the workload is large, and errors are easy to occur.

The third method is to create a thread through the thread class of Java, call a Python interpreter through the thread, directly run Python codes through the Python interpreter and pass parameters, and then return the running result of the Python codes to the Java codes for use.

The method used by the system is the third method, the model trained in deep learning of process class object operation is used, the method can ensure the high efficiency and the correctness of Python code operation, and the next operation can be carried out in Java only by acquiring the result of the Python code.

3. System display

A Thangka image recognition system comprises a front-end page and a rear-end part; a front-end page for interacting with the user, including a system home page (fig. 11), a Thangka image upload page (fig. 12), and a Thangka image description text presentation page (fig. 13); the back end part is used for uploading images of the Thangka and generating description texts; uploading the Thangka image, storing the uploaded Thangka image on a server in a temporary file form, and transmitting the file to a calling model detection main honor part on a storage path of the server; the method in the generation of the description text is the step in the image recognition method of Thangka; the description text generation also comprises a template method used for displaying the generated Thangka image text and contained extension content (religious meaning), wherein the extension content is stored in a server in a static data form and is directly called when needed.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A Thangka image identification method is characterized by comprising the following steps:

s2, inputting the characteristic vector of the main goblet and the obtained other characteristic vectors of the main goblet to a decoder, and obtaining target characteristic information;

2. The method for identifying a Thangka image as claimed in claim 1, wherein in step S1, the encoder is two convolutional neural networks, specifically comprising a main network ResNeXt-50 and a branch network MobileNet parallel to the main network, and the branch network MobileNet is used as an auxiliary classifier to extract and output other feature vectors of the main honor and the main honor of the Thangka image simultaneously; and obtaining a main honor feature vector finally output by the encoder by fusing the extracted features of the two convolutional neural networks.

3. The method of claim 2, wherein the ResNeXt-50 network adds a global average pooling layer after the last convolutional layer tail outputs the eigenvector for controlling the scale of the eigenvector, and the vector output from the global average pooling layer obtains the final principal honor eigenvector through a full connection layer.

4. The method of claim 1, wherein in step S2, the decoder employs a transform-based model unit structure, the transform model unit structure is constructed by adding a feedforward full-connected unit to the middle of 2 multi-headed self-attention units using 2 multi-headed self-attention units and 1 feedforward full-connected unit; and (4) constraining the multi-term embedding accumulated scale problem by using a layer normalization LN mode.

5. A Thangka image recognition system is characterized by comprising a front-end page and a rear-end part;

The method in the generation of the description text is the steps of any one of claims 1-4.

6. The Thangka image recognition system according to claim 5, wherein the description text generation further comprises a template method for displaying the generated Thangka image text and the contained extension content, wherein the extension content is stored in the server in the form of static data and is directly called when needed.