CN115115883A

CN115115883A - License classification method and system based on multi-mode feature fusion

Info

Publication number: CN115115883A
Application number: CN202210757445.9A
Authority: CN
Inventors: 金耀辉; 邱健; 王晴晴
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-27

Abstract

The invention provides a certificate classification method and system based on multi-modal feature fusion, wherein the method fully considers the characteristics of multi-modal information such as visual features, text semantic features, text position features and the like contained in a certificate image, fully utilizes the multi-modal information and the interrelation among all the modalities, extracts the visual features by constructing a convolutional neural network, and converts the visual features into visual feature vectors; the language model is trained according to the unique text information in the license, the text in the license image is converted into a text information vector, and the obtained visual feature vector and the text information vector are subjected to multi-mode fusion, so that the original single-mode visual feature and the text information can be reserved, and meanwhile, the interaction between the two modes can be used as the basis for classification. The invention not only considers the visual characteristics of the license image, but also fully considers the text information and the mutual relation between the text information and the text information, thereby leading the classification result to achieve higher classification accuracy and more fine-grained classification.

Description

License classification method and system based on multi-mode feature fusion

Technical Field

The invention relates to the technical field of computer vision and natural language processing, in particular to a license classification method and system based on multi-modal feature fusion.

Background

Deep learning has enjoyed great success in both computer vision and natural language processing, particularly convolutional neural networks. On the basis, the multi-mode learning can aggregate information of multi-source data, so that the representation learned by the model is more complete, and under sufficient training data, the types of the modes are richer, and the estimation of the representation space is more accurate.

In the problem of license classification, the traditional processing mode of license image classification includes classification by using a rule machine or classification by using a single modality (only visual features or only text information), and scalability and universality are lacked. As in prior art publication No. CN111738251, the license image itself is a special image and contains abundant multi-modal features (visual features, text information, text block position coordinate information, etc.). Meanwhile, the license contains rich key-value pair relations, but the key-value pair relations are often not fully utilized in practical application scenes.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the license classification method and system based on multi-mode feature fusion, which not only consider the visual features of the license image, but also fully consider the text content information and the text block position coordinate information, so that the classification result achieves higher classification accuracy and finer-grained classification.

The invention is realized by the following technical scheme.

In one aspect of the invention, a license classification method based on multi-modal feature fusion is provided, which comprises the following steps:

inputting the whole layout area of the license image into a convolutional neural network according to multi-mode information contained in the license image, extracting visual features, and representing layout visual modal information by using a layout visual feature vector;

according to the multi-mode information contained in the license image, the text information in the license image is extracted by using an optical character recognition model, and the method comprises the following steps:

and extracting the complete text content in the license image and positioning the text position coordinate information of the text block.

According to the distance relation between the text block position information coordinates, the formula is as follows:

where i, j each represent a different text block, if d _ij If the value is less than the preset threshold value theta, judging that the text content in the text block is related, and then reconstructing the text content information according to the coordinate position relation of the corresponding text block, wherein the formula is as follows:

t _ij ＝t _i +t _j

wherein t is _i ,t _j The text content of the ith and jth text blocks is obtained, so that the texts with the key-value pair relation are aggregated into long text information;

according to the long text information of the reconstructed license text data, the long text information is used as a training data set, and a language model conforming to semantic expression in the license is obtained through training in combination with text semantic expression;

and encoding the reconstructed long text information into a text feature vector with a fixed length by using the trained language model.

According to the obtained layout visual feature vector and the long text feature vector of the license image, tensor outer product is conducted, single-mode and double-mode interaction is simulated in an explicit mode, and a new multi-dimensional feature vector is obtained;

and (4) judging the feature vector information after multi-modal feature fusion by the multi-dimensional feature vector through a convolutional neural network, clustering according to different mapping results in the space, and classifying to obtain a fine-grained license type.

According to another aspect of the present invention, there is provided a license classification system based on multi-modal feature fusion, including: the system comprises a multi-modal feature extraction module, a text reconstruction module, a language model training module, a long text information feature vector representation module, a tensor outer product calculation module and a multi-modal feature information fusion judgment module, wherein:

the multimode characteristic extraction module acquires the overall layout visual characteristics of the license image by using a convolutional neural network, outputs layout visual characteristic vectors, extracts text content information and text position block information by using an optical character recognition model, and outputs the text content information and text block position coordinate information;

the text reconstruction module judges the key-value pair relationship according to the distance relationship of the position coordinates of different text blocks, reconstructs the content of the text blocks according to the position information of the text blocks and obtains a long text with the key-value pair relationship;

the language model training module is used for training a language model of unique text semantics in the license image by using the reconstructed long text;

the text information feature vector representation module converts the extracted long text into feature representation with fixed length by using the language model obtained by training;

the tensor outer product calculation module is used for making a tensor outer product of the layout visual characteristic vector and the long text characteristic vector, and explicitly expressing monomodal and bimodal interaction to obtain a feature-fused multidimensional characteristic vector;

the multi-mode characteristic information fusion judgment module judges the information of the obtained multi-mode fusion characteristic vector through a convolutional neural network, and classifies the multi-mode fusion characteristic vector according to different mapping results in space to obtain the fine-grained license type to be judged.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following beneficial effects:

1) the method has the advantages that the multi-modal characteristics are considered, the key-value pair relation in the text information is fully considered, the license text information is reconstructed according to the key-value pair relation, the reconstructed license long text information can fully express the license text information, and meanwhile, the text position information required to be utilized by the reconstructed long text information also provides reference for visual characteristic extraction, so that the multi-modal characteristic extraction is more consistent with the characteristics of the license.

2) The method improves the license image according to the multi-modal feature classification technology, improves the classification precision, and improves the scalability and the universality of the system.

3) The method has the advantages that rich layout visual information, text content information and text block position coordinate information contained in the license image can be fully utilized, the feature fusion is carried out on the plurality of modal feature vectors, the single-modal information can be represented explicitly, the dual-modal interaction can be realized, the license classification result has higher accuracy and finer granularity classification division, the application range is wider, and the robustness performance is better.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a block diagram of a certificate classification method and system based on multi-modal feature fusion according to a preferred embodiment of the present invention;

fig. 2 is a flowchart of a license classification method based on multi-modal feature fusion in a preferred embodiment of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention are described in detail below with reference to fig. 1 and 2. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention. The present invention is in no way limited to any specific configuration and algorithm set forth below, but rather covers any modification, replacement or improvement of elements, components or algorithms without departing from the spirit of the invention. In the drawings and the following description, well-known structures and techniques are not shown in order to avoid unnecessarily obscuring the present invention.

The embodiment of the invention provides a certificate classification method based on multi-mode feature fusion, which aims at the problems that the traditional certificate image classification method uses simple rules to classify or only uses a single mode to classify certificate images, the rich multi-mode information of the certificate images cannot be fully utilized, the method is lack of robustness and the like. Extracting layout visual information of the license image by using a convolutional neural network, and representing by using a layout visual feature vector; extracting text information of the license picture by using an optical character recognition model, wherein the text information comprises text content and text block position coordinate information, and further training a language model suitable for the semantic meaning of the license text; converting the long text into a long text feature vector with a fixed length for representation by using a trained language model; carrying out tensor outer product on the layout visual characteristic vector of the license image and the text characteristic vector to obtain a new multi-dimensional characteristic vector; the new feature vector after multi-modal feature fusion can explicitly simulate single-modal and bi-modal interaction, information is judged through a convolutional neural network, and clustering is carried out according to different mapping results in space to obtain a fine-grained license classification result. The method is suitable for fine-grained classification of the license image generally with visual features and text information.

The license classification method based on multi-modal feature fusion provided by the embodiment comprises the following steps:

s0a, inputting the whole layout area of the license image into a convolutional neural network according to the unique layout visual information of the license image, extracting visual features, and acquiring the layout visual feature vector of the license image;

s0b, extracting text information in the license image through an optical character recognition model according to the unique text semantic information of the license image, wherein the text information comprises text content information and text block position coordinate information of the license image;

s1, judging the key-value pair relation according to the distance relation of the position coordinates of different text blocks, reconstructing the content of the text blocks according to the position information of the text blocks to obtain the long text with the key-value pair relation

S2, training to obtain a language model adaptive to the certificate text semantics by using the reconstructed long text information as training data;

s3, converting the long text information into a feature vector with a fixed length through a trained language model, and representing the long text information in the license image;

s4, according to the obtained layout visual feature vector and the long text feature vector of the license image, making tensor outer product to obtain a new multi-dimensional feature vector for representing single-mode and double-mode interaction in an explicit way;

and S5, constructing a classification network by the new feature vector obtained after multi-modal feature fusion through a convolutional neural network, and classifying to obtain a fine-grained license classification result by estimating and clustering the representation space of the new feature vector obtained after multi-modal fusion.

The feature vectors pass through a convolutional neural network to obtain different mapping results in the space, and the mapping results correspond to different license categories. The method mainly aims to perform feature fusion on multi-mode feature vectors by fully utilizing rich layout visual information and text information contained in a license image, can explicitly simulate single-mode information and dual-mode interaction, and constructs a classification network through a convolutional neural network, so that a license classification result has higher accuracy and finer-grained classification.

The method provided by the above embodiment of the present invention is further described below with reference to the accompanying drawings as follows:

as shown in fig. 1, which is a flow chart of license classification based on multi-modal feature fusion, the method includes the following steps:

s0a, inputting the license image into a visual feature extraction network, and acquiring a vector representation A of the visual feature of the image;

in step S0b, text information in the license image, including text content information and text block position coordinate information of the license image, is extracted through an optical character recognition model according to text semantic information unique to the license image.

In step S1, the key-value pair relationship is determined according to the distance relationship between the position coordinates of different text blocks, and the content of the text block is reconstructed according to the position information of the text block, so as to obtain a long text with the key-value pair relationship

In step S2, using the reconstructed long text information as training data, training to obtain a language model adapted to the certificate text semantics;

in step S3, the long text information is converted into a fixed-length feature vector B through the trained language model, so as to represent the long text information in the license image.

In step S4, a dimension is extended by 1 for the visual feature vector a and the text feature vector B, and then tensor outer product is performed on the obtained features to obtain a new feature vector C, which not only calculates the feature correlation between the two modalities, but also retains the information of the specific modality, and the calculation formula is as follows:

in step S5, a classification network is constructed by a convolutional neural network from a new feature vector obtained by fusing the multi-modal features, and a fine-grained license classification result is obtained by classification, which is specifically implemented as follows:

feature fusion is carried out on the multi-mode feature vectors, single-mode information and double-mode interaction can be simulated in an explicit mode, new feature vectors are obtained, a classification network is constructed after the new feature vectors pass through a convolution neural network, different mapping result clusters in the space are obtained, and the clustering corresponds to different license categories.

Through specific practical experiments, when the certificate image with visual characteristics and text information is classified based on the certificate classification method based on multi-mode characteristic fusion, fine-grained classification can be well completed, and high accuracy is achieved.

Another embodiment of the present invention provides a system for license classification with multi-modal feature fusion, including: the system comprises a multi-modal feature extraction module, a text reconstruction module, a language model training module, a long text information feature vector representation module, a tensor outer product calculation module and a multi-modal feature information fusion judgment module; wherein:

the text reconstruction module judges the key-value pair relation according to the distance relation of the position coordinates of different text blocks, reconstructs the content of the text blocks according to the position information of the text blocks and obtains a long text with the key-value pair relation;

As a preferred embodiment, the multi-modal feature extraction module is connected with the text reconstruction module to obtain layout visual feature vectors and long text information with key-value pair relationship after reconstruction; (ii) a

As a preferred embodiment, the text reconstruction module is connected with the language model training module, and the language model conforming to the semantic expression in the license is obtained by training according to the reconstructed license long text information as a training data set in combination with the text semantic expression;

as a preferred embodiment, the language model training module is connected with the long text information feature vector representation module, and the trained language model adaptive to the certificate image text semantic expression obtained by training is used for converting the reconstructed long text into feature representation with fixed length;

as a preferred embodiment, the multi-modal feature extraction module and the long text information feature vector representation module are connected with the tensor outer product calculation module, and the layout visual feature vector and the long text feature vector are subjected to tensor outer product, so as to represent single mode dominantly and interact in dual mode, and obtain a feature-fused multi-dimensional feature vector;

as a preferred embodiment, the tensor outer product calculation module is connected with the multi-modal feature information fusion judgment module, the obtained multi-dimensional feature vectors are input into the convolutional neural network for further processing, and the multi-modal features are judged and classified according to different mapping result clusters in the space, so as to obtain a fine-grained license classification result. In some embodiments of the invention:

the multi-modal feature extraction module is connected with the text reconstruction module to obtain a layout visual feature vector and long text information with a key-value pair relationship after reconstruction;

the text reconstruction module is connected with the language model training module, and is used as a training data set according to the reconstructed certificate long text information, and a language model conforming to semantic expression in the certificate is obtained through training in combination with text semantic expression;

the language model training module is connected with the long text information feature vector representation module, and is used for converting the reconstructed long text into feature representation with fixed length;

the multi-modal feature extraction module and the long text information feature vector representation module are connected with the tensor outer product calculation module, tensor outer products are carried out on the layout visual feature vectors and the long text feature vectors, single modes are represented in an explicit mode, and double-mode interaction is carried out to obtain feature-fused multi-dimensional feature vectors;

and the tensor outer product calculation module is connected with the multi-modal feature information fusion judgment module, the obtained multi-dimensional feature vectors are input into the convolutional neural network for further processing, and the multi-modal features are judged and classified according to different mapping result clusters in the space to obtain a fine-grained license classification result.

The system provided by the above embodiment of the present invention has the working process as follows:

the license image input multi-mode feature extraction module acquires the overall layout visual features of the license image by using a convolutional neural network, outputs layout visual feature vectors, extracts text content information and text position block information by using an optical character recognition model, and outputs the text content information and text block position coordinate information;

judging the key-value pair relation according to the distance relation of position coordinates of different text blocks, and reconstructing the content of the text blocks according to the position information of the text blocks to obtain a long text with the key-value pair relation;

training a language model of unique text semantics in the license image by using the reconstructed long text;

converting the extracted long text into a feature representation with a fixed length by using a language model obtained by training;

making a tensor outer product of the layout visual feature vector and the long text feature vector, and explicitly expressing single-modal and bimodal interactions to obtain a feature-fused multi-dimensional feature vector;

and inputting the new multi-dimensional feature vector into a multi-mode feature information fusion judgment module, and calculating, judging and classifying through a classification network trained by a convolutional neural network to obtain a fine-grained license classification result.

The license classification method and system based on multi-modal feature fusion provided by the embodiment of the invention extract multi-modal feature information, including layout visual features, text content information and text block position coordinate information; judging the key-value pair relation according to the distance relation of position coordinates of different text blocks, and reconstructing the content of the text blocks according to the position information of the text blocks to obtain a long text with the key-value pair relation; further training to obtain a language model adaptive to unique text semantics in the license image; converting the long text into a feature representation with a fixed length by using a language model; performing multi-modal feature fusion by using a tensor outer product method, and explicitly representing single-modal and bimodal interactions to obtain a new multi-dimensional feature vector; and (4) passing the new feature vector through a classification network to obtain a fine-grained license classification result. The license classification method and system based on multi-modal feature fusion provided by the embodiment of the invention adopt the abundant visual features, text content information and text block position coordinate information in the license image, and utilize tensor outer product to perform multi-modal feature fusion, and represent single-modal and dual-modal interaction dominantly, so that the license classification result has higher accuracy and finer granularity classification division, and the application range is wider, and the robustness performance is better.

It should be noted that, the steps in the method provided by the present invention can be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art can implement the step flow of the method by referring to the technical scheme of the system, that is, the embodiment in the system can be understood as a preferred example of the implementation method, and details are not described herein.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A license classification method based on multi-modal feature fusion is characterized by comprising the following steps:

multi-mode feature extraction, namely extracting visual features, text content information and text position block information of the whole layout of the license image, and outputting visual feature vectors and text content information of the layout and position coordinate information of the text blocks;

text reconstruction, namely reconstructing the content of the text block according to the position information of the text block according to the distance relation of the position coordinates of different text blocks to obtain a reconstructed long text;

training a voice model, and training to obtain a language model adaptive to unique text semantics in the license image according to the reconstructed long text serving as a data set;

the long text vector representation is used for converting the long text into the feature representation with fixed length according to the language model obtained by training;

performing multi-mode feature fusion, performing tensor outer product on visual features and long text features of the layout, explicitly representing single mode, and performing bimodal interaction to obtain a new multi-dimensional feature vector;

and calculating and classifying the multidimensional characteristic vector by using a convolutional neural network to obtain a fine-grained license classification result.

2. The license classification method based on multi-modal feature fusion according to claim 1, is characterized in that:

extracting visual characteristic information of the layout of the whole license image by using a convolutional neural network to obtain a visual characteristic vector of the layout;

extracting license text information in the image by using an optical character recognition model, wherein the license text information comprises text content information and text block position coordinate information;

the distance relationship between the text block position coordinates is as follows:

t _ij ＝t _i +t _j

according to the reconstructed certificate long text information, the reconstructed certificate long text information is used as a training data set, and a language model which accords with semantic expression in the certificate is obtained through training by combining text semantic expression;

3. The license classification method based on multi-modal feature fusion according to claim 1, wherein the multi-modal feature fusion is performed according to the layout visual feature vector and the long text feature vector to obtain a multi-dimensional feature vector of multi-modal fusion, specifically:

and expanding the visual characteristic vector A and the text characteristic vector B by 1 for one dimension, and then carrying out tensor outer product to obtain a multi-mode fused multi-dimensional characteristic vector C, wherein the formula is as follows:

4. the license classification method based on multi-modal feature fusion according to claim 1, characterized in that the multi-dimensional feature vectors are calculated and classified by using a convolutional neural network to obtain a fine-grained license classification result, specifically: inputting the multidimensional characteristic vector into a convolutional neural network to construct a classification network;

and clustering according to different mapping results in the space, and corresponding to different license categories to obtain license classification results.

5. A license classification system based on multi-modal feature fusion, comprising:

the multi-mode feature extraction module is used for acquiring the visual features, the text content information and the text position block information of the whole layout of the license image, and outputting visual feature vectors, the text content information and the text block position coordinate information;

the text reconstruction module reconstructs the content of the text block according to the position information of the text block according to the distance relation of the position coordinates of different text blocks to obtain a reconstructed long text;

the language model training module is used for training a language model of unique text semantics in the license image by using the acquired reconstructed long text;

the long text information feature vector representation module converts the reconstructed long text into feature representation with fixed length by using the trained language model;

the tensor outer product calculation module is used for making a tensor outer product of the visual characteristic vector and the long text characteristic vector, and explicitly expressing the interaction between the single mode and the double mode to obtain a multi-dimensional characteristic vector after characteristic fusion;

and the multi-mode characteristic information fusion judgment module further obtains more characteristic representations through a convolutional neural network, performs information judgment on the obtained fused multi-dimensional characteristic vectors, and classifies the obtained fused multi-dimensional characteristic vectors to obtain fine-grained license types needing to be judged.

6. The system for license classification based on multi-modal feature fusion of claim 5, further comprising any one or more of:

the multi-modal feature extraction module is connected with the text reconstruction module to obtain layout visual feature vectors and long text information with key-value pair relationship after reconstruction;

the text reconstruction module is connected with the language model training module, and a language model conforming to semantic expression in the license is obtained by training according to the reconstructed license long text information as a training data set in combination with text semantic expression;

the language model training module is connected with the long text information feature vector representation module, and is used for converting the reconstructed long text into feature representation with fixed length by using the trained language model adaptive to certificate image text semantic expression;

the tensor outer product calculation module is connected with the multi-modal feature information fusion judgment module, the obtained multi-dimensional feature vectors are input into the convolutional neural network for further processing, and the multi-modal features are judged and classified according to different mapping result clusters in the space, so that a fine-grained license classification result is obtained.