CN115115883A - License classification method and system based on multi-mode feature fusion - Google Patents

License classification method and system based on multi-mode feature fusion Download PDF

Info

Publication number
CN115115883A
CN115115883A CN202210757445.9A CN202210757445A CN115115883A CN 115115883 A CN115115883 A CN 115115883A CN 202210757445 A CN202210757445 A CN 202210757445A CN 115115883 A CN115115883 A CN 115115883A
Authority
CN
China
Prior art keywords
text
information
license
feature
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210757445.9A
Other languages
Chinese (zh)
Inventor
金耀辉
邱健
王晴晴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202210757445.9A priority Critical patent/CN115115883A/en
Publication of CN115115883A publication Critical patent/CN115115883A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a certificate classification method and system based on multi-modal feature fusion, wherein the method fully considers the characteristics of multi-modal information such as visual features, text semantic features, text position features and the like contained in a certificate image, fully utilizes the multi-modal information and the interrelation among all the modalities, extracts the visual features by constructing a convolutional neural network, and converts the visual features into visual feature vectors; the language model is trained according to the unique text information in the license, the text in the license image is converted into a text information vector, and the obtained visual feature vector and the text information vector are subjected to multi-mode fusion, so that the original single-mode visual feature and the text information can be reserved, and meanwhile, the interaction between the two modes can be used as the basis for classification. The invention not only considers the visual characteristics of the license image, but also fully considers the text information and the mutual relation between the text information and the text information, thereby leading the classification result to achieve higher classification accuracy and more fine-grained classification.

Description

License classification method and system based on multi-mode feature fusion
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a license classification method and system based on multi-modal feature fusion.
Background
Deep learning has enjoyed great success in both computer vision and natural language processing, particularly convolutional neural networks. On the basis, the multi-mode learning can aggregate information of multi-source data, so that the representation learned by the model is more complete, and under sufficient training data, the types of the modes are richer, and the estimation of the representation space is more accurate.
In the problem of license classification, the traditional processing mode of license image classification includes classification by using a rule machine or classification by using a single modality (only visual features or only text information), and scalability and universality are lacked. As in prior art publication No. CN111738251, the license image itself is a special image and contains abundant multi-modal features (visual features, text information, text block position coordinate information, etc.). Meanwhile, the license contains rich key-value pair relations, but the key-value pair relations are often not fully utilized in practical application scenes.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides the license classification method and system based on multi-mode feature fusion, which not only consider the visual features of the license image, but also fully consider the text content information and the text block position coordinate information, so that the classification result achieves higher classification accuracy and finer-grained classification.
The invention is realized by the following technical scheme.
In one aspect of the invention, a license classification method based on multi-modal feature fusion is provided, which comprises the following steps:
inputting the whole layout area of the license image into a convolutional neural network according to multi-mode information contained in the license image, extracting visual features, and representing layout visual modal information by using a layout visual feature vector;
according to the multi-mode information contained in the license image, the text information in the license image is extracted by using an optical character recognition model, and the method comprises the following steps:
and extracting the complete text content in the license image and positioning the text position coordinate information of the text block.
According to the distance relation between the text block position information coordinates, the formula is as follows:
Figure BDA0003720008420000021
where i, j each represent a different text block, if d ij If the value is less than the preset threshold value theta, judging that the text content in the text block is related, and then reconstructing the text content information according to the coordinate position relation of the corresponding text block, wherein the formula is as follows:
t ij =t i +t j
wherein t is i ,t j The text content of the ith and jth text blocks is obtained, so that the texts with the key-value pair relation are aggregated into long text information;
according to the long text information of the reconstructed license text data, the long text information is used as a training data set, and a language model conforming to semantic expression in the license is obtained through training in combination with text semantic expression;
and encoding the reconstructed long text information into a text feature vector with a fixed length by using the trained language model.
According to the obtained layout visual feature vector and the long text feature vector of the license image, tensor outer product is conducted, single-mode and double-mode interaction is simulated in an explicit mode, and a new multi-dimensional feature vector is obtained;
and (4) judging the feature vector information after multi-modal feature fusion by the multi-dimensional feature vector through a convolutional neural network, clustering according to different mapping results in the space, and classifying to obtain a fine-grained license type.
According to another aspect of the present invention, there is provided a license classification system based on multi-modal feature fusion, including: the system comprises a multi-modal feature extraction module, a text reconstruction module, a language model training module, a long text information feature vector representation module, a tensor outer product calculation module and a multi-modal feature information fusion judgment module, wherein:
the multimode characteristic extraction module acquires the overall layout visual characteristics of the license image by using a convolutional neural network, outputs layout visual characteristic vectors, extracts text content information and text position block information by using an optical character recognition model, and outputs the text content information and text block position coordinate information;
the text reconstruction module judges the key-value pair relationship according to the distance relationship of the position coordinates of different text blocks, reconstructs the content of the text blocks according to the position information of the text blocks and obtains a long text with the key-value pair relationship;
the language model training module is used for training a language model of unique text semantics in the license image by using the reconstructed long text;
the text information feature vector representation module converts the extracted long text into feature representation with fixed length by using the language model obtained by training;
the tensor outer product calculation module is used for making a tensor outer product of the layout visual characteristic vector and the long text characteristic vector, and explicitly expressing monomodal and bimodal interaction to obtain a feature-fused multidimensional characteristic vector;
the multi-mode characteristic information fusion judgment module judges the information of the obtained multi-mode fusion characteristic vector through a convolutional neural network, and classifies the multi-mode fusion characteristic vector according to different mapping results in space to obtain the fine-grained license type to be judged.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following beneficial effects:
1) the method has the advantages that the multi-modal characteristics are considered, the key-value pair relation in the text information is fully considered, the license text information is reconstructed according to the key-value pair relation, the reconstructed license long text information can fully express the license text information, and meanwhile, the text position information required to be utilized by the reconstructed long text information also provides reference for visual characteristic extraction, so that the multi-modal characteristic extraction is more consistent with the characteristics of the license.
2) The method improves the license image according to the multi-modal feature classification technology, improves the classification precision, and improves the scalability and the universality of the system.
3) The method has the advantages that rich layout visual information, text content information and text block position coordinate information contained in the license image can be fully utilized, the feature fusion is carried out on the plurality of modal feature vectors, the single-modal information can be represented explicitly, the dual-modal interaction can be realized, the license classification result has higher accuracy and finer granularity classification division, the application range is wider, and the robustness performance is better.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a block diagram of a certificate classification method and system based on multi-modal feature fusion according to a preferred embodiment of the present invention;
fig. 2 is a flowchart of a license classification method based on multi-modal feature fusion in a preferred embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention are described in detail below with reference to fig. 1 and 2. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention. The present invention is in no way limited to any specific configuration and algorithm set forth below, but rather covers any modification, replacement or improvement of elements, components or algorithms without departing from the spirit of the invention. In the drawings and the following description, well-known structures and techniques are not shown in order to avoid unnecessarily obscuring the present invention.
The embodiment of the invention provides a certificate classification method based on multi-mode feature fusion, which aims at the problems that the traditional certificate image classification method uses simple rules to classify or only uses a single mode to classify certificate images, the rich multi-mode information of the certificate images cannot be fully utilized, the method is lack of robustness and the like. Extracting layout visual information of the license image by using a convolutional neural network, and representing by using a layout visual feature vector; extracting text information of the license picture by using an optical character recognition model, wherein the text information comprises text content and text block position coordinate information, and further training a language model suitable for the semantic meaning of the license text; converting the long text into a long text feature vector with a fixed length for representation by using a trained language model; carrying out tensor outer product on the layout visual characteristic vector of the license image and the text characteristic vector to obtain a new multi-dimensional characteristic vector; the new feature vector after multi-modal feature fusion can explicitly simulate single-modal and bi-modal interaction, information is judged through a convolutional neural network, and clustering is carried out according to different mapping results in space to obtain a fine-grained license classification result. The method is suitable for fine-grained classification of the license image generally with visual features and text information.
The license classification method based on multi-modal feature fusion provided by the embodiment comprises the following steps:
s0a, inputting the whole layout area of the license image into a convolutional neural network according to the unique layout visual information of the license image, extracting visual features, and acquiring the layout visual feature vector of the license image;
s0b, extracting text information in the license image through an optical character recognition model according to the unique text semantic information of the license image, wherein the text information comprises text content information and text block position coordinate information of the license image;
s1, judging the key-value pair relation according to the distance relation of the position coordinates of different text blocks, reconstructing the content of the text blocks according to the position information of the text blocks to obtain the long text with the key-value pair relation
S2, training to obtain a language model adaptive to the certificate text semantics by using the reconstructed long text information as training data;
s3, converting the long text information into a feature vector with a fixed length through a trained language model, and representing the long text information in the license image;
s4, according to the obtained layout visual feature vector and the long text feature vector of the license image, making tensor outer product to obtain a new multi-dimensional feature vector for representing single-mode and double-mode interaction in an explicit way;
and S5, constructing a classification network by the new feature vector obtained after multi-modal feature fusion through a convolutional neural network, and classifying to obtain a fine-grained license classification result by estimating and clustering the representation space of the new feature vector obtained after multi-modal fusion.
The feature vectors pass through a convolutional neural network to obtain different mapping results in the space, and the mapping results correspond to different license categories. The method mainly aims to perform feature fusion on multi-mode feature vectors by fully utilizing rich layout visual information and text information contained in a license image, can explicitly simulate single-mode information and dual-mode interaction, and constructs a classification network through a convolutional neural network, so that a license classification result has higher accuracy and finer-grained classification.
The method provided by the above embodiment of the present invention is further described below with reference to the accompanying drawings as follows:
as shown in fig. 1, which is a flow chart of license classification based on multi-modal feature fusion, the method includes the following steps:
s0a, inputting the license image into a visual feature extraction network, and acquiring a vector representation A of the visual feature of the image;
in step S0b, text information in the license image, including text content information and text block position coordinate information of the license image, is extracted through an optical character recognition model according to text semantic information unique to the license image.
In step S1, the key-value pair relationship is determined according to the distance relationship between the position coordinates of different text blocks, and the content of the text block is reconstructed according to the position information of the text block, so as to obtain a long text with the key-value pair relationship
In step S2, using the reconstructed long text information as training data, training to obtain a language model adapted to the certificate text semantics;
in step S3, the long text information is converted into a fixed-length feature vector B through the trained language model, so as to represent the long text information in the license image.
In step S4, a dimension is extended by 1 for the visual feature vector a and the text feature vector B, and then tensor outer product is performed on the obtained features to obtain a new feature vector C, which not only calculates the feature correlation between the two modalities, but also retains the information of the specific modality, and the calculation formula is as follows:
Figure BDA0003720008420000051
in step S5, a classification network is constructed by a convolutional neural network from a new feature vector obtained by fusing the multi-modal features, and a fine-grained license classification result is obtained by classification, which is specifically implemented as follows:
feature fusion is carried out on the multi-mode feature vectors, single-mode information and double-mode interaction can be simulated in an explicit mode, new feature vectors are obtained, a classification network is constructed after the new feature vectors pass through a convolution neural network, different mapping result clusters in the space are obtained, and the clustering corresponds to different license categories.
Through specific practical experiments, when the certificate image with visual characteristics and text information is classified based on the certificate classification method based on multi-mode characteristic fusion, fine-grained classification can be well completed, and high accuracy is achieved.
Another embodiment of the present invention provides a system for license classification with multi-modal feature fusion, including: the system comprises a multi-modal feature extraction module, a text reconstruction module, a language model training module, a long text information feature vector representation module, a tensor outer product calculation module and a multi-modal feature information fusion judgment module; wherein:
the multimode characteristic extraction module acquires the overall layout visual characteristics of the license image by using a convolutional neural network, outputs layout visual characteristic vectors, extracts text content information and text position block information by using an optical character recognition model, and outputs the text content information and text block position coordinate information;
the text reconstruction module judges the key-value pair relation according to the distance relation of the position coordinates of different text blocks, reconstructs the content of the text blocks according to the position information of the text blocks and obtains a long text with the key-value pair relation;
the language model training module is used for training a language model of unique text semantics in the license image by using the reconstructed long text;
the text information feature vector representation module converts the extracted long text into feature representation with fixed length by using the language model obtained by training;
the tensor outer product calculation module is used for making a tensor outer product of the layout visual characteristic vector and the long text characteristic vector, and explicitly expressing monomodal and bimodal interaction to obtain a feature-fused multidimensional characteristic vector;
the multi-mode characteristic information fusion judgment module judges the information of the obtained multi-mode fusion characteristic vector through a convolutional neural network, and classifies the multi-mode fusion characteristic vector according to different mapping results in space to obtain the fine-grained license type to be judged.
As a preferred embodiment, the multi-modal feature extraction module is connected with the text reconstruction module to obtain layout visual feature vectors and long text information with key-value pair relationship after reconstruction; (ii) a
As a preferred embodiment, the text reconstruction module is connected with the language model training module, and the language model conforming to the semantic expression in the license is obtained by training according to the reconstructed license long text information as a training data set in combination with the text semantic expression;
as a preferred embodiment, the language model training module is connected with the long text information feature vector representation module, and the trained language model adaptive to the certificate image text semantic expression obtained by training is used for converting the reconstructed long text into feature representation with fixed length;
as a preferred embodiment, the multi-modal feature extraction module and the long text information feature vector representation module are connected with the tensor outer product calculation module, and the layout visual feature vector and the long text feature vector are subjected to tensor outer product, so as to represent single mode dominantly and interact in dual mode, and obtain a feature-fused multi-dimensional feature vector;
as a preferred embodiment, the tensor outer product calculation module is connected with the multi-modal feature information fusion judgment module, the obtained multi-dimensional feature vectors are input into the convolutional neural network for further processing, and the multi-modal features are judged and classified according to different mapping result clusters in the space, so as to obtain a fine-grained license classification result. In some embodiments of the invention:
the multi-modal feature extraction module is connected with the text reconstruction module to obtain a layout visual feature vector and long text information with a key-value pair relationship after reconstruction;
the text reconstruction module is connected with the language model training module, and is used as a training data set according to the reconstructed certificate long text information, and a language model conforming to semantic expression in the certificate is obtained through training in combination with text semantic expression;
the language model training module is connected with the long text information feature vector representation module, and is used for converting the reconstructed long text into feature representation with fixed length;
the multi-modal feature extraction module and the long text information feature vector representation module are connected with the tensor outer product calculation module, tensor outer products are carried out on the layout visual feature vectors and the long text feature vectors, single modes are represented in an explicit mode, and double-mode interaction is carried out to obtain feature-fused multi-dimensional feature vectors;
and the tensor outer product calculation module is connected with the multi-modal feature information fusion judgment module, the obtained multi-dimensional feature vectors are input into the convolutional neural network for further processing, and the multi-modal features are judged and classified according to different mapping result clusters in the space to obtain a fine-grained license classification result.
The system provided by the above embodiment of the present invention has the working process as follows:
the license image input multi-mode feature extraction module acquires the overall layout visual features of the license image by using a convolutional neural network, outputs layout visual feature vectors, extracts text content information and text position block information by using an optical character recognition model, and outputs the text content information and text block position coordinate information;
judging the key-value pair relation according to the distance relation of position coordinates of different text blocks, and reconstructing the content of the text blocks according to the position information of the text blocks to obtain a long text with the key-value pair relation;
training a language model of unique text semantics in the license image by using the reconstructed long text;
converting the extracted long text into a feature representation with a fixed length by using a language model obtained by training;
making a tensor outer product of the layout visual feature vector and the long text feature vector, and explicitly expressing single-modal and bimodal interactions to obtain a feature-fused multi-dimensional feature vector;
and inputting the new multi-dimensional feature vector into a multi-mode feature information fusion judgment module, and calculating, judging and classifying through a classification network trained by a convolutional neural network to obtain a fine-grained license classification result.
The license classification method and system based on multi-modal feature fusion provided by the embodiment of the invention extract multi-modal feature information, including layout visual features, text content information and text block position coordinate information; judging the key-value pair relation according to the distance relation of position coordinates of different text blocks, and reconstructing the content of the text blocks according to the position information of the text blocks to obtain a long text with the key-value pair relation; further training to obtain a language model adaptive to unique text semantics in the license image; converting the long text into a feature representation with a fixed length by using a language model; performing multi-modal feature fusion by using a tensor outer product method, and explicitly representing single-modal and bimodal interactions to obtain a new multi-dimensional feature vector; and (4) passing the new feature vector through a classification network to obtain a fine-grained license classification result. The license classification method and system based on multi-modal feature fusion provided by the embodiment of the invention adopt the abundant visual features, text content information and text block position coordinate information in the license image, and utilize tensor outer product to perform multi-modal feature fusion, and represent single-modal and dual-modal interaction dominantly, so that the license classification result has higher accuracy and finer granularity classification division, and the application range is wider, and the robustness performance is better.
It should be noted that, the steps in the method provided by the present invention can be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art can implement the step flow of the method by referring to the technical scheme of the system, that is, the embodiment in the system can be understood as a preferred example of the implementation method, and details are not described herein.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (6)

1. A license classification method based on multi-modal feature fusion is characterized by comprising the following steps:
multi-mode feature extraction, namely extracting visual features, text content information and text position block information of the whole layout of the license image, and outputting visual feature vectors and text content information of the layout and position coordinate information of the text blocks;
text reconstruction, namely reconstructing the content of the text block according to the position information of the text block according to the distance relation of the position coordinates of different text blocks to obtain a reconstructed long text;
training a voice model, and training to obtain a language model adaptive to unique text semantics in the license image according to the reconstructed long text serving as a data set;
the long text vector representation is used for converting the long text into the feature representation with fixed length according to the language model obtained by training;
performing multi-mode feature fusion, performing tensor outer product on visual features and long text features of the layout, explicitly representing single mode, and performing bimodal interaction to obtain a new multi-dimensional feature vector;
and calculating and classifying the multidimensional characteristic vector by using a convolutional neural network to obtain a fine-grained license classification result.
2. The license classification method based on multi-modal feature fusion according to claim 1, is characterized in that:
extracting visual characteristic information of the layout of the whole license image by using a convolutional neural network to obtain a visual characteristic vector of the layout;
extracting license text information in the image by using an optical character recognition model, wherein the license text information comprises text content information and text block position coordinate information;
the distance relationship between the text block position coordinates is as follows:
Figure FDA0003720008410000011
where i, j each represent a different text block, if d ij If the value is less than the preset threshold value theta, judging that the text content in the text block is related, and then reconstructing the text content information according to the coordinate position relation of the corresponding text block, wherein the formula is as follows:
t ij =t i +t j
wherein t is i ,t j The text content of the ith and jth text blocks is obtained, so that the texts with the key-value pair relation are aggregated into long text information;
according to the reconstructed certificate long text information, the reconstructed certificate long text information is used as a training data set, and a language model which accords with semantic expression in the certificate is obtained through training by combining text semantic expression;
and encoding the reconstructed long text information into a text feature vector with a fixed length by using the trained language model.
3. The license classification method based on multi-modal feature fusion according to claim 1, wherein the multi-modal feature fusion is performed according to the layout visual feature vector and the long text feature vector to obtain a multi-dimensional feature vector of multi-modal fusion, specifically:
and expanding the visual characteristic vector A and the text characteristic vector B by 1 for one dimension, and then carrying out tensor outer product to obtain a multi-mode fused multi-dimensional characteristic vector C, wherein the formula is as follows:
Figure FDA0003720008410000021
4. the license classification method based on multi-modal feature fusion according to claim 1, characterized in that the multi-dimensional feature vectors are calculated and classified by using a convolutional neural network to obtain a fine-grained license classification result, specifically: inputting the multidimensional characteristic vector into a convolutional neural network to construct a classification network;
and clustering according to different mapping results in the space, and corresponding to different license categories to obtain license classification results.
5. A license classification system based on multi-modal feature fusion, comprising:
the multi-mode feature extraction module is used for acquiring the visual features, the text content information and the text position block information of the whole layout of the license image, and outputting visual feature vectors, the text content information and the text block position coordinate information;
the text reconstruction module reconstructs the content of the text block according to the position information of the text block according to the distance relation of the position coordinates of different text blocks to obtain a reconstructed long text;
the language model training module is used for training a language model of unique text semantics in the license image by using the acquired reconstructed long text;
the long text information feature vector representation module converts the reconstructed long text into feature representation with fixed length by using the trained language model;
the tensor outer product calculation module is used for making a tensor outer product of the visual characteristic vector and the long text characteristic vector, and explicitly expressing the interaction between the single mode and the double mode to obtain a multi-dimensional characteristic vector after characteristic fusion;
and the multi-mode characteristic information fusion judgment module further obtains more characteristic representations through a convolutional neural network, performs information judgment on the obtained fused multi-dimensional characteristic vectors, and classifies the obtained fused multi-dimensional characteristic vectors to obtain fine-grained license types needing to be judged.
6. The system for license classification based on multi-modal feature fusion of claim 5, further comprising any one or more of:
the multi-modal feature extraction module is connected with the text reconstruction module to obtain layout visual feature vectors and long text information with key-value pair relationship after reconstruction;
the text reconstruction module is connected with the language model training module, and a language model conforming to semantic expression in the license is obtained by training according to the reconstructed license long text information as a training data set in combination with text semantic expression;
the language model training module is connected with the long text information feature vector representation module, and is used for converting the reconstructed long text into feature representation with fixed length by using the trained language model adaptive to certificate image text semantic expression;
the multi-modal feature extraction module and the long text information feature vector representation module are connected with the tensor outer product calculation module, tensor outer products are carried out on the layout visual feature vectors and the long text feature vectors, single modes are represented in an explicit mode, and double-mode interaction is carried out to obtain feature-fused multi-dimensional feature vectors;
the tensor outer product calculation module is connected with the multi-modal feature information fusion judgment module, the obtained multi-dimensional feature vectors are input into the convolutional neural network for further processing, and the multi-modal features are judged and classified according to different mapping result clusters in the space, so that a fine-grained license classification result is obtained.
CN202210757445.9A 2022-06-29 2022-06-29 License classification method and system based on multi-mode feature fusion Pending CN115115883A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210757445.9A CN115115883A (en) 2022-06-29 2022-06-29 License classification method and system based on multi-mode feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210757445.9A CN115115883A (en) 2022-06-29 2022-06-29 License classification method and system based on multi-mode feature fusion

Publications (1)

Publication Number Publication Date
CN115115883A true CN115115883A (en) 2022-09-27

Family

ID=83329800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210757445.9A Pending CN115115883A (en) 2022-06-29 2022-06-29 License classification method and system based on multi-mode feature fusion

Country Status (1)

Country Link
CN (1) CN115115883A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620265A (en) * 2022-12-19 2023-01-17 华南理工大学 Locomotive signboard information intelligent identification method and system based on deep learning
CN117421641A (en) * 2023-12-13 2024-01-19 深圳须弥云图空间科技有限公司 Text classification method, device, electronic equipment and readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620265A (en) * 2022-12-19 2023-01-17 华南理工大学 Locomotive signboard information intelligent identification method and system based on deep learning
CN117421641A (en) * 2023-12-13 2024-01-19 深圳须弥云图空间科技有限公司 Text classification method, device, electronic equipment and readable storage medium
CN117421641B (en) * 2023-12-13 2024-04-16 深圳须弥云图空间科技有限公司 Text classification method, device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111858954B (en) Task-oriented text-generated image network model
CN109409222B (en) Multi-view facial expression recognition method based on mobile terminal
Yang et al. Visual sentiment prediction based on automatic discovery of affective regions
US20210012198A1 (en) Method for training deep neural network and apparatus
CN115115883A (en) License classification method and system based on multi-mode feature fusion
Zhai et al. Deep structure-revealed network for texture recognition
Shen et al. FEXNet: Foreground extraction network for human action recognition
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
Raut Facial emotion recognition using machine learning
CN113343974B (en) Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN113656660B (en) Cross-modal data matching method, device, equipment and medium
CN113269089A (en) Real-time gesture recognition method and system based on deep learning
CN110111365B (en) Training method and device based on deep learning and target tracking method and device
CN113657274A (en) Table generation method and device, electronic equipment, storage medium and product
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN116821781A (en) Classification model training method, text analysis method and related equipment
CN115409107A (en) Training method of multi-modal association building model and multi-modal data retrieval method
Shao et al. Deep multi-center learning for face alignment
CN117540007A (en) Multi-mode emotion analysis method, system and equipment based on similar mode completion
CN111695507B (en) Static gesture recognition method based on improved VGGNet network and PCA
CN114120074B (en) Training method and training device for image recognition model based on semantic enhancement
Zhao et al. ChildPredictor: A child face prediction framework with disentangled learning
CN117011539A (en) Target detection method, training method, device and equipment of target detection model
CN114781348A (en) Text similarity calculation method and system based on bag-of-words model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination