CN116775929A

CN116775929A - Cross-modal retrieval method based on multi-level fine granularity semantic alignment

Info

Publication number: CN116775929A
Application number: CN202310802642.2A
Authority: CN
Inventors: 谭梦悦; 贺成龙; 顾学海
Original assignee: Nanjing Laiwangxin Technology Research Institute Co ltd
Current assignee: Nanjing Laiwangxin Technology Research Institute Co ltd
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-09-19

Abstract

The invention provides a cross-modal retrieval method based on multi-level fine granularity semantic alignment, which comprises the following steps: the multi-mode fine granularity semantic understanding method comprises the following steps: detecting low-quality images in battlefield information data and extracting texts under a complex background; and combining with a pre-training framework of intra-single-mode fine-grained understanding and cross-mode semantic alignment, the fine-grained semantic understanding under the multi-mode condition is realized. Cross-modal information bureau and retrieval method: through the high-performance content retrieval engine construction technology, multi-mode battlefield information data is quickly retrieved by adopting a memory database with a two-way reference index; the quick data retrieval scenes of the two application modes of text searching and graph searching of the information data are researched, so that the method is better applied to the information service scene. Aiming at the complexity of cross-modal retrieval of battlefield data, the invention provides an effective solution, increases the cross-modal retrieval capacity and improves the retrieval efficiency.

Description

Cross-modal retrieval method based on multi-level fine granularity semantic alignment

Technical Field

The invention relates to a cross-modal retrieval method based on multi-level fine granularity semantic alignment.

Background

With the rapid development of the fields of the internet of things, social networks, electronic commerce and the like, data is increasing and accumulating at an unprecedented speed. The method has the advantages that characteristic learning is effectively carried out on massive complex big data, hidden knowledge and rules in the big data are found, potential value in the big data is further mined, and the method has great significance for efficient aggregation and retrieval of the data in a battlefield environment. Because the information data collected in the war environment has huge diversity, various sources, large information quantity and complex content, and is difficult to effectively manage and utilize information, massive information data needs to be quickly, accurately and effectively searched and retrieved to ensure the correctness and timeliness of decision making, and therefore an effective cross-mode information content efficient retrieval technology is needed to solve the problem.

Today battlefield data description means are rich, and multi-modal data exists widely. Multimodal data refers to data acquired through different fields or views for the same descriptive object, and each field or view describing the data is called a modality. For example, the same weapon name and corresponding image are multimodal data; in face recognition, multimodal data may be composed of different angle images of the face. In multi-modal data, each modality can provide certain information for the rest of modalities, i.e. certain relativity exists between modalities. However, when multi-mode data mining is performed, the effectiveness of the mining task cannot be ensured by performing equal processing on different mode data or performing simple connection integration on all mode features. Meanwhile, the research is oriented to battlefield data mining, and the problem of unstable quality and large quantity of battlefield information exists at the same time.

In light of the above, difficulties in cross-modal retrieval of battlefield data are listed:

1) Information multi-modality: battlefield information sources comprise various forms such as texts, images and the like, and the uniform semantic understanding is difficult;

2) The information quality is low: the quality of battlefield information is unstable, such as the problems of difficult recognition or false recognition caused by too low image pixels and complex image background;

3) The information magnitude is large: the battlefield information is huge in quantity, and for mass data, college retrieval is a challenge.

Therefore, the research aims at the difficulties, more accurate complex data features are learned through complementation of information among modalities, and subsequent application retrieval is supported. The invention mainly performs the following two-direction research:

1) The fine granularity semantic understanding technology research of the multi-mode data mainly comprises research of effective identification of low-quality information and fusion of multi-mode information characterization, and solves the problems caused by unstable data quality and the problem of difficult multi-mode matching existing in the traditional information searching technology respectively; wherein the technical implementation is represented as two blocks: information high-quality recognition technology based on real scene fine granularity understanding and cross-modal semantic understanding technology based on semantic and characterization feature fusion;

2) The research of cross-modal information aggregation retrieval technology mainly solves the problem of high-efficiency retrieval among cross-modal information through high-performance index construction and scene construction (text searching and graph searching) aiming at battlefield information retrieval; the technical implementation is as follows: a high-performance content retrieval engine construction technology and a cross-mode retrieval scene realization technology;

disclosure of Invention

The invention aims to: the invention aims to solve the technical problem of providing a cross-modal retrieval method based on multi-level fine granularity semantic alignment, which aims at the defects of the existing multi-modal retrieval technology and comprises the following steps:

step 1, detecting middle and low quality images in data by adopting an information high quality recognition technology based on real scene fine granularity understanding, and extracting texts under a complex background;

step 2, adopting a multi-mode semantic understanding technology based on multi-level fine granularity to realize fine granularity semantic understanding under the multi-mode condition;

step 3, rapidly searching the multi-mode data memory database by a high-performance content search engine construction technology;

and 4, designing and displaying functional architectures of two application modes, namely text searching and graph searching, and graph searching, on the information data.

The step 1 comprises the following steps:

Step 1-1, the information high-quality recognition technology comprises fine granularity recognition of images transposed in the same space based on resolution and complex background OCR recognition based on scene data synthesis;

the resolution same space transposition-based image fine granularity identification is realized by adopting an ultra-low resolution image detection technology, converting a high resolution image and a low resolution image into a common space, and performing face identification, sensitive icon detection and similar image identification by using a classification model;

the complex background OCR recognition (refer to https:// baike. Baidu. Com/item/OCR% E6%, 96%, 87%, E5%, AD%, 97%, E8%, AF%, 86%, E5%, 88% AB/10392860) based on scene data synthesis is realized by adopting a Chinese detection and recognition method combined with a deep learning model.

Step 1-1 includes:

step 1-1-1, the detection of the middle and low quality images in the data is realized, which comprises adopting a low-resolution face recognition method based on double-branch CNN; the low-resolution face recognition method based on the double-branch CNN comprises the following steps: a deep convolutional neural network called as a double-branch CNN is adopted for extracting convolutional neural networks FECNN and SRFECNN respectively, a high-resolution image and a low-resolution image are converted into a common space, then face recognition and sensitive icon detection are carried out by using an SVC classification model, and finally a fine granularity detection technology of the ultra-low-resolution image is realized;

The feature extraction convolutional neural network FECNN takes VGGnet as a basic framework, and the last full-connection layer of VGGnet is removed;

the SRFECNN is a combination of a steganography analysis residual network SRnet and a feature extraction convolutional neural network FECNN, and the steganography analysis residual network SRnet (Steganalysis Residual Network) provides a steganography technology to detect hidden data, so as to reconstruct an incoming image in super resolution, and provide more information for subsequent recognition work better; the feature extraction convolutional neural network FECNN (Feature Embedding Convolutional Neural Network) is a feedforward neural network (Feedforward Neural Networks) which comprises feature extraction, convolutional calculation and has a depth structure, and is one of representative algorithms of Deep Learning; the output of the steganography analysis residual network SRnet is used as the input of the feature extraction convolutional neural network FECNN, and the connection becomes a lower branch network SRFECNN; the steganography analysis residual network SRnet provides a steganography technology to detect hidden data and reconstruct the super-resolution of an incoming image; the steganalysis residual network SRnet contains three segments: a linear classifier of a full connection layer responsible for extracting a front section of the noise residual, reducing a middle section of the dimension of the feature map, and a final standard;

The double-branch CNN converts the high-resolution image and the low-resolution image into a common space, and the whole double-branch algorithm is divided into an upper network and a lower network;

the upper network inputs 224×224 images as standard, the high resolution face image needs to pass through bicubic interpolation algorithm, and the 224×224 dimension images are transmitted into the feature extraction convolutional neural network FECNN;

the lower network input is 224 multiplied by 224 dimensional images obtained after bicubic interpolation;

the output of the upper layer network and the output of the lower layer network are feature vectors of 1 multiplied by 4096, the lower layer network is provided with more than one steganography analysis residual error network SRnet than the upper layer network, the steganography analysis residual error network SRnet is used for carrying out super-resolution reconstruction on an incoming image, then the output of the steganography analysis residual error network SRnet is used as the input of a feature extraction convolutional neural network FECNN, and the SRFECNN is obtained by connecting the output of the steganography analysis residual error network SRnet;

step 1-1-2, the text extraction under the complex background is realized, which comprises the following steps: the recognition of text information in the image of the interference background is realized by adopting a Chinese detection model based on a combined text suggestion network CTPN (Connectionist Text Proposal Network) and a Chinese recognition model based on a convolution recurrent neural network CRNN (Convolutional Recurrent Neural Network);

The Chinese detection model based on CTPN is adopted and comprises the following steps:

step a1, obtaining feature vectors by using the first 5 convolution stages of VGG16 (VGG is a visual geometry group network, visual Geometry Group Network;16 means that 13 convolution layers and 3 full link layers exist in the VGG structure);

step a2, extracting features on the feature vectors by using a sliding window of 3*3, and predicting more than two anchors (anchors are one type of nominal in the neural network and are used for defining a target candidate area) by using the features;

step a3, inputting the characteristics obtained in the step a2 into a two-way long-short-term memory artificial neural network LSTM, outputting a W-256 result, and inputting the result into a 512-dimensional full-connection layer FC;

step a4, the output obtained by classification is sequentially from top to bottom:

2k vertical coordinates, the y-axis coordinates representing the height and center of the selection box;

2k score, which represents category information of k anchors, indicating whether the anchors are characters;

k side-definition, which represents the horizontal offset of the selection box;

step a5, an algorithm of text construction is adopted to obtain an elongated rectangular text detection box, and then the rectangular text detection boxes are combined into a text sequence box;

The Chinese recognition model based on CRNN is used for recognizing text sequences with indefinite lengths from end to end, and the CRNN network structure sequentially comprises:

the CNN convolution layer is used for extracting features from the input image to obtain a feature map;

cyclic layer RNN: predicting the feature sequence by using a bidirectional RNN, learning each feature vector in the sequence, and outputting prediction label distribution;

the transcription layer CTC loss converts a series of tag distributions obtained from the loop layer RNN into final tag sequences using CTC loss (loss calculation based on time series class classification of neural networks, connectionist Temporal Classification Loss).

In step 2, the Multi-mode semantic understanding technology based on Multi-level fine granularity is that semantic association in a high-level semantic space between different modal information is utilized to integrate the information to form a group of more user-friendly and more useful retrieval results, and the Multi-mode semantic understanding technology is mainly realized by two technologies of Multi-level semantic representation alignment (MVPTR) based on a dual-stage model architecture;

the fine-grained semantic understanding in a multi-modal situation is achieved, wherein the multi-modal includes two modalities of text and image.

The step 3 comprises the following steps:

step 3-1, the high-performance content retrieval engine construction technology refers to quick retrieval of a Java memory database based on a bidirectional reference index;

and 3-2, rapidly searching the multi-mode data memory database, namely constructing an optimized memory database searching and storing structure model, and realizing Java memory database searching and characteristic analysis by a space information fusion planning method.

Step 3-1 includes: optimized storage structure model of Java memory database (Yao Zhe, tao Jianwen. Multi-Source adapted Multi-tag Classification framework [ J ] using a lexicographical order storage mechanism]Computer engineering and application, 2017, 53 (7): 88-96, 170.), constructing a storage correlation feature analysis model of the memory database based on a data storage and index separation fusion method, performing joint feature decomposition of the memory database by adopting methods of inserting, deleting and updating indexes to obtain an embedding dimension m of an index table, searching feature quantity of a corresponding index in a cache to obtain an embedding distribution order n, u of the memory database _i ∈R ^m ，R ^m Representing an index table with an embedding dimension m, u _i Representing the ith index, establishing connection between the HBase and the memory database by connecting the Java API with the HBase, realizing the creation of a storage index of the memory database by using the HBase record, and performing the optimized association index mode v _j ∈R ⁿ Under guidance, the secondary index combined characteristic distribution set M of the mass data is obtained as follows:

wherein R is ⁿ Representing an embedded distribution order n memory database v _j Represents the jth index order;

constructing a combined feature detection model of a memory database by using an odbserver coprocessor, and obtaining a retrieval feature distribution element X in a definition index file X by adopting a data table matching method _t The reference feature quantity meeting the bidirectional index of the memory database is as follows:

wherein P is _i Represents the i-th reference feature quantity;

storage node x of memory database _t The elements in the federated feature distribution domain n satisfy x _t ∈B；

Distributed quantization parameter set defining storage space of memory databaseIs the association rule feature quantity of index file X, where i ₁ ，i ₂ ，i ₃ Obtaining the pass i by adopting a block joint parameter identification method _n+1 Taking t=0.5 from the statistical regression analysis result after the iteration to obtain a storage structure optimization model of the memory database;

design method combining with custom data processing logic structure (Li Maoying, salix, hu Qinghua. Isomorphic migration learning theory and algorithm research progress [ J)]University of Nanjing information engineering university (Nature science edition), 2019, 11 (3): 269-277.) to obtain a combined parameter feature set f= { F of the memory database ₁ ，f ₂ ，…，f _n Of f, where f _n Is the nth subset of feature set F;

because the blockchain has the advantages of decentralization, non-falsification and the like, the blockchain is applied to a clustering center fusion process, the clustering center positions are acquired by utilizing the blockchain link points, the clustering centers are combined two by two according to the size and the distance of the clustering centers, iteration is repeated until only one blockchain fusion clustering center is acquired, a distributed graph database (https:// blog.csdn.net/cr 7258/artecle/details/121186774) of memory data is obtained at the blockchain fusion clustering center, a cloud data processing frame is built on the basis of a Hadoop architecture, fusion processing is carried out on the data in the cloud data processing frame, and a Hadoop cloud data fusion method is utilized to obtain a joint statistical feature quantity Q of the Java memory database, wherein the joint statistical feature quantity Q is expressed as follows:

wherein N is a highly-associated joint distribution sequence of the memory database;

the data attribute retrieval result of the memory database adopts a digital arithmetic coding method, and the structural feature quantity of the design memory database quick retrieval connection table is expressed as (BREIMAN L.random forest [ J ]. Machine Learning,2001, 45 (1): 5-32.):

wherein c _k K is a data retrieval sequence, n is a highly-associated joint distribution sequence subset of the memory database, F is a subset of a joint parameter feature set F of the memory database, and T represents matrix transposition;

Fuzzy clustering is carried out on n unknown information components, and a source knowledge point matching set retrieved by the memory database is obtained as s=(s) ₁ ，s ₂ ，s ₃ ，…，s _n ) ^T Wherein T represents the matrix transpose;

by adopting the method of data high-correlation factor analysis, the multi-dimensional characteristic parameter set H with the memory database for quick retrieval is adopted _nm Is (GASPAR Cano, JOSEose Garcia-Rodriguez, ALBERTO Garcia-Garcia, et al Automatic selection of molecular descriptors using random forest: application to drug discovery [ J)].Expert Systems With Applications，2017，72(1):151－159.)：

Wherein x is _m (t) represents the mth subset in the index file X, t=0.5. The control time of the memory database retrieval is t ₀ A multi-table connection and aggregation method is adopted to obtain a bidirectional index optimal feature solution set xi (Huang Liangtao, zhao Zhicheng, zhao Yaqun) of the Java memory database, a cryptosystem layering identification scheme [ J ] based on random forests]Computer science report, 2018, 41 (2): 382-399.):

wherein a= { a ₁ ，a ₂ ，…，a _n Related attribute set, a, retrieved for memory database _n For the nth subset in set a, b= { B ₁ ，b ₂ ，…，b _n ' is a distribution set of decision tree functions for a typical query, b _n Is the nth subset of set B;

the graph model design method based on the relational database realizes the feature analysis and fusion processing of the memory data. (Hu Congrui, liu, feng Ling, et al. Bidirectional pointer storage optimization method for object proxy database [ J ]. Computer science report, 2018, 41 (8): 62-75.)

Step 3-2 includes: a decision tree classification method is adopted to construct a bidirectional reference index decision model for memory database retrieval, optimizing control on the fast retrieval process of the database is realized through index table names and index columns, based on a relational database model analysis method, through analyzing the operation logic of the memory database, a proper space fusion path is automatically selected according to the read-write characteristics of the current user request, a lightweight data processing frame is constructed on the basis, the consistency and the integrity of the data of the memory database are ensured, and a space fusion model (Gao Haiying) of the memory database is constructed under the support of the data processing frame]Microcomputer application 2020, 36 (8): 31-33., extracting semantic feature quantity of database space by using space fusion model, inquiring semantic distribution set X based on inquiring method of graph data model _i (t)(GUO L，CHEHATA N，MALLET C，et al.Relevance of air-borne lidar and multispectral image data for urban scene classification using random forests[J].ISPRS Journal of Photogrammetry and Remote Sensing，2011，66(1):56－66.)：

Where dt is the retrieval of the memory database; in the multiple data model database (GUO L, CHEHATAN, MALLET C, et al Release of air-borne lidar and multispectral image data for urban scene classification using random forests [ J)]ISPRS Journal of Photogrammetry and R emote Sensing,2011, 66 (1): 56-66.), performing fusion positioning of the memory database by combining with a joint decision method, performing fuzzy clustering processing on universal characteristic quantities of query tests of the multi-model database, and performing differentiated fusion scheduling on query results to obtain a multi-model data fusion model f of the memory database _x (t) is denoted as (Zhou Qingping, tan Changgeng, wang Hongjun. KNN text classification algorithm based on clustering improvement [ J ]]Computer applications research 2016, 33 (11): 3374-3377):

wherein x is _i (t) is a bidirectional reference feature matching set retrieved by a memory database, and in a native multi-model database, a feature allocation mechanism of association factor fusion is adopted to obtain a channel matching set output R (x) of a bidirectional reference index of the memory database as (WANG Liai, ZHOUXud, ZHU Xinkai, et al. Estimation of biomass in wheat using random forest regression algorithm and remote sensing data [ J)].The Crop Journal，2016，4(3):212－219.)：

Wherein alpha is ₁ And τ ₁ Respectively distributing amplitude and delay of associated features of the memory database, wherein l is E [0, L-1 ]]，τ ₀ <τ ₁ <…τ _L-1 L is the length of the associated feature sequence of the memory database, a data engine query control method is adopted to obtain a multi-data model database fusion distribution set, the distribution set represents the information conflict feature quantity on the relational database, and the association knowledge is adopted to guide mining to obtain a feature distribution entity set fused by association factors as x= (x) ₁ ，x ₂ ，…，x _m ) ^T The field name y obtained by the quick search of the memory database is (Lu, feng Zhongke. The prediction of the accumulation and growth amount of Beijing urban stand by using a random forest model [ J ] ]University of northeast forestry report 2020, 48 (5): 7-11.):

y＝c _m X-R(x)

wherein X is a decision vector, X is a feature distribution entity set fused by association factors,is the absolute number of the average value of X, i.e. x= { x|xz ^-1 } ^T ；c _m A scheduling set for data of different types and different structures;

setting f=n/MT as sampling frequency, the channel model Y for fast retrieval of the memory database is obtained as (Cui Chunyu. LTE network coverage assessment study based on random forest [ J ]. Communication world 2020, 27 (4): 75-76.):

through the processing of the steps 3-1 to 3-2, a bidirectional reference index decision model for quickly searching the memory database is constructed;

optimizing control in the process of quickly searching the memory database is realized through the index table name and the index column, and a node fusion scheduling model z of the memory database is constructed according to the correlation relationship between the index table and the value table _k Is (Luo Yan, shouyasheng, wang Tinggang, etc.. Power grid real time based on random forestRunning risk assessment method [ J]Information technology 2020, 44 (4): 26-31.):

wherein h is _k Two-dimensional topological structure design (https:// blog. Csdn. Net/qq_ 38807606/arc/details/128320126) of the memory database is carried out in a grid environment by representing a two-way reference feature distribution function for memory database retrieval, and a relational database Postgre SQL fusion is utilized to obtain a reference coordinate point of data retrieval as (0, 0), and a transmission link distribution set T of the memory database _s ＝N _f T _f Obtaining a linear programming model by adopting joint linear correlation fusionIs (SCORNET E, BIAUG, PHILIPPE-VERT J.Consitstency of random forests [ J)].Annals ofStatistics，2015，43(4):1716－1741.)：

Wherein, the feature quantity of the graph storage management structure of the m multiplied by n order relational database is thath _n Is thatThe nth value in the vector, m and n are vector dimensions, a fuzzy parameter characteristic matching set for memory database retrieval is constructed in the same entity representation attribute characteristic range, the optimizing control on the memory database in the quick retrieval process is realized through the index table name and the index column, and the density rho of the memory database space conflict nodes is (ZHAO Teng, WANG Linteng, ZHANG Yan, et al relation factor identification of electricty consumpton behavior of us-ers and electricity bemand f) according to the correlation relation between the index table and the value tableorecasting based on mutual in-formation and random forests[J].Proceedings of the CSEE，2016，36(3):604－614.)：

Ambiguity detection is carried out on a bidirectional index channel of a memory database, and a joint feature distribution set psi= [ psi ] is obtained in a graph storage model structure based on a relational database ₁ ，ψ ₂ ，…，ψ _N ]，ψ _N Centralizing the nth value for the ψ distribution;

obtaining two-way fused sparsity characteristic quantity lambda of memory database _x Expressed as (CHRZANOWSKA M, ALFARO E, WITKOWSKA D.the individual borrowers recognition: single and ensemble trees [ J ] ].Expert Systems with Applications，2008，36(3):6409－6414.)：

Wherein lambda' is the reference feature distribution of the bidirectional index of the memory database, and combines the association relation of the memory database retrieval to construct a fuzzy decision model of the memory database retrieval, and the space planning design of the memory database is carried out by considering the packet switching protocol, and the information conduction model O of the Java memory database retrieval is carried out _x Is (MALEKIPIRBAZARI M, AKSAKALLI V.risk assessment in social lending via random forests [ J)].Expert Systems with Applications，2015，42(10):4621－4631.)：

Where s is a bit sequence of the bi-directional index output of the memory database,is a relational data model parameter; according to the processing procedure, the memory number is realizedAnd (5) optimizing and searching the database.

In step 4, the two application modes of searching in text and searching in graph refer to final functions displayed through searching in text and searching in graph, wherein searching in text includes inputting short words to be searched in search columns, and returning to corresponding image sets; the graph searching comprises uploading an image to be searched and returning an image set corresponding to the uploaded image element.

The invention also provides a storage medium which is characterized in that a computer program or an instruction is stored, and when the computer program or the instruction is operated, the cross-modal retrieval method based on multi-level fine granularity semantic alignment is realized.

The beneficial effects are that: in the cross-modal retrieval method, key breakthroughs are made in the following aspects: in the aspect of processing low-quality images, an information high-quality recognition technology is adopted, so that the quality of data information is improved, and a data basis is provided for the subsequent semantic understanding; the multi-level view cross-modal semantic understanding technology with the fusion of the semantic and the characterization is adopted to realize the alignment between the text semantic and the image semantic, thereby providing an algorithm foundation for cross-modal retrieval; in order to realize the retrieval among the high-efficiency multi-mode data, a high-performance content retrieval engine construction technology is used, so that the retrieval efficiency is improved in a business scene with mass data; on the use scene, the cross-modal function of searching the graph by text and searching the graph by graph is realized, so that the multi-modal feedback under the single semantic condition is achieved, and the service efficiency is improved.

According to the method, the multi-modal semantics are aligned as much as possible based on multi-modal semantic understanding and cross-modal information retrieval, so that the cross-modal information can be associated and retrieved.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

Fig. 1 is a schematic diagram of the overall architecture of the present invention.

FIG. 2 is a fine-grained semantic understanding method of multimodal data in the present invention.

FIG. 3 is a cross-modal information aggregation retrieval technique in accordance with the present invention.

Fig. 4 is a schematic diagram of the model structure of MVPTR in the present invention.

FIG. 5 is a diagram illustrating the construction of high-speed memory data retrieval in accordance with the present invention

FIG. 6 is a schematic diagram of a flow chart of the invention.

FIG. 7 is a schematic diagram of a search result using the method of the present invention.

Detailed Description

According to the invention, on the aspect of multi-mode fine granularity semantic understanding, low-quality image information is effectively promoted, multi-mode semantic alignment is effectively practiced, and a multi-level fine granularity semantic representation fusion technology is further provided, so that text and image semantics are passed; in the process of cross-modal information aggregation retrieval, the retrieval efficiency is effectively improved through a high-performance content retrieval engine construction technology, and results are displayed through the functions of text searching and graph searching. The above invention will be described in terms of multi-modal fine-grained semantic understanding and cross-modal information bureau and retrieval.

As shown in fig. 1, the invention provides a cross-modal retrieval method based on multi-level fine granularity semantic alignment, which comprises the following steps:

As shown in FIG. 2, the method is a fine granularity semantic understanding method of multi-mode data in the invention. Further, in a multi-modal fine-grained semantic understanding method, the step 1 includes:

step 1-1, adopting an information high-quality identification technology based on real scene fine granularity understanding; the real scene-based method comprises the following steps: image data of administrative persons and parades in a certain area;

the information high-quality recognition technology comprises fine-granularity recognition of images based on resolution and spatial transposition and complex background OCR recognition based on scene data synthesis.

Step 1-1-1, the detection of the middle and low quality images in the data comprises the following steps:

The data comprises administrative-related character image and parade image data of a certain area;

the middle-low quality image refers to an image with low image resolution. The image resolution is a measure of the resolution of the image details output by the imaging system, and represents the detail degree of the information, which is a measure of the target detail degree in the image. The higher the resolution of the image, the higher the pixel density of the image, meaning that the image contains more detailed information. Image resolution can be divided into spatial resolution, temporal resolution, radiation resolution, spectral resolution, etc. The spatial resolution represents the size of the smallest target that can be resolved by the sensor, that is, the size of the actual range of the target represented by each pixel point of the image. Therefore, how to improve the image quality and the spatial resolution of the image is a problem to be solved in the imaging technical field. The information data generally has the problem of low resolution of images. Low resolution can result in serious loss of image information, and thus models trained with low quality data sets often fail to achieve the desired effect. In addition, if the model is trained based on High Resolution (HR) images, the quality of the data source in the real service scene is very Low, and there is a problem of mismatch between the High Resolution and Low Resolution (LR) data, the effect is also compromised.

The method for detecting the middle and low quality images comprises the steps of adopting a low-resolution face recognition method based on double-branch CNN;

the low-resolution face recognition method based on the dual-branch CNN (convolutional neural network ) adopts two deep convolutional neural networks (Deep Convolutional Neural Network) named FECNN and SRFECNN, namely dual-branch CNN, converts a high-resolution image and a low-resolution image into a common space, and then uses SVC classification models (support vector classification, support Vector Classification, https:// blog.csdn.net/lipeitong 333/artefact/details/123042687) to perform face recognition (Face Identification), sensitive icon detection and other applications, and finally realizes the fine granularity detection technology of the ultra-low resolution images.

Wherein, FECNN (feature extraction convolutional neural network ) takes VGGnet (https:// blog. Csdn. Net/fengbingchun/arc/details/113056463) as a basic framework. However, the full connection layer (fully connection layers) is finally used for classifying tasks in VGGnet, and the face recognition has stronger flexibility only when the feature extraction and the classification are carried out, so that the last full connection layer of VGGnet is removed; the SRFECNN is a combination of SRnet and FECNN. The output of SRnet is used as the input of FECNN, and the connection becomes the lower branch network SRFECNN. SRnet (steganalysis residual network ), providing steganography techniques to detect hidden data for super-resolution reconstruction of incoming images to better provide more information for subsequent recognition tasks. SRnet (steganographic network structure) contains three segments: the anterior segment responsible for extracting noise residuals (layer L1-layer L7), the middle segment that reduces the dimension of the feature map (layer L8-layer L12), and the final standard full connected-layer linear classifier (Fully connected-Softmax).

The dual-branch CNN converts a high-resolution image and a low-resolution image into a common space, that is, the dual-branch algorithm is integrally divided into an upper branch network and a lower branch network, the upper network inputs a 224×224 image which is standard, the high-resolution face image is required to be transmitted into the FECNN through a bicubic interpolation algorithm no matter how large the size is, the lower network inputs a 224×224 image which is obtained through bicubic interpolation, the difference is that the low-resolution image dimension is rn×n, the high-resolution image dimension is m×m, N is always less than M, the upper network output and the lower network output are feature vectors of 1×4096, and the lower branch network has one more SRnet.

The SRnet is used for reconstructing the super-resolution of the incoming image to better provide more information for the subsequent recognition work, and then the output of the SRnet is used as the input of the FECNN, so that the lower branch network SRFECNN is formed by connecting the outputs of the SRnet.

Step 1-1-2, realizing text extraction under complex background, comprising:

the complex background comprises an image with text data in the related information data of a certain region;

and the text extraction adopts a Chinese detection model based on CTPN and a Chinese recognition model based on CRNN to realize recognition of text information in the image of the interference background.

FIG. 4 is a schematic diagram of a CTPN-based Chinese detection model according to the present invention. The Chinese detection model based on CTPN comprises the following specific implementation steps:

1) Firstly, obtaining feature vectors by using the first 5 convolution stages of VGG 16;

2) Features are extracted from the feature vectors obtained in the previous step by using a sliding window of 3*3, and a plurality of anchors (used for defining a target candidate region) are predicted by using the features.

3) Inputting the features obtained in the previous step into a bidirectional LSTM (Long Short Memory neural network, https:// zhuanlan. Zhihu. Com/p/42717426); BLSTM, two-way Long and Short Memory neural network, bi Long Short-Term Memory), outputting a W.256 result, and inputting the result into a 512-dimensional full-connection layer FC;

4) Finally, the output obtained by classification is mainly divided into three parts, namely 2k vertical coordinates from top to bottom: coordinates of a y-axis representing the height and center of the selection box; 2k score: the category information of k anchors is shown to indicate whether the k anchors are characters or not; k side-definition represents the horizontal offset of the selection box. The horizontal width of the anchor in the present invention is all 16 pixels unchanged, that is, the unit of the minimum selection frame for differentiation is "16 pixels".

5) The resulting elongated rectangles are then combined into a sequence box of text using an algorithm for text construction.

CTPN (joint text suggestion network, connectionist Text Proposal Network) based text detection differs from traditional target detection based text detection in that text in business image data varies much more strongly in length than in height, where traditional detection text boundaries start and end are difficult to match; in the CTPN method, only the vertical position of the text is predicted, the horizontal position is not predicted, only a small text segment with a fixed width is detected, the corresponding height is accurately predicted, and finally the text segments are connected together, so that the text line is obtained.

The Chinese recognition model based on CRNN (convolutional recurrent neural network ) is mainly used for recognizing text sequences with indefinite lengths end to end, and is used for converting text recognition into sequence learning problems depending on time sequence without cutting single text, namely sequence recognition based on images.

The CRNN network structure comprises three parts, namely:

1) CNN (convolutional layer): extracting features from the input image by using the depth CNN to obtain a feature map;

2) RNN (cyclic layer): predicting the feature sequence using a bi-directional RNN (recurrent neural network ), learning each feature vector in the sequence, and outputting a prediction tag (true value) distribution;

3) CTC loss (transcript layer): a series of tag distributions obtained from the loop layer are converted into final tag sequences using CTC Loss (Loss calculation based on time series class classification of neural networks, connectionist Temporal Classification Loss).

The CRNN adopts a BLSTM (two-way Long and Short Memory) model and a CTC transcription part, and can process the relation between the context in the learning character image, thereby effectively improving the accuracy of text recognition and greatly improving the robustness of the whole model.

In the step 2, the multi-mode semantic understanding technology based on semantic representation fusion is adopted, namely the multi-mode data is fused and understood in a mode that the multi-modes are aligned in a semantic representation layer.

Step 2-1, namely semantic characterization fusion, wherein the multi-level fine granularity refers to that although multimedia information of different modes has different data structures and expression forms, the multimedia information expresses the same semantic, and the information is integrated together to form a group of more user-friendly and more useful retrieval results by utilizing semantic association of the different modes in a high-level semantic space;

Step 2-2, the multi-mode semantic understanding technology comprises multi-level semantic representation alignment based on a dual-stage model architecture, so that more accurate semantic matching in a fine granularity level is realized; the multi-level semantic representation alignment based on the dual-stage model architecture means that vision and language are two important manifestations of human intelligence, so that in order to cooperatively process information from vision and text, multi-modal research on vision and language in recent years is focused on aligning the vision and language semantic learning from different tasks, such as image text retrieval, vision language question and answer, phrase representation and the like. In order to break barriers among tasks and learn generalized Multi-modal representation, the invention adopts MVPTR (Multi-layer language Pre-training model), performs self-supervision Pre-training on large-scale image text pairs, can obtain good performance on downstream tasks through fine adjustment, and can cooperatively utilize semantic alignment of multiple layers.

As shown in fig. 5, a schematic diagram of the MVPTR model structure in the present invention is shown. The model of MVPTR is divided into two phases: single-modality learning and cross-modality learning phases. In a single-mode stage, the model learns the interaction in the mode to obtain the representation of multi-layer semantics of each mode; in the cross-modal stage, the model learns interactions between modalities and performs fine-grained reasoning.

In a single-mode learning stage, MVPTR only learns interaction and representation in a mode through a visual encoder and a text encoder, the visual encoder takes the spliced object feature sequence and the object tag sequence as input to learn the relation between objects, and simultaneously aligns object features and corresponding object-level concepts; the text encoder takes the concatenated word sequence and phrase sequence as input, provides structural information in the phrases, and further learns the phrase-level concepts in the context.

In a visual encoder, the concept of object level is included in the input visual sequence in a predictive label manner. The concept of object level may be used as an anchor point to help align object representations and words. In order to further strengthen the effect of its anchor point, this study proposed a pre-training task MCR.

MCR (MASK information recovery, masked concept recovering) is similar to BERT's MLM (MASK language prediction model, masked Language Model, https:// www.jianshu.com/p/56d0c0ea44a 8) task, with researchers randomly masking a portion of the input tag sequence, setting it to special characters [ MASK ] or random substitution, and based on the output of the visual encoder, predicting the masked portion of the original tag through a linear layer. The MCR task can be regarded as alignment of visual features and object concepts under weak supervision (predicting that a particular tag needs to learn the association of a corresponding object with it), and MCR can similarly tag images, further align the representation of the region, and facilitate cross-modal interactive learning thereafter.

The cross-modal semantic interaction and alignment stage comprises the steps of firstly aligning global representations obtained by Shan Motai encoders from coarse granularity by using VSC tasks, and aligning semantic spaces of the two encoders; and then splicing and inputting the aligned symbol strings, phrases and object feature sequences into a multi-modal encoder for learning, wherein in order to prevent the generation of quick representation from labels to words in the process of carrying out subsequent pre-training tasks, the learning of the real cross-modal relation is influenced, and the label sequence is not taken into consideration. At this stage, object features and phrase representations are further aligned by WPG and, based on previous characterizations, high-level reasoning tasks, including ITM and MLM, are completed.

Wherein the VSC task, that is, before the cross-modal encoder is input, MVPTR aligns the semantic space of the two modal encoders through VSC (visual semantic contrast, visual Semantic Contrastive, https:// arxiv. Org/abs/2210.11000), and roughly aligns the image and the text on the global level. The representation of the "[ CLS ]" (classification semantic representation) symbol string obtained by the visual and text encoder is used as the global representation of the image and the text, and the cosine similarity between the two vectors is used as the semantic similarity. InfoNCE (https:// blog. Csdn. Net/qq_ 38978225/arc/details/125295721) is used as training loss, and only matching image-text in the same batch of data is positive sample pairs (corresponding to the diagonal portion of the cosine similarity matrix in the model map), and the rest are negative sample pairs. By aligning the global coarse granularity, the research splices and inputs symbol strings, phrases and object feature sequences in the aligned space into a cross-modal encoder;

The WPG (Weakly-supervised Phrase Grounding, https:// blog. Csdn. Net/klrp 95/artecle/deltails/88252181) is a Weakly supervised description positioning method. For each commonly encoded image-text pair, considering the characterization of n phrases and the characterization of m object features obtained by the cross-modal encoder, calculating the semantic similarity between each phrase-region through cosine similarity, and carrying out similarity matrix on n x m. And selecting a region most similar to each phrase as a score of the phrase matched in the whole image according to the multi-example learning method, and averaging all the phrases to obtain an image-text matching score based on phrase-region matching. The training process may then be based on the score of the image-sentence match.

The ITM (Image-text Matching, https:// blog. Csdn. Net/csdn_tclz/art/details/109902169) is a pre-training task commonly used in vision-language pre-training models, and is essentially a task for deducing a sequence relationship, and it is necessary to determine whether images and texts of the multi-mode sequence are matched. In MVPTR, a multi-layer perceptron is learned to predict a matching 2-class score directly by using CLS string (class semantic representation) features output by a cross-modal encoder. Similar to ALBEF (Pre-fusion alignment model, align Before Fuse, https:// arxiv. Org/pdf/2107.07651. Pdf), the harder negative samples were sampled from the training batch for ITM tasks based on global similarity of VSC task outputs.

The MLM (mask language prediction model, masked Language Model, https:// www.jianshu.com/p/56d0c0ea44a 8) is also a common task in pre-training models, and masking and replying to key words in descriptive text, such as quantitative words, adjectives, nouns, actions, etc., are essentially reasoning tasks from different angles. The MLM settings were consistent with other pre-trained models: a part of symbol string is randomly covered or replaced, and a multi-layer perceptron is learned to predict the original symbol string through representation output by a model.

Step 2-3, the fine granularity semantic understanding under the multi-mode condition is realized, wherein the fine granularity semantic understanding comprises the function of semantic alignment between a text and an image;

as shown in FIG. 3, the cross-modal information aggregation retrieval technique of the present invention is shown. The system comprises a high-performance content search engine construction technology, a text search graph and a functional architecture of two application modes of the graph search graph.

In the step 3, the technology of constructing the high-performance content retrieval engine is to use a memory information database quick retrieval method based on a bidirectional reference index so as to improve the retrieval efficiency of multi-mode data.

Step 3-1, as shown in FIG. 5, is a schematic diagram of the high-speed memory data retrieval construction in the present invention. The high-performance content retrieval engine is used for quickly retrieving the Java memory database based on the two-way reference index, and in order to realize the quick retrieval of the Java memory database based on the two-way reference index, an optimized storage structure model of the Java memory database is constructed by adopting a dictionary sequence ordering storage mechanism. Analysis shows that based on a data storage and index separation and fusion method, a storage correlation feature analysis model of the memory database is constructed, and the joint feature decomposition of the memory database is carried out by adopting methods of inserting, deleting and updating indexes, so that the embedding dimension of an index table is m, and the feature quantity of the corresponding index is searched in a cache Obtaining the embedded distribution order of the memory database as n and u as same _i ∈R ^m The invention establishes connection between the HBase and the memory database mainly by connecting the HBase (https:// baike. Baidu. Com/item/HBase/7670213 fr=aladin) through the pure Java API so as to realize the establishment of the storage index of the memory database by using the HBase record, and the method is in an optimized association index mode v _j ∈R ⁿ Under guidance, the secondary index joint characteristic distribution set of the mass data is obtained as follows:

constructing a combined feature detection model of a memory database by using an odbserver coprocessor, and obtaining a retrieval feature distribution element X in a definition index file X by adopting a data table matching method _t Satisfies reference feature quantity (P) of bi-directional index of memory database _i Representing the i-th reference feature quantity) is:

i.e. storage node x of the memory database _t The elements in the federated feature distribution domain n satisfy x _t e.B. Distributed quantization parameter set defining storage space of memory databaseIs the association rule feature quantity of index file X, where i ₁ ，i ₂ ，i ₃ Obtaining the pass i by adopting a block joint parameter identification method _n+1 And taking t=0.5 as a statistical regression analysis result after the iteration. And obtaining a storage structure optimization model of the memory database through entity model construction.

An optimized storage structure model of a memory database is built by adopting a dictionary sequence storage mechanism, and a custom data processing logic structure design method (Li Maoying, salix aspen, hu Qinghua. Isomorphic migration learning theory and algorithm research progress [ J ] is combined]Nanjing university of information engineeringThe combined parameter feature set F= { F of the memory database is obtained by the academic report (natural science edition), 2019, 11 (3): 269-277.) ₁ ，f ₂ ，…，f _n }。

Because the blockchain has the advantages of decentralization, non-tampering and the like, the blockchain is applied to the clustering center fusion process, the position of the clustering center is firstly obtained by using the nodes of the blockchain, the clustering centers are combined pairwise according to the size and the distance of the clustering center, and the iteration is performed until a single blockchain fusion clustering center is obtained. And (3) in the block chain fusion clustering center, obtaining the distributed graph database model of the memory data, wherein the distributed graph database model meets the following conditions: s= { S ₁ ，s ₂ ，…，s _n A cloud data processing frame is built on the basis of a Hadoop framework, and fusion processing is carried out on collected data in the frame, so that a method of Hadoop cloud data fusion is utilized to obtain a joint statistical feature quantity expression Q of a Java memory database, wherein the joint statistical feature quantity expression Q is as follows:

wherein N is a highly-associated joint distribution sequence of the memory database. The data attribute retrieval result of the memory database adopts a digital arithmetic coding method, and the structural feature quantity of the design memory database quick retrieval connection table is expressed as (B REIMAN L.random formulas [ J ]. Machine Learning,2001, 45 (1): 5-32.):

Wherein, c _k K is a rule item of the relational data of the memory database and is a data retrieval sequence. Fuzzy clustering is carried out on n unknown information components, and a source knowledge point matching set retrieved by the memory database is obtained as s=(s) ₁ ，s ₂ ，s ₃ ，…，s _n ) ^T By adopting a method of analyzing the data high-correlation factors, a multi-dimensional characteristic parameter set H for quickly searching the memory database is adopted _nm Is (GASPAR Cano, JOSEose Garcia-Rodriguez, ALBER TO Garcia-Garcia, et al Automatic selection of molecular descriptors using random forest: application TO drug discovery [ J)].Expert Systems With Applications，2017，72(1):151－159.)：

x _m (t) represents the mth subset in the index file X, t=0.5. The control time of the memory database retrieval is t ₀ A multi-table connection and aggregation method is adopted to obtain a bidirectional index optimal feature solution set xi (Huang Liangtao, zhao Zhicheng, zhao Yaqun) of the Java memory database, a cryptosystem layering identification scheme [ J ] based on random forests]Computer science report, 2018, 41 (2): 382-399.):

wherein A= { a ₁ ，a ₂ ，…，a _n And the related attribute set is retrieved by the memory database, and B= { B ₁ ，b ₂ ，…，b _n And is a distribution set of decision tree functions for a typical query. The larger the correlation attribute C (tau) of the memory database search is, the more similar x (t) is to x (t+tau), thereby obtaining the bidirectional reference index model of Java memory database search.

The graph model design method based on the relational database realizes the feature analysis and fusion processing of the memory data.

Step 3-2, the quick search is realized by high-performance optimization of memory database content search, so that the Java memory database search is accelerated; the content retrieval of the memory database is optimized with high performance, a bidirectional reference index decision model for the memory database retrieval is constructed by adopting a decision tree classification method, the optimizing control in the process of fast database retrieval is realized by index table names, index columns and other information, and a proper space fusion path is automatically selected according to the read-write characteristics of the current user request by analyzing the main operation logic of the memory database based on a relational database model analysis method.

Constructing a lightweight data processing framework on the basis of the data processing framework to ensure the consistency and the integrity of the data of the memory database, constructing a space fusion model of the memory database under the support of the framework, extracting semantic feature quantities of a database space by utilizing the constructed model, and inquiring a semantic distribution set (GUO L, CHEHATA N, MALLET C, et al. Release of air-borne lidar and multispectral image data for urban scene classification using random forests [ J ]. ISPRS Journal of Photogrammetry and R emote Sensing,2011, 66 (1): 56-66.):

In a multi-data model database, a combined judgment method is combined to perform fusion positioning of a memory database, fuzzy clustering processing is performed on general feature quantity of a query test of the memory database, and differentiated fusion scheduling is performed on a query result to obtain a multi-model data fusion model of the memory database, wherein the multi-model data fusion model is expressed as (Zhou Qingping, tan Changgeng, wang Hongjun; based on a clustering improved KNN text classification algorithm [ J ]. Computer application research, 2016, 33 (11): 3374-3377.):

wherein x is _α Is a bidirectional reference feature matching set for retrieving a memory database, and in a native multi-model database, a feature allocation mechanism of association factor fusion is adopted to obtain a channel matching set output of a bidirectional reference index of the memory database as (WANG Liai, ZHOU Xudong, ZHU Xinkai, et al. Estimation of biomass in wheat using random forest regression algorithm and remote sensing data [ J)].The Crop Journal，2016，4(3):212－219.)：

Wherein alpha is ₁ And τ ₁ Distributing amplitude and delay for associated features of a memory database, wherein l epsilon [0, L-1, tau ] ₀ <τ ₁ <…τ _L-1 Adopting a data engine query control method to obtain a data model database fusion distribution set meeting c _j T _c <T _f ，Representing information conflict characteristic quantity on a relational database, adopting association knowledge to guide and mine, and obtaining a characteristic distribution entity set fused by association factors as x= (x) ₁ ，x ₂ ，…，x _m ) ^T The field name y obtained by the quick search of the memory database is (Lu, feng Zhongke. The prediction of the accumulation and growth amount of Beijing urban stand by using a random forest model [ J ]]University of northeast forestry report 2020, 48 (5): 7-11.):

y＝c _m X-R(x)

where X is an index rule vector with decision vector X, i.e., x= { x|xz ^-1 } ^T ，c _m For a scheduling set of data with different types and different structures, T' =MT/N, f=N/MT is set as sampling frequency on the basis, and a channel model Y for quickly searching an internal memory database is obtained (Cui Chunyu. LTE network coverage evaluation research based on random forest [ J ]]Communication world 2020, 27 (4): 75-76.):

and (3) constructing a bidirectional reference index decision model for quickly retrieving the memory database through the processing of the steps 3-1 to 3-2.

Optimizing control in the process of quickly searching the memory database is realized through information such as index table names, index columns and the like, and a node fusion scheduling model z of the memory database is constructed according to the correlation relation between the index table and the value table _k (Luo Yan, shouxiang, wang Tinggang, etc.. BasePower grid real-time operation risk assessment method [ J ] in random forest]Information technology 2020, 44 (4): 26-31.) as follows:

in the formula, h _k Two-way reference characteristic distribution function representing memory database retrieval is carried out, two-dimensional topological structure design of the memory database is carried out in a grid environment, and a relational database Postgre SQL fusion is utilized to obtain a reference coordinate point of data retrieval as (0, 0), and a transmission link distribution set of the memory database is T _s ＝N _f T _f Obtaining a linear programming model by adopting joint linear correlation fusionIs (SCORNET E, BIAUG, PHILIPPE-VERT J.Consitstency of random forests [ J)].Annals of Statistics，2015，43(4):1716－1741.)：/>

In the method, in the process of the invention,for the graph storage management structure characteristic quantity of m multiplied by n order relational database, in the same entity expression attribute characteristic range, constructing a fuzzy parameter characteristic matching set for memory database retrieval, realizing optimizing control on the memory database in the process of quick retrieval by index table name, index column and other information, and according to the correlation relation between the index table and the value table, the density ρ of the memory database space conflict node is (ZHAO Teng, WANG Linteng, ZHANG Yan, et al, relation factor identification of electricty consumpton behavior of us-ers and electricity bemand forecasting based on mutual in-formation and random forests [ J ]].Proceedings of the CSEE，2016，36(3):604－614.)：

Ambiguity detection is carried out on a bidirectional index channel of a memory database, and a joint feature distribution set psi= [ psi ] is obtained in a graph storage model structure based on a relational database ₁ ，ψ ₂ ，…，ψ _N ]Obtaining the two-way fused sparsity characteristic quantity lambda of the memory database _x Expressed as (CHRZANOWSKA M, ALFARO E, WITKOWSKA D.the individual borrowers recognition: single and ensemble trees [ J ] ].Expert Systems with Applications，2008，36(3):6409－6414.)：

Wherein lambda' is the reference feature distribution of the bidirectional index of the memory database, and is combined with the association relation of the memory database retrieval to construct a fuzzy decision model of the memory database retrieval, and the space planning design of the memory database is carried out by considering the packet switching protocol, and the information conduction model O of the Java memory database retrieval is carried out _x Is (MALEKIPIRBAZARI M, AKSAKALLI V.risk assessment in social lending via random forests [ J)].Expert Systems with Applications，2015，42(10):4621－4631.)：

Where s is a bit sequence of the bi-directional index output of the memory database,is a relational data model parameter. According to the analysis, the optimized retrieval of the memory database is realized. And 4, the two application modes of searching the graph in the text and searching the graph in the graph refer to that the final function is displayed through searching the graph in the text and searching the graph in the graph.

The text search is carried out by inputting short words to be searched in a search column, and returning to a corresponding image set. Milvus (https:// zhuanlan. Zhihu. Com/p/517553501) is adopted, is a cloud primary vector database, has the characteristics of high availability, high performance and easy expansion, and is used for real-time recall of massive vector data. Milvus is built based on vector search libraries such as FAISS (https:// zhuanlan. Zhihu. Com/p/357414033), annoy (https:// blog. Csdn. Net/tdaajames/arc/details/125201950), HNSW (https:// blog. Csdn. Net/weixin_ 44839084/arc/details/126662427), and the like, and the core is to solve the problem of dense vector similarity retrieval. Based on a vector retrieval library, milvus supports functions of data partitioning and slicing, data persistence, incremental data ingestion, scalar vector mixed query, time tracing and the like, simultaneously greatly optimizes the vector retrieval performance, and can meet the application requirements of any vector retrieval scene.

Examples

The invention combines multi-mode fine granularity semantic understanding and cross-mode information aggregation retrieval methods, wherein the cross-mode retrieval application of multi-level fine granularity semantic based on the multi-mode fine granularity semantic is characterized, and a retrieval scheme is provided for the problems of multi-mode information, low information quality and extremely large information quantity of battlefield data. The invention is characterized in the following aspects: in the aspect of multi-mode fine granularity semantic understanding, the method mainly comprises research of effective identification of low-quality information and multi-mode information characterization fusion, and respectively solves the problems caused by unstable data quality and the problem of difficult multi-mode matching existing in the traditional information searching technology; wherein the technical implementation is represented as two blocks: information high-quality recognition technology based on real scene fine granularity understanding and cross-modal semantic understanding technology based on semantic and characterization feature fusion; in the aspect of cross-modal information aggregation retrieval, the problem of high-efficiency retrieval among cross-modal information is solved mainly through high-performance index construction and scene construction (text searching and graph searching) aiming at battlefield information retrieval; the technical implementation is as follows: a high-performance content retrieval engine construction technology and a cross-mode retrieval scene realization technology;

the invention provides a cross-modal retrieval method based on multi-level fine granularity semantic alignment, and the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be noted that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in FIG. 1, a schematic diagram of the overall architecture of the present invention is shown, including multimodal fine-grained semantic understanding and cross-modal information aggregate retrieval.

As shown in FIG. 2, the method is a fine granularity semantic understanding method of multi-mode data in the invention. In a first aspect, the present invention provides a multi-modal fine-grained semantic understanding method, comprising:

step 1, detecting low-quality images (in the embodiment, images with the size of 50 pixels or less are defined as low-quality images) in data by adopting an information high-quality recognition technology based on real scene fine granularity understanding, and extracting text under a complex background;

as shown in FIG. 3, the cross-modal information aggregation retrieval method of the invention is provided. In a second aspect, the present invention provides a cross-modal information aggregation retrieval method, including:

Further, in a multi-modal fine-grained semantic understanding method, the step 1 includes:

step 1-1, the method is based on real scenes and comprises the following steps: image data of administrative persons and parades in a certain area;

the information quality recognition technology comprises fine granularity recognition of images based on resolution and spatial transposition and complex background OCR recognition based on scene data synthesis (refer to https:// baike. Baidu. Com/item/OCR% E6%96% E5% AD%97% E8% AF%86% E5%88% AB/10392860).

The method for detecting the middle and low quality images comprises the steps of adopting a low-resolution face recognition method based on double-branch CNN; the present technology represents a significant improvement in recognition accuracy. The present technique performs better than other competing methods at various probe image resolutions, especially when applied on very low resolution images of 6 x 6 pixels, with a more significant performance improvement (11.4%). The present technique also provides HR image reconstruction with visual quality comparable to the most advanced super resolution methods. And the model trained by the present technique requires much less space than conventional deep convolutional neural networks such as VGGnet, and is therefore suitable for systems with conventional memory.

Step 1-1-2, realizing text extraction under complex background, the method comprises the following steps:

FIG. 3 is a schematic diagram of a CTPN-based Chinese detection model according to the present invention. The Chinese detection model based on CTPN comprises the following specific implementation steps:

1) Firstly, obtaining a feature map by using the first 5 Conv stages of VGG16, wherein the size of the feature map is W.times.H.times.C;

2) Features are extracted on the feature map obtained in the previous step using a sliding window of 3*3, and these features are used to predict a number of anchors, where the anchor definition is the same as that in the previous Faster-RCNN, i.e., to define the target candidate region.

3) Inputting the features obtained in the previous step into a Bidirectional LSTM (BLSTM), outputting a W-256 result, and inputting the result into a 512-dimensional full connection layer FC;

CTPN-based text detection differs from traditional target detection in that text in business image data varies much more strongly in length than height, where conventionally detected text boundaries begin and end are difficult to match; in the CTPN method, only the vertical position of the text is predicted, the horizontal position is not predicted, only a small text segment with a fixed width is detected, the corresponding height is accurately predicted, and finally the text segments are connected together, so that the text line is obtained.

The overall evaluation was performed on five reference datasets. The present technique achieves optimal performance on all five data sets. This may be due to the fact that the present technology has a very strong detection capability on very challenging text, e.g. very small text, some of which are very difficult even for humans. In the step 2, the multi-mode semantic understanding technology based on semantic representation fusion is adopted, namely the multi-mode data is fused and understood in a mode that the multi-modes are aligned in a semantic representation layer.

Step 2-1, the multi-level fine granularity multi-mode semantic understanding technology refers to that although multimedia information of different modes has different data structures and expression forms, the multimedia information expresses the same semantic, and the information is integrated together to form a group of more user-friendly and more useful retrieval results by utilizing semantic association of the different modes in a high-level semantic space;

as shown in fig. 5, a schematic diagram of the MVPTR model structure in the present invention is shown. The model of MVPTR is divided into two phases: single-modality learning and cross-modality learning phases. In a single-mode stage, the model learns the interaction in the mode to obtain the representation of multi-layer semantics of each mode; in the cross-modal stage, the model learns interactions between modalities and performs fine-grained reasoning. The technology performs task tests on different data sets, wherein the data sets comprise (MSCOCO, flickr k, VQA v2 and SNLI-VE), and the tests comprise image-text retrieval tasks on MSCOCO and Flickr30k, visual question-answering tasks on VQA v2 and visual reasoning tasks on SNLI-VE. The MVPTR dual-stage model architecture facilitates forced single-mode and cross-mode representation learning, and learning multi-level alignment. The proposed model achieves the most advanced results on downstream tasks.

Step 3-1, as shown in FIG. 5, is a schematic diagram of the high-speed memory data retrieval construction in the present invention. The high-performance content retrieval engine is used for quickly retrieving the Java memory database based on the two-way reference index, and in order to realize the quick retrieval of the Java memory database based on the two-way reference index, an optimized storage structure model of the Java memory database is constructed by adopting a dictionary sequence ordering storage mechanism. The technology adopts a word order sorting storage mechanism to construct an optimized storage structure model of the memory database, adopts methods of inserting, deleting and updating indexes to carry out joint feature decomposition of the memory database, realizes optimizing control in the process of quickly searching the memory database, and realizes optimized search of the database. The experiment measures and compares the database retrieval throughput, the query time and the longitude. Researches show that the method has small time cost for retrieving the memory database and good convergence.

And 4, the two application modes of searching the graph in the text and searching the graph in the graph refer to that the final function is displayed through searching the graph in the text and searching the graph in the graph.

The text search is carried out by inputting short words to be searched in a search column, and returning to a corresponding image set. Milvus is adopted, and is a cloud primary vector database which has the characteristics of high availability, high performance and easy expansion and is used for real-time recall of massive vector data. Milvus builds a vector search library based on FAISS, annoy, HNSW and the like, and aims at solving the problem of dense vector similarity retrieval. Based on a vector retrieval library, milvus supports functions of data partitioning, data persistence, incremental data ingestion, scalar vector mixed query, time vector and the like, simultaneously greatly optimizes the performance of vector retrieval, and can meet the application requirements of any vector retrieval scene.

FIG. 6 is a schematic diagram of a flowchart of the present invention. And searching the images by the images, including uploading the images to be searched, and returning to the image set corresponding to the uploaded image elements. The input is an image to be queried, after image preprocessing and feature extraction, vector retrieval is used for retrieving in a distributed vector library, after a result is obtained, post-processing is performed to supplement detailed information of the image, and a group of similar image information is output. As shown in fig. 7, the keyword "fighter" is searched by the method of the present invention, and the obtained text result (the text result is partially related to sensitive information, and thus is subjected to fuzzy processing), picture result and video result.

Claims

1. A cross-modal retrieval method based on multi-level fine granularity semantic alignment is characterized by comprising the following steps:

2. The method of claim 1, wherein step 1 comprises:

the complex background OCR based on scene data synthesis realizes the recognition of text information in an image of an interference background by adopting a Chinese detection and recognition method combined with a deep learning model.

3. The method of claim 2, wherein step 1-1 comprises:

the SRFECNN is the combination of a steganography analysis residual network SRnet and a feature extraction convolutional neural network FECNN, and the steganography analysis residual network SRnet provides a steganography technology to detect hidden data and reconstruct super-resolution of an incoming image; the feature extraction convolutional neural network FECNN is a feedforward neural network which comprises feature extraction, convolutional calculation and has a depth structure; the output of the steganography analysis residual network SRnet is used as the input of the feature extraction convolutional neural network FECNN, and the connection becomes a lower branch network SRFECNN; the steganography analysis residual network SRnet provides a steganography technology to detect hidden data and reconstruct the super-resolution of an incoming image; the steganalysis residual network SRnet contains three segments: a linear classifier of a full connection layer responsible for extracting a front section of the noise residual, reducing a middle section of the dimension of the feature map, and a final standard;

the lower network input is 224X 224 dimensional images obtained after bicubic interpolation;

step 1-1-2, the text extraction under the complex background is realized, which comprises the following steps: adopting a Chinese detection model based on a joint text suggestion network CTPN and a Chinese recognition model based on a convolution recurrent neural network CRNN to realize recognition of text information in an image of an interference background;

step a1, obtaining feature vectors by using the first 5 convolution stages of the visual geometry group network VGG 16;

step a2, extracting features on the feature vectors by using a sliding window of 3*3, and predicting more than two anchors by using the features;

k side-definition, which represents the horizontal offset of the selection box;

the transcribing layer CTC loss, which converts a series of tag distributions obtained from the loop layer RNN into a final tag sequence, uses the neural network based loss calculation CTC loss of time series class classification.

4. The method according to claim 3, wherein in step 2, the multi-modal semantic understanding technology based on multi-level fine granularity is adopted, and the multi-level semantic representation alignment based on a dual-stage model architecture is implemented;

5. The method of claim 4, wherein step 3 comprises:

6. The method of claim 5, wherein step 3-1 comprises: constructing an optimized storage structure model of a Java memory database by adopting a dictionary sequence storage mechanism, constructing a storage correlation characteristic analysis model of the memory database based on a data storage and index separation fusion method, adopting methods of inserting, deleting and updating indexes to perform joint characteristic decomposition of the memory database to obtain an embedding dimension m of an index table, searching for characteristic quantities of corresponding indexes in a cache to obtain an embedding distribution order n, u of the memory database _i ∈R ^m ，R ^m Representing an index table with an embedding dimension m, u _i Representing the ith index, establishing connection between the HBase and the memory database by connecting the Java API with the HBase, realizing the creation of a storage index of the memory database by using the HBase record, and performing the optimized association index mode v _j ∈R ⁿ Under guidance, the secondary index combined characteristic distribution set M of the mass data is obtained as follows:

wherein P is _i Represents the i-th reference feature quantity;

Distributed quantization parameter set defining storage space of memory databaseIs the association rule feature quantity of index file X, where i ₁ ，i ₂ ，i ₃ Obtaining the pass i by adopting a block joint parameter identification method _n+1 The statistical regression analysis result after the iteration is carried out for a second time, so as to obtain a storage structure optimization model of the memory database;

combining with a custom data processing logic structure design method to obtain a combined parameter feature set F= { F of the memory database ₁ ，f ₂ ，…，f _n Of f, where f _n Is the nth subset of feature set F;

the method comprises the steps of applying block chains to a clustering center fusion process, obtaining a clustering center position by utilizing block chain link points, merging the clustering centers pairwise according to the size and the distance of the clustering centers, and repeating iteration until only one block chain fusion clustering center is obtained, obtaining a distributed graph database of memory data in the block chain fusion clustering center, building a cloud data processing frame on the basis of a Hadoop architecture, carrying out fusion processing on the data in the cloud data processing frame, and obtaining a joint statistical feature quantity Q of the Java memory database by utilizing a Hadoop cloud data fusion method, wherein the expression formula is as follows:

and adopting a digital arithmetic coding method for a data attribute retrieval result of the memory database, and designing the structure characteristic quantity of the quick retrieval connection table of the memory database to be expressed as:

by adopting the method of data high-correlation factor analysis, the multi-dimensional characteristic parameter set H with the memory database for quick retrieval is adopted _nm The method comprises the following steps:

wherein x is _m (t) represents the mth subset of the index file X; the control time of the memory database retrieval is t ₀ The method of multi-table connection and aggregation is adopted to obtain a bidirectional index optimal feature solution set xi of the Java memory database:

7. The method of claim 6, wherein step 3-2 comprises: constructing a bidirectional reference index decision model for retrieving a memory database by adopting a decision tree classification method, realizing optimizing control on the database in the process of quick retrieval by using index table names and index columns, automatically selecting a proper space fusion path according to read-write characteristics of a current user request by analyzing operation logic of the memory database based on a relational database model analysis method, constructing a lightweight data processing frame on the basis, ensuring consistency and integrity of data of the memory database, constructing a space fusion model of the memory database under the support of the data processing frame, extracting semantic feature quantity of a database space by using the space fusion model, and inquiring a semantic distribution set X by using an inquiry method based on a graph data model _i (t)：

Where dt is the retrieval of the memory database; in a multi-data model database, a joint judgment method is combined to perform fusion positioning of a memory database, fuzzy clustering processing is performed on universal characteristic quantities of query tests of the multi-data model database, differentiated fusion scheduling is performed on query results, and a multi-model data fusion model f of the memory database is obtained _x (t) is expressed as:

wherein x is _i (t) is a bidirectional reference feature matching set for memory database retrieval, and in the original multi-model database, a feature allocation mechanism of association factor fusion is adopted to obtain a bidirectional reference index of the memory databaseThe channel matching set output R (x) is:

wherein alpha is ₁ And τ ₁ Respectively distributing amplitude and delay of associated features of the memory database, wherein l is E [0, L-1 ]]，τ ₀ <τ ₁ <…τ _L-1 L is the length of the associated feature sequence of the memory database, a data engine query control method is adopted to obtain a multi-data model database fusion distribution set, the distribution set represents the information conflict feature quantity on the relational database, and the association knowledge is adopted to guide mining to obtain a feature distribution entity set fused by association factors as x= (x) ₁ ，x ₂ ，…，x _m ) ^T The field name y for obtaining the quick retrieval of the memory database is as follows:

y＝c _m X-R(x)

setting f=n/MT as sampling frequency, and obtaining a channel model Y for fast retrieval of the memory database as follows:

8. The method of claim 7, wherein step 3-2 further comprises: optimizing control in the process of fast retrieving the memory database is realized through the index table name and the index column,according to the correlation relation between the index table and the value table, constructing a node fusion scheduling model z of the memory database _k The method comprises the following steps:

wherein h is _k Two-way reference characteristic distribution function representing memory database retrieval is carried out, two-dimensional topological structure design of the memory database is carried out in a grid environment, and a relational database Postgre SQL fusion is utilized to obtain a reference coordinate point of data retrieval as (0, 0), and a transmission link distribution set T of the memory database is obtained _s ＝N _f T _f Obtaining a linear programming model by adopting joint linear correlation fusionThe method comprises the following steps:

wherein, the feature quantity of the graph storage management structure of the m multiplied by n order relational database is that h _n Is->The nth value in the vector, m and n are vector dimensions, a fuzzy parameter characteristic matching set for memory database retrieval is constructed in the same entity representation attribute characteristic range, the optimization control in the process of fast memory database retrieval is realized through index table names and index columns, and the density rho of memory database space conflict nodes is as follows according to the correlation relation between the index table and the value table:

obtaining two-way fused sparsity characteristic quantity lambda of memory database _x Expressed as:

wherein lambda is ^′ For the reference feature distribution of the bidirectional index of the memory database, a fuzzy decision model for the memory database retrieval is constructed by combining the association relation of the memory database retrieval, the space planning design of the memory database is carried out by considering the packet switching protocol, and the information conduction model O for the Java memory database retrieval is carried out _x The method comprises the following steps:

where s is a bit sequence of the bi-directional index output of the memory database,is a relational data model parameter.

9. The method according to claim 8, wherein in step 4, the two application modes of searching in text and searching in graph refer to final functions displayed through searching in text and searching in graph, wherein the searching in text includes inputting short words to be searched in a search bar, and returning to a corresponding image set; the graph searching comprises uploading an image to be searched and returning an image set corresponding to the uploaded image element.

10. A storage medium storing a computer program or instructions which, when executed, implement the method of any one of claims 1 to 9.