CN115658942A

CN115658942A - Financial scene-oriented joint credit investigation intelligent data retrieval method

Info

Publication number: CN115658942A
Application number: CN202211346701.1A
Authority: CN
Inventors: 杨帆; 曹杰; 周云松; 王治平; 王培才; 茅天天
Original assignee: Jiangsu United Credit Reference Co ltd; Nanjing University of Finance and Economics
Current assignee: Jiangsu United Credit Reference Co ltd; Nanjing University of Finance and Economics
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-01-31

Abstract

A financial scene-oriented combined credit investigation intelligent data retrieval method comprises the steps of firstly, extracting image features of combined credit investigation by using a vgg model, extracting features of a combined credit investigation text by using Gloves, mapping the combined credit investigation text into a subspace, simultaneously, using reference constraint loss in transfer learning, and enabling a classifier to be incapable of distinguishing differences between text and image features between modes through modal counter loss, wherein the text and the image features of the combined credit investigation of a financial scene are fused into a whole, so that the text and the image with similar features are gathered together; and finally, inputting the combined credit investigation text data of a single financial scene, retrieving the associated combined credit investigation picture data and text data, or inputting a piece of combined credit investigation picture data, and retrieving the associated combined credit investigation picture information and text information.

Description

Financial scene-oriented joint credit investigation intelligent data retrieval method

Technical Field

The invention designs a financial scene-oriented joint credit investigation intelligent data retrieval method, and belongs to the technical field of artificial intelligence and joint credit investigation data safety.

Background

With the development of the internet era and the explosive growth of information of financial scenes, the traditional data retrieval can search similar financial credit investigation information in a single form, and cannot process modal information such as pictures of joint credit investigation, and when needing to retrieve other modal information related to the modal information simultaneously, the traditional text classification retrieval cannot meet the requirement because the type of input data is different from the type of output information. For the financial data involved in the joint credit investigation, the characteristics of the corresponding data need to be extracted, which is a primary task in the fields of data mining, machine learning and the like. For the joint credit investigation intelligentized data of the financial scene, the data describe or relate to the same semantic concept of the joint credit investigation data of the financial scene from different dimensions, and the characteristics of the data are required to be capable of distinguishing the attribution of the joint credit investigation data. However, different modal data of the joint credit investigation intelligent data in many current financial scenes cannot be directly compared, and the intelligent data retrieval and retrieval need to be put into the same space for comparison.

CN2022107808165 discloses a system, a method, a system, a device and a medium for credit investigation query processing, which are applied to the technical field of data processing of financial technology and are used for solving the problems of poor processing capability, stability and robustness of the credit investigation query processing system in the prior art.

CN2018104363407 discloses a big data-based internet financial enterprise credit investigation risk analysis method and system. The method comprises the following steps: the credit investigation risk assessment model related to the internet financial enterprise is constructed by collecting credit investigation information of the internet financial enterprise and the collected credit investigation information, and the credit investigation risk of the internet financial enterprise is assessed by the constructed credit investigation risk assessment model so that the internet financial platform can judge whether the internet financial platform is suitable for financial transaction with the internet financial enterprise. The invention provides an internet financial enterprise credit investigation risk analysis method based on big data.

CN2020106703966 discloses a financial wind control model construction method based on big data credit, which is used for sampling financial wind control data, wherein the financial wind control data comprises a financial wind control data training set and a financial wind control data verification set; establishing a model of the financial wind control data according to the sampled financial wind control data, wherein the model comprises the financial wind control data and a model parameter set; and judging the risk level of the enterprise to be assessed according to the model.

CN2021105832803 discloses a block chain-based personal credit investigation query method and device, and relates to the technical field of block chains. When receiving a personal credit investigation request sent by a current financial system, firstly determining whether the current financial system is a alliance link node; after the current financial system is determined to be a alliance link node, authorizing the current financial system to inquire the personal credit investigation information of the current user according to the personal credit investigation request so as to improve the safety of the personal credit investigation; and adding the business handling information determined by the current financial system based on the personal credit information of the current user and the business handling request submitted by the current user to the block chain address of the current user in the alliance chain.

CN2015106960220 relates to a financial service platform system based on enterprise credit investigation information collection, which comprises an enterprise credit collection point, a credit investigation processing terminal, a financial credit investigation service platform and a credit processing system, wherein the enterprise credit collection point is directly in contact with an enterprise and is provided with a plurality of credit information collection personnel, the credit investigation processing terminal is used by the credit information collection personnel, the enterprise credit collection point is in contact with the financial credit investigation service platform through an information transmitting and receiving system, and the credit processing system is arranged and managed by the financial credit investigation service platform; the credit investigation processing terminal is provided with a credit data conversion system, and the credit data conversion system is provided with a data transmission and transmission system. The system combines the enterprise credit investigation and the financial service, improves the acquisition effect of the enterprise credit investigation, and sets different credit security levels, thereby facilitating different enterprise information to be used by the financial service center.

CN2019110627642 discloses a supply chain financial credit investigation analysis method based on a multilayer genetic method under the background of big data, S1: data collection: collecting all evaluation data of a company within a certain period of time, and processing the data through a supply chain finance institute to obtain an evaluation matrix; s2: setting matrix distribution parameters: randomly setting a plurality of analysis points in the evaluation matrix, wherein each analysis point takes a base point of the evaluation matrix as a root node and connects all the analysis points to form an initial analysis tree; s3: encoding chromosomes: randomly selecting a time period in the initial analysis tree to perform Prufer coding to obtain a whole integer sequence, and respectively converting the obtained integer sequence into chromosome codes.

However, the prior art still fails to provide an intelligent data retrieval method for intelligent data retrieval of machine learning.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: an intelligent data retrieval method facing financial scenes is provided. The invention relates to a financial scene oriented joint credit investigation intelligent data retrieval method, which realizes machine learning-based intelligent data retrieval, adopts a subspace method to learn a projection matrix for symbiotic information of joint credit investigation data pairs of different financial scenes, projects the joint credit investigation data characteristics of different modes to a common potential subspace, and measures the similarity of different modes in the subspace, thereby realizing joint credit investigation intelligent data retrieval.

In order to solve the problems, the technical scheme adopted by the invention is as follows: a financial scene-oriented combined credit investigation intelligent data retrieval method comprises the steps of firstly extracting image features of combined credit investigation by using a vgg model, extracting features of combined credit investigation texts by using Gloves, mapping the combined credit investigation texts into a subspace, simultaneously using reference constraint loss in transfer learning, enabling a classifier to be incapable of distinguishing differences between the text features and the image features between modes through modal counter loss, and fusing the text features and the image features of the combined credit investigation of a financial scene into a whole, so that the text features and the image features with similar features are gathered together. And finally, inputting the combined credit investigation text data of a single financial scene, retrieving the associated combined credit investigation picture data and text data, or inputting a piece of combined credit investigation picture data, and retrieving the associated combined credit investigation picture information and text information.

The method comprises the following steps:

1) The VGG model is a preferred algorithm for extracting the CNN features from the joint credit investigation image of the financial scene, although the parameter quantity of the VGG model is huge, the VGG model has extremely excellent performance in a plurality of migration learning tasks, the VGG model is different according to the sizes of convolution kernels and the number of the convolution layers, therefore, the model used by the invention is the VGG16 model, and the convolution layer and the pooling layer of the VGG16 can be divided into different blocks which are numbered as Block 1-Block 5 from front to back in sequence. Each block contains several convolutional layers and one pooling layer. For example, block4 contains 3 convolutional layers, conv3-512 and 1 pooling layer, maxpool. At the same time, the number of channels in the convolutional layers is the same within the same block. In VGG16, it requires input data with dimensions 224 × 3 (row resolution × column resolution × number of channels). And each time the channel passes through one Block, the number of the channels is doubled, and the channels are kept unchanged after 512 channels are reached. And after passing through the pooling layer, passing through two full-connection layers, and finally extracting the feature vector.

2) The text data set of the joint credit investigation is preprocessed before feature extraction, wherein the preprocessing comprises data loading, stop word removal, word segmentation and feature extraction. The method comprises the steps of utilizing a Pythroch self-defined data set to preprocess data of joint credit investigation, converting the data which can be processed by the orch into a pointer so as to be input into a neural network, and importing a corresponding joint credit data set into a data loader.

3) After the text of the joint credit investigation is preprocessed by stop words, the invention trains a training set by using glove.42B.300d as a glove pre-training word vector embedded file, and extracts corresponding text features. For the input text of the joint credit investigation, firstly, the specification of the size writing is carried out, then, the corresponding words are matched in the word vector embedding file glove.42B.300d, the related co-occurrence matrix data is loaded, and then, the operation is carried out.

4) The model training module is the core module of the invention. And after the extraction of the training set text of the joint credit investigation and the image characteristics of the joint credit investigation is finished, the extracted text is used as input data to enter a model training module. Firstly, different text and image characteristics of joint credit investigation respectively enter a full-connection network to obtain characteristics with uniform dimensionality, and classification loss is calculated, so that the relation in a mode can still be ensured after the image and the text are mapped to a subspace. Meanwhile, the characteristics of the image text of the positive case and the negative case combined credit investigation are obtained, the triple loss is calculated by combining given parameters, the minimum distance between the same semantic meaning and different modalities is ensured, and the maximum distance between the different modalities is ensured. In addition, by using the idea of transfer learning for reference, modal confrontation loss is introduced, and the differences of different modalities in the subspace are confused by introducing a gradient direction layer mode, so that the image texts can be accurately classified in the modalities, and meanwhile, the categories cannot be distinguished among the modalities, and the effect of model training is improved. And (3) using an AdamaOptimizer optimizer in each iteration, introducing secondary gradient correction to find a global optimum point, and finally obtaining the trained model.

5) The input of the retrieval module is the data of the joint credit investigation directly input by the user, and the input can be the text data of the joint credit investigation or the image data of the joint credit investigation. After a user inputs data, the system firstly extracts the input features of the joint credit investigation, then loads the feature model database trained before for comparison, extracts the features with the highest similarity after comparison, and matches the data corresponding to the joint credit investigation to output an index result.

Has the advantages that: the invention discloses a financial scene-oriented combined credit investigation intelligent data retrieval method which comprises the steps of firstly extracting image features of combined credit investigation by using a vgg model, extracting features of a combined credit investigation text by using Gloves, mapping the combined credit investigation text into a subspace, simultaneously using reference constraint loss in transfer learning, enabling a classifier to be incapable of distinguishing differences between the text and the image features among modes through modal counter loss, and fusing the text and the image features of the combined credit investigation of a financial scene into a whole, so that the text and the image with similar features are gathered together. And finally, inputting the combined credit investigation text data of a single financial scene, retrieving the associated combined credit investigation picture data and text data, or inputting a piece of combined credit investigation picture data, and retrieving the associated combined credit investigation picture information and text information.

In particular, the present invention has the following advantages:

1. and joint credit investigation intelligent data retrieval based on machine learning is realized.

2. The method comprises the steps of learning a projection matrix for symbiotic information of the joint credit data pairs of different financial scenes, projecting joint credit data features of different modes to a common potential subspace, and measuring the similarity of different modes in the subspace, so that joint credit investigation intelligent data retrieval is realized.

Drawings

Fig. 1 is a schematic diagram of the VGG16 model.

Detailed Description

The present invention will be further illustrated by the following specific examples, which are to be construed as merely illustrative, and not limitative of the remainder of the disclosure in any way whatsoever.

The invention comprises the following steps:

1) The method includes the steps that image data of corresponding joint credit are crawled from a financial scene by means of a Python crawler technology, a requests library is adopted, a url library is used for data crawling, finally, the crawled data are stored into a csv file, repeated data can be screened during data preprocessing, the acquired data have multiple attributes, and the attributes capable of accurately describing the data need to be selected in the data acquisition stage. In order to ensure the accuracy of our data, the data with small relevance or irrelevant is deleted. In order to ensure uniqueness, a piece of data with the best timeliness is selected and stored in a local database to form a joint credit image data set CID.

2) The method includes the steps that corresponding text data of joint credit investigation is crawled from a financial scene by means of a Python crawler technology, a requests library is adopted, a url library is used as an auxiliary for data crawling, finally, the crawled data are stored in a csv file, repeated data can be screened during data preprocessing, the acquired data have multiple attributes, and the attributes capable of accurately describing the data are selected in a data acquisition stage. In order to ensure the accuracy of our data, the data with small relevance or irrelevant is deleted. In order to ensure uniqueness, a piece of data with the best timeliness is selected and stored in a local database to form a combined token text data set CTD.

3) The VGG model is a preferred algorithm for extracting CNN features from a joint credit investigation image data set CID of a financial scene at present, although the parameter quantity of the algorithm is huge, the algorithm has extremely excellent performance in a plurality of migration learning tasks, the VGG model is different according to different sizes of convolution kernels and the number of convolution layers, therefore, the model used by the invention is a VGG16 model, and the convolution layer and the pooling layer of the VGG16 can be divided into different blocks which are numbered from front to back as Block 1-Block 5. Each block contains several convolutional layers and one pooling layer. For example, block4 contains 3 convolutional layers, conv3-512 and 1 pooling layer, maxpool. At the same time, the number of channels in the convolutional layers is the same within the same block. In the VGG16, it requires input data with dimensions of 224 × 3 (row resolution × column resolution × number of channels). And each time the channel passes through one Block, the number of the channels is doubled, and the channels are kept unchanged after 512 channels are reached. And after passing through the pooling layer, passing through two full-connection layers, and finally extracting the feature vector. As shown in Table 1

TABLE 1VGG16 feature extraction model

4) The combined credit investigation text data set CTD is preprocessed before feature extraction, and the preprocessing comprises data loading, word stop, word segmentation and feature extraction. The method comprises the steps of utilizing a Pythroch self-defined data set to preprocess data of joint credit investigation, converting the data which can be processed by the orch into a pointer so as to be input into a neural network, and importing a corresponding joint credit data set into a data loader.

5) After the text of the joint credit investigation is preprocessed by stop words, the invention trains a training set by using glove.42B.300d as a glove pre-training word vector embedded file, and extracts corresponding text features. For the input text of the joint credit investigation, firstly, the specification of the size writing is carried out, then, the corresponding words are matched in the word vector embedding file glove.42B.300d, the related co-occurrence matrix data is loaded, and then, the operation is carried out.

The model training module is the core module of the invention. And after the extraction of the training set text of the joint credit investigation and the image characteristics of the joint credit investigation is finished, the extracted text is used as input data to enter a model training module. Firstly, different text and image characteristics of joint credit investigation respectively enter a full-connection network to obtain characteristics with uniform dimensionality, and classification loss is calculated, so that the relation in a mode can still be ensured after the image and the text are mapped to a subspace. Meanwhile, the characteristics of the image text of the positive case and the negative case combined credit investigation are obtained, the triple loss is calculated by combining given parameters, the minimum distance between the same semantic meaning and different modalities is ensured, and the maximum distance between the different modalities is ensured. In addition, by using the idea of transfer learning for reference, modal confrontation loss is introduced, and the differences of different modalities in the subspace are confused by introducing a gradient direction layer mode, so that the image texts can be accurately classified in the modalities, and meanwhile, the categories cannot be distinguished among the modalities, and the effect of model training is improved. And (3) using an AdamaOptimizer optimizer in each iteration, introducing secondary gradient correction to find a global optimum point, and finally obtaining the trained model.

The input of the retrieval module is the data of the joint credit investigation directly input by the user, and the input can be the text data of the joint credit investigation or the image data of the joint credit investigation. After a user inputs data, the system firstly extracts the input features of the joint credit investigation, then loads the feature model database trained before for comparison, extracts the features with the highest similarity after comparison, and matches the data corresponding to the joint credit investigation to output an index result.

Claims

1. A financial scene-oriented combined credit investigation intelligent data retrieval method is characterized in that firstly, a vgg model is used for extracting image features of combined credit investigation, gloves is used for extracting features of combined credit investigation texts, the combined credit investigation texts are mapped into subspaces, constraint loss is used for reference in transfer learning, through modal counter loss, a classifier cannot distinguish differences between texts and image features between modalities, the texts and the image features of the combined credit investigation of a financial scene are fused into a whole, and the texts and the images with similar features are gathered together; and finally, inputting the combined credit investigation text data of a single financial scene, retrieving the associated combined credit investigation picture data and text data, or inputting a piece of combined credit investigation picture data, and retrieving the associated combined credit investigation picture information and text information.

2. The financial scenario-oriented joint credit investigation intelligent data retrieval method as claimed in claim 1, wherein the method comprises the following steps:

1) The convolution layer and the pooling layer of the VGG16 can be divided into different blocks which are numbered as Block 1-Block 5 from front to back in sequence; each block comprises a plurality of convolution layers and a pooling layer; in the same block, the number of channels of the convolution layer is the same; in VGG16, the CNN parameters extracted in the image require input data with dimensions of 224 × 3, i.e., row resolution × column resolution × number of channels; the number of channels is doubled after each Block of Block, and the number of channels is kept unchanged after reaching 512; after passing through the pooling layer, passing through two full-connection layers, and finally extracting a feature vector; converting the data into a tensor for inputting into a neural network, and importing a corresponding combined token data set into a data loader;

2) Preprocessing a text data set of the joint credit investigation before feature extraction, wherein the preprocessing comprises data loading, stop word removal, word segmentation and feature extraction; preprocessing the data of the joint credit by utilizing a Pythroch self-defined data set, converting the data which can be processed by the rch into a pointer so as to be input into a neural network, and importing the corresponding joint credit data set into a data loader;

3) After the text of the joint credit investigation is preprocessed by stop words, a glove.42B.300d is used as a glove pre-training word vector embedded file to train a training set, and corresponding text features are extracted; for the input text of the joint credit investigation, firstly, carrying out size writing specification, then matching corresponding words in a word vector embedded file glove.42B.300d, loading related co-occurrence matrix data, and then carrying out operation;

4) Model training: after the extraction of the training set text of the joint credit investigation and the image characteristics of the joint credit investigation is finished, the extracted text is used as input data to enter a model training module; firstly, the characteristics of different combined credit investigation image data sets CID and image combined credit investigation text data sets CTD respectively enter a full-connection network to obtain characteristics with uniform dimensionality, and classification loss is calculated, so that the relation in a mode can still be ensured after the images and the texts are mapped to a subspace; meanwhile, the characteristics of the image text of the positive case and the negative case combined credit investigation are obtained, the triple loss is calculated by combining given parameters, the minimum distance between the same semantic meaning and different modalities is ensured, and the maximum distance between the different modalities is ensured; in addition, by using the idea of transfer learning, modal confrontation loss is introduced, and the differences of different modalities in a subspace are confused by introducing a gradient direction layer mode, so that the image texts can be accurately classified in the modalities, and meanwhile, the categories cannot be distinguished among the modalities, and the model training effect is improved; and (3) using an AdamaOptimizer optimizer in each iteration, introducing secondary gradient correction to find a global optimum point, and finally obtaining the trained model.

3. The financial-scene-oriented intelligent data retrieval method for joint credit investigation of claim 1, wherein the input of the retrieval module is the data of joint credit investigation directly input by a user, and the input is the text data CI of the joint credit investigation or the image data CT of the joint credit investigation; after a user inputs data, a retrieval module firstly extracts the input features of the joint credit investigation, then loads a feature model database trained before for comparison, extracts the features with the highest similarity after comparison, and matches the data corresponding to the joint credit investigation to output an index result.