CN111563191A

CN111563191A - Data processing system based on graph network

Info

Publication number: CN111563191A
Application number: CN202010647976.3A
Authority: CN
Inventors: 张学锋; 刘世林; 康青杨; 韩远; 吴桐; 曾途
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-08-21

Abstract

The embodiment of the invention discloses a data processing system based on a graph network, which comprises: the data collection device is used for crawling basic information of each individual from the public webpage by using a crawler technology and storing the basic information in a database; and the graph network construction device extracts information from the database, constructs a graph network by taking an individual as a node, connects the nodes with the incidence relation, takes the individual basic information as the attribute of the node, and encodes the attributes into a vector as a characterization vector of the node. The graph network constructed by the system can greatly simplify the operation amount based on the graph network and improve the calculation efficiency and the accuracy of the calculation result.

Description

Data processing system based on graph network

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a data processing system based on a graph network.

Background

Enterprises cannot exist independently in the operation process, necessarily have incidence relations with other enterprises, individuals or organizations, establish the mutual relations among the enterprises, and during subsequent application and analysis, the enterprises can not only be based on the data of the enterprises, but also be based on the data of the enterprises with the incidence relations, so that the accuracy of analysis results can be improved. At present, a method for establishing an inter-enterprise relationship is to establish a relationship network graph (or called as an enterprise knowledge graph or a graph network) of an enterprise, that is, each node in the graph is an enterprise, the nodes with an association relationship are connected to form an edge, and a more sophisticated process is to show a specific relationship on each edge, or show all or part of data generated by the enterprise in an operation on the node. However, the graph network constructed in this way has certain technical defects, for example, various data (short characters, long texts, tables, etc.) displayed on the same node are independent and discrete, which causes two problems when the graph network is utilized, one is that the computation load is too large (because the network structure is usually very complex), which causes that some or even all data in the node are not utilized, and thus the accuracy of the computation result is affected; the other is that when multiple data in the node need to be calculated, multiple calculations need to be performed based on the multiple data, which not only increases the calculation amount, but also greatly reduces the processing efficiency.

Disclosure of Invention

The invention aims to solve the technical problems of reducing the calculation amount of a graph network and improving the accuracy and the processing efficiency of a calculation result, and provides a data processing system based on the graph network.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

a graph network based data processing system, the graph network based data processing system comprising:

the data collection device is used for crawling basic information of each individual from the public webpage by using a crawler technology and storing the basic information in a database;

and the graph network construction device extracts information from the database, constructs a graph network by taking an individual as a node, connects the nodes with the incidence relation, takes the individual basic information as a plurality of attributes of the node, and encodes the attributes into a vector as a characterization vector of the node.

In the above scheme, the data collection device can obtain the basic information of each individual from the public webpage by using a crawler technology, then the graph network construction device constructs a graph network composed of enterprise nodes by using the individual as a node and the basic information of the individual as an attribute of the node, and the graph network is used for representing (expressing) the relationship between enterprises. One node comprises a plurality of attributes, and various scattered data are unified on one hand by encoding the node into a characterization vector, namely, various information of the node is expressed by one vector, so that the simultaneous introduction can be realized during calculation, and the accuracy of a calculation result can be improved; on the other hand, the calculation is directly carried out based on one vector during calculation, so that two benefits are brought, one is that compared with the calculation of original data, the calculation process is greatly simplified, the calculation amount is reduced, the other benefit is that the calculation of various data can be realized simultaneously, multiple times of calculation is avoided, and the processing efficiency is further improved.

That is to say, when the user terminal is to be applied, the user terminal can directly perform corresponding graph calculation based on the graph network, and since the calculation is directly performed based on the characterization vector, the operation process of the application terminal can be greatly simplified, not only can the processing efficiency be improved, but also the hardware performance of the application terminal can be reduced, and then the hardware cost is reduced, which has a positive significance for large data application.

When the graph network construction device constructs the graph network, aiming at each node, each attribute in the plurality of attributes is coded into a vector to obtain a plurality of attribute vectors, and then the plurality of attribute vectors are aggregated to obtain a characterization vector of the node.

And the graph network construction device aggregates the attribute vectors in an equal-weight superposition mode to obtain the characterization vectors of the nodes. The method is simple in calculation, can reduce operation quantization, and can also keep the characteristics of each attribute.

The graph network construction device adopts a graph neural network model, and aggregates the attribute vectors based on the specified learning task to further obtain the characterization vectors of the nodes.

There are some conventional applications of enterprise-based data, such as tax monitoring, enterprise classification, etc. In the above scheme, a plurality of attribute vectors are aggregated by using the graph neural network model based on the designated learning task, so that when the graph neural network model is used, the calculation is performed based on the coded data, the calculation amount of the user terminal can be further reduced, the calculation efficiency is improved, the features can be selectively extracted or some features can be weakened based on the characterization vectors obtained by the designated learning task, and the accuracy of the calculation result can be further improved by using the characterization vectors for calculation.

The token vectors for each node are equal in length.

In the scheme, all the nodes are coded into the characterization vectors with the specified equal length, so that the characterization of each node is more uniform, and the subsequent graph calculation is more facilitated.

The graph network construction apparatus encodes an attribute specified in a number of attributes into an attribute vector.

Based on different applications, not all attributes need to be utilized, so that according to specific applications, the specified partial attributes are selectively encoded, interference of unnecessary attributes on encoding results can be avoided, encoding efficiency can also be improved, and the problem of how to further improve the encoding efficiency and the accuracy of the encoding results is solved.

The graph network construction device encodes the attributes into attribute vectors through a pre-trained attribute encoder.

The corresponding attributes are coded through the pre-trained attribute coder, and then the attribute coding result is adjusted according to the application purpose in the graph network model training process.

The basic information comprises individual identification and operation activity information, and the data collection device establishes an association relationship between the individual identification and the operation activity information and then stores the association relationship.

The data collection device may crawl information from one or more web pages for direct storage, but this may not facilitate quick search. In the scheme, the individual identification and the operation activity information are stored after establishing the association relationship, and the individual identification can be used as the ID to establish the directory, so that the method is more favorable for quickly searching the required data, namely, the problem of how to quickly search the data is solved.

The system further comprises a plurality of application terminals, and the application terminals acquire the constructed graph network from the graph network construction device so as to execute the calculation of the specified task based on the graph network.

Compared with the prior art, the embodiment of the disclosure has the following beneficial effects:

(1) compared with the calculation based on the original data, the calculation based on the vector is realized by converting the node data into the vector form for expression, so that the calculation based on the graph network is greatly reduced, and the processing efficiency is improved.

(2) All attributes are aggregated through a vector, more attribute features can be introduced during graph network-based calculation, and the accuracy of a calculation result is further improved.

(3) Compared with the method for calculating various data independently for multiple times, the method can realize the calculation of multiple data simultaneously based on one vector, and further improves the calculation efficiency.

(4) The calculation amount can be greatly reduced, so that the hardware performance requirement on the user terminal can be reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a block diagram of a data processing system based on a graph network in an embodiment of the invention.

Fig. 2 is a flowchart of processing performed by the graph network constructing apparatus when constructing the graph network.

FIG. 3 is a diagram of the use of the BERT model to encode business introduction attributes.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Referring to fig. 1, the data processing system based on graph network provided in this embodiment includes a data collection device 10, a graph network construction device 20, and a plurality of application terminals 30, where the data collection device 10, the graph network construction device 20, and the application terminals 30 may interact with each other. For example, the data collection device may collect data and then provide the data for the graph network construction device to use to construct the graph network, and the application terminal may directly obtain the graph network constructed by the data collection device from the graph network construction device to perform corresponding computing application.

It is easy to understand that the application terminal is based on the application, if only based on the previous data processing, the data processing system based on the graph network can also only comprise the data collection device and the graph network construction device.

The data collection device, the graph network construction device and the application terminal can be a server, a tablet computer, a notebook computer or a desktop computer and the like, and even a palm computer, a smart phone and the like with strong processing capability.

More specifically, the data collection device crawls basic information of each individual from public web pages by using a crawler technology and stores the information in a database.

After the data is crawled by the data collection device, the data can be directly stored in the database, but the data is possibly not beneficial to the data calling of the later-stage graph network construction device, so that the data is preferably stored after the data and the individual are associated. For example, when the basic information includes the individual identification and the operation activity information, the individual identification and the operation activity information are stored in the database after the association relationship is established.

Because a business may have a relationship with a business, an individual, an organization, etc. in a business activity, a graph network may include businesses, individuals, and organizations, and thus an individual may be a business, an individual, or an organization.

Taking an enterprise as an example, the business information here may include, for example, the nature of the enterprise, the industry to which the enterprise belongs, the province of the enterprise, the registered capital of the enterprise, the scope of business, financial statements, or other information (loan, tax, social security, water and electricity), and news opinions, and taking an individual as an example, the business information may include name, age, gender, personal profile, and the like. Different information is needed based on different applications, but some more storage may be collected at the time of data collection so as to be available when needed.

The individual identification and the operation activity information are associated, specific information is corresponding to corresponding enterprises, data storage errors are avoided, and meanwhile, a catalogue can be manufactured based on the individual identification, so that the corresponding information can be extracted from a database more quickly. In addition, for the same enterprise, different information is obtained from different webpages in more cases, the data obtained from the webpages are integrated by taking the individual identification as an object, and the association relationship is established and then stored.

It is to be understood that the individual identification here is a representation of a business ID, and may be, for example, a business name, or information such as an organization code may be used as the business ID.

More specifically, the graph network constructing device extracts information from the database and constructs a graph network by using an enterprise as a node. An enterprise is usually associated with many individuals, so the graph network includes several nodes, and the nodes having association relationship are connected. The basic information of the individual is used as the attribute of the node, and the node is coded into a characterization vector. Generally, as described above, the basic information of an individual includes a plurality of kinds of information, and thus a node has a plurality of kinds of attributes (for example, an attribute is one of industry, registered capital, financial statement, and the like), and when a node is encoded as a token vector, a plurality of attributes are encoded as a vector, and the vector is used as the token vector of the node.

More specifically, when the graph network constructing apparatus constructs a graph network, each attribute of the plurality of attributes is encoded into a vector for each node, so that a plurality of attribute vectors can be obtained, and then the plurality of attribute vectors are aggregated, so that a characterization vector of the node can be obtained.

Each node may encode the attributes into a vector in a respective manner. There are many ways to encode attributes as vectors. For example, one implementation is to model the attributes of all nodes in the graph network and encode them into fixed-length vectors, that is, the attribute conversion model of a specified structure is constructed in a modeling manner, and the obtained vectors are the desired characterization vectors.

As another embodiment, each attribute of the node may be encoded directly by using a trained attribute encoder. The attribute encoder may be trained in an unsupervised (e.g., BERT or other unsupervised training methods based on a large corpus) training method, or in a supervised manner using other specialized datasets (e.g., word segmentation datasets, ImageNet-based image classification neural networks). After the user trains the attribute encoder, other users can directly use the attribute encoder if needed, namely sharing and sharing of the attribute encoder are realized, and then other users can avoid retraining when encoding the attribute of the node, so that encoding efficiency is improved, and meanwhile, calculation consumption is saved.

It is to be understood that the attribute encoder employed may be different for different attributes. For example, for enterprise introduction of this attribute, pre-trained BERT model coding is used. The coded vector of the whole text is represented by the corresponding vector of [ CLS ], and [ SEP ] represents the sentence separator. Then multiple sentences may be input to the BERT model as shown in figure 3. Tokenmbedding is directly a word vector (already with a word-to-word vector conversion in the pre-trained model). Segment embedding can be viewed as one-hot encoding the id of a sentence. And (3) converting the Position embedding Position id by sine and cosine in different periods to obtain a Position vector. After the BERT multi-layer transform structure, extracting a vector corresponding to [ CLS ] as a characterization vector of the whole enterprise introduction attribute.

As another example, for a picture attribute in a node, such as a product picture. And (3) passing the picture through a VGG16 network pre-trained by imageNet, and extracting a coding vector of the last hidden layer of the picture as a representation vector of the picture.

And comparing some simple information, such as numerical information (e.g. age, registered capital, etc.), may be normalized coding; for example, one-hot encoding may be used for the classification type information (e.g., gender, industry, etc.).

It is also easy to understand that the above coding is only an example of an implementation manner, and what kind of specific way to code various attributes is adopted, and the scheme is not limited in this embodiment.

It should be noted here that, in order to simplify the operation and improve the calculation efficiency, when encoding the attributes into attribute vectors, the lengths of the respective attribute vectors may be restricted to be uniform.

When all attribute vectors are aggregated to obtain a characterization vector of a node, various embodiments are possible. For example, as a simplest implementation, the graph network constructing apparatus aggregates all attribute vectors by an equal weight superposition manner, that is, directly superposes all attribute vectors, each of which is a vector element of a token vector of a node.

As another embodiment, the graph network constructing device aggregates all the attribute vectors based on the specified learning task by using the graph neural network model. The designated learning task may be, for example, industry classification of the enterprise, tax ability detection, etc., a common or extensible application.

When the graph network construction device adopts the graph neural network model to learn based on the designated learning task, all attribute vectors need to be aggregated for each node to obtain the characterization vector of the node, then the graph neural network model is adopted to learn, the aggregation mode is continuously updated in the learning process, namely the characterization vector of the node is continuously updated, and the characterization vector obtained by the last updating after the learning is finished is the final characterization vector of the node.

The node characterization vector obtained by coding based on a certain learning task has more application significance, the later-stage calculation amount based on the application of the learning task can be further simplified, and the graph network model adopts a message propagation mechanism and can introduce the information of the peripheral nodes into the current node as an environment. Therefore, in actual operation, it may be preferable to aggregate all attribute vectors in a manner based on the graph neural network model, so as to obtain the characterization vectors of the nodes.

In more detail, referring to fig. 2, the graph network constructing apparatus, when encoding the basic information of the enterprise into the characterization vector of the node, performs the following steps:

and S300, encoding the attribute of each node into a characterization vector, wherein the characterization vector of the attribute can also be called an attribute vector.

Here, a universally applicable attribute encoder may be integrated into the graph network model in a module form to provide attribute representation for the graph network model.

In addition, the length of the token vector of each attribute can be specified, and the lengths of the token vectors of all attributes can be the same, so that the later operation is simplified.

It should be noted that, for a learning task, not all attributes may be needed, for example, the enterprise site attribute is not needed, so that before encoding, the attributes needed to be used may be specified in advance, and here, all the attributes to be used need to be converted into a vector encoding form. It will also be readily appreciated that whether attributes are utilized is based on the specific application purpose, and that some attributes may not be ready for utilization in the present application, but may be ready for utilization in the next application.

S301, for each node, aggregating all the attribute characterization vectors to obtain the characterization vector of the node, which is the initial characterization vector.

For example, for a single node, the characterization vectors of all attributes (which may be referred to as attribute vectors for short) are added bitwise, for example, the weight of each attribute vector is the same during initialization, and as training progresses, the weights of different attribute vectors change according to the affinity and sparseness with the learning purpose (label) of the learning task, and finally, fixed-length vectors with the same dimension are obtained. That is, the purpose of this step is to aggregate all attribute token vectors included in a node into a token vector of the node for each node, and the token vectors of each node have the same length.

For example, if a certain node has 3 attributes, and then 3 attribute vectors (attribute vectors are token vectors of the attributes) are obtained correspondingly, if the attribute vectors are (x 1, x2, x 3), (y 1, y2, y 3), z1, z2, and z 3), the token vector obtained after aggregation is (x 1, x2, x 3), (y 1, y2, y 3), and then (z 2, z 3)

，

，

) And a, b and c are weights of 3 attribute vectors respectively. During initialization, the values of a, b and c can be the same, but as training progresses, the importance of each attribute vector is automatically weighed by the model based on the learning purpose, and the weight value is adjusted.

It should be noted that the number of attribute vectors included in each node may be different, and therefore the vector dimension of each node may be different, and the bitwise addition is performed when the vector dimensions are equal, or when the lengths of all nodes are not required to be the same. If the length of the characterization vector of each node is required to be the same, for the condition that the vector dimensions are different, all attribute vectors can be spliced together to obtain an ultra-long large vector, and then the ultra-long large vector is transformed to the specified dimension size through linear transformation, so that the length of the characterization vector of each node is the same.

For example, an enterprise node contains attributes such as: registered capital, business introduction, product picture, where registered capital exists in a tabular form. And coding the product picture by using the attribute encoder model pre-trained on ImageNet to obtain a fixed-length picture vector, and obtaining a fixed-length text vector introduced by an enterprise by using BERT/sent2 vec. Then, the fixed-length vectors and the table attributes (the table attributes do not need to be coded into attribute vectors) are spliced and then converted into the fixed-length vectors unified by all the participants, namely, the characterization vectors of the nodes through a linear transformation module. Namely, the linear transformation module generates a matrix, the original vector matrix = new vector, and the new vector is the vector obtained after the lengths are unified.

The linear transformation module is a part of a graph neural network model, and because the used attributes of each node may be different and the feature dimensions (i.e. the number of attributes) of the bottom layer may not be consistent, the linear transformation module can transform the respective attribute features into node vectors with the same length. When the error is transmitted reversely, because the error is added by equal weight, the gradient can be directly transmitted to the matrix parameter of the linear transformation, and is updated together.

And S302, after the characteristic vectors of the nodes are obtained in the step 301, learning is performed by using a neural network model by using the characteristic vectors of the nodes.

In this step, a GCN (graph convolutional neural network) or GAT (graph attention) model (or other graph neural network models, message passing graph neural network models, may also be used) may be used for learning. Since the graph neural network model of these message passing mechanisms is a common technique, a description of the specific training process thereof is not expanded.

S303, updating the aggregation mode of the characterization vectors of all the attributes of each node based on the learning result, for example, updating the weight of each attribute vector.

During initialization, the weight values of the attribute vectors can be the same, but as training progresses, the importance of the attribute vectors is automatically weighed by the model based on learning purposes, and the weight values are adjusted, namely the weight values of the attribute vectors are updated.

And S304, updating the characterization vector of each node based on the updated aggregation mode. That is, all attribute characterization vectors are aggregated based on the updated aggregation mode, and an updated characterization vector of each node is obtained.

S305, judging whether the learning is finished or not, for example, judging whether the prediction accuracy reaches a set threshold value or not, if so, finishing the learning, otherwise, not finishing the learning. After the learning is completed, the process proceeds to step S307, and if the learning is not completed, the process proceeds to step S306.

S306, learning by using the feature vectors updated by each node and the neural network model adopted in the step S302, and then returning to the step S303.

And S307, taking the representation vector of each node obtained by the last updating as the representation vector of each node in the graph network.

After the graph network is constructed by the graph network construction device, the application terminal can acquire the graph network from the graph network construction device so as to execute the calculation of the specified task based on the graph network. For example, the business of an enterprise is classified based on the graph network, and only vector elements representing the business of the enterprise need to be extracted from the characterization vectors of the nodes for calculation, so that the calculation amount is very small. For example, the tax payment capability of an enterprise is predicted based on the graph network, and during calculation, only a plurality of corresponding vector elements (such as turnover, cost accounting, enterprise scale and the like) need to be extracted from the representation vectors of the nodes at the same time for calculation, so that the processing efficiency is greatly improved.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A graph network based data processing system, comprising:

2. The graph network-based data processing system of claim 1, wherein the graph network constructing apparatus encodes each of the plurality of attributes into a vector for each node when constructing the graph network, obtains a plurality of attribute vectors, and then aggregates the plurality of attribute vectors to obtain the characterization vector of the node.

3. The data processing system of claim 2, wherein the graph network construction device aggregates the plurality of attribute vectors in an equal weight superposition manner to obtain the characterization vectors of the nodes.

4. The graph network-based data processing system of claim 2, wherein the graph network construction device aggregates the plurality of attribute vectors based on a specified learning task using a graph neural network model, thereby obtaining the characterization vectors of the nodes.

5. The graph network based data processing system of claim 4, wherein the token vectors for each node are of equal length.

6. The graph network based data processing system of claim 2, wherein the graph network constructing means encodes specified ones of the plurality of attributes as attribute vectors.

7. The graph network based data processing system of claim 2, wherein the graph network constructing means encodes the attributes into attribute vectors by a pre-trained attribute encoder.

8. The graph network-based data processing system of claim 1, wherein the basic information comprises individual identification and business activity information, and the data collection device associates the individual identification with the business activity information and stores the association.

9. The graph network-based data processing system according to claim 1, further comprising a plurality of application terminals, wherein the application terminals acquire the constructed graph network from the graph network construction device so as to perform computation of a specified task based on the graph network.