CN116578613B

CN116578613B - Data mining system for big data analysis

Info

Publication number: CN116578613B
Application number: CN202310855939.5A
Authority: CN
Inventors: 金萍; 葛浩然; 宗瑜
Original assignee: Hefei Shangchuang Information Technology Co ltd; West Anhui University
Current assignee: Hefei Shangchuang Information Technology Co ltd; West Anhui University
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-09-08
Anticipated expiration: 2043-07-13
Also published as: CN116578613A

Abstract

The invention relates to the technical field of big data, and discloses a data mining system for big data analysis, which comprises the following components: the data preprocessing module is used for identifying entities contained in the government affair data based on the government affair data, generating a government affair data map based on the relation between the entities and generating entity vectors for the entities; the diagram generation module is used for generating a diagram matrix based on the government affair data map; the region generation module equally divides the graph matrix into N non-overlapping sub-regions; sequentially splicing the regional parameters of all the subregions to generate a subregion vector; the data processing module is used for inputting the graph matrix, the subarea vector and the entity vector into the data coding model and outputting classification labels related to the data mining targets; the invention makes up the association relation of the missing entities by learning the graph matrix and the sampling area parameters, and is suitable for the data mining processing of government big data.

Description

Data mining system for big data analysis

Technical Field

The invention relates to the technical field of big data, in particular to a data mining system for big data analysis.

Background

Government affair big data refers to mass data collected, integrated, analyzed and utilized by government departments. The government affair big data is characterized in that: 1. large scale. The government affair big data is managed by taking TB and PB as units, and the data scale is far more than that of common enterprises. 2. The data is complex and diverse. The government affair big data contains data with different formats, such as structured data, unstructured data, semi-structured data and the like. 3. Multidimensional association. The government affair big data has the characteristic of multidimensional association, and new knowledge and rules can be found by analyzing the association existing between different dimensions. Government departments from which government big data are sourced have complex organization relations, so that the association relation between government big data is found out by a rule-based method or a general machine learning method, and the goal is difficult to finish when the government departments are applied to specific data mining tasks.

Disclosure of Invention

The invention provides a data mining system for big data analysis, which solves the technical problem that the association relation between government big data in the related technology is found out by a rule-based method or a general machine learning method.

The invention provides a data mining system for big data analysis, comprising: the data preprocessing module is used for identifying entities contained in the government affair data based on the government affair data, generating a government affair data map based on the relation between the entities and generating entity vectors for the entities; the diagram generation module is used for generating a diagram matrix based on the government affair data map, wherein the value of an element in the diagram matrix is 0 or 1, and whether a relation exists between the entities is indicated; and the region generation module equally divides the graph matrix into N non-overlapping subareas, and the size of each subarea is M elements.

Extracting the row number of the elements of the lower left corner and the upper right corner of the divided subareas to generate an area parameter expressed asWherein->Column number of elements respectively representing lower left corner of sub-region,/->The number of columns and rows of elements respectively representing the upper right corner of the sub-region; sequentially splicing the regional parameters of all the subregions to generate a subregion vector; a data processing module for inputting the graph matrix, the sub-region vector, and the entity vector into a data encoding model, the data encoding model comprising: first linear layer, first sampling layer, firstA hidden layer and a first fully-connected layer.

The first linear layer is calculated as follows:wherein->Representing a sub-region vector,/->Representing a sample area vector, +.>Weight parameter representing the first linear layer, < ->Meaning rounded up or rounded up to an integer, < >>Representing an activation function, a sinusoidal function or a hyperbolic tangent function is selected.

The sampling area vector is input into a first sampling layer, the first sampling layer restores the sampling area vector into the area parameter of the sampling area, a corresponding sampling area is generated in the graph matrix, and empty elements in the graph matrix corresponding to the sampling area are filled with 1.

And inputting the graph matrix and the entity vector updated by the first sampling layer into a first hiding layer, wherein the first hiding layer updates the entity vector to obtain a weighted entity vector, and inputting the graph matrix and the weighted entity vector into a first full-connection layer, and outputting a classification label related to a data mining target by the first full-connection layer.

Further, the elements of the ith row and jth column of the graph matrix are represented as，/>A0 indicates that there is no relation between the i and j-th entities,/->A relation of 1 indicates that there is a relation between the i and j-th entities.

Further, if the graph matrix cannot be equally divided, edge filling is performed on the graph matrix, and after each edge filling is completed, the number of rows and columns of the graph matrix is increased by 1, and the edge filling is performed until the graph matrix can be equally divided.

Further, the definition of the region parameters of the sampling region is the same as the definition of the region parameters of the sub-region; if the region parameter of the sampling region is negative, it indicates that the size of the sampling region is 0.

Further, the calculation formula of the first hidden layer is as follows:wherein->And->Tensor matrix representing entity vector and weighted entity vector, respectively,>representing the sum of the diagram matrix and the identity matrix, +.>Representation->Degree matrix of->Weight matrix representing the first hidden layer, +.>Representing the ReLU activation function.

Further, the first hidden layer has a multi-layer structure, and the calculation formula is as follows:wherein->And->Respectively representing the output and input of the first hidden layer of the l+1 layer, when l=1 +.>Tensor matrix representing entity vectors, when l > 1 +.>Output of the first hidden layer representing the first layer,/->Representing the sum of the diagram matrix and the identity matrix, +.>Representation->Degree matrix of->Representing the weight matrix of the first hidden layer of the layer l+1.

Further, the first linear layer includes a plurality of input channels, each input channel including a graph matrix of a different physical arrangement order; the first sampling layer generates a corresponding sampling area on each graph matrix respectively and updates each input graph matrix; the first hidden layer has a channel corresponding to each graph matrix, each channel outputs a group of weighted entity vectors, and the weighted entity vectors of the channels are added and divided by the channel number to be used as the weighted entity vectors of the input full-connection layer.

Further, the entity included in the government data identification government data includes a government staff post name, a government agency built-in agency name, a government work task name, a policy file name, and the like.

And the second sampling layer is arranged in front of the first full-connection layer, extracts an entity from a record of one government work task of a government staff, extracts a weighted entity vector corresponding to the entity extracted from the record of the one government work task from the weighted entity vector to generate a first entity set, performs feature fusion on the first entity set, and then inputs the first full-connection layer, and a classification label output by the first full-connection layer is used for representing offline workload of the one government work task.

Further, the system further comprises an online evaluation module and a total workload evaluation module, wherein the online evaluation module collects online work behavior information of one government staff for one government work task and then generates online feature vectors representing online work behaviors.

The on-line work behavior information of one government affair task of a government affair staff comprises the number of pages related to the government affair task, the number of words input on the pages related to the government affair task and the policy file name entity related to the government affair task; and encoding the online work behavior information of one government work task including the number of browsed pages related to the government work task and the number of words input on the pages related to the government work task to obtain page features and word features, and splicing the page features, the word features and entity vectors of policy file name entities related to the government work task to obtain a first spliced vector.

Then inputting the first spliced vector into a multi-layer perceptron, wherein the classification label of the multi-layer perceptron is the on-line workload time; and extracting the characteristics of an output layer input into the multi-layer perceptron as an on-line characteristic vector.

And the total workload assessment module is used for splicing the characteristic vectors after the characteristic fusion of the on-line characteristic vectors and the first entity set to obtain a second spliced vector, then inputting the second spliced vector into a total workload assessment model, and outputting a grading or grading of the classification label representing the total workload by the total workload assessment model.

Further, the system also comprises an portrait vector module and a vector index module, wherein the portrait vector module extracts a weighted entity vector of a portrait object name output by the first hidden layer as a portrait vector of the portrait object; the entity included in the government data identified by the government data includes a photo image, a portrait object name, a person track, a policy file name, a government agency name, and a government personnel post name.

The first full-connection layer outputs two classification labels which are respectively corresponding to the record of illegal actions of the portrait object and the record of illegal actions of the portrait object.

The vector index module clusters the portrait vectors of all the generated portrait objects, takes the cluster center as a code vector, and maps the portrait vectors in the cluster where the code vector is located with the portrait vectors.

The invention has the beneficial effects that: according to the invention, the study of the graph matrix and the sampling area parameters is used for making up for the missing association relation between the entities, making up for the missing relation between the required data entities in a specific data mining task, and the higher accuracy performance can be achieved by a lower hidden layer number, so that the method is suitable for the data mining processing of government affair big data.

Drawings

FIG. 1 is a block diagram of a data mining system for big data analysis according to the present invention.

FIG. 2 is a block diagram of a data mining system for big data analysis according to the present invention.

FIG. 3 is a block diagram of a data mining system for big data analysis according to the present invention.

In the figure: the system comprises a data preprocessing module 101, a graph generating module 102, a region generating module 103, a data processing module 104, a total work amount evaluation module 201, an on-line evaluation module 202, an portrait vector module 301 and a vector indexing module 302.

Detailed Description

The subject matter described herein will now be discussed with reference to example embodiments. It is to be understood that these embodiments are merely discussed so that those skilled in the art may better understand and implement the subject matter described herein and that changes may be made in the function and arrangement of the elements discussed without departing from the scope of the disclosure herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.

As shown in fig. 1, a data mining system for big data analysis, comprising: the data preprocessing module 101 identifies entities included in the government data based on the government data, generates a government data map based on a relationship between the entities, and generates entity vectors for the entities.

For government data of text type, the entity is identified by Named Entity Recognition (NER), and the method for generating the entity vector can be generated by encoding a word frequency-inverse document frequency model (TF-IDF), a document-vector model (Doc 2 vec), and the like.

Manual and self-help algorithms (Bootstrapping) and the like can be used to determine relationships between entities.

The graph generation module 102 generates a graph matrix based on the government affair data graph, wherein the value of an element in the graph matrix is 0 or 1, and the graph matrix indicates whether a relationship exists between entities.

Elements of the ith row and jth column of the graph matrix are represented as，/>A 0 indicates that there is no relationship between the i and j-th entities,a relation of 1 indicates that there is a relation between the i and j-th entities.

The region generation module 103 equally divides the graph matrix into N non-overlapping sub-regions, each sub-region having a size of m×m elements.

If the image matrix cannot be equally divided, edge filling is carried out on the image matrix, and the number of rows and columns of the image matrix is increased by 1 after each edge filling is finished, so that the image matrix can be equally divided.

Wherein R represents the total number of elements of the graph matrix, M is the number of columns of the sub-region, +.>Representing an upward rounding. N is a custom superparameter, and is generally 100.

Extracting the row number of the elements of the lower left corner and the upper right corner of the divided subareas to generate an area parameter expressed asWherein->Column number of elements respectively representing lower left corner of sub-region,/->The number of rows and columns of elements in the upper right corner of the sub-region are respectively indicated.

And sequentially splicing the regional parameters of all the subregions to generate a subregion vector. The splicing order is to splice the elements in the upper right corner from small to large.

The row and column numbers are the corresponding row and column positions of the elements in the graph matrix, for example, the row and column numbers of the elements in the first row and column are 1.

A data processing module 104 for inputting the graph matrix, the sub-region vector, and the entity vector into a data encoding model, the data encoding model comprising: a first linear layer, a first sampling layer, a first hidden layer, and a first fully connected layer.

The first linear layer is calculated as follows:wherein->Representing a sub-region vector,/->Indicating the sampling area directionQuantity (S)>Weight parameter representing the first linear layer, < ->Meaning rounded up or rounded up to an integer, < >>Representing an activation function, a sinusoidal function or a hyperbolic tangent function is selected.

Through the calculation of the first linear layer, a sampling region scaled on the basis of the sub-region is obtained.

The definition of the regional parameters of the sampling region is the same as the definition of the regional parameters of the sub-region; if the region parameter of the sampling region is negative, it indicates that the size of the sampling region is 0.

If the graph matrix was edge-filled, the graph matrix entered into the first hidden layer deletes the edge-filled portion.

The calculation formula of the first hidden layer is as follows:wherein->And->Representing entity vectors and weighted entity vectors, respectivelyTensor matrix of>Representing the sum of the graph matrix and the identity matrix of the same size,/->Representation->Degree matrix of->Weight matrix representing the first hidden layer, +.>Representing the ReLU activation function.

Further, the first hidden layer has a multi-layer structure, and the calculation formula is as follows:wherein->And->Respectively representing the output and input of the first hidden layer of the l+1 layer, when l=1 +.>Tensor matrix representing entity vectors, when l > 1 +.>Output of the first hidden layer representing the first layer,/->Representing the sum of the graph matrix and the identity matrix of the same size,/->Representation->Degree matrix of->Weight matrix representing the first hidden layer of the layer l+1,/and>representing the ReLU activation function.

The row vector of the tensor matrix of entity vectors represents an entity vector, and the serial number of the row vector is consistent with the serial number of the entity in the graph matrix. The tensor matrix of the weighted entity vector is represented in the same way as the tensor matrix of the entity vector.

In a further embodiment, the first linear layer comprises a plurality of input channels, each input channel comprising a different physical arrangement of the graph matrix; in particular, the physical arrangement order of the graph matrix may be random or manually ordered.

The first sampling layer generates a corresponding sampling area on each graph matrix respectively and updates each input graph matrix; the first hidden layer has a channel corresponding to each graph matrix, each channel outputs a group of weighted entity vectors, and the weighted entity vectors of the channels are added and divided by the channel number to be used as the weighted entity vectors of the input full-connection layer.

The entity vector input per channel of the first hidden layer is uniform, but the graph matrix is different.

The data coding model is trained by adopting a training method of a neural network model.

According to the data mining system for big data analysis, the unknown relation discovery capability between entities is generated through learning the graph matrix and the sampling area parameters, the defect of the relation between the needed data entities in a specific data mining task is overcome, the model can achieve higher accuracy performance with lower hidden layer number, and the data mining system is suitable for data mining processing of government big data.

If there is an inconsistent entity vector dimension, more than one linear layer may be provided before the first hidden layer for mapping entity vectors to the same dimension.

The data mining system for big data analysis also comprises a database for storing government data, graph matrix and the like.

As shown in fig. 2, in a specific embodiment, a data mining system for big data analysis is applied to the measurement of workload of government staff, and the objective of data mining is to evaluate workload of government staff, where the government data identifies that entities included in the government data include a post name of government staff, a name of government organization, a name of an organization built in the government organization, a name of a government work task, a name of a policy file, and so on.

For example, the name of a government task is a annual report of an enterprise in a censored area, and the contents of such a government task can be easily obtained on a government platform which distributes the task through a system.

One method of feature fusion for the first set of entities is:wherein->Represents the e-th weighted entity vector in the first set of entities, and K represents the total number of weighted entity vectors in the first set of entities.

For example, the class labels correspond to time values representing the time of work offline for the government work task; the time of the on-line work of the government work task is easy to count, and the time of the on-line work and the off-line work of the government work task can be used as the total time of the government work task.

One problem that the present system overcomes is that a government task may be performed by multiple government workers, taking into account the coordination between the government workers and the workflow between the government departments.

One problem with this embodiment is that it is misshapen to evaluate the workload of a government task only with total working time, and on this basis, an on-line evaluation module 202 and a total workload evaluation module 201 are added, wherein the on-line evaluation module 202 collects on-line work behavior information of one government task of a government staff, and then generates an on-line feature vector characterizing the on-line work behavior.

The total workload assessment module 201 performs feature fusion on the online feature vector and the first entity set to obtain a second spliced vector, and then inputs the second spliced vector into a total workload assessment model, and the total workload assessment model outputs a score or a rating of the classification label representing the total workload.

For example, the evaluation levels are classified into large, medium, and small, and correspond to the outputs of one total workload evaluation model, respectively.

The type of total workload assessment model is a neural network model.

The on-line work behavior information of one government affair task of a government affair staff comprises the number of pages related to the government affair task, the number of words input on the pages related to the government affair task, the policy file name entity related to the government affair task and the like; and encoding the online work behavior information of one government work task including the number of browsed pages related to the government work task and the number of words input on the pages related to the government work task to obtain page features and word features, and splicing the page features, the word features and entity vectors of policy file name entities related to the government work task to obtain a first spliced vector.

Then inputting the first spliced vector into a multi-layer perceptron (MLP), wherein the classification label of the multi-layer perceptron is on-line workload time; and extracting the characteristics of an output layer input into the multi-layer perceptron as an on-line characteristic vector.

Based on this embodiment, we need to evaluate the total evaluation of the workload of a government staff in a period of time, and provide a method that the type of the total workload evaluation module 201 is a cyclic neural network (RNN), the second spliced vector of the government work task performed by a government staff in a period of time is sequentially input into the total workload evaluation module 201 according to the time sequence, and the total workload evaluation module 201 outputs a classification label to represent the score or rating of the total workload of a government staff in a period of time after the last input. I.e. the classification labels are output in the last time step if the input is performed in time steps.

The purpose of the workload evaluation can be specifically to adjust the rest time of government staff or evaluate the performance of the government staff.

As shown in fig. 3, in another specific embodiment, a data mining system for big data analysis is applied to the creation of a portrait, where the traditional creation mode of a portrait in government aspect is that related data of a storage object is structured, and the structured storage easily causes the deletion of the portrait, and requires a large amount of manual assistance to search for corresponding data, and in this embodiment, the goal of data mining is vectorization of the portrait; the entity included in the government data identified by the government data includes a photo image, a portrait object name, a person track, a policy file name, a government agency name, a government personnel post name, and the like.

The entities of the character track are recorded in time-ordered addresses, so that entity vectors are obtained by adopting a semantic coding mode.

A data mining system for big data analysis includes a portrait vector module 301, the portrait vector module 301 extracting a weighted entity vector of a portrait object name output by a first hidden layer as a portrait vector of the portrait object.

The first full-connection layer outputs two classification labels which are respectively corresponding to the record of illegal behaviors of the portrait object and the record of illegal behaviors of the portrait object; the record of illegal behavior of the portrait object actually represents the result that the weighted entity vector of the portrait object is input to the output of the first full connection layer.

Generally, character track data recorded in government affair big data are all information related to specific behaviors of portrait objects, so that the establishment method of the portrait images can obtain vectorization representation of the behaviors and portrait features simultaneously, obtain the information of the portrait objects of the related portrait images through vectorization indexes, and evaluate the behavior types of target portrait.

The data mining system for big data analysis further comprises a vector index module 302, which clusters the portrait vectors of all the generated portrait objects, takes the cluster center as a code vector, and maps the portrait vectors in the cluster where the code vector is located with the portrait vectors.

And performing similarity calculation on the image vector and the code vector of the new image object generated after the big data is updated to match the code vector with the maximum similarity as a neighboring code vector, and extracting information of the image object corresponding to the image vector mapped by the neighboring code vector as an index result or directly taking the image vector as an index result.

The embodiment has been described above with reference to the embodiment, but the embodiment is not limited to the above-described specific implementation, which is only illustrative and not restrictive, and many forms can be made by those of ordinary skill in the art, given the benefit of this disclosure, are within the scope of this embodiment.

Claims

1. A data mining system for big data analysis, comprising: the data preprocessing module is used for identifying entities contained in the government affair data based on the government affair data, generating a government affair data map based on the relation between the entities and generating entity vectors for the entities; the diagram generation module is used for generating a diagram matrix based on the government affair data map, wherein the value of an element in the diagram matrix is 0 or 1, and whether a relation exists between the entities is indicated; the region generation module equally divides the graph matrix into N non-overlapping subareas, and the size of each subarea is M elements;

extracting the row number of the elements of the lower left corner and the upper right corner of the divided subareas to generate an area parameter expressed asWherein->Column number of elements respectively representing lower left corner of sub-region,/->The number of columns and rows of elements respectively representing the upper right corner of the sub-region; sequentially splicing the regional parameters of all the subregions to generate a subregion vector; a data processing module for inputting the graph matrix, the sub-region vector, and the entity vector into a data encoding model, the data encoding model comprising: a first linear layer, a first sampling layer, a first hidden layer, and a first fully-connected layer;

the first linear layer is calculated as follows:wherein->Representing a sub-region vector,representing a sample area vector, +.>Weight parameter representing the first linear layer, < ->Meaning rounded up or rounded up to an integer, < >>Representing activation functions, selecting sine functions orA hyperbolic tangent function;

the sampling area vector is input into a first sampling layer, the first sampling layer restores the sampling area vector into the area parameter of the sampling area, a corresponding sampling area is generated in the graph matrix, and empty elements in the graph matrix corresponding to the sampling area are filled with 1;

2. A data mining system for big data analysis according to claim 1, wherein the elements of row i and column j of the graph matrix are represented as，/>A0 indicates that there is no relation between the i and j-th entities,/->A relation of 1 indicates that there is a relation between the i and j-th entities.

3. The data mining system for big data analysis of claim 1, wherein if the graph matrix is not equally divided, the graph matrix is edge-filled, and the number of rows and columns of the graph matrix is increased by 1 after each edge filling is completed, and the edge filling is performed until the graph matrix is equally divided.

4. A data mining system for big data analysis according to claim 1, characterized in that the definition of the region parameters of the sampling region is the same as the definition of the region parameters of the sub-region; if the region parameter of the sampling region is negative, it indicates that the size of the sampling region is 0.

5. The data mining system for big data analysis of claim 1, wherein the first hidden layer is calculated as:wherein->And->Tensor matrix representing entity vector and weighted entity vector, respectively,>representing the sum of the diagram matrix and the identity matrix, +.>Representation->Degree matrix of->Weight matrix representing the first hidden layer, +.>Representing the ReLU activation function.

6. The data mining system for big data analysis of claim 1, wherein the first hidden layer is a multi-layer structure, and the calculation formula is as follows:wherein->And->Respectively representing the output and input of the first hidden layer of the l+1 layer, when l=1 +.>Tensor matrix representing entity vectors, when l > 1 +.>Output of the first hidden layer representing the first layer,/->Representing the sum of the diagram matrix and the identity matrix, +.>Representation->Degree matrix of->Weight matrix representing the first hidden layer of the layer l+1,/and>representing the ReLU activation function.

7. The data mining system for big data analysis of claim 1, wherein the first linear layer comprises a plurality of input channels, each input channel comprising a different entity arrangement sequence of graph matrices; the first sampling layer generates a corresponding sampling area on each graph matrix respectively and updates each input graph matrix; the first hidden layer has a channel corresponding to each graph matrix, each channel outputs a group of weighted entity vectors, and the weighted entity vectors of the channels are added and divided by the channel number to be used as the weighted entity vectors of the input full-connection layer.

8. The data mining system for big data analysis of claim 1, wherein the government data identifies that the entity included in the government data includes a government staff post name, a government agency built-in agency name, a government work task name, a policy file name;

9. The data mining system for big data analysis of claim 8, further comprising an online assessment module and a total workload assessment module, wherein the online assessment module collects online work behavior information of a government staff member for a government work task and then generates an online feature vector characterizing the online work behavior;

the on-line work behavior information of one government affair task of a government affair staff comprises the number of pages related to the government affair task, the number of words input on the pages related to the government affair task and the policy file name entity related to the government affair task; the method comprises the steps that online work behavior information of a government work task comprises the number of browsed pages related to the government work task and the number of characters input on the pages related to the government work task, page characteristics and character characteristics are obtained through coding, and entity vectors of policy file name entities related to the page characteristics, the character characteristics and the government work task are spliced to obtain a first spliced vector;

then inputting the first spliced vector into a multi-layer perceptron, wherein the classification label of the multi-layer perceptron is the on-line workload time; extracting the characteristics of an output layer of the input multi-layer perceptron as an on-line characteristic vector;

10. The data mining system for big data analysis of claim 1, further comprising an portrayal vector module and a vector index module, the portrayal vector module extracting a weighted entity vector of a portrayal object name output by the first hidden layer as the portrayal vector of the portrayal object; the government affair data identifies that the entity contained in the government affair data comprises a photo image, an portrait object name, a character track, a policy file name, a government affair organization name and a government affair personnel post name;

the first full-connection layer outputs two classification labels which are respectively corresponding to the record of illegal behaviors of the portrait object and the record of illegal behaviors of the portrait object;