CN114357175A - Data mining system based on semantic network - Google Patents

Data mining system based on semantic network Download PDF

Info

Publication number
CN114357175A
CN114357175A CN202111211723.2A CN202111211723A CN114357175A CN 114357175 A CN114357175 A CN 114357175A CN 202111211723 A CN202111211723 A CN 202111211723A CN 114357175 A CN114357175 A CN 114357175A
Authority
CN
China
Prior art keywords
data
entity
service
mining
semantic network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111211723.2A
Other languages
Chinese (zh)
Inventor
白雪娇
王东升
田桂申
屈春一
曹阳
柴占军
章达英
邹睿翀
吴小锋
刘琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Times Economic Publishing House Co ltd
State Grid Inner Mongolia East Electric Power Co ltd Comprehensive Service Branch
Information and Telecommunication Branch of State Grid East Inner Mogolia Electric Power Co Ltd
State Grid Eastern Inner Mongolia Power Co Ltd
Original Assignee
China Times Economic Publishing House Co ltd
State Grid Inner Mongolia East Electric Power Co ltd Comprehensive Service Branch
Information and Telecommunication Branch of State Grid East Inner Mogolia Electric Power Co Ltd
State Grid Eastern Inner Mongolia Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Times Economic Publishing House Co ltd, State Grid Inner Mongolia East Electric Power Co ltd Comprehensive Service Branch, Information and Telecommunication Branch of State Grid East Inner Mogolia Electric Power Co Ltd, State Grid Eastern Inner Mongolia Power Co Ltd filed Critical China Times Economic Publishing House Co ltd
Priority to CN202111211723.2A priority Critical patent/CN114357175A/en
Publication of CN114357175A publication Critical patent/CN114357175A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a data mining system based on semantic network, comprising: the matching module is used for receiving data analysis requirements, extracting data mining services from the data analysis requirements by using a pre-trained relation model, and determining service nodes corresponding to the data mining services; the determining module is used for constructing a mining workflow based on the data mining service; and the display module is used for displaying the service nodes corresponding to the service items in the mining workflow and the process in a visual mode. According to the scheme, the historical data processing, the mining rules and the business nodes are associated in an enterprise data mining scene, the mining workflow is generated intelligently, and the working efficiency is improved.

Description

Data mining system based on semantic network
Technical Field
The invention relates to the field of big data, in particular to a data mining system based on a semantic network.
Background
The semantic network provides a more effective way for the expression, organization, management and utilization of massive, heterogeneous and dynamic big data on the internet, so that the intelligent level of the network is higher, for example, in intelligent search, a search engine not only searches for keywords, but also firstly carries out semantic understanding. For example, after the query is participled, the description of the query is normalized so that it can be matched with the knowledge base. The returned result of the query is a complete knowledge system given by the search engine after searching the corresponding entity in the knowledge base.
Search engines are increasingly intelligent and rely on the contribution of thousands of users. However, in the business operation of an enterprise, the related business data mining has the problem that the data volume and the data source are different, so that the business data formats of the enterprise are various, the universality is poor, and the relevance is low. Currently, manual distribution and manual processing are mainly relied on, for example, in an auditing work, project flows are matched by a service node according to experience. However, each process of enterprise data mining has many general modules, so that the semantic network can be adopted to construct the relationship between various data analysis and mining workflows. At present, a common semantic network is used for generally referring to various large-scale knowledge bases, the application scene of the semantic network emphasizes the universality and emphasizes the integration of more entities, and compared with the industry, the semantic network is not high enough in accuracy and is influenced by the concept range, and the entity, the attribute, the relationship among nodes and the like of the entity base are difficult to be standardized by means of the supporting capability of the ontology base on enterprise rules and constraint conditions, so that the problems of poor universality and low relevance in enterprise data mining workflow are caused.
Disclosure of Invention
In order to solve the problems existing in the prior art, the invention provides a data mining system based on a semantic network, which comprises:
the matching module is used for receiving data analysis requirements, extracting data mining services from the data analysis requirements by using a pre-trained relation model, and determining service nodes corresponding to the data mining services;
a determining module for constructing a mining workflow based on the data mining service;
the display module is used for displaying the service nodes corresponding to the service items in the mining workflow and the process in a visual mode;
wherein the relational model is trained by a process comprising:
obtaining context information related to a service node;
inputting the obtained context information into a first neural network model to determine relationships between entities associated with the business nodes to obtain a first semantic network of the business nodes;
receiving a shared semantic network from a server;
obtaining a second semantic network of the service node by inputting the obtained first semantic network and the received shared semantic network into a second neural network model for expanding the first semantic network;
extracting knowledge from a plurality of processing historical data based on the second semantic network to obtain entity data, and then obtaining an optimal relationship between a data mining service and related service nodes by using a knowledge integration algorithm;
wherein the entity data comprises: data mining services and service nodes.
Preferably, the shared semantic network is generated by the server based on semantic network data provided by all the service nodes.
Preferably, the inputting the obtained context information into the first neural network model further comprises:
inputting text information of the context information into the first neural network model; determining a privacy level of the first semantic network and inputting the determined privacy level to the first neural network model;
wherein the data in the first semantic network output from the first neural network model comprises data extracted according to the privacy level.
Preferably, the receiving the shared semantic network from the server further comprises:
sending information about a profile of the service node to a server;
a shared semantic network associated with a profile of the service node is received from a server.
Preferably, the training of the relational model comprises:
acquiring processing history data and service attribute data;
determining entity data from the processing history data and the service attribute data;
performing knowledge integration from the entity data according to entity types based on an entity alignment algorithm to obtain the relationship between multiple types of entities, and further constructing a relationship model;
wherein the entity types include: service nodes, data mining services and service execution servers.
Preferably, the determining entity data from the processing history data and the service attribute data includes:
when the processing historical data and the service attribute data are semi-structured data or unstructured data, performing knowledge extraction on the processing historical data and the service attribute data to obtain entity data;
when the processing history data and the service attribute data are structured data, integrating the processing history data or the service attribute data to obtain entity data;
the processing history data and the service attribute data further include data types, and the data types include: structured data, semi-structured data, and unstructured data.
Preferably, the knowledge extraction comprises: entity extraction, relationship extraction and attribute extraction.
Preferably, the knowledge integration is performed from the entity data according to entity types based on the entity alignment algorithm to obtain relationships between multiple types of entities, and further construct a relationship model, including:
constructing training data by adopting a triple format according to the entity type;
and based on the training data, carrying out knowledge integration by mapping relevant attributes by adopting an entity alignment algorithm based on Bayesian estimation.
Preferably, the performing knowledge integration by mapping relevant attributes based on the ontology model and using an entity alignment algorithm based on bayesian estimation includes:
determining a plurality of entities with the same or similar entities as the entities to be integrated from the training data based on the entities to be integrated;
judging by adopting a similarity mining algorithm from the entities to obtain an entity with the highest correlation;
and aligning the entity to be integrated with the highest correlation degree, connecting the correlation attributes, and labeling.
Preferably, the triplet representation of the entity is as follows:
G=(E,R,S)
wherein E ═ { E ═ E1,e2,…e|E|S and S ═ S1,s2,…s|S|Respectively, the entity sets to be aligned in the training data, which contain | E | and | S | entities of different types; r ═ R1,r2,…r|R|The relation set of each entity in the training data contains | R | different relations.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a semantic network-based data mining system, which comprises a matching module, a data analysis module and a data mining module, wherein the matching module is used for receiving data analysis requirements, extracting data mining services from the data analysis requirements by using a pre-trained relation model, and determining service nodes corresponding to the data mining services; a determining module for constructing a mining workflow based on the data mining service; the display module is used for displaying the service nodes corresponding to the service items in the mining workflow and the process in a visual mode; wherein the relational model is trained by a process comprising: extracting knowledge from a plurality of processing historical data based on a semantic network to obtain each entity data, and then obtaining an optimal relationship between a data mining service and related service nodes by using a knowledge integration algorithm; wherein the entity data comprises: data mining services and service nodes. Through the knowledge extraction and knowledge integration technology, historical data processing, mining rules and business nodes are associated in enterprise data mining practice, mining workflows are generated intelligently, and working efficiency is improved.
Drawings
FIG. 1 is a block diagram of a semantic network based data mining system of the present invention;
FIG. 2 is a process for constructing a relational model of the present invention;
FIG. 3 is a logic construction process of the present invention.
Detailed Description
The invention focuses on understanding massive complex text information such as manuscripts, reports and the like from enterprise data through natural language processing technically, and can better serve knowledge extraction and knowledge integration by combining a graph database, thereby constructing an efficient and reasonable mining workflow.
The invention constructs the semantic network of enterprise business data and utilizes various intelligent information processing technologies. Through knowledge extraction technology, structured (such as a relational database), semi-structured (such as XML and JSON) and unstructured (such as pictures and texts) data are obtained from enterprise related data, documents and systems, and knowledge elements such as entities, relations, attributes and the like are extracted from the data. Taking the enterprise auditing system as an example, the acquired data may include but is not limited to a manuscript and a report in the auditing process, and a resume and a work summary of a business node, and the like, and ambiguities between the referents such as entities, relationships, attributes and the like and the fact object can be eliminated through knowledge integration. In the invention, the service nodes, the characteristics of each service node, the data analysis requirements, the data mining services and the like are associated to construct an association model through knowledge integration. The method solves the problems of relative independence of each entity, poor relevance, difficult matching and the like in the data mining process.
For a better understanding of the present invention, reference is made to the following description taken in conjunction with the accompanying drawings and examples.
The invention provides a data mining system based on semantic network, as shown in figure 1, comprising:
the matching module is used for receiving data analysis requirements, extracting data mining services from the data analysis requirements by using a pre-trained relation model, and determining service nodes corresponding to the data mining services;
a determining module for constructing a mining workflow based on the data mining service;
the display module is used for displaying the service nodes corresponding to the service items in the mining workflow and the process in a visual mode;
the training of the relationship model comprises: obtaining context information related to a service node;
inputting the obtained context information into a first neural network model to determine relationships between entities associated with the business nodes to obtain a first semantic network of the business nodes;
receiving a shared semantic network from a server;
obtaining a second semantic network of the service node by inputting the obtained first semantic network and the received shared semantic network into a second neural network model for expanding the first semantic network;
performing knowledge extraction from a plurality of processing historical data based on the second semantic network to obtain each entity data, and then obtaining an optimal relationship between a data mining service and related service nodes by using knowledge integration; wherein the entity data comprises: data mining services and service nodes.
For the technical points in the above modules, the following is introduced by taking an audit scenario as an example:
enterprise audit types can comprise project audits, responsible audits and other business types, auditors comprise a business node A, a business node B and the like, each auditor corresponds to each business node, resumes and work summaries of the business nodes and drafts, reports and the like in the auditing process are mined through knowledge extraction and integration and based on big data, a project which is good at the first is obtained, a project which is good at the second is obtained, different business types are related to different terminals through responsible at the second, data mining businesses related to each audit scene are determined at the same time, a reasonable mining workflow is further constructed, and then after audit tasks are distributed, automatic matching of the data mining businesses and the auditors is realized, for example, project audit businesses are distributed to the first, and three projects A-B-C are related; distributing responsible audit service for B, and relating to four projects of A-B-C-D. The invention can make the excavation workflow more reasonable and improve the auditing efficiency.
In a further embodiment, the shared semantic network is generated by the server based on semantic network data provided by all traffic nodes. The inputting the obtained context information into the first neural network model further comprises:
inputting text information of the context information into the first neural network model; determining a privacy level of the first semantic network and inputting the determined privacy level to the first neural network model;
wherein the data in the first semantic network output from the first neural network model comprises data extracted according to the privacy level. The receiving a shared semantic network from a server further comprises: sending information about a profile of the service node to a server; a shared semantic network associated with a profile of the service node is received from a server.
In the invention, a graph database is adopted for data storage. Compared with a relational database, the invention adopts a more rigorous and perfect semantic network system architecture, and has clear logic from entities to concepts and concepts to ontologies.
The invention uses a labeled attribute graph model that contains 4 elements: nodes, relationships, attributes, tags.
A node is a primary data element, linked to other nodes by relationships, may have one or more attributes (i.e., attributes stored as key/value pairs), with one or more labels to describe its role in the graph. For example: in the auditing and bidding service scene, the project is used as a node, the supplier and the auditing responsible person are used as attributes, the illegal action and the abnormal record are used as labels, and the project related to the final auditing scene is related through the relationship, so that a reasonable auditing workflow is formed.
The relationship is connecting two nodes, is directional, and may have one or more attributes.
An attribute is a named value, where a name (or key) is a string, can be indexed and constrained, and a conforming index can be created by multiple attributes. For example, it may be quickly retrieved by querying the vendor name.
The tags are used to group nodes, indexing the criteria can speed up the lookup of nodes.
The construction of the relationship model in the matching module is shown in fig. 2, and the construction process includes extraction and integration of knowledge, specifically:
the knowledge extraction process is oriented to business data to be processed of an enterprise, and available knowledge elements are extracted through a scheme of an automation technology and manual assistance, and the extraction mainly comprises the extraction of entities, relations and attributes.
The entity extraction refers to automatically identifying named entities from an original database, and the entity extraction method based on statistical machine learning is adopted in the invention. Training an original database by a machine learning method, identifying an entity by using a trained model, simultaneously combining a supervised learning algorithm and rules, identifying the entity of data in a text by an OCR intelligent text identification method, obtaining a mean value of all word vectors in a target entity according to the word vector of each word based on the word vector of each word in the target entity, and taking the mean value of all word vectors as an entity phrase vector. And then labeling the entity to achieve the purpose of extracting the entity. Entities such as audit scenarios, data mining services, service nodes responsible for approving the project, the vendor involved in the project, and various violation issues in the project are extracted from documents or data related to the audit workflow. Documents or data related to the audit workflow include, for example, manuscripts, reports. And updating the service node state (on-duty and off-duty), the service scope, the adequacy field and other entities by combining audit basic data including but not limited to a staff roster, a work summary, a work resume and the like.
In the entity labeling mode provided by the invention, the concrete semantics of the entity in each sentence are considered, so that the noise in the entity labeling process is reduced, the accuracy of the labeling result is improved, the text data can be mined from multiple angles, and the relationship among texts can be enriched.
The invention adopts a machine learning-based relation extraction method, the unstructured proportion of enterprise audit data is large and the enterprise audit data has no regularity, and the characteristics such as vocabulary, syntax and semantics are extracted by a characteristic engineering method and effectively combined. For example, entity attributes such as a service node (a), an organization (department of engineering), a responsibility (director bidding), and time (2012-2015) are acquired from basic data, and entity attributes such as a project (a), a problem (violation in bidding process, winning a blacklist unit), time (2014), and the like are acquired from processing history data.
The attribute extraction considers the attribute of an entity as a dependency relationship between the entity and the attribute value, so the attribute extraction problem is converted into a relationship extraction problem.
Under the condition that the service language is lack of the labeled corpus and the universal language labeled training corpus is rich, the mapping is established between the universal language and the service language to obtain the training corpus of the service language, and then the entity relationship extraction model based on the language conversion is trained by using the comparison corpus information of the universal language and the service language to obtain the relationship extraction model of the service language. The system consists of a contrast corpus collection module and a relation extraction module. And the comparison corpus collection module obtains a public space dictionary through dictionary expansion, and acquires the comparison corpus on the basis. The relation extraction module adopts a variational self-encoder. First, feature representations of the generic language instance and the business language instance obtained above are learned separately. Then, the data is input to a decoder, whether the data is from a general language or a service language is judged, and the feature representation of the general language is mapped to the service language. And finally, carrying out relation classification on the business language examples by utilizing the business language relation extraction network obtained by training.
Specifically, the comparison corpus collection module adopts an unsupervised dictionary expansion model to respectively obtain word vectors and dictionaries of the general language and the business language in a public space, then the dictionary is utilized to translate words of the general language into the business language, and the relation labels of the general language are directly mapped to the business language to obtain a training data set of the business language. And initializing word embedding by using the obtained word vector in the public space. Searching the corresponding business language words based on the general language words, searching the corresponding general language words based on the business language words, and connecting dictionaries respectively generated from the two directions to eliminate repeated word pairs to form a final dictionary.
And in the relation extraction module, firstly, translating the general language relation extraction data set by using the dictionary generated by the comparison corpus collection module to obtain a business language data set. And acquiring the contrast corpus by adopting a dictionary expansion model, and extracting the entity relationship of the language conversion. Specifically, the potential feature expressions of two language examples are learned respectively by using two sentence encoders, the feature expressions of a general language are mapped to a service language, and finally, a service language relationship extraction network with language adaptability is obtained to perform relationship prediction on the service language.
The statement encoder utilizes the LSTM to extract the characteristics of the business language examples obtained by translating the general language and the dictionary, and converts the statement examples containing the entity pairs into distributed potential characteristic representation. The sentence coder is then used to generate feature representations of the two language instances as input to the decoder, and the decoder and coder are then iteratively trained. And finally, the trained business language statement encoder has language adaptability through competition between the encoder and the decoder.
At the sentence coder. And constructing a universal language sentence encoder by using the universal language relation extraction data set, and constructing a business language sentence encoder by using the business language relation extraction data set obtained by dictionary translation. In the embedding layer, the real number direction is firstly usedThe quantities encode words, word position information, and entity types. The input sentence is expressed as x ═ w1,w2,…,wnUsing a word vector matrix
Figure BDA0003309180720000121
) Initializing each word to a dimension dwWherein V represents a fixed-size vocabulary. And initializing by using the word vector obtained by the comparison corpus acquisition module, and mapping words from different languages into the same feature space. Since words close to the target entity are generally more able to determine the relationship between the entity pairs, to capture the positional information between each word and the two entities in the sentence, the euclidean distances of the words from the head and tail entities, respectively, are converted into real number vectors to be embedded as the positions of the words. Using a matrix of position vectors
Figure BDA0003309180720000122
Mapping Euclidean distance into two dimensions dpWherein D represents a set of euclidean distances. Thus, for each word, two position vectors are obtained for two entities. In order to reflect the relationship between the entity type and the relationship type between the entities, the entity type embedding of two entities is added to each word in the sentence, and a vector matrix is used
Figure BDA0003309180720000123
Mapping entity types into two dimensions detWherein E represents a set of entity types. Finally, an input sentence is represented as a sequence of vectors w ═ w1,w2,…,wnH, where each word has an embedding dimension of d ═ dw+2dp+2det
After encoding an input statement, the convolutional layer uses a plurality of convolutional kernels to slide on the statement to extract local information, and the output of the ith sliding window is: p is a radical ofi=Φcwi-w+1:i+b
Wherein, wi-w+1:iDefined as w in the ith windowWord vector concatenation of individual words, Φ c is a convolution matrix,
Figure BDA0003309180720000131
is a bias vector, where dcRepresenting the number of convolution kernels.
Next, all local features extracted by the pooling layer are merged and convolved, and an activation function is applied to obtain a final representation of fixed length. The jth element of the output vector x is:
[x]j=tan maxipij
and after the characteristic expressions generated by the universal language sentence coder and the business language sentence coder are obtained, the characteristic expressions are input into a relation extraction model to carry out entity relation extraction. The relation extraction model consists of a layer of full connection and a classifier, and finally, the probability distribution of each input sample on all relations is output. The decoder uses a layer of fully-connected neural network and an activation function to construct a binary classifier, receives the output of the encoder as input, and judges whether the feature representation comes from a universal language statement encoder or a business language statement encoder. The specific training process is as follows: pre-training a universal language sentence encoder and a classifier of a relation extraction model by utilizing a universal language relation extraction data set, and minimizing relation classification loss; then training a decoder to minimize the decoder loss, training a business language sentence encoder and a classifier of a relation extraction model on a business language, minimizing a comprehensive loss function, and continuously iterating the process until the model converges. After successful training, the feature representation output by the encoder can perform accurate relationship extraction, and meanwhile, the difference of the feature representation is obviously reduced among different languages.
Due to the problems of uneven quality, unclear relation and the like of enterprise audit data obtained in knowledge extraction, knowledge integration is required. The knowledge integration of the invention is to carry out data integration, disambiguation, processing, reasoning verification, updating and the like on knowledge from different databases, archives, files and systems under the preset service specification, so as to achieve the integration of data, information, methods and experiences and form a mining rule relevance model.
The knowledge integration is divided into a plurality of functional modules which respectively comprise entity alignment, data integration, knowledge reasoning, quality assessment, ontology construction and knowledge updating.
The entity alignment is a key technology in the knowledge integration process and is a process for deducing whether different entities from different data sets are mapped to the same object in the physical world. The entities in the invention are process data in data mining practice, such as manuscripts, reports and the like, and the semantic similarity between the entities is obtained by utilizing the relationship between the entities (specific objects in the manuscripts and the reports). For example, in project data mining practice, feasibility study reports (entities) may appear in a manuscript many times, but the feasibility study reports of different projects are different, so that the feasibility study reports corresponding to the projects need to be confirmed through context (relationships between the entities). Performing word segmentation on the sentences in the training data by adopting a Stanford word segmentation tool, and performing word embedding processing by utilizing a bidirectional LSTM network after word segmentation;
selecting a word from the sentence pairs in sequence based on each sentence pair in the processed training data, extracting a word set aligned with the word from the sentence pair representing hypothesis sentences according to the label of the training data, and using the word set as a monitoring signal of the word in the inter-sentence attention module until all the monitoring signals are extracted;
wherein, the alignment data set in the training data is marked with the alignment information between two sentences; the labeling of the training data comprises: the relationship of each sentence pair and the alignment of the words. The invention also defines a loss function for the joint learning improvement: and utilizing an improved loss function to jointly learn the improved triplet loss and the weighted softmax loss to limit the distance between samples in the same sub-category and increase the distance between samples in different sub-categories.
In the enterprise data mining practice, the unstructured text data is more abundant, and the entity alignment algorithm using machine learning is more accurate. A triple representation is used. As a general expression, a triplet, i.e. G ═ (E, R, S), where E ═ E { (E), is used1,e2,…e|E|The E is a set of entities in the knowledge base, and contains E different entities; r ═ R1,r2,…r|R|The relation set in the entity contains | R | different relations; representing a set of triples in a knowledge base. The basic form of the triple mainly comprises an entity 1, a relation, an entity 2, concepts, attributes, attribute values and the like, wherein the entity is the most basic element in the semantic network, and different relations exist among different entities. Concepts refer primarily to collections, categories, object types, categories of transactions, such as suppliers, people, items, etc.; the attributes mainly refer to attributes, characteristics and parameters which the object may have, such as personnel working state, job title, project name, bid winning unit and the like; attribute values refer primarily to values of attributes specified by a subject, such as job title, retirement, employment, expert, subject, organization, balance adjustment sheet, income, business risk, and the like. Each entity has a globally unique ID, each attribute-attribute value pair (attribute-value pair) can characterize the intrinsic properties of the entity, and a relationship can be used to link two entities together to characterize an association between them.
For example, according to a service scenario, related data mining services are found out from processing historical data, a relationship between the data mining services and the scenario is established, the relationship between the scenario and more data mining services is obtained based on more historical data, the sequence and the incidence relationship of each data mining service are determined through training, the optimal mining workflow related to a service item is finally determined, the mining workflow can be solidified, the flow can be directly used in a new project, and the training can be continued to be optimized when a new project exists. The triple can also find out handed service nodes from the data mining service, reestablish the relationship between the service nodes and the working experience and the relationship between the service nodes and the working state, and finally determine the optimal service node of each link in the project flow.
The whole algorithm is optimized by considering that the processing historical data is mainly Chinese. Defining a triple (h, r, t), wherein a tail entity t is regarded as an association process of a head entity h through a relation r, and a function of the triple is defined as: s (h, r, t) — | | | h + r-t | |. Then, the probability of occurrence of the triplet (h, r, t) is p (h, r, t) ═ δ (s (h, r, t)). Where δ is the sigmoid function (chy activation function).
The output of sigmoid is between 0 and 1, and in the classification task, the event probability is adopted.
In order to obtain a relational model, the invention adopts the logarithm of the probability of maximizing the occurrence of the triples as an objective function of machine learning:
Figure BDA0003309180720000161
where Δ represents a set of triples. Then probability maximization of the triplet is equivalent to minimizing LembAnd adopting a negative sampling technology in the solving process to obtain:
Figure BDA0003309180720000162
where Δ' represents a negative triplet set; k is the number of samplings, n is the number of negative samplings, E(h′,r,t′)~P(Δ′)Representing the desire to randomly remove n negative triplets from delta'. Negative triples are generated by replacing headers in positive triples (h, r, t) or for entities and guarantee negative triples
Figure BDA0003309180720000171
. h' represents a head entity after replacement by a negative sampling technique; t' represents the tail entity after replacement by the negative sampling technique.
Currently, the known triple transformation realized by adopting a negative sampling technology generally adopts an adaptive algorithm to optimize a loss function of an embedded model. The entity in the invention mainly comes from a data mining practice database, is mostly characters, adopts a maximum probability algorithm, and improves the conversion rate from the positive triples to the negative triples without influencing the accuracy rate by expectation.
The ontology is constructed as an important part of knowledge integration, and helps people or software to share common understanding of information structures. In the invention, the ontology is the concept (such as category, attribute, limitation and the like) of the business data, and the model is more definite to the related professional terms, concepts and assumptions in the auditing field through the ontology construction.
In the invention, an enterprise modeling method is adopted, a logic model of specified knowledge is established through an ontology, and a formalized integrated model is constructed by first-order logic and comprises an enterprise design ontology, a project ontology, a scheduling ontology or a service ontology, as shown in figure 3.
The quality assessment comprises quantifying the confidence level of the knowledge and reserving the knowledge with higher confidence level. Since there is no single correct ontology for any domain. Building an ontology is an inventive process, and the quality of the ontology can only be evaluated through practical application. In the invention, the entity relation triples are labeled in a manual labeling mode and are used as training data once, and the confidence coefficient of the extraction result is calculated by using a regression model.
After the relationship model is constructed, aiming at new data analysis requirements, the trained relationship model can be used for extracting data mining services related to the requirements and determining service nodes corresponding to the data mining services; and connecting the determined data mining services according to an actual execution flow to form an optimal mining workflow, and displaying the optimal mining workflow.
The displaying in a visual mode further comprises the steps of visually displaying each service item on a data map by taking a database data storage structure as a basis, and marking different colors by taking each violation behavior in the service item as a label; the service nodes corresponding to the service items can be marked by different colors, the service strength items of the service nodes are noted, the display and the check of the mining workflow are dynamically, conveniently and visually realized in a visual mode, and meanwhile, unreasonable data mining services or service nodes which are probably more appropriate based on the existing service nodes can be directly modified.
The knowledge updating is along with the richness of the service scene and the perfection of the service rule, and the model can be retrained as required to ensure the updating of the model.
It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data mining device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data mining device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data mining equipment to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data mining device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The present invention is not limited to the above embodiments, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present invention are included in the scope of the claims of the present invention which are filed as the application.

Claims (10)

1. A semantic network-based data mining system, comprising:
the matching module is used for receiving data analysis requirements, extracting data mining services from the data analysis requirements by using a pre-trained relation model, and determining service nodes corresponding to the data mining services;
a determining module for constructing a mining workflow based on the data mining service;
the display module is used for displaying the service nodes corresponding to the service items in the mining workflow and the process in a visual mode;
wherein the relational model is trained by a process comprising:
obtaining context information related to a service node;
inputting the obtained context information into a first neural network model to determine relationships between entities associated with the business nodes to obtain a first semantic network of the business nodes;
receiving a shared semantic network from a server;
obtaining a second semantic network of the service node by inputting the obtained first semantic network and the received shared semantic network into a second neural network model for expanding the first semantic network;
extracting knowledge from a plurality of processing historical data based on the second semantic network to obtain entity data, and then obtaining an optimal relationship between a data mining service and related service nodes by using a knowledge integration algorithm;
wherein the entity data comprises: data mining services and service nodes.
2. The system of claim 1,
the shared semantic network is generated by the server based on semantic network data provided by all of the traffic nodes.
3. The system of claim 1, wherein inputting the obtained context information into the first neural network model further comprises:
inputting text information of the context information into the first neural network model; determining a privacy level of the first semantic network and inputting the determined privacy level to the first neural network model;
wherein the data in the first semantic network output from the first neural network model comprises data extracted according to the privacy level.
4. The system of claim 1, wherein the receiving the shared semantic network from the server further comprises:
sending information about a profile of the service node to a server;
a shared semantic network associated with a profile of the service node is received from a server.
5. The system of claim 1, wherein the training of the relational model comprises:
acquiring processing history data and service attribute data;
determining entity data from the processing history data and the service attribute data;
performing knowledge integration from the entity data according to entity types based on an entity alignment algorithm to obtain the relationship between multiple types of entities, and further constructing a relationship model;
wherein the entity types include: service nodes, data mining services and service execution servers.
6. The system of claim 5, wherein determining entity data from the processing history data and the traffic attribute data comprises:
when the processing historical data and the service attribute data are semi-structured data or unstructured data, performing knowledge extraction on the processing historical data and the service attribute data to obtain entity data;
when the processing history data and the service attribute data are structured data, integrating the processing history data or the service attribute data to obtain entity data;
the processing history data and the service attribute data further include data types, and the data types include: structured data, semi-structured data, and unstructured data.
7. The system of claim 6, wherein the knowledge extraction comprises: entity extraction, relationship extraction and attribute extraction.
8. The system of claim 5, wherein the knowledge integration from the entity data according to entity types based on the entity alignment algorithm to obtain relationships between multiple types of entities and further construct a relationship model, comprises:
constructing training data by adopting a triple format according to the entity type;
and based on the training data, carrying out knowledge integration by mapping relevant attributes by adopting an entity alignment algorithm based on Bayesian estimation.
9. The system according to claim 5, wherein the integration of knowledge by mapping relevant attributes using an entity alignment algorithm based on Bayesian estimation based on the ontology model comprises:
determining a plurality of entities with the same or similar entities as the entities to be integrated from the training data based on the entities to be integrated;
judging by adopting a similarity mining algorithm from the entities to obtain an entity with the highest correlation;
and aligning the entity to be integrated with the highest correlation degree, connecting the correlation attributes, and labeling.
10. The system of claim 9, wherein the entity's triplet representation is as follows:
G=(E,R,S)
wherein E ═ { E ═ E1,e2,…e|E|S and S ═ S1,s2,…s|S|Respectively, the entity sets to be aligned in the training data, which contain | E | and | S | entities of different types; r ═ R1,r2,…r|RAnd | R | different relationships are included in the relationship set of each entity in the training data.
CN202111211723.2A 2021-10-18 2021-10-18 Data mining system based on semantic network Pending CN114357175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111211723.2A CN114357175A (en) 2021-10-18 2021-10-18 Data mining system based on semantic network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111211723.2A CN114357175A (en) 2021-10-18 2021-10-18 Data mining system based on semantic network

Publications (1)

Publication Number Publication Date
CN114357175A true CN114357175A (en) 2022-04-15

Family

ID=81095893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111211723.2A Pending CN114357175A (en) 2021-10-18 2021-10-18 Data mining system based on semantic network

Country Status (1)

Country Link
CN (1) CN114357175A (en)

Similar Documents

Publication Publication Date Title
WO2022100045A1 (en) Training method for classification model, sample classification method and apparatus, and device
CN109165294B (en) Short text classification method based on Bayesian classification
US20200279105A1 (en) Deep learning engine and methods for content and context aware data classification
CN110705296A (en) Chinese natural language processing tool system based on machine learning and deep learning
CN113254507B (en) Intelligent construction and inventory method for data asset directory
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN113282729B (en) Knowledge graph-based question and answer method and device
CN116089873A (en) Model training method, data classification and classification method, device, equipment and medium
CN113627190A (en) Visualized data conversion method and device, computer equipment and storage medium
CN113157859A (en) Event detection method based on upper concept information
CN114416979A (en) Text query method, text query equipment and storage medium
US20220027748A1 (en) Systems and methods for document similarity matching
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN117891939A (en) Text classification method combining particle swarm algorithm with CNN convolutional neural network
CN112989830A (en) Named entity identification method based on multivariate features and machine learning
CN114372148A (en) Data processing method based on knowledge graph technology and terminal equipment
Ziv et al. CompanyName2Vec: Company entity matching based on job ads
CN115329380A (en) Database table classification and classification method, device, equipment and storage medium
CN114357175A (en) Data mining system based on semantic network
CN110968795B (en) Data association matching system of company image lifting system
US20230297648A1 (en) Correlating request and response data using supervised learning
CN117235629B (en) Intention recognition method, system and computer equipment based on knowledge domain detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination