CN111143479A - Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm - Google Patents

Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm Download PDF

Info

Publication number
CN111143479A
CN111143479A CN201911254786.9A CN201911254786A CN111143479A CN 111143479 A CN111143479 A CN 111143479A CN 201911254786 A CN201911254786 A CN 201911254786A CN 111143479 A CN111143479 A CN 111143479A
Authority
CN
China
Prior art keywords
entity
corpus
word
cluster
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911254786.9A
Other languages
Chinese (zh)
Other versions
CN111143479B (en
Inventor
陆佳炜
王小定
高燕煦
朱昊天
高飞
肖刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Easy Point Life Digital Technology Co ltd
Fuzhou Zhiqing Intellectual Property Service Co ltd
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201911254786.9A priority Critical patent/CN111143479B/en
Publication of CN111143479A publication Critical patent/CN111143479A/en
Application granted granted Critical
Publication of CN111143479B publication Critical patent/CN111143479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • G06F16/287Visualization; Browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A knowledge graph relation extraction and REST service visualization fusion method based on a DBSCAN clustering algorithm comprises the following steps: firstly, constructing a target field corpus; secondly, entity extraction facing to a corpus; the third step: combining Word2vec, performing instructive secondary pre-grouping on the corpus, and constructing a knowledge graph by using a DBSCAN clustering algorithm; classifying various visual graphs, summarizing and summarizing the attributes and structural characteristics of the various graphs, and formally expressing various graph information by creating a visual model tree VT; and fifthly, defining M-JSON as a JSON prototype structure returned by the REST Web service, matching the M-JSON with a data structure in a visual model tree, and inquiring whether the matched attribute combination has actual semantic association or not by using the knowledge graph in the third step so as to select effective dimension combination and improve the accuracy rate of automatically generating the graph.

Description

Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
Technical Field
The invention relates to a knowledge graph relation extraction and REST service visualization fusion method based on a DBSCAN clustering algorithm.
Background
Service-Oriented Computing (SOC) is a Computing paradigm for distributed systems that is currently receiving attention from both the industry and academia. Under the promotion of the development of the SOC computing mode, the Web service is further popularized and applied. With the introduction of the REST (representational State transfer) architecture style in 2000, REST services became increasingly an important component of Web services. The simple, light and fast features of REST services have promoted its popularity on the Internet, and keep a considerable exponential growth, while driving the growth of services. Diversified data services are integrated with multiple fields such as economy, medical treatment, sports, life and the like in a cross mode, and a huge amount of data is promoted. However, regardless of the data faced, the primary goal of humans to obtain data is to obtain valid information in the data.
Data visualization assists a user in analyzing and understanding data through an interactive visualization interface and a data-image conversion technology. The basis of visualization is data, and the data of the network data era is multi-source heterogeneous (multi-source heterogeneous), which brings the problems of data source integration and data arrangement; data service providers in a plurality of fields provide a large number of services, and each service has a data response mode and a response format with different structures, which brings difficulty to data acquisition and data analysis; with the development of multimedia technology and visualization technology, people no longer satisfy common form data, and pursue more intuitive and rich data display forms and more convenient and efficient data processing tools. Therefore, by reducing human intervention, heterogeneous service data is automatically analyzed and organized, and the data is automatically visualized, so that the method has important practical significance.
The knowledgegraph was formally proposed by Google corporation in 6 months 2012 and is a graph-based data structure. The knowledge graph is a structured semantic knowledge base, which shows each entity in the real world and the relationship between the entities in the real world in the form of a graph, and is described in a formalized manner. The general representation of the basic building blocks of a knowledge-graph are entity-relationship-entity triplets, and entity attribute-value pairs. Knowledge graph is stored in the form of a triple expression of "entity-relationship-entity" or "entity-attribute value", and these data will constitute a considerable entity-relationship network, i.e., a "graph" of knowledge.
At present, although some data visualization modeling methods for REST services exist, the automatic data visualization efficiency of the methods is low, or a large amount of redundant graphics exist in the automatically generated visualization graphics, which is not beneficial to understanding and analyzing by a user. The knowledge graph has high-efficiency information retrieval capability, strong semantic relation construction capability and visualization presentation capability, and the rules hidden behind the data can be more effectively found out by combining the knowledge graph with data visualization.
Disclosure of Invention
The invention provides a knowledge graph relation extraction and REST service visualization fusion method based on a DBSCAN clustering algorithm. And analyzing and inductive modeling common visual graphs, and performing structure matching on a data structure of the REST service and a structure of a visual model to obtain an attribute combination of a candidate coordinate axis/legend meeting requirements. Whether semantic association exists in the attribute combination is obtained by inquiring the knowledge graph, so that the visualization work of Web data is further optimized, and the probability of generating effective graphs is improved.
In order to realize the invention, the technical scheme is as follows:
a knowledge graph relation extraction and REST service visualization fusion method based on a DBSCAN clustering algorithm comprises the following steps:
the first step, the construction of a target domain corpus: the method comprises the steps of taking the network corpus content as a basis for constructing a knowledge graph, using the network corpus entry information as the original corpus content, screening the original network corpus content for constructing the knowledge graph, and comparing and analyzing the webpage content of the network entries, wherein the original corpus content comprises a HTML (hypertext markup language) label, editing information of the entries, webpage link information and other redundant information irrelevant to the entries in addition to title and text information. Filtering and cleaning the content of the network entries, extracting the title and the effective text content, wherein the filtering content comprises the following steps: performing HTML tag/text style symbol filtering, entry template symbol and non-English character filtering, entry editing information filtering, picture information filtering, link information filtering, page proprietary title attribute name filtering and numerical value filtering on webpage contents of entries;
step two, corpus-oriented entity extraction: the knowledge graph is a data information network with a graph structure formed by entities and relations, the basic structure of the knowledge graph is represented by a triple of 'entity-relation-entity', the triple comprises two entities with a real semantic relation and a relation between the two entities, and is represented by a (head, relation, tail) form, wherein G represents the triple, head represents a head entity, tail represents a tail entity, relation represents the relation between the head entity and the tail entity, each entity also comprises an attribute and an attribute value, the attribute of the entity is also converted into a tail entity connected with the entity, a relation is established between the head entity and the tail entity, and the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
the third step: combining Word2vec, performing guiding secondary pre-grouping on the material library, and using a DBSCAN (sensitivity-Based Clustering of Applications with Noise) Clustering algorithm to construct a knowledge graph, wherein the structure of a triple G is (head, relation, tail), the relation has various relations with the difference of the head and the tail, and the relation is actually a relation set in the knowledge graph and is used for representing complex relations among various entities. The method aims to judge whether semantic association exists between two attributes, namely whether a relation exists between two entities, and does not pay attention to the relation. By calculating word vectors of words in a corpus, performing secondary grouping on the corpus and performing secondary clustering on a word vector set corresponding to the corpus, and extracting an entity relation by using a DBSCAN clustering algorithm, the process is as follows:
fourthly, constructing a visual model Tree (VT for short): classifying various visual graphs, summarizing and summarizing the attributes and structural features of various graphs, and formally expressing various graph information by creating a visual model tree (VT);
fifthly, the data visualization optimization matching method based on the network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by RESTWeb service; matching the Web data prototype structure M-JSON with each structModel in the visual model tree VT according to the data structure, wherein the returned result is a set formed by attribute combinations of candidate coordinate axes/legends which meet the conditions; on the basis of structure matching, whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic correlation is inquired by using the knowledge graph constructed in the third step, and effective dimension combination is selected according to the optimization matching of the inquiry result so as to improve the accuracy (Precision) of automatically generated graphs.
Further, the process of the second step is as follows:
2.1, entity extraction: entity extraction, also known as named entity recognition, is the automatic recognition of named entities from text data sets, which generally refers to entities identified by names of people, places, names of organizations, and all other names, and is accomplished by using some mainstream named entity recognition systems, which includes the steps of: firstly, naming entity recognition is carried out on the content of a material library through a tool; secondly, marking the type attribute of the identified named entity; thirdly, filtering the named entities according to the type attributes, deleting the unsuitable named entities, keeping labels of other named entities, and defining the entry names as the named entities by default;
2.2, extracting attribute entities: extracting attributes from information frames of the vocabulary entry network corpus by taking the information frames of the vocabulary entry network corpus as the sources of the attributes, then intercepting information of the information frames of each vocabulary entry from a corpus, extracting attribute names according to the structure of the information frames, taking the attribute names as tail entities of named entities corresponding to the names of the vocabulary entries, not reserving attribute values, and if no information frame exists in a certain vocabulary entry, not establishing the tail entities for the named entities corresponding to the vocabulary entry;
2.3, noun entity extraction, comprising four steps: word splitting (Split), part-of-speech Tagging (POS Tagging), Stop Word Filtering (Stop Word Filtering), Stemming (Stemming), and the named entity extraction step is performed with the identified named entity already tagged, so that the following operations are performed only on corpus content outside the tagged entity.
Still further, the process of 2.3 is as follows:
2.3.1, splitting words: designing a splitting rule mode by using a regular expression, splitting words of the content of the corpus according to spaces, symbols and paragraphs, and obtaining word texts;
2.3.2, part of speech tagging: in order to obtain nouns in a corpus, part-of-speech tagging is firstly needed for text vocabularies, wherein the part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, the part-of-speech tagging is a text data processing technology for tagging the part-of-speech of words in the corpus according to the meaning and context content of the words in the corpus linguistics, a plurality of words possibly contain a plurality of parts-of-speech and have multiple meanings, the part-of-speech selection depends on the context meaning, the corpus which is tagged with named entities is used as a tagging object text for part-of-speech tagging, noun objects are found out according to tagging results, non-noun objects are removed from the corpus, but the non-noun vocabulary names are not included, named entities, noun objects and original punctuations in each vocabulary entry are reserved in the corpus at the moment, and all the contents still keep the original text;
2.3.3, stop word filtering: the Stop Word is from Stop Word, which refers to a Word or a Word automatically filtered out when processing natural language text in order to save storage space and improve search efficiency in information retrieval, and for a given purpose, any kind of Word can be selected as the Stop Word, and the Stop Word mainly comprises two types: one is the functional Words (Function Words) contained in human language, which are used very commonly and appear very frequently, but have no exact actual meaning; the other kind is real Words (Content Words), which refers to Words with a part of actual concrete meaning but without definite reference or direction, in the natural language processing, a Stop Word List (Stop Word List) is existed, the Stop Word List is used as a reference dictionary, Stop Words are deleted from the corpus through Word comparison, the corpus Content is further simplified, and no Stop Words in the corpus are ensured to be reserved;
2.3.4, stem extraction: the word drying extraction is a process of removing morphological affixes to obtain corresponding roots, and is a specific processing process of western languages such as English, the same English word has singular and plural deformations, temporal deformations, and deformations of different predicates corresponding to nominal pronouns, the words have slight differences in form but correspond to the same roots, and the words are treated as the same words under the condition of calculating correlation, and in this case, word stem processing is required, and a Porter Stem Algorithm (Porter Stem Stemming Algorithm) is a mainstream stem extraction Algorithm, and the core concept is that words are classified and processed according to the type of morphological affixes, except for part of special deformations, most word deformations have certain rules, and the deformations are classified into 6 categories according to the rules.
Furthermore, in 2.3.4, the stem extraction steps are as follows:
2.3.4.1, according to the word deformation category, proceed affix removal and word recovery for each case to obtain the stem information of noun object in corpus to reduce the cases of different words, 6 different word deformations are as follows:
2.3.4.1.1, complex number, words ending in ed and ing;
2.3.4.1.2, words that contain vowels and end in y;
2.3.4.1.3, double suffix words;
2.3.4.1.4, suffixed with-ic, -ful, -new, -active, etc.;
2.3.4.1.5, < c > vcvc < v >, in case of a suffix of-ant, -ref, etc. (c is a consonant, v is a vowel);
2.3.4.1.6, < c > vc < v > where there are more than 1 pair of vc between vowel consonants, a word ending with e;
2.3.4.2, creating noun objects reduced to stem as noun entities, and updating the noun objects in the corpus to be expressed in stem form.
In the third step, the DBSCAN algorithm is a noise application spatial clustering algorithm based on density, which examines connectivity among samples according to density distribution of the samples and expands clustering clusters based on the connectable samples to obtain a final clustering result, as follows:
3.1, using Word2vec to train the corpus W into a Word vector cluster Cube: word2vec is a Word vector tool, it expresses words as the characteristic vector of the Word, Word2vec converts the Word into the numerical value form, use a N dimensional vector to express, imbed the corpus W into a vector space and get Word vector cluster Cube, every Word vector distributes among them discretely, according to the Word vector interrelatedness degree sparsely, the distribution state presents different gathering conditions too, can obtain the associativity distribution state of the Word by analyzing the gathering state of the Word vector, group Word vector according to different relativity and sparseness, in order to obtain the relation among the words, namely the relation among the entities;
3.2, performing guiding pre-grouping on the material library twice: because the clustering of the DBSCAN clustering algorithm is easily influenced by the distribution condition of the data set, in order to ensure the core concept, namely, the main classification object or key word in the target field is taken as a clustering center, the corpus needs to be subjected to guiding grouping twice in advance;
3.3, on the basis of guiding grouping, clustering each word vector cluster Cube in Cube by a DBSCAN clustering algorithmzClustering and calculating cubezCluster center of (3)zFor each newly generated word vector cluster CkComputing cluster center centroidskFinding out Centroid according to the mapping relation between word vector object and entity objectzAnd CentroidkCorresponding EntityzAnd EntitykBy EniytzBeing the head Entity, EntitykAs a tail Entity, default Entity association is R, and a triple (Entity) is constructedz,R,Entityk) And adding the data into a triple set, automatically searching a clustering center for each corpus set through a DBSCAN clustering algorithm, clustering, and simultaneously constructing a triple.
The flow of 3.3 is as follows:
3.3.1 clustering each word vector cluster Cube in Cube by using DBSCAN clustering algorithmzClustering and calculating cubezCluster center of (3)z
The DBSCAN clustering algorithm in step 3.3.1 is executed as follows:
3.3.1.1 in cubezSelecting any unvisited sample p (namely a data point p) as a circle center, and drawing a circular neighborhood (namely an epsilon-neighborhood) with the radius of epsilon;
3.3.1.2, if the number of samples in the epsilon-neighborhood is not less than minPts (the minimum number of samples in the neighborhood), creating a new cluster C for p, and adding the samples in the neighborhood into a set N;
3.3.1.3, repeating steps 3.3.1.1 and 3.3.1.2 for the sample p 'in the set N, judging the subordinate of p' before dividing the epsilon-neighborhood each time, and adding p 'into the cluster C if p' does not belong to any cluster;
3.3.1.4, when all samples in N are accessed, in cubezTo select another sample that has not been visited and repeat step 3.3.1.1 until cubezThe samples in (1) are all accessed;
3.3.1.5, the obtained clustering result: a cluster set;
3.3.2 for each newly generated word vector Cluster CkComputing cluster center centroidskFinding out Centroid according to the mapping relation between word vector object and entity objectzAnd CentroidkCorresponding EntityzAnd EntitykIn term of EntityzBeing the head Entity, EntitykAs a tail Entity, default Entity association is R, and a triple (Entity) is constructedz,R,Entityk) And adding a triple set TP;
3.3.3, if the sample number of the minimum cluster in the clustering result is greater than the threshold value Z, taking the obtained cluster set ClusDS as input, adjusting and reducing the value of epsilon, minPts, clustering each cluster again, and executing the steps 3.3.1 and 3.3.2; if the number of the samples of the minimum cluster in the clustering result is not more than the threshold value Z, inquiring the corresponding number of each sample in each clusterCorresponding EntityqThe Entity corresponding to the cluster center of the clusterQGrouped into multiple triplets (Entity)Q,R,Entityq);
Wherein, in DBSCAN clustering algorithm, the size of epsilon-neighborhood and minPts is determined by cubezDetermined by the number of samples in (1), cubezThe larger the number of samples, (epsilon, minPts) is, the larger the neighborhood range and the minimum number of samples in the neighborhood are used in the early stage, the clustering number can be limited, if the smaller value is used, the generated large number of fine packets can cause excessive information dispersion, the extracted entity corresponding to the cluster center can not show core content as an upper-layer entity, when a DBSCAN algorithm is called in a recursive mode, the (epsilon, minPts) value is adjusted in a gradient descending mode, the neighborhood range and the minimum number of samples are reduced, clustering is carried out on the clusters obtained in the last clustering in sequence again, and the number of samples in each cluster is reduced;
all the entities in the corpus W are in relation with other entities, and the corresponding three tuples are combined with each other, so that the knowledge graph is formed.
The step 3.2 is as follows:
3.2.1, grouping the corpus W and the word vector cluster Cube corresponding to the corpus W for one time, and the steps are as follows:
3.2.1.1, extracting a root corpus label in the corpus W to form a core entity; obtaining network language material through a crawler, extracting a first-layer sub-classification label of a root language material label in a language material base, and generating a first-layer sub-classification label set Tag ═ t1,t2...ti...tnN sub-classification labels, each label having a corresponding entity and word vector, and combining these entities with core entities to form n triples, and adding the triples into a triplet set TP;
3.2.1.2 Each Tag t in the set Tag of class labelsiTaking the corresponding word vector as a cluster center, calculating the Euclidean distance from each data point in the word vector cluster Cube to each centroid, then distributing the data points to the cluster corresponding to each cluster center according to the principle of closeness, and dividing the corpus W into n corpus sets Wi(1 < i < n), corpus set wiThe corresponding word vector cluster is cubei
3.2.2, for each corpus set wiSecondary grouping and corresponding word vector cluster cubeiPerforming secondary clustering, and according to the step of grouping for the first time, the flow is as follows:
3.2.2.1 for each corpus set wiAnd corresponding cluster core label tiExtracting a Cluster core tag tiFor a second-level core entity, obtaining the network corpus through a crawler, extracting a first-level sub-classification label of a cluster center label, and generating a classification label set Tagi={ti1,ti2...,tij,...timiWherein (1 < j < m >i1 < i < n), representing the current tag tiTotally contain miSub-classification labels, each label having a corresponding entity and word vector, and combining these entities with secondary core entities to form miAdding triples into a triple set TP;
3.2.2.2 Tag from step 3.2.2.1iThe word vector corresponding to each label in the cluster is used as a cluster center, and the current word vector cluster cube is calculatediThe Euclidean distance from each data point to each centroid, then the data points are distributed to the clusters corresponding to each cluster centroid according to the principle of closeness, and at the moment, each corpus set w isiIs divided into m againiIndividual set of language material wijWherein (1 < ═ j < ═ mi1 < i < n), i.e., the original corpus W is divided into
Figure BDA0002309912420000081
Corpus set wij,wijThe corresponding word vector cluster is cubeij
Wherein the Euclidean Distance (Euclidean Distance) in step 3.2.1.2 is judgedImportant basis for the category to which the breakpoint belongs, assuming a given sample
Figure RE-GDA0002419015180000082
And
Figure RE-GDA0002419015180000083
wherein i, j is 1,2, …, m represents the number of samples, n represents the number of features, and the calculation method of the Euclidean distance is as follows:
Figure RE-GDA0002419015180000084
3.2.3, finding out each corpus set w by combining TF-IDF algorithmijThe keyword in (1), the keyword is utilized to the material set wijGrouping again;
3.2.3.1, finding out each corpus set w by using TF-IDF algorithmijThe keyword in (1);
the TF-IDF algorithm in step 3.2.3 is a numerical statistical method for evaluating the importance of a word to a given document, and the word frequency TF (term frequency) refers to the frequency of occurrence of a given word in a given document, and the calculation formula is as follows:
Figure BDA0002309912420000085
nx,yrefers to the number of times the term x appears in the document y, Σknk,yReferring to the total vocabulary number in the Document y, the inverse Document frequency idf (inverse Document frequency) is used to evaluate the amount of information provided by a word or term, i.e. whether the term is common in the whole Document, and the formula is:
Figure BDA0002309912420000086
n refers to the total number of documents, NxAnd (3) indicating the document number of the term x, taking each entry in the text as a document, and finally, jointly calculating the values of TF and IDF to obtain the formula of TF-IDT:
TF-IDFx,y=TFx,y×IDFx
3.2.3.2, for each corpus set wijManually screening the keywords, removing the keywords with low relevance with the core entity of the current corpus set, reserving partial keywords with highest relevance, and reserving the quantity of the keywords to be relevant to the overall quality of all the extracted keywords;
3.2.3.3, constructing a triple of the entity corresponding to the key word extracted from each corpus set and screened and the core entity of the current corpus set, adding a triple set TP, calculating Euclidean distance again by taking the word vector of the key word as the cluster center, and calculating w of each corpus setijAnd corresponding word vector cluster cubeijGrouping is carried out;
3.2.3.4, using these keywords as each corpus set wijThe Euclidean distance between the data points in the set and each centroid is calculated once again, the data points are classified, and the original language material base is divided into a plurality of small language material sets w at the momentz
All triples constructed in the instructive pre-grouping process are added into the triple set TP, and each obtained word vector cluster is marked as cubezAnd the corpus corresponding to the text is marked as wzWherein z is a natural number and represents the number of clusters in the Cube set and the number of corpus sets in the corpus set W; cubezThe cluster center of (A) is denoted as CentroidzThe Entity object corresponding to the Entity is Entityz
The fourth step comprises the following steps:
4.1, defining VT comprises two parts of basic attribute (basic) and visual structure (DVSCHEMA), formalized definition is as (1), wherein the basic attribute stores general information of graphic titles, sub-titles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICATTRIBUTE includes three attributes: title (title), subtitle (subtitle) and attributes (attributes), wherein the formal definitions are shown as (2), the title is used for storing the title of the finally generated visual graph, the subtitle is used for storing the subtitle of the finally generated visual graph, and the attributes are used for storing the setting parameters of the position, the color combination, the font and the font size of the finally generated visual graph;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, the basic visualization graph can be generalized into four basic categories according to the data type, the graph data structure and the graph dimension required by the graph: general graph (General), topological graph (Topology), Map (Map), Text graph (Text), formal definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
4.4, the four basic categories in step 4.3 include two attributes: the graphic type (VType) and the graphic structure (structModel) are adopted, the VType stores the type of the graphics to which the type belongs, the structModel stores the basic visual structure of the graphics to which the type belongs, the formalized definition is shown as (4), and the 'A:: B' indicates that the 'A contains the attribute B';
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType>
<StructModel>;
4.5, in the step 4.4, the four basic categories have respective Mapping relations (Mapping), and describe data structures, data dimensions, graph structure relations and data Mapping position information of various graphs; according to Mapping information and by combining the data structure of the graph, the basic visual structure structModel of various graphs can be abstracted.
In 4.4, the graphs of the VType attributes of the four basic categories are as follows:
4.4.1 General includes bar chart (BarChart), line chart (LineChart), pie chart (PieChart), radar chart (RadarChart), scatter chart (ScatterChart);
4.4.2, the Topology comprises a network map (NetworkChart), a tree map (TreeMap) and an area tree map (treemapcart);
4.4.3, Map includes area Map (AreaMapChart), country Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, Text includes word cloud (WorldCloudchart);
in the 4.5, Mapping relation Mapping and basic visual structure structModel definition of various graphs are defined by the following steps:
4.5.1, graphics in General type, commonly used to represent two-dimensional or three-dimensional data, information can be represented by tuples (XAxis, YAxis) or triples (XAxis, YAxis, ZAxis), Mapping structure of such graphics like (5), where Legendname represents legend name, storing each packet information in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted as (6), the child node of the structModel is a temporary Root node Root, and the Root comprises two child nodes: a key-value pair K _ V and a legend node Legendnode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, the graph in Topology type is usually used to represent the topological relation data, the tree graph and the area tree graph can represent the attribute structure by using the nested key value pair { key: value, child: { key: value } }, the Mapping structure is as (7); the network graph can be represented by a node set (Nodes) and an edge set (Links), wherein the Mapping structure is as shown in (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two sub-structures, Root1 and Root2 are temporary Root nodes of the two sub-structures respectively, and Root1 includes two sub-nodes: the child node child is a child node child, and the child structure of child node child is a key value pair K _ V; root2 contains two child nodes: node sets and edge sets, wherein child Nodes of the node sets are key and value, the value can be null, and the child Nodes of the edge sets are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, maps of the type Map, usually used to represent Map information, are represented by key-value pair arrays [ { PlaceName: value } ] or triple arrays [ { lng, lat, value } ], Mapping structures of such maps are as (10), where PlaceName represents place name, lng represents latitude, and lat represents longitude; the structure of the basic structModel can be abstracted according to the Mapping structure, such as (11), the structModel has two substructures, Root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 contains a child-node key-value pair K _ V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng>
<lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4, graphics in the Text type commonly use a binary (Keyword) to represent the Keyword frequency, and the Mapping structure of such graphics is as (12), wherein the Keyword is a word extracted from the Text, and the frequency represents the occurrence frequency of the word in the Text; the structure of the basic structModel can be abstracted according to the Mapping structure, wherein the structure is shown as (13), child nodes of the structModel are temporary Root nodes Root, and the Root contains a key value pair K _ V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>。
the fifth step comprises the following steps:
5.1, matching the Web data prototype structure M-JSON with a structModel of a visual model tree VT according to a data structure, and matching M attribute combination results of candidate coordinate axes/legends meeting conditions in the M-JSON, wherein each combination result is represented as a binary group consisting of a key value pair L and an attribute name A, and L and A respectively correspond to LegendNode and K _ V in the step 4.5.1;
5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a doublet: p: (L:: name, A:: name), and matching each result Pi=(Li::name,Ai: : name), converted to triplet form Gi=(Li::name,R,Ai: : name) into set S ═ { G ═ G1,G2,...,Gm};
5.2.2, sequentially grouping ShirG in the set SiMapping F (L) to the triplet structure of the knowledge-graph as followsi::name→head,R→relation,Ai: : name → tail) is mapped into a triple (head, relation, tail), whether the current triple (head, relation, tail) exists in the constructed corpus knowledge graph is matched, result is True or False, which is respectively expressed as 1 and 0, first, a head entity node head and a tail entity node tail are matched in the corpus knowledge graph, then, the relation between the head entity node and the tail entity node is matched, and when and only when the head entity head, the tail entity node and the relation are successfully matched, the result is 1;
5.2.3, after the object query in the set S is completed, returning a set Q { (G)i,resulti) Q is used for judging whether semantic association exists in the current qualified binary group or not, and the semantic association is used as judgment of a matching result of attribute combination of a candidate coordinate axis/legend, so that only structure matching exists and resultiWhen the number of the images is 1, the matching is judged to be successful, so that the accuracy of data attribute matching is improved, and the generation rate of images without practical significance is reduced.
The invention has the following beneficial effects: when the Web data generates the graph in a visual mode, the method can be used for constructing a network corpus knowledge graph by utilizing the network corpus data, analyzing, inducing and modeling common visual graphs, optimizing the matching process of a Web data prototype structure and a common visual graph model, reducing the generation of redundant graphs and improving the generation rate of effective graphs. Meanwhile, the work of manual participation in graph screening is reduced in the automatic data visualization process, and the Web data visualization process is simplified.
Drawings
FIG. 1 shows a corpus grouping and word vector set clustering flow diagram.
Fig. 2 shows a knowledge graph construction flow chart based on the dbscan algorithm.
Fig. 3 shows a block diagram of a visual model tree VT.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 3, a method for knowledge graph relation extraction and REST service visualization fusion based on a DBSCAN clustering algorithm includes the following steps:
the first step, the construction of a target domain corpus: the method comprises the following steps of taking the corpus content of a network corpus (such as Wikipedia) as the basis of constructing a knowledge graph, improving the text quality of the content of a domain corpus and the comprehensiveness of the content, using the vocabulary entry information of the network corpus as the content of an original corpus, screening the content of the original network corpus for constructing the knowledge graph, analyzing the discovery of the webpage content of entries, and filtering and cleaning target vocabulary entries by using redundant information which is irrelevant to the vocabulary entries such as titles and text information, HTML tags, editing information of the vocabulary entries, webpage link information and the like besides the titles and the text information in the corpus content, and extracting the titles and effective text content, wherein the filtering content comprises the following steps: performing HTML tag/text style symbol filtering (e.g., deleting HTML tags such as < h1> text </h1>, < p > text < p >, and < div > text </div >, and retaining text, deleting style symbols such as span { font-color: # effeffefe; }), term editing information filtering (e.g., deleting [ edit ] tags), picture information filtering (e.g., deleting < img src: '…'/> picture tags), link information filtering (e.g., deleting < a href "…" title ″ "> text > tag < a, and retaining text information), page-specific title/attribute name filtering (e.g., deleting proprietary title and attribute names such as futher reviewing, External links), and numerical value filtering (e.g., deleting numerical hyperlinks such as 20,30, etc.);
for example, a web corpus of Wikipedia (Wikipedia) is used, web page contents of Wikipedia competitive sports (Athletic sports) are obtained through a crawler, and vocabulary entry corpus contents containing the competitive sports (Athletic sports) and subcategories thereof are obtained after filtering and screening;
step two, corpus-oriented entity extraction: the knowledge graph is a data information network with a graph structure formed by entities and relations, and the basic structure of the knowledge graph is represented by a triple of 'entity-relation-entity', wherein the triple comprises two entities with a real semantic relation and a relation between the two entities, and is represented by a head (relationship, tail) form, wherein G represents the triple, head represents a head entity, tail represents a tail entity, relation represents the relation between the head entity and the tail entity, each entity also comprises an attribute and an attribute value, the attribute of the entity is also converted into a tail entity connected with the entity, a relation is established between the head entity and the tail entity, and the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
2.1, entity extraction: the entity extraction is also called named entity identification, which is to automatically identify a named entity from a text data set, which generally refers to a Person name, a place name, a mechanism noun and other entities with all names as identifiers, and the process can be completed by using some mainstream named entity identification systems, such as a standard NER which can mark entities in a text by classes and can identify seven types of attributes including Time, Location, Person, Date, Organization, Money, Percent and the like, and using the named entity as a tool to identify the named entity of the content of the database, wherein the identified named entity is marked with its type attribute, and the process is as follows: firstly, naming entity recognition is carried out on the content of a material library through a tool; secondly, marking the type attribute of the identified named entity; thirdly, filtering the named entities according to the type attributes, deleting the unsuitable named entities, keeping labels of other named entities, and defining the entry names as the named entities by default;
2.2, extracting attribute entities: taking an information frame of a vocabulary entry network corpus as a source of an attribute, extracting the attribute from the information frame, then intercepting information frame information of each vocabulary entry in a corpus, extracting an attribute name according to the structure of the information frame, taking the attribute name as a tail entity of a named entity corresponding to the name of the vocabulary entry, not reserving an attribute value, if a certain vocabulary entry does not have the information frame, not creating the tail entity for the named entity corresponding to the vocabulary entry, taking an information frame (Info Box) of a vocabulary entry "National Baskeletall Association (NBA)" in Wikipedia as an example, wherein the information frame is formed in a form, the content of a 1 st line and a 1 st column in a 1 st line is "Sport", the content of a 2 nd line and a 2 nd column in a 2 nd line is "Foundation", and the content of a 2 nd line and a 2 nd column in a 2 nd line is "June 6,1946; 73years ago ' and the like, and constructing triples by extracting the contents of the first column, namely ' Sport ' and ' found ', and the vocabulary entry ' National Basketcall Association (NBA) ';
2.3, noun entity extraction, comprising four steps: word splitting (Split), part-of-speech Tagging (POS Tagging), Stop Word Filtering (Stop Word Filtering), and stem extraction (Stemming). The named entity which is identified in the named entity extraction step is already marked, so that the following operation only extracts the corpus content outside the marked entity;
2.3.1, splitting words: designing a splitting rule mode by using a regular expression, splitting words of the content of the corpus according to spaces, symbols and paragraphs, and obtaining word texts;
2.3.2, part of speech tagging: to obtain nouns in a corpus, part-of-speech tagging is first required for text vocabularies. Part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, and is a text data processing technology for tagging part-of-speech of words in a corpus according to meanings and context content in the linguistic science of the corpus, wherein a plurality of words possibly contain a plurality of parts-of-speech and have a plurality of meanings, the selection of the part-of-speech depends on the context meaning, the corpus which is tagged with named entities is used as a tagging object text for part-of-speech tagging, noun objects are found out according to tagging results, non-noun objects are eliminated from the corpus, but vocabulary names of the non-nouns are not included, at the moment, the named entities, the noun objects and original punctuation points in each vocabulary entry are reserved in the corpus, and all contents still keep the original text sequence;
2.3.3, stop word filtering: the Stop Word is from Stop Word, which refers to the Word or Word automatically filtered out when processing natural language text in order to save storage space and improve search efficiency in information retrieval, and any kind of Word can be selected as the Stop Word for a given purpose, and the Stop Word mainly comprises two types: one category is functional Words (Function Words) contained in human languages, which are often articles, conjunctions, adverbs, prepositions, etc., and such functional Words are used very commonly and occur very frequently, but have no exact actual meaning, such as a, an, the, which, on, etc.; the other type is real Words (Content Words), which refer to a part of Words with actual specific meanings but without specific reference or direction, such as wan, welcome, open, contider, index and the like, in natural language processing, a Stop Word List (Stop Word List) is provided, the Stop Word List is used as a reference dictionary, Stop Words are deleted from a corpus through Word comparison, the Content of the corpus is further simplified, and no Stop Words in the corpus are ensured to be reserved;
2.3.4, stem extraction: the word drying extraction is a process of removing morphological affixes to obtain corresponding roots, is a specific processing process of western languages such as English, and the like, and the same English word has singular and plural deformations (such as applets and applets), temporal deformations (such as ding and did) such as ing and ed, and deformations (such as like and keys) of different predicates corresponding to the personal pronouns, and the like. Although the words have slight differences in form, the words all correspond to the same root word, and should be processed as the same word under the condition of calculating correlation, at this time, word drying processing is required, and a baud stem Algorithm (porter stem Algorithm) is a mainstream stem extraction Algorithm, and the core concept is to classify, process and restore the words according to the type of morphological affixes. Except part of special deformations, most word deformations have certain rules, the deformations are divided into 6 categories according to the rules, and the extraction steps of the word stems are as follows:
2.3.4.1, according to the word deformation category, proceed affix removal and word recovery for each case to obtain the stem information of noun object in corpus to reduce the different shapes of the same word. The 6 different word variants are as follows:
2.3.4.1.1, complex number, words ending in ed and ing;
2.3.4.1.2, words that contain vowels and end in y;
2.3.4.1.3, double suffix words;
2.3.4.1.4, suffixed with-ic, -ful, -new, -active, etc.;
2.3.4.1.5, < c > vcvc < v >, in case of a suffix of-ant, -ref, etc. (c is a consonant, v is a vowel);
2.3.4.1.6, < c > vc < v > where there are more than 1 pair of vc between vowel consonants, a word ending with e;
2.3.4.2, creating noun objects reduced to word stems as noun entities, and updating the noun objects in the corpus to be expressed in the form of word stems;
the third step: combining Word2vec, performing guiding secondary pre-grouping on the material library, and using a DBSCAN (sensitivity-Based Clustering of Applications with Noise) Clustering algorithm to construct a knowledge graph, wherein the structure of a triple G is (head, relation, tail), the relation has various relations with the difference of the head and the tail, and the relation is actually a relation set in the knowledge graph and is used for representing complex relations among various entities. The method aims at judging whether semantic association exists between two attributes, namely whether a relation exists between two entities, but not paying attention to the relation, referring to fig. 1, performing secondary grouping on a corpus and performing secondary clustering on a word vector set corresponding to the secondary grouping by calculating word vectors of words in the corpus; referring to fig. 2, the DBSCAN clustering algorithm is used to extract entity relationships, and the process flow is as follows:
the DBSCAN algorithm is a noise application space clustering algorithm based on density, examines the connectivity among samples according to the density distribution of the samples, and expands a cluster based on the connectable samples to obtain a final clustering result;
3.1, using Word2vec to train the corpus W into a Word vector cluster Cube: word2vec is a Word vector tool that represents words as feature vectors of words. Word2vec converts words into numerical value form, an N-dimensional vector is used for representing, the corpus W is embedded into a vector space to obtain Word vector clusters Cube, each Word vector is distributed in the Word vector clusters Cube in a discrete mode, and distribution states also present different aggregation conditions according to the density of the correlation degree of the Word vectors. The relevance distribution state of the words can be obtained by analyzing the aggregation state of the word vectors, and the word vectors are grouped according to different affinity and sparseness relations so as to obtain the relation between the words, namely the relation between entities;
3.2, performing guiding pre-grouping on the material library twice: because the clustering of the DBSCAN clustering algorithm is easily influenced by the distribution condition of the data set, in order to ensure the core concept, namely, the main classification object or key word in the target field is taken as a clustering center, the corpus needs to be pre-grouped twice in a guiding way, and the method comprises the following steps:
3.2.1, grouping the corpus W and the word vector cluster Cube corresponding to the corpus W for one time, and the steps are as follows:
3.2.1.1, extracting a root corpus label in the corpus W to form a core entity; obtaining network language material through a crawler, extracting a first-layer sub-classification label of a root language material label in a language material base, and generating a first-layer sub-classification label set Tag ═ t1,t2...ti...tnN sub-classification labels, each label having a corresponding entity and word vector, and combining these entities with core entities to form n triplets, and adding the triplets into a triple set TP;
3.2.1.2 Each Tag t in the set Tag of class labelsiUsing the corresponding word vector as a cluster center, calculating the Euclidean distance from each data point in the word vector cluster Cube to each centroid, then distributing the data points to the clusters corresponding to each cluster center according to the principle of closeness, and dividing the corpus W into the clusters corresponding to each cluster center at the momentFor n corpus sets wi(1 < i < n), corpus set wiThe corresponding word vector cluster is cubei
3.2.2, for each corpus set wiAnd corresponding word vector cluster cubeiPerforming secondary grouping, wherein according to the step of primary grouping, the flow is as follows:
3.2.2.1 for each corpus set wiAnd corresponding cluster core label tiExtracting a Cluster core tag tiIs a secondary core entity. Obtaining the web corpus through a crawler, extracting the first-layer sub-classification labels of the cluster center labels, and generating a classification label set
Figure BDA0002309912420000161
Wherein (1 < j < m >i1 < i < n), representing the current tag tiTotally contain miSub-classification labels, each label having a corresponding entity and word vector, and combining these entities with secondary core entities to form miAdding triples into a triple set TP;
3.2.2.2 Tag from step 3.2.2.1iThe word vector corresponding to each label in the cluster is used as a cluster center, and the current word vector cluster cube is calculatediThe Euclidean distance from each data point to each centroid, then the data points are distributed to the clusters corresponding to each cluster centroid according to the principle of closeness, and at the moment, each corpus set w isiIs divided into m againiIndividual set of language material wijWherein (1 < ═ j < ═ mi1 < i < n), i.e., the original corpus W is divided into
Figure BDA0002309912420000171
Corpus set wij,wijThe corresponding word vector cluster is cubeij
Wherein the Euclidean Distance (Euclidean Distance) in step 3.2.1.2 is an important basis for determining the class of the data point, and a given sample is assumed to exist
Figure BDA0002309912420000172
And
Figure BDA0002309912420000173
wherein i, j is 1,2, …, m represents the number of samples, n represents the number of features, and the calculation method of the Euclidean distance is as follows:
Figure BDA0002309912420000174
for example, firstly, conducting a guiding pre-classification on a constructed competitive sports (Athletic sports) corpus W entity data set, extracting a first-layer sub-classification label of a Wikipedia corpus label of a previously crawled competitive sports (Athletic sports) to form a label set Tag { "Association football", "Baseball", "Badminton", "Basketball", "Beach capacitor", … }, wherein n ═ 55 sub-classification labels are included, each label has a corresponding entity and Word vector trained by Word2vec, and the entities are connected with a core entity "Athletic sports" to form 55 triplets. Then, the label objects are used as centroids, the Euclidean distance from each data point in the data set to each centroid is calculated, then the data points are distributed to the class corresponding to each centroid according to the principle of closeness, at the moment, 55 clusters which take the event class as the centroid, namely 55 grouped data sets are obtained, and meanwhile, the corpus W is also divided into 55 corpus sets;
then labeling with cluster center txAssociated football and corresponding corpus set wxFor example, the sub-classification label of the "Association football" is obtained according to the first grouping step, and the sub-classification label Tag is obtainedxThe method comprises the steps of "composition", "club", "player", "counter", "channels", "refer", "manager", then constructing entities for the seven tags, combining the entities with the "Association football" to form a triple, and adding a triple set TP. Then according to the step of grouping for the first time, the word vectors corresponding to the labels are taken as cluster centers, and samples in the corpus set with the Association football as the cluster centers are grouped again by taking the Euclidean distance as the basisSo as to generate a new cluster and the corresponding corpus set, at this time, the corpus set w with cluster center label of 'Association football' is generatedxIs divided into 7 clusters, namely 7 grouped data sets, and a corpus set wxIs also divided into 7 corpus sets;
3.2.3, finding out each corpus set w by combining TF-IDF algorithmijThe keyword in (1), the keyword is utilized to the material set wijGrouping again;
3.2.3.1, finding out each corpus set w by using TF-IDF algorithmijThe keyword in (1);
the TF-IDF algorithm in step 3.2.3 is a numerical statistical method for evaluating the importance of a word to a given document, and the word frequency TF (term frequency) refers to the frequency of occurrence of a given word in a given document, and the calculation formula is as follows:
Figure BDA0002309912420000181
nx,yrefers to the number of times the term x appears in the document y, Σknk,yReferring to the total vocabulary number in the Document y, the inverse Document frequency idf (inverse Document frequency) is used to evaluate the amount of information provided by a word or term, i.e. whether the term is common in the whole Document, and the formula is:
Figure BDA0002309912420000182
n refers to the total number of documents, NxReferring to the number of documents in the term x, each term in the text is treated as a document. And finally, jointly calculating the values of the TF and the IDF, wherein the formula for obtaining the TF-IDT is as follows:
TF-IDFx,y=TFx,y×IDFx
3.2.3.2, for each corpus set wijThe keywords are manually screened, the keywords with low relevance with the core entity of the current corpus set are removed, the part of the keywords with the highest relevance is reserved, the number of the keywords and all the extracted keywords are reservedThe overall quality of the word is relevant;
3.2.3.3, constructing a triple by the entity corresponding to the key word extracted from each corpus set and screened and the core entity of the current corpus set, and adding the triple into the triple set TP. Then, the word vectors of the keywords are taken as cluster centers, Euclidean distance calculation is carried out again, and each corpus set w is subjected toijAnd corresponding word vector cubeijGrouping is carried out;
3.2.3.4, using these keywords as each corpus set wijAnd performing Euclidean distance calculation from the data points in the set to each centroid again, and classifying the data points. The original corpus has now been divided into a plurality of corpus sets wz
All triples constructed in the instructive pre-grouping process are added into the triple set TP, and each obtained word vector cluster is marked as cubezAnd the corpus corresponding to the text is marked as wzWherein z is a natural number and represents the number of clusters in the Cube set and the number of corpus sets in the corpus set W; cubezThe cluster center of (A) is denoted as CentroidzThe Entity object corresponding to the Entity is Entityz
For example, keywords in each corpus set are found through TF-IDF calculation, for example, keywords such as "score", "winner", "result", "pass", "goal", "shot" and the like are found in the corpus set corresponding to the "composition" cluster in the "association font" category, but some words appear frequently but have weak association with "composition", such as "phase", "list", "year", "body", "find", and the like. Therefore, the keywords of each corpus set need to be screened by manual intervention, the keywords with low entity correlation corresponding to the current cluster center are removed, the part of the keywords with high correlation is reserved, the entities corresponding to the keywords screened from each corpus set are combined with the entities corresponding to the current cluster center, and a triple with the latter as a head entity is constructed. Then, the word vectors of the keywords are used as cluster centers, Euclidean distance calculation is carried out again, and word vector clusters and corpus sets are grouped; .
3.3, on the basis of guiding grouping, clustering each word vector cluster Cube in Cube by a DBSCAN clustering algorithmzClustering and calculating cubezCluster center of (3)zFor each newly generated word vector cluster CkComputing cluster center centroidskFinding out Centroid according to the mapping relation between word vector object and entity objectzAnd CentroidkCorresponding EntityzAnd EntitykBy EyntitzBeing the head Entity, EntitykAs a tail Entity, default Entity association is R, and a triple (Entity) is constructedz,R,Entityk) And added to the triad set. Automatically searching a clustering center for each corpus set through a DBSACN clustering algorithm, clustering, and simultaneously constructing a triple, wherein the flow is as follows:
3.3.1 clustering each word vector cluster Cube in Cube by using DBSCAN clustering algorithmzClustering and calculating cubezCluster center of (3)z
The DBSCAN clustering algorithm in step 3.3.1 is executed as follows:
3.3.1.1 in cubezSelecting any unvisited sample p (namely a data point p) as a circle center, and drawing a circular neighborhood (namely an epsilon-neighborhood) with the radius of epsilon;
3.3.1.2, if the number of samples in the epsilon-neighborhood is not less than minPts (the minimum number of samples in the neighborhood), creating a new cluster C for p, and adding the samples in the neighborhood into a set N;
3.3.1.3, repeating steps 3.3.1.1 and 3.3.1.2 for the sample p 'in the set N, judging the subordinate of p' before dividing the epsilon-neighborhood each time, and adding p 'into the cluster C if p' does not belong to any cluster;
3.3.1.4, when all samples in N are accessed, in cubezTo select another sample that has not been visited and repeat step 3.3.1.1 until cubezThe samples in (1) are all accessed;
3.3.1.5, the obtained clustering result: a cluster set;
3.3.2 for each newly generated word vector clusterCkComputing cluster center centroidskFinding out Centroid according to the mapping relation between word vector object and entity objectzAnd CentroidkCorresponding EntityzAnd EntitykIn term of EntityzBeing the head Entity, EntitykAs a tail Entity, default Entity association is R, and a triple (Entity) is constructedz,R,Entityk) And adding a triple set TP;
3.3.3, if the sample number of the minimum cluster in the clustering result is greater than the threshold value Z, taking the obtained cluster set ClusDS as input, adjusting and reducing the value of epsilon, minPts, clustering each cluster again, and executing the steps 3.3.1 and 3.3.2; if the number of the samples of the minimum cluster in the clustering result is not more than the threshold value Z, inquiring the Entity corresponding to each sample in each clusterqThe Entity corresponding to the cluster center of the clusterQGrouped into multiple triplets (Entity)Q,R,Entityq);
Wherein, in DBSCAN clustering algorithm, the size of epsilon-neighborhood and minPts is determined by cubezDetermined by the number of samples in (1), cubezThe larger the number of samples, the larger the value of (. epsilon., minPts). The cluster number can be limited by using a larger neighborhood range and the minimum sample number in the neighborhood in the early stage, if a smaller value is used, a large number of generated fine groups can cause excessive information dispersion, the extracted entity corresponding to the cluster center as an upper-layer entity cannot show core content, when a DBSCAN algorithm is called in a returning mode, the value (epsilon, minPts) is adjusted progressively according to the gradient, the neighborhood range and the minimum sample number are reduced, the clusters obtained by the last clustering are clustered again in sequence, and the sample number in each cluster is reduced;
all the entities in the corpus W establish relations with other entities, and the corresponding three tuples are combined with each other, so that a knowledge graph is formed, and the entity relation with weaker relevance is possibly generated due to the automatic clustering of the found cluster centers and clustering conditions, so that manual checking and screening are needed after the construction of the knowledge graph is completed, and the entity association with low relevance is removed, so that the quality of the knowledge graph is improved;
for example, the original Athletic sports (Athletic sports) corpus has been divided into a plurality of corpus sets at this time, then a clustering center is automatically searched through a DBSCAN clustering algorithm, clustering is performed, a triple is constructed at the same time, the size of a designated k is determined by the size of the corpus set, the larger value of the set k is, the entity corresponding to the centroid obtained through calculation finally constructs a triple with the entity corresponding to the centroid when the centroid is grouped in the previous layer, then the new group is used as a data set, the DBSCAN algorithm is called again, and the above operations are repeated until each group only contains less than 10 data points (at this time, the threshold value Z is 10). And finally, constructing a triple by the entity corresponding to the data point in each group and the entity corresponding to the current centroid. All entities in the Athletic sports (Athletic sports) corpus establish relationships with other entities, and triplets formed by the entities are combined with each other, so that a knowledge graph is formed. However, the mass center and the clustering condition searched by automatic clustering may sometimes generate entity association with weak correlation, so that manual proofreading and screening are finally needed to remove entity association with extremely low correlation;
fourthly, constructing a visual model Tree (VT for short): classifying various visual graphs, summarizing and summarizing the attributes and structural features of various graphs, and formally expressing various graph information by creating a visual model tree (VT) by the following steps:
4.1, defining VT comprises two parts of basic attribute (basic) and visual structure (DVSCHEMA), formalized definition is as (1), wherein the basic attribute stores general information of graphic titles, sub-titles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICATTRIBUTE includes three attributes: title (title), subtitle (subtitle) and attributes (attributes), wherein the formal definitions are shown as (2), the title is used for storing the title of the finally generated visual graph, the subtitle is used for storing the subtitle of the finally generated visual graph, and the attributes are used for storing the setting parameters of the position, the color combination, the font and the font size of the finally generated visual graph;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, the basic visualization graph can be generalized into four basic categories according to the data type, the graph data structure and the graph dimension required by the graph: general graph (General), topological graph (Topology), Map (Map), Text graph (Text), formal definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
4.4, the four basic categories in step 4.3 include two attributes: the graphic type (VType) and the graphic structure (structModel) are adopted, the VType stores the type of the graphics to which the type belongs, the structModel stores the basic visual structure of the graphics to which the type belongs, the formalized definition is shown as (4), and the ' A:: B ' indicates that the ' A ' contains the attribute B ';
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType>
<StructModel>
in step 4.4, the graphs of the VType attributes of the four basic categories are as follows:
4.4.1 General includes bar chart (BarChart), line chart (LineChart), pie chart (PieChart), radar chart (RadarChart), scatter chart (ScatterChart);
4.4.2, the Topology comprises a network map (NetworkChart), a tree map (TreeMap) and an area tree map (treemapcart);
4.4.3, Map includes area Map (AreaMapChart), country Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, Text includes word cloud (WorldCloudchart);
4.5, in the step 4.4, the four basic categories have respective Mapping relations (Mapping), and describe data structures, data dimensions, graph structure relations and data Mapping position information of various graphs; according to Mapping information and by combining a data structure of the graph, a basic visual structure structModel of various graphs can be abstracted;
in the 4.5, Mapping relation Mapping and basic visual structure structModel definition of various graphs are defined by the following steps:
4.5.1, graphics in General type, commonly used to represent two-dimensional or three-dimensional data, information can be represented by tuples (XAxis, YAxis) or triples (XAxis, YAxis, ZAxis), Mapping structure of such graphics like (5), where Legendname represents legend name, storing each packet information in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted as (6), the child node of the structModel is a temporary Root node Root, and the Root comprises two child nodes: a key-value pair K _ V and a legend node Legendnode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, the graph in Topology type is usually used to represent the topological relation data, the tree graph and the area tree graph can represent the attribute structure by using the nested key value pair { key: value, child: { key: value } }, the Mapping structure is as (7); the network graph can be represented by a node set (Nodes) and an edge set (Links), wherein the Mapping structure is as shown in (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two sub-structures, Root1 and Root2 are temporary Root nodes of the two sub-structures respectively, and Root1 includes two sub-nodes: the child node child is a child node child, and the child structure of child node child is a key value pair K _ V; root2 contains two child nodes: node sets and edge sets, wherein child Nodes of the node sets are key and value, the value can be null, and the child Nodes of the edge sets are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, maps of the type Map, usually used to represent Map information, are represented by key-value pair arrays [ { PlaceName: value } ] or triple arrays [ { lng, lat, value } ], Mapping structures of such maps are as (10), where PlaceName represents place name, lng represents latitude, and lat represents longitude; the structure of the basic structModel can be abstracted according to the Mapping structure, such as (11), the structModel has two substructures, Root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 contains a child-node key-value pair K _ V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng><lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4, graphics in the Text type commonly use a binary (Keyword) to represent the Keyword frequency, and the Mapping structure of such graphics is as (12), wherein the Keyword is a word extracted from the Text, and the frequency represents the occurrence frequency of the word in the Text; the structure of the basic structModel can be abstracted according to the Mapping structure, wherein the structure is shown as (13), child nodes of the structModel are temporary Root nodes Root, and the Root contains a key value pair K _ V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>
fifthly, the data visualization optimization matching method based on the network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by RESTWeb service; matching the Web data prototype structure M-JSON with each structModel in the visual model tree VT according to the data structure, wherein the returned result is a set formed by attribute combinations of candidate coordinate axes/legends which meet the conditions; on the basis of structure matching, by using the knowledge graph constructed in the third step, inquiring whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic correlation, optimizing and matching according to the inquiry result, selecting effective dimension combination to improve the accuracy (Precision) of automatically generated graphs, and the steps are as follows:
5.1, matching the Web data prototype structure M-JSON with a structModel of a visual model tree VT according to a data structure, matching M attribute combination results of candidate coordinate axes/legends meeting conditions in the M-JSON, wherein each combination result is expressed as a binary group consisting of a key value pair L and an attribute name A, and L and A respectively correspond to LegendNode and K _ V in the step 4.5.1;
5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a doublet: and P is (L:: name, A:: name). Each matching result Pi=(Li::name,Ai: : name), converted to triplet form Gi=(Li::name,R,Ai: : name) into set S ═ { G ═ G1,G2,...,Gm};
5.2.2, sequentially grouping ShirG in the set SiMapping F (L) to the triplet structure of the knowledge-graph as followsi::name→head,R→relation,Ai: : name → tail) to be mapped into a triple (head, relation, tail), whether the current triple (head, relation, tail) exists in the constructed corpus knowledge graph is matched, and result is True or False, which is respectively expressed as 1 and 0. Firstly, matching a head entity node head and a tail entity node tail in a corpus knowledge graph, then matching a relation relationship between the head entity node and the tail entity node, and when the head entity head, the tail entity node and the relation relationship are successfully matched, result is 1; otherwise, result is 0;
5.2.3, after the object query in the set S is completed, returning a set Q { (G)i,resulti) Q is used for judging whether semantic association exists in the current qualified binary group or not, and the semantic association is used as judgment of a matching result of attribute combination of a candidate coordinate axis/legend, so that only structure matching exists and resultiWhen it is 1, then it is judgedAnd the fixed matching is successful, so that the accuracy of data attribute matching is improved, and the generation rate of images without practical significance is reduced.

Claims (10)

1. A knowledge graph relation extraction and REST service visualization fusion method based on a DBSCAN clustering algorithm is characterized by comprising the following steps:
the first step, the construction of a target domain corpus: the method comprises the following steps of taking network corpus content as a basis for constructing a knowledge graph, using network corpus entry information as original corpus content, screening the original network corpus content for constructing the knowledge graph, comparing and analyzing webpage content of the network entries, wherein the original corpus content comprises HTML labels besides headers and text information, editing information of the entries, webpage link information and other redundant information irrelevant to the entries, filtering and cleaning the content of the network entries, extracting the headers and effective text content, and filtering the content, wherein the filtering comprises the following steps: performing HTML tag/text style symbol filtering, entry template symbol and non-English character filtering, entry editing information filtering, picture information filtering, link information filtering, page proprietary title attribute name filtering and numerical value filtering on webpage contents of entries;
step two, corpus-oriented entity extraction: the knowledge graph is a data information network with a graph structure formed by entities and relations, the basic structure of the knowledge graph is represented by a triple of 'entity-relation-entity', the triple comprises two entities with a real semantic relation and a relation between the two entities, and is represented by a form of G ═ tail, wherein G represents the triple, head represents a head entity, tail represents a tail entity, and relation represents the relation between the head entity and the tail entity, each entity also comprises an attribute and an attribute value, the attribute of the entity is also converted into the tail entity connected with the entity, a relation is established between the head entity and the tail entity, and the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
the third step: in combination with Word2vec, performing instructive secondary pre-grouping on the corpus, and constructing a knowledge graph by using a DBSCAN (sensitivity-based statistical Clustering of Applications with Noise) Clustering algorithm: the structure of the triple G is (head, relation, tail), the relation has many relations with the difference of the head and the tail, the relation is actually a relation set in a knowledge map and is used for representing the complex relation between many entities, the purpose is to judge whether semantic association exists between two attributes, namely whether relation exists between two entities, but not to pay attention to the relation, the word vector of the vocabulary of the corpus is calculated, the corpus is secondarily grouped and secondarily clustered with the corresponding word vector set, and the entity relation is extracted by using a DBSCAN clustering algorithm;
fourthly, constructing a visual model tree VT: classifying various visual graphs, summarizing and summarizing the attributes and structural features of various graphs, and formally expressing various graph information by creating a visual model tree VT;
fifthly, the data visualization optimization matching method based on the network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by REST Web service; matching the Web data prototype structure M-JSON with each structModel in the visual model tree VT according to the data structure, wherein the returned result is a set formed by attribute combinations of candidate coordinate axes/legends which meet the conditions; on the basis of structure matching, the knowledge graph constructed in the third step is utilized to inquire whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic correlation, the matching is optimized according to the inquiry result, and effective dimension combination is selected, so that the accuracy rate of automatically generating the graph is improved.
2. The knowledge-graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm according to claim 1, wherein the process of the second step is as follows:
2.1, entity extraction: entity extraction, also called named entity recognition, is the automatic recognition of named entities from text data sets, which generally refers to entities identified by names of people, places, names of organizations, and all other names, and is accomplished by using some mainstream named entity recognition systems, which includes the steps of: firstly, naming entity recognition is carried out on the content of a material library through a tool; secondly, marking the type attribute of the identified named entity; thirdly, filtering the named entities according to the type attributes, deleting the unsuitable named entities, keeping labels of other named entities, and defining the entry names as the named entities by default;
2.2, extracting attribute entities: extracting attributes from information frames of the vocabulary entry network corpus by taking the information frames of the vocabulary entry network corpus as the sources of the attributes, then intercepting information of the information frames of each vocabulary entry from a corpus, extracting attribute names according to the structure of the information frames, taking the attribute names as tail entities of the named entities corresponding to the names of the vocabulary entries, not reserving attribute values, and if no information frame exists in a certain vocabulary entry, not establishing the tail entities for the named entities corresponding to the vocabulary entry;
2.3, noun entity extraction, comprising four steps: word splitting (Split), part-of-speech Tagging (POS Tagging), Stop Word Filtering (Stop Word Filtering), Stemming (Stemming), and the named entity extraction step is performed with the identified named entity already tagged, so that the following operations only extract the corpus content outside the tagged entity.
3. The method for knowledge-graph relation extraction and REST service visualization fusion based on DBSCAN clustering algorithm according to claim 2, wherein the process of 2.3 is as follows:
2.3.1, splitting words: designing a splitting rule mode by using a regular expression, splitting words of the corpus content according to spaces, symbols and paragraphs, and obtaining word texts;
2.3.2, part of speech tagging: in order to obtain nouns in a corpus, part-of-speech tagging is needed to be carried out on text vocabularies, wherein the part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, the part-of-speech tagging is a text data processing technology for tagging the part of speech of words in the corpus according to the meaning and context content of the words in the corpus linguistics, a plurality of words possibly contain a plurality of parts of speech and have multiple meanings, the part-of-speech selection depends on the context meaning, the corpus which is subjected to named entity tagging is used as a tagging object text for carrying out part-of-speech tagging, a noun object is found out according to tagging results, a non-noun object is removed from the corpus but the non-noun entry names are not included, named entities, noun objects and original punctuation points in each entry are reserved in the corpus at this time, and all;
2.3.3, stop word filtering: stop words, the name of which is from Stop Word, refer to words or phrases that are automatically filtered out when processing natural language text in order to save storage space and improve search efficiency in information retrieval, and for a given purpose, any kind of words can be selected as Stop words, and Stop words mainly include two kinds: one is the functional Words (Function Words) contained in human language, which are used very commonly and appear very frequently, but have no exact actual meaning; the other type is real Words (Content Words), which refer to Words with a part of actual concrete meanings but without definite reference or direction, in the natural language processing, a Stop Word List (Stop Word List) is existed, the Stop Word List is used as a reference dictionary, Stop Words are deleted from a corpus through Word comparison, the corpus Content is further simplified, and no Stop Words in the corpus are ensured to be reserved;
2.3.4, stem extraction: the word drying extraction is a process of removing morphological affixes to obtain corresponding roots, and is a specific processing process of western languages such as English, the same English word has singular and plural deformations, temporal deformations, and deformations of different predicates corresponding to the nominal pronouns, the words have slight differences in form but correspond to the same roots, and the words are treated as the same words under the condition of calculating correlation, and then word stem processing is needed.
4. The DBSCAN clustering algorithm-based knowledge graph relationship extraction and REST service visualization fusion method according to claim 3, wherein in 2.3.4, the stem extraction step is as follows:
2.3.4.1, according to the word deformation category, proceed affix removal and word recovery for each case to obtain the stem information of noun object in corpus to reduce the cases of different shapes of different words, the 6 different word deformations are as follows:
2.3.4.1.1, complex number, words ending in ed and ing;
2.3.4.1.2, words that contain vowels and end in y;
2.3.4.1.3, double suffix words;
2.3.4.1.4, suffixed with-ic, -ful, -new, -active, etc.;
2.3.4.1.5, < c > vcvc < v >, in case of a suffix of-ant, -ref, etc. (c is a consonant, v is a vowel);
2.3.4.1.6, < c > vc < v > where there are more than 1 pair of vc between vowel consonants, a word ending with e;
2.3.4.2, creating noun objects reduced to stem as noun entities, and updating the noun objects in the corpus to be expressed in stem form.
5. The method for fusion of knowledge-graph relation extraction and REST service visualization based on DBSCAN clustering algorithm according to any of claims 1 to 4, wherein in the third step, the DBSCAN algorithm is a density-based noise application spatial clustering algorithm which examines connectivity among samples according to density distribution of the samples and expands the clustering cluster based on the connectable samples to obtain the final clustering result, and the process is as follows:
3.1, using Word2vec to train the corpus W into a Word vector cluster Cube: word2vec is a Word vector tool, it expresses words as the characteristic vector of the Word, Word2vec converts the Word into the numerical value form, use a N dimensional vector to express, imbed the corpus W into a vector space and get Word vector cluster Cube, every Word vector distributes among them discretely, according to the Word vector interrelatedness degree sparsely, the distribution state presents different aggregation situations too, can obtain the associativity distribution state of the Word by analyzing the aggregation state of the Word vector, group Word vector according to different parentage relations, in order to obtain the relation among the words, namely the relation among the entities;
3.2, performing guiding pre-grouping on the material library twice: because the clustering of the DBSCAN clustering algorithm is easily influenced by the distribution condition of the data set, in order to ensure the core concept, namely, the main classification object or the key word of the target field is taken as a clustering center, the corpus needs to be subjected to two-time guiding pre-grouping;
3.3, on the basis of guiding grouping, clustering each word vector cluster Cube in Cube by a DBSCAN clustering algorithmzClustering and calculating cubezCluster center of (3)zFor each newly generated word vector cluster CkComputing cluster center centroidskFinding out Centroid according to the mapping relation between word vector object and entity objectzAnd CentroidkCorresponding EntityzAnd EntitykIn term of EntityzBeing the head Entity, EntitykAs a tail Entity, default Entity association is R, and a triple (Entity) is constructedz,R,Entityk) And adding the data into the triple sets, automatically searching a clustering center for each corpus set through a DBSCAN clustering algorithm, clustering, and simultaneously constructing the triple.
6. The knowledge-graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm according to claim 5, wherein the flow of 3.3 is as follows:
3.3.1 clustering each word vector cluster Cube in Cube by using DBSCAN clustering algorithmzClustering and calculating cubezCluster center of (3)z
The DBSCAN clustering algorithm in step 3.3.1 is executed as follows:
3.3.1.1 in cubezSelecting any unvisited sample p (namely a data point p) as a circle center, and drawing a circular neighborhood (namely an epsilon-neighborhood) with the radius of epsilon;
3.3.1.2, if the number of samples in the epsilon-neighborhood is not less than minPts (the minimum number of samples in the neighborhood), creating a new cluster C for p, and adding the samples in the neighborhood into a set N;
3.3.1.3, repeating steps 3.3.1.1 and 3.3.1.2 for the sample p 'in the set N, judging the subordinate of p' before dividing the epsilon-neighborhood each time, and adding p 'into the cluster C if p' does not belong to any cluster;
3.3.1.4, when all samples in N are accessed, in cubezTo select another sample that has not been visited and repeat step 3.3.1.1 until cubezThe samples in (1) are all accessed;
3.3.1.5, the obtained clustering result: a cluster set;
3.3.2 for each newly generated word vector Cluster CkComputing cluster center centroidskFinding out Centroid according to the mapping relation between word vector object and entity objectzAnd CentroidkCorresponding EntityzAnd EntitykIn term of EntityzBeing the head Entity, EntitykAs a tail Entity, default Entity association is R, and a triple (Entity) is constructedz,R,Entityk) And adding a triple set TP;
3.3.3, if the sample number of the minimum cluster in the clustering result is greater than the threshold value Z, taking the obtained cluster set ClusDS as input, adjusting and reducing the value of epsilon, minPts, clustering each cluster again, and executing the steps 3.3.1 and 3.3.2; if the number of the samples of the minimum cluster in the clustering result is not more than the threshold value Z, inquiring the Entity corresponding to each sample in each clusterqThe Entity corresponding to the cluster center of the clusterQGrouped into multiple triplets (Entity)Q,R,Entityq);
Wherein, in DBSCAN clustering algorithm, the size of epsilon-neighborhood and minPts is determined by cubezDetermined by the number of samples in (1), cubezThe larger the number of samples, (epsilon, minPts) is, the larger the neighborhood range and the minimum number of samples in the neighborhood are used in the early stage to limit the clustering number, if the smaller value is used, the generated large number of fine packets can cause excessive information dispersion, the extracted entity corresponding to the cluster core as the upper-layer entity cannot show the core content, and the DBSC is called recursivelyDuring AN AN algorithm, adjusting the value of epsilon, minPts according to gradient degressive, reducing the neighborhood range and the minimum sample number, sequentially clustering the clusters obtained by the last clustering again, and reducing the number of samples in each cluster;
all the entities in the corpus W are in relation with other entities, and triples formed by the entities and the other entities are correspondingly combined with each other to form a knowledge graph.
7. The knowledge-graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm according to claim 5, wherein the 3.2 steps are as follows:
3.2.1, grouping the corpus W and the word vector cluster Cube corresponding to the corpus W for one time, and the steps are as follows:
3.2.1.1, extracting a root corpus label in the corpus W to form a core entity; obtaining the network corpus by a crawler, extracting a first-layer sub-classification label of a root corpus label in the corpus, and generating a first-layer sub-classification label set Tag ═ t1,t2...ti...tnN sub-classification labels, each label having a corresponding entity and word vector, and combining these entities with core entities to form n triples, and adding the triples into a triplet set TP;
3.2.1.2 Each Tag t in the set Tag of class labelsiTaking the corresponding word vector as a cluster center, calculating the Euclidean distance from each data point in the word vector cluster Cube to each centroid, then distributing the data points to the cluster corresponding to each cluster center according to the principle of closeness, and dividing the corpus W into n corpus sets Wi(1 < i < n), corpus set wiThe corresponding word vector cluster is cubei
3.2.2, for each corpus set wiSecondary grouping and corresponding word vector cluster cubeiPerforming secondary clustering, and according to the step of grouping for the first time, the flow is as follows:
3.2.2.1 for each corpus set wiAnd corresponding cluster core label tiExtracting a Cluster core tag tiFor a second-level core entity, obtaining the network corpus through a crawler, extracting a first-level sub-classification label of the cluster center label, and generating a classification label set
Figure FDA0002309912410000051
Wherein (1 < j < m >i1 < i < n), representing the current tag tiTotally contain miSub-classification labels, each label having a corresponding entity and word vector, and combining these entities with secondary core entities to form miAdding triples into a triple set TP;
3.2.2.2 Tag from step 3.2.2.1iThe word vector corresponding to each label in the cluster is used as a cluster center, and the current word vector cluster cube is calculatediThe Euclidean distance from each data point to each centroid, then the data points are distributed to the clusters corresponding to each cluster centroid according to the principle of closeness, and at the moment, each corpus set w isiIs divided into m againiCorpus set wijWherein (1 < ═ j < ═ mi1 < i < n), i.e., the original corpus W is divided into
Figure FDA0002309912410000052
Corpus set wij,wijThe corresponding word vector cluster is cubeij
Wherein the Euclidean Distance (Euclidean Distance) in step 3.2.1.2 is an important basis for determining the class of the data point, and a given sample is assumed to exist
Figure FDA0002309912410000053
And
Figure FDA0002309912410000054
wherein i, j is 1, 2.. times.m, representing the number of samples,n represents the number of features, and the calculation mode of the Euclidean distance is as follows:
Figure FDA0002309912410000055
3.2.3, finding out each corpus set w by combining TF-IDF algorithmijThe keyword in (1) is used for the corpus collection wijGrouping again;
3.2.3.1, finding out each corpus set w by using TF-IDF algorithmijThe keyword in (1);
the TF-IDF algorithm in step 3.2.3 is a numerical statistical method for evaluating the importance of a word to a given document, and the word frequency TF (term frequency) refers to the frequency of occurrence of a given word in a given document, and the calculation formula is as follows:
Figure FDA0002309912410000061
nx,yrefers to the number of times the term x appears in the document y, Σknk,yReferring to the total vocabulary number in the Document y, the inverse Document frequency idf (inverse Document frequency) is used to evaluate the amount of information provided by a word or term, i.e. whether the term is common in the whole Document, and the formula is:
Figure FDA0002309912410000062
n refers to the total number of documents, NxAnd (3) indicating the document number of the term x, taking each entry in the text as a document, and finally, jointly calculating the values of TF and IDF to obtain the formula of TF-IDT:
TF-IDFx,y=TFx,y×IDFx
3.2.3.2, for each corpus set wijThe keywords are manually screened, the keywords with low relevance with the core entity of the current corpus set are removed, partial keywords with the highest relevance are reserved, and the number of the keywords and the whole of all the extracted keywords are reservedQuality correlation;
3.2.3.3, constructing a triple of the entity corresponding to the screened key words extracted from each corpus set and the core entity of the current corpus set, adding the triple set TP, calculating Euclidean distance again by taking the word vectors of the key words as cluster centers, and calculating w of each corpus setijAnd corresponding word vector cluster cubeijGrouping is carried out;
3.2.3.4, using these keywords as each corpus set wijThe Euclidean distance between the data points in the set and each centroid is calculated once again, the data points are classified, and the original language material base is divided into a plurality of small language material sets wz
All triples constructed in the instructive pre-grouping process are added into the triple set TP, and each obtained word vector cluster is marked as cubezAnd the corpus corresponding to the text is marked as wzWherein z is a natural number and represents the number of clusters in the Cube set and the number of corpus sets in the corpus set W; cubezThe cluster center of (A) is denoted as CentroidzThe Entity object corresponding to the Entity is Entityz
8. The knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm according to any one of claims 1 to 4, wherein the fourth step comprises the following steps:
4.1, defining VT comprises two parts of basic attribute (basic) and visual structure (DVSCHEMA), formalized definition is as (1), wherein the basic attribute stores general information of graphic titles, subtitles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICATTRIBUTE includes three attributes: title (title), subtitle (subtitle) and attributes (attributes), wherein the formal definitions are shown as (2), the title is used for storing the title of the finally generated visual graph, the subtitle is used for storing the subtitle of the finally generated visual graph, and the attributes are used for storing the setting parameters of the position, the color combination, the font and the font size of the finally generated visual graph;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, the basic visualization graph can be generalized into four basic categories according to the data type, the graph data structure and the graph dimension required by the graph: general graph (General), topological graph (Topology), Map (Map), Text graph (Text), formal definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
4.4, the four basic categories in step 4.3 include two attributes: the graphic type (VType) and the graphic structure (structModel), the VType stores the category of the graphics to which the category belongs, the structModel stores the basic visual structure of the graphics to which the category belongs, the formalized definition is (4), "A: : b "indicates that" A contains attribute B ";
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType>
<StructModel>;
4.5, in the step 4.4, the four basic categories have respective Mapping relations (Mapping), and describe data structures, data dimensions, graph structure relations and data Mapping position information of various graphs; according to Mapping information and by combining the data structure of the graph, the basic visual structure structModel of various graphs can be abstracted.
9. The method for knowledge graph relation extraction and REST service visualization fusion based on DBSCAN clustering algorithm according to claim 8, wherein in 4.4, the graphs to which VType attributes of four basic categories belong are as follows:
4.4.1 General includes bar chart (BarChart), line chart (LineChart), pie chart (PieChart), radar chart (RadarChart), scatter chart (ScatterChart);
4.4.2, the Topology comprises a network map (NetworkChart), a tree map (TreeMap) and an area tree map (treemapcart);
4.4.3, Map includes area Map (AreaMapChart), country Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, Text includes word cloud (WorldCloudchart);
in the 4.5, Mapping relation Mapping and basic visual structure structModel definition of various graphs are defined by the following steps:
4.5.1, graphics in General type, commonly used to represent two-dimensional or three-dimensional data, information can be represented by tuples (XAxis, YAxis) or triples (XAxis, YAxis, ZAxis), Mapping structure of such graphics as (5), where Legendname represents legend name, storing each grouping information in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted as (6), the child node of the structModel is a temporary Root node Root, and the Root comprises two child nodes: a key-value pair K _ V and a legend node Legendnode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, graphs in Topology type are commonly used to represent topological relational data, tree graphs and area tree graphs can be represented with nested key-value pairs { key: value, children: { key: value } } to represent the attribute structure, Mapping structure as (7); the network graph can be represented by a node set (Nodes) and an edge set (Links), wherein the Mapping structure is as shown in (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two sub-structures, Root1 and Root2 are temporary Root nodes of the two sub-structures respectively, and Root1 includes two sub-nodes: the child node child is a child node child, and the child structure of child node child is a key value pair K _ V; root2 contains two child nodes: node sets and edge sets, wherein child Nodes of the node sets are key and value, the value can be null, and the child Nodes of the edge sets are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, graphics in the Map type are commonly used to represent Map information, with a key-value pair array [ { PlaceName: value } ] or a triplet array [ { lng, lat, value } ] representing map information, such a graph has a Mapping structure as (10), wherein PlaceName represents a place name, lng represents a latitude, and lat represents a longitude; the structure of the basic structModel can be abstracted according to the Mapping structure, such as (11), the structModel has two substructures, Root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 contains a child-node key-value pair K _ V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng><lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4, the graph in the Text type commonly uses a binary group (Keyword) to represent the Keyword frequency, and the Mapping structure of the graph is as shown in (12), wherein the Keyword is a word extracted from the Text, and the frequency represents the occurrence frequency of the word in the Text; the structure of the basic structModel can be abstracted according to the Mapping structure, wherein the structure is shown as (13), child nodes of the structModel are temporary Root nodes Root, and the Root contains a key value pair K _ V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>。
10. the knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm according to any one of claims 1 to 4, wherein the fifth step comprises the following steps:
5.1, matching the Web data prototype structure M-JSON with a structModel of a visual model tree VT according to a data structure, and matching M attribute combination results of candidate coordinate axes/legends meeting conditions in the M-JSON, wherein each combination result is represented as a binary group consisting of a key value pair L and an attribute name A, and L and A respectively correspond to Legendnode and K _ V in the step 4.5.1;
5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a doublet: p: (L:: name, A:: name), and matching each result Pi=(Li::name,Ai: : name), converted to triplet form Gi=(Li::name,R,Ai: : name) into set S ═ { G ═ G1,G2,...,Gm};
5.2.2, sequentially grouping ShirG in the set SiMapping F (L) to the triplet structure of the knowledge-graph as followsi::name→head,R→relation,Ai: : name → tail) is mapped into a triple (head, relation, tail), whether the current triple (head, relation, tail) exists in the constructed corpus knowledge graph is matched, result is True or False, which is respectively expressed as 1 and 0, firstly, a head entity node head and a tail entity node tail are matched in the corpus knowledge graph, then, the relation between the head entity node and the tail entity node is matched, and when and only when the head entity head, the tail entity node and the relation are successfully matched, the result is 1;
5.2.3, after the object query in the set S is completed, returning a set Q { (G)i,resulti) Q is used for judging whether semantic association exists in the current qualified binary group or not, and the semantic association is used as judgment of a matching result of attribute combination of a candidate coordinate axis/legend, so that only structure matching exists and resultiWhen the number of the images is 1, the matching is judged to be successful, so that the accuracy of data attribute matching is improved, and the generation rate of images without practical significance is reduced.
CN201911254786.9A 2019-12-10 2019-12-10 Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm Active CN111143479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911254786.9A CN111143479B (en) 2019-12-10 2019-12-10 Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911254786.9A CN111143479B (en) 2019-12-10 2019-12-10 Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm

Publications (2)

Publication Number Publication Date
CN111143479A true CN111143479A (en) 2020-05-12
CN111143479B CN111143479B (en) 2023-09-01

Family

ID=70517820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911254786.9A Active CN111143479B (en) 2019-12-10 2019-12-10 Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm

Country Status (1)

Country Link
CN (1) CN111143479B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109828995A (en) * 2018-12-14 2019-05-31 中国科学院计算技术研究所 A kind of diagram data detection method, the system of view-based access control model feature
CN111813955A (en) * 2020-07-01 2020-10-23 浙江工商大学 Service clustering method based on knowledge graph representation learning
CN111831823A (en) * 2020-07-10 2020-10-27 湖北亿咖通科技有限公司 Corpus generation and model training method
CN112036182A (en) * 2020-07-31 2020-12-04 中国科学院信息工程研究所 Knowledge representation learning method and system for introducing attribute semantics from multiple angles
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112559907A (en) * 2020-12-09 2021-03-26 北京国研数通软件技术有限公司 Basic data retrieval and integrated display method based on spatio-temporal label spatio-temporal correlation
CN113191497A (en) * 2021-05-28 2021-07-30 国家电网有限公司 Knowledge graph construction method and system for substation stepping exploration site selection
CN113379214A (en) * 2021-06-02 2021-09-10 国网福建省电力有限公司 Method for automatically filling and assisting decision of power grid accident information based on affair map
CN113867850A (en) * 2020-06-29 2021-12-31 阿里巴巴集团控股有限公司 Data processing method, device, equipment and storage medium
CN113988724A (en) * 2021-12-28 2022-01-28 深圳市迪博企业风险管理技术有限公司 Risk analysis method for financial activity knowledge graph of listed company
CN113987152A (en) * 2021-11-01 2022-01-28 北京欧拉认知智能科技有限公司 Knowledge graph extraction method, system, electronic equipment and medium
CN114398109A (en) * 2022-01-07 2022-04-26 福州大学 Method for constructing personalized intelligent assistant based on general knowledge graph
WO2022126944A1 (en) * 2020-12-17 2022-06-23 上海朝阳永续信息技术股份有限公司 Text clustering method, electronic device and storage medium
CN114675818A (en) * 2022-03-29 2022-06-28 江苏科技大学 Method for realizing measurement visualization tool based on rough set theory
CN114780083A (en) * 2022-06-17 2022-07-22 之江实验室 Visual construction method and device of knowledge map system
CN114996412A (en) * 2022-08-02 2022-09-02 医智生命科技(天津)有限公司 Medical question-answering method and device, electronic equipment and storage medium
CN116842394A (en) * 2023-09-01 2023-10-03 苏州高视半导体技术有限公司 Algorithm parameter file generation method, electronic equipment and storage medium
CN116910131A (en) * 2023-09-12 2023-10-20 山东省国土测绘院 Linkage visualization method and system based on basic geographic entity database
CN117725555A (en) * 2024-02-08 2024-03-19 暗物智能科技(广州)有限公司 Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium
CN117973519A (en) * 2024-03-29 2024-05-03 南京中医药大学 Knowledge graph-based data processing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
US20180082183A1 (en) * 2011-02-22 2018-03-22 Thomson Reuters Global Resources Machine learning-based relationship association and related discovery and search engines
CN109255125A (en) * 2018-08-17 2019-01-22 浙江工业大学 A kind of Web service clustering method based on improvement DBSCAN algorithm
CN109902434A (en) * 2019-03-18 2019-06-18 浙江工业大学 Service data visual modeling and matching process towards REST framework style under cloud computing environment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180082183A1 (en) * 2011-02-22 2018-03-22 Thomson Reuters Global Resources Machine learning-based relationship association and related discovery and search engines
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN109255125A (en) * 2018-08-17 2019-01-22 浙江工业大学 A kind of Web service clustering method based on improvement DBSCAN algorithm
CN109902434A (en) * 2019-03-18 2019-06-18 浙江工业大学 Service data visual modeling and matching process towards REST framework style under cloud computing environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张元鸣等: "数据服务依赖图模型及自动组合方法研究" *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109828995B (en) * 2018-12-14 2020-12-11 中国科学院计算技术研究所 Visual feature-based graph data detection method and system
CN109828995A (en) * 2018-12-14 2019-05-31 中国科学院计算技术研究所 A kind of diagram data detection method, the system of view-based access control model feature
CN113867850B (en) * 2020-06-29 2023-12-29 阿里巴巴集团控股有限公司 Data processing method, device, equipment and storage medium
CN113867850A (en) * 2020-06-29 2021-12-31 阿里巴巴集团控股有限公司 Data processing method, device, equipment and storage medium
CN111813955A (en) * 2020-07-01 2020-10-23 浙江工商大学 Service clustering method based on knowledge graph representation learning
CN111813955B (en) * 2020-07-01 2021-10-19 浙江工商大学 Service clustering method based on knowledge graph representation learning
CN111831823A (en) * 2020-07-10 2020-10-27 湖北亿咖通科技有限公司 Corpus generation and model training method
CN112036182A (en) * 2020-07-31 2020-12-04 中国科学院信息工程研究所 Knowledge representation learning method and system for introducing attribute semantics from multiple angles
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112559907A (en) * 2020-12-09 2021-03-26 北京国研数通软件技术有限公司 Basic data retrieval and integrated display method based on spatio-temporal label spatio-temporal correlation
WO2022126944A1 (en) * 2020-12-17 2022-06-23 上海朝阳永续信息技术股份有限公司 Text clustering method, electronic device and storage medium
CN113191497A (en) * 2021-05-28 2021-07-30 国家电网有限公司 Knowledge graph construction method and system for substation stepping exploration site selection
CN113191497B (en) * 2021-05-28 2024-04-23 国家电网有限公司 Knowledge graph construction method and system for substation site selection
CN113379214A (en) * 2021-06-02 2021-09-10 国网福建省电力有限公司 Method for automatically filling and assisting decision of power grid accident information based on affair map
CN113987152A (en) * 2021-11-01 2022-01-28 北京欧拉认知智能科技有限公司 Knowledge graph extraction method, system, electronic equipment and medium
CN113988724A (en) * 2021-12-28 2022-01-28 深圳市迪博企业风险管理技术有限公司 Risk analysis method for financial activity knowledge graph of listed company
CN114398109A (en) * 2022-01-07 2022-04-26 福州大学 Method for constructing personalized intelligent assistant based on general knowledge graph
CN114675818B (en) * 2022-03-29 2024-04-19 江苏科技大学 Method for realizing measurement visualization tool based on rough set theory
CN114675818A (en) * 2022-03-29 2022-06-28 江苏科技大学 Method for realizing measurement visualization tool based on rough set theory
CN114780083A (en) * 2022-06-17 2022-07-22 之江实验室 Visual construction method and device of knowledge map system
US11907390B2 (en) 2022-06-17 2024-02-20 Zhejiang Lab Method and apparatus for visual construction of knowledge graph system
CN114996412A (en) * 2022-08-02 2022-09-02 医智生命科技(天津)有限公司 Medical question-answering method and device, electronic equipment and storage medium
CN114996412B (en) * 2022-08-02 2022-11-15 医智生命科技(天津)有限公司 Medical question and answer method and device, electronic equipment and storage medium
CN116842394A (en) * 2023-09-01 2023-10-03 苏州高视半导体技术有限公司 Algorithm parameter file generation method, electronic equipment and storage medium
CN116910131A (en) * 2023-09-12 2023-10-20 山东省国土测绘院 Linkage visualization method and system based on basic geographic entity database
CN116910131B (en) * 2023-09-12 2023-12-08 山东省国土测绘院 Linkage visualization method and system based on basic geographic entity database
CN117725555A (en) * 2024-02-08 2024-03-19 暗物智能科技(广州)有限公司 Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium
CN117725555B (en) * 2024-02-08 2024-06-11 暗物智能科技(广州)有限公司 Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium
CN117973519A (en) * 2024-03-29 2024-05-03 南京中医药大学 Knowledge graph-based data processing method

Also Published As

Publication number Publication date
CN111143479B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN111143479B (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN111177591B (en) Knowledge graph-based Web data optimization method for visual requirements
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN108763333B (en) Social media-based event map construction method
CN109189942B (en) Construction method and device of patent data knowledge graph
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
CN106446148B (en) A kind of text duplicate checking method based on cluster
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN102955848B (en) A kind of three-dimensional model searching system based on semanteme and method
CN106951438A (en) A kind of event extraction system and method towards open field
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN111309925A (en) Knowledge graph construction method of military equipment
CN106776562A (en) A kind of keyword extracting method and extraction system
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
WO2014210387A2 (en) Concept extraction
CN106484797A (en) Accident summary abstracting method based on sparse study
CN101673306B (en) Website information query method and system thereof
CN112818661B (en) Patent technology keyword unsupervised extraction method
Alami et al. Hybrid method for text summarization based on statistical and semantic treatment
CN112036178A (en) Distribution network entity related semantic search method
CN105975547A (en) Approximate web document detection method based on content and position features
CN114090861A (en) Education field search engine construction method based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230802

Address after: No. 9 Santong Road, Houzhou Street, Taijiang District, Fuzhou City, Fujian Province, 350000. Zhongting Street Renovation. 143 Shopping Mall, 3rd Floor, Jiahuiyuan Link Section

Applicant after: Fuzhou Zhiqing Intellectual Property Service Co.,Ltd.

Address before: The city Zhaohui six districts Chao Wang Road Hangzhou City, Zhejiang province 310014 18

Applicant before: JIANG University OF TECHNOLOGY

Effective date of registration: 20230802

Address after: Room 103, No.1 Pugong Shanxi Road, Phase III, Software Park, Xiamen Torch High tech Zone, Xiamen, Fujian Province, 361000

Applicant after: Easy Point Life Digital Technology Co.,Ltd.

Address before: No. 9 Santong Road, Houzhou Street, Taijiang District, Fuzhou City, Fujian Province, 350000. Zhongting Street Renovation. 143 Shopping Mall, 3rd Floor, Jiahuiyuan Link Section

Applicant before: Fuzhou Zhiqing Intellectual Property Service Co.,Ltd.

GR01 Patent grant
GR01 Patent grant