CN111177591A - Knowledge graph-based Web data optimization method facing visualization demand - Google Patents

Knowledge graph-based Web data optimization method facing visualization demand Download PDF

Info

Publication number
CN111177591A
CN111177591A CN201911254814.7A CN201911254814A CN111177591A CN 111177591 A CN111177591 A CN 111177591A CN 201911254814 A CN201911254814 A CN 201911254814A CN 111177591 A CN111177591 A CN 111177591A
Authority
CN
China
Prior art keywords
corpus
graph
entity
word
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911254814.7A
Other languages
Chinese (zh)
Other versions
CN111177591B (en
Inventor
陆佳炜
王小定
高燕煦
朱昊天
徐俊
肖刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou Zhiqing Intellectual Property Service Co ltd
Shenzhen Shukangyun Information Technology Co ltd
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201911254814.7A priority Critical patent/CN111177591B/en
Publication of CN111177591A publication Critical patent/CN111177591A/en
Application granted granted Critical
Publication of CN111177591B publication Critical patent/CN111177591B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A knowledge graph-based Web data optimization method facing visualization requirements comprises the following steps: firstly, constructing a target field corpus; secondly, entity extraction facing to a corpus; thirdly, performing secondary pre-grouping on the corpus, and constructing a knowledge graph by using a k-means clustering algorithm; classifying various visual graphs, summarizing and summarizing the attributes and structural features of various graphs, and formally expressing various graph information by creating a visual model tree VT; fifthly, the data visualization optimization matching method based on the network corpus knowledge graph comprises the following steps: defining M-JSON as a JSON prototype structure returned by REST Web service, matching the M-JSON with a data structure in a visual model tree, and inquiring whether the matched attribute combination has actual semantic association by using the knowledge graph in the third step so as to select effective dimension combination and improve the accuracy rate of automatically generating the graph.

Description

Knowledge graph-based Web data optimization method facing visualization demand
Technical Field
The invention relates to a knowledge graph-based Web data optimization method facing visualization requirements.
Background
Service-Oriented Computing (SOC) is a Computing paradigm for distributed systems that is currently receiving attention from both the industry and academia. Under the promotion of the development of the SOC computing mode, the Web service is further popularized and applied. With the introduction of the REST (Representational State Transfer) architecture style in 2000, REST services became increasingly an important component of Web services. The simple, light and fast features of REST services make it prevalent on the Internet, and keep a considerable exponential increase, while driving the increase in the number of services. Diversified data services are integrated with multiple fields such as economy, medical treatment, sports, life and the like in a cross mode, and a huge amount of data is promoted. However, regardless of the data faced, the primary goal of humans to obtain data is to obtain valid information in the data.
Data visualization assists a user in analyzing and understanding data through an interactive visualization interface and a data-image conversion technology. The basis of visualization is data, and the data of the network data era is multi-source heterogeneous (multi-source heterogeneous), which brings the problems of data source integration and data arrangement; data service providers in various fields provide a large number of services, and each service has a data response mode and a response format with different structures, which brings difficulty to data acquisition and data analysis; with the development of multimedia technology and visualization technology, people no longer satisfy common form data, and pursue more intuitive and rich data display forms and more convenient and efficient data processing tools. Therefore, by reducing human intervention, heterogeneous service data is automatically analyzed and arranged, and the data is automatically visualized, so that the method has important practical significance.
The knowledgegraph was formally proposed by Google corporation in 6 months 2012 and is a graph-based data structure. The knowledge graph is a structured semantic knowledge base, which shows each entity in the real world and the relationship between the entities in the real world in the form of a graph, and is described in a formalized manner. The general representation of the basic building blocks of a knowledge-graph are entity-relationship-entity triplets, and entity attribute-value pairs. Knowledge-graphs are stored in the form of a triple expression of "entity-relationship-entity" or "entity-attribute value", and these data will constitute a considerable network of entity-relationships, i.e., a "graph" of knowledge.
At present, although some data visualization modeling methods for REST services exist, the automatic data visualization efficiency of the methods is low, or a large number of redundant graphs exist in an automatically generated visualization graph, which is not beneficial to understanding and analyzing by a user. The knowledge graph has high-efficiency information retrieval capability, strong semantic relation construction capability and visualization presentation capability, and the laws hidden behind the data can be found out more effectively by combining the knowledge graph with the data visualization.
Disclosure of Invention
The invention provides a knowledge graph-based Web data optimization method facing visualization requirements, common visualization graphs are analyzed, induced and modeled, the structure of Web data is matched with that of a visualization model, and attribute combinations of candidate coordinate axes/legends meeting requirements are obtained. And constructing a knowledge graph by using the network corpus, and acquiring whether semantic association exists in the attribute combination by inquiring the knowledge graph to further optimize the visual work of Web data and improve the probability of generating effective graphs.
In order to realize the invention, the technical scheme is as follows:
a knowledge graph-based Web data optimization method facing visualization requirements comprises the following steps:
the first step, the construction of a target domain corpus: the method comprises the steps of taking network corpus content as a basis for constructing a knowledge graph, using network corpus entry information as original corpus content, screening the original network corpus content for constructing the knowledge graph, comparing and analyzing webpage content of the network entries, wherein the original corpus content comprises a HTML (hypertext markup language) label, editing information of the entries, webpage link information and other redundant information irrelevant to the entries, filtering and cleaning the content of the network entries, and extracting a title and effective text content. The filtering content comprises: performing HTML tag/text style symbol filtering, entry template symbol and non-English character filtering, entry editing information filtering, picture information filtering, link information filtering, page proprietary title attribute name filtering and numerical value filtering on webpage contents of entries;
step two, corpus-oriented entity extraction: the knowledge graph is a data information network with a graph structure formed by entities and relations, the basic structure of the knowledge graph is represented by an entity-relation-entity triple, the triple comprises two entities with real semantic relations and the relation between the two entities, and is represented by a (head, relation, tail) form, wherein G represents the triple, head represents a head entity, tail represents a tail entity, and relation represents the relation between the head entity and the tail entity; each entity also comprises an attribute and an attribute value, the attribute of the entity is also converted into a tail entity connected with the entity, and a relationship is established between the attribute and the tail entity;
the third step: combining Word2vec, performing secondary pre-grouping on a corpus, and constructing a knowledge graph by using a k-means clustering algorithm, wherein a triple G has a structure of (head, relation, tail), the relation has various relations along with the difference of the head and the tail, the relation is actually a relation set in the knowledge graph and is used for representing complex relation among various entities, the purpose is to judge whether semantic relation exists between two attributes, namely whether relation exists between two entities, and does not pay attention to the relation, the corpus is subjected to secondary grouping by calculating Word vectors of the vocabulary of the corpus, and the entity relation is extracted by using the k-means clustering algorithm;
fourthly, constructing a visual model Tree (VT for short): classifying various visual graphs, summarizing and summarizing the attributes and structural features of various graphs, and formally expressing various graph information by creating a visual model tree (VT);
fifthly, the data visualization optimization matching method based on the network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by RESTWeb service; matching the Web data prototype structure M-JSON with each structModel in the visual model tree VT according to the data structure, wherein the returned result is a set formed by attribute combinations of candidate coordinate axes/legends which meet the conditions; on the basis of structure matching, the knowledge graph constructed in the third step is utilized to inquire whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic correlation, the matching is optimized according to the inquiry result, and effective dimension combination is selected, so that the accuracy rate (Precision) of automatically generated graphs is improved. Furthermore, in the second step, the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
2.1, entity extraction: entity extraction, also known as named entity recognition, is the automatic recognition of named entities from a text dataset, which generally refers to entities identified by names of people, places, names of organizations, and all other names. The process can be completed by using a named entity recognition system of some mainstream, and the steps comprise: firstly, naming entity recognition is carried out on the content of a material library through a tool; secondly, marking the type attribute of the identified named entity; thirdly, filtering the named entities according to the type attributes, deleting the unsuitable named entities, keeping labels of other named entities, and defining the entry names as the named entities by default;
2.2, extracting attribute entities: extracting attributes from information frames of the vocabulary entry network corpus by taking the information frames of the vocabulary entry network corpus as the sources of the attributes, then intercepting information of the information frames of each vocabulary entry from a corpus, extracting attribute names according to the structure of the information frames, taking the attribute names as tail entities of the named entities corresponding to the names of the vocabulary entries, not reserving attribute values, and if no information frame exists in a certain vocabulary entry, not establishing the tail entities for the named entities corresponding to the vocabulary entry;
2.3, noun entity extraction, comprising four steps: the method comprises the following steps of Word splitting (Split), part-of-speech Tagging (POS Tagging), Stop Word Filtering (Stop Word Filtering) and stem extraction (Stemming), wherein the named entities are already tagged in the named entity extraction step, so that the following operation only extracts the corpus content outside the tagged entities.
Still further, the process of 2.3 is as follows:
2.3.1, splitting words: designing a splitting rule mode by using a regular expression, splitting words of the corpus content according to spaces, symbols and paragraphs, and obtaining word texts;
2.3.2, part of speech tagging: in order to obtain nouns in a corpus, part-of-speech tagging is needed to be carried out on text vocabularies, wherein the part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, the part-of-speech tagging is a text data processing technology for tagging the part of speech of words in the corpus according to the meaning and context content of the words in the corpus linguistics, a plurality of words possibly contain a plurality of parts of speech and have multiple meanings, the part-of-speech selection depends on the context meaning, the corpus which is subjected to named entity tagging is used as a tagging object text for carrying out part-of-speech tagging, a noun object is found out according to tagging results, a non-noun object is removed from the corpus but the non-noun entry names are not included, named entities, noun objects and original punctuation points in each entry are reserved in the corpus at this time, and all;
2.3.3, stop word filtering: stop words, the name from Stop Word, refer to words or phrases that are automatically filtered out when processing natural language text in order to save storage space and improve search efficiency in information retrieval, and for a given purpose, any kind of words can be selected as Stop words, and Stop words include two kinds: one is the functional Words (Function Words) contained in human language, which are used very commonly and appear very frequently, but have no exact actual meaning; the other type is real Words (Content Words), which refers to a part of Words with actual specific meanings but without specific reference or direction; in the natural language processing, a Stop Word List (Stop Word List) is existed, the Stop Word List is used as a reference dictionary, Stop words are deleted from a corpus through Word comparison, the corpus content is further simplified, and no Stop words in the corpus are ensured to be reserved;
2.3.4, stem extraction: the word drying extraction is a process of removing morphological affix to obtain a corresponding root, and is a specific processing process of western languages such as English; the same English word has singular and plural deformation, temporal deformation and deformation of different predicates corresponding to the person's pronouns. Although the words have slight differences in form, the words all correspond to the same root word, and should be processed as the same word when calculating the correlation, and then the word stem processing is needed; the baud stem Algorithm (Porter Stemming Algorithm) is a mainstream stem extraction Algorithm, and the core idea is to classify, process and restore words according to the type of morphological affixes. Except for part of special deformation, most word deformation has certain rules, and the deformation is divided into 6 categories according to the rules.
Still further, in 2.3.4, the stem extraction steps are as follows:
2.3.4.1, according to the word deformation category, proceed affix removal and word recovery for each case to obtain the stem information of noun object in corpus to reduce the cases of different shapes of different words, the 6 different word deformations are as follows:
2.3.4.1.1, complex number, words ending in ed and ing;
2.3.4.1.2, words that contain vowels and end in y;
2.3.4.1.3, double suffix words;
2.3.4.1.4, suffixed with-ic, -ful, -new, -active, etc.;
2.3.4.1.5, < c > vcvc < v >, in case of a suffix of-ant, -ref, etc. (c is a consonant, v is a vowel);
2.3.4.1.6, < c > vc < v > where there are more than 1 pair of vc between vowel consonants, a word ending with e;
2.3.4.2, creating noun objects reduced to stem as noun entities, and updating the noun objects in the corpus to be expressed in stem form.
In the third step, the construction process of the knowledge graph is as follows:
3.1, training Word vectors using Word2 vec: word2vec is a Word vector tool that represents words as feature vectors of words. Words are converted to numerical form using Word2vec, represented using an N-dimensional vector.
3.2, grouping the corpus in advance twice: because the clustering of the k-means algorithm is easily influenced by the distribution condition of the data set, in order to ensure the core concept, namely that the main classification object of the target field is a clustering center and cannot directly use k-means clustering, the corpus needs to be pre-grouped twice;
3.3, automatically searching a clustering center for the small corpus set through a k-means clustering algorithm, clustering, and simultaneously constructing a triple, wherein the method comprises the following steps:
3.3.1, determining the size of k according to the size of the corpus set, wherein the larger the set is, the larger the k value is.
3.3.2, the entity corresponding to the centroid obtained by k-means clustering calculation and the entity corresponding to the centroid when the upper layer is grouped construct a triple.
The k-means algorithm in step 3.3.2 is an unsupervised clustering algorithm, and each Word is represented by a Word vector trained by Word2vec in a corpus. And taking each corpus set as a data set, and performing clustering calculation by using a k-means clustering algorithm. The steps of k-means clustering are as follows:
3.3.2.1, selecting k objects in the data set as initial centers, wherein each object represents a clustering center;
3.3.2.2, dividing the objects in the word vector sample into the class corresponding to the nearest cluster center according to the Euclidean distance between the objects and the cluster centers;
3.3.2.3, updating the clustering center: taking the mean values corresponding to all the objects in each category as the clustering center of the category, and calculating the value of a target function;
3.3.2.4, judging whether the values of the clustering center and the objective function are changed, if not, outputting the result, and if so, returning to 3.3.2.2;
3.3.3, taking the new group as a data set, calling the k-means algorithm again, and repeating the steps 3.3.1-3.3.3 until each group only contains the number of elements smaller than a certain threshold value Z;
3.3.4, constructing a triple of the entity corresponding to the data point in each group and the entity corresponding to the current centroid;
all entities in the corpus establish relations with other entities, triples formed by the entities and the triplets correspondingly combine with each other to form a knowledge graph, and entity relations with weak relevance may be generated due to the fact that the cluster centers and the clustering conditions found out by automatic clustering are likely to be generated, so that manual proofreading and screening are needed after the knowledge graph is built, entity relations with low relevance are removed, and the quality of the knowledge graph is improved.
The step 3.2 is as follows:
3.2.1, grouping the material libraries for one time, and comprising the following steps:
3.2.1.1, extracting a first layer of sub-classification labels of the previously acquired target field labels, wherein the target field labels form core entities and generate a first layer of sub-classification label set Tag, wherein the first layer of sub-classification label set Tag comprises n sub-classification labels, each label has a corresponding entity and word vector, and the entities are connected with the core entities to form n triples;
3.2.1.2, taking the first-layer sub-classification label object as a centroid, calculating the Euclidean distance from each data point to each centroid in the corpus data set, then distributing the data points to the class corresponding to each centroid according to the principle of closeness, obtaining n clusters which take the first-layer sub-classification label as the centroid, namely n grouped data sets, and simultaneously dividing the corpus into n corpus sets;
wherein the Euclidean Distance (Euclidean Distance) in step 3.2.1.2 is an important basis for determining the class of the data point, and a given sample is assumed to exist
Figure BDA0002309924410000061
And
Figure BDA0002309924410000062
where i, j is 1,2, …, m denotes the number of samples, n denotes the number of features, and euclidean distanceThe calculation of the distance is:
Figure BDA0002309924410000063
3.2.2, combining TF-IDF algorithm, grouping the material base for the second time, comprising the following steps:
and 3.2.2.1, finding out the key words in each corpus set by calculating TF-IDF.
The TF-IDF algorithm in step 3.2.2 is a numerical statistical method for evaluating the importance of a word to a given document, and the Term Frequency TF (Term Frequency) refers to the Frequency of occurrence of a given word in a given document, and is calculated by the formula:
Figure BDA0002309924410000071
nx,yrefers to the number of times the term x appears in the document y, Σknk,yReferring to the total vocabulary number in the Document y, the Inverse Document Frequency IDF (Inverse Document Frequency) is used to evaluate the amount of information provided by a word or term, i.e., whether the term is common in the whole Document, and is calculated by the formula:
Figure BDA0002309924410000072
n refers to the total number of documents, NxAnd (3) indicating the document number of the term x, taking each entry in the text as a document, and finally, jointly calculating the values of TF and IDF to obtain the formula of TF-IDF:
TF-IDFx,y=TFx,y×IDFx
3.2.2.2, manually screening the keywords of each corpus set, removing the keywords with low correlation with the core entity of the current corpus set, reserving partial keywords with the highest correlation, and reserving the correlation between the number of the keywords and the overall quality of all the extracted keywords;
3.2.2.3, constructing a triple of the entity corresponding to the screened keyword extracted from each corpus set and the core entity of the current corpus set, taking the keywords as the mass center in each corpus set, calculating the Euclidean distance from the data point in the set to each mass center, classifying the data point, and dividing the primitive corpus into a plurality of small corpus sets at the moment.
The fourth step comprises the following steps:
4.1, defining VT comprises two parts of basic attribute (basic) and visual structure (DVSCHEMA), formalized definition is as (1), wherein the basic attribute stores general information of graphic titles, subtitles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICATTRIBUTE includes three attributes: title (title), subtitle (subtitle) and attributes (attributes), wherein the formal definitions are shown as (2), the title is used for storing the title of the finally generated visual graph, the subtitle is used for storing the subtitle of the finally generated visual graph, and the attributes are used for storing the setting parameters of the position, the color combination, the font and the font size of the finally generated visual graph;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, the basic visualization graph can be generalized into four basic categories according to the data type, the graph data structure and the graph dimension required by the graph: general graph (General), topological graph (Topology), Map (Map), Text graph (Text), formal definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
4.4, the four basic categories in step 4.3 include two attributes: the method comprises the steps of determining a graph type (VType) and a graph structure (structModel), wherein the VType stores a graph type to which the class belongs, the structModel stores a basic visual structure of a graph to which the class belongs, the formalized definition is shown as (4), and the graph type (VType) and the graph structure (structModel) indicate that the graph type (VType) and the graph structure (structModel) respectively contain an attribute B;
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType>
<StructModel>。
in 4.4, the graphs of the VType attributes of the four basic categories are as follows:
4.4.1 General includes bar chart (BarChart), line chart (LineChart), pie chart (PieChart), radar chart (RadarChart), scatter chart (ScatterChart);
4.4.2, the Topology comprises a network map (NetworkChart), a tree map (TreeMap) and an area tree map (treemapcart);
4.4.3, Map includes area Map (AreaMapChart), country Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, Text includes word cloud (WorldCloudchart);
4.5, in the step 4.4, the four basic categories have respective Mapping relations (Mapping), and describe data structures, data dimensions, graph structure relations and data Mapping position information of various graphs; according to Mapping information and by combining the data structure of the graph, the basic visual structure structModel of various graphs can be abstracted.
In the 4.5, Mapping relation Mapping and basic visualization structure StructModel of each type of graph are defined as follows:
4.5.1, graphics in General type, commonly used to represent two-dimensional or three-dimensional data, information can be represented by tuples (XAxis, YAxis) or triples (XAxis, YAxis, ZAxis), Mapping structure of such graphics as (5), where Legendname represents legend name, storing each grouping information in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted as (6), the child node of the structModel is a temporary Root node Root, and the Root comprises two child nodes: a key-value pair K _ V and a legend node Legendnode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, graphs in Topology type are commonly used to represent topological relation data, tree and area tree graphs can represent attribute structures with nested key-value pairs { key: value, child: { key: value } }, Mapping structures such as (7); the network graph can be represented by a node set (Nodes) and an edge set (Links), wherein the Mapping structure is as shown in (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two sub-structures, Root1 and Root2 are temporary Root nodes of the two sub-structures respectively, and Root1 includes two sub-nodes: the child node child is a child node child, and the child structure of child node child is a key value pair K _ V; root2 contains two child nodes: node sets and edge sets, wherein child Nodes of the node sets are key and value, the value can be null, and the child Nodes of the edge sets are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, maps of the type Map, usually used to represent Map information, are represented by key-value pair arrays [ { PlaceName: value } ] or triple arrays [ { lng, lat, value } ], Mapping structures of such maps are as (10), where PlaceName represents place name, lng represents latitude, and lat represents longitude; the structure of the basic structModel can be abstracted according to the Mapping structure, such as (11), the structModel has two substructures, Root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 contains a child-node key-value pair K _ V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng>
<lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4, the graph in the Text type commonly uses a binary group (Keyword) to represent the Keyword frequency, and the Mapping structure of the graph is as shown in (12), wherein the Keyword is a word extracted from the Text, and the frequency represents the occurrence frequency of the word in the Text; the structure of the basic structModel can be abstracted according to the Mapping structure, wherein the structure is shown as (13), child nodes of the structModel are temporary Root nodes Root, and the Root contains a key value pair K _ V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>。
the fifth step comprises the following steps:
and 5.1, matching the Web data prototype structure M-JSON with the structModel of the visual model tree VT according to the data structure, and matching M attribute combination results of candidate coordinate axes/legends meeting the conditions in the M-JSON, wherein each combination result is represented as a binary group consisting of a key value pair L and an attribute name A. Wherein L and A correspond to LegendNode and K _ V in step 4.5.1, respectively;
5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a doublet: p: (L:: name), and each matching result Pi=(Li::name,Ai: : name), converted to triplet form Gi=(Li::name,R,Ai:: name) into set S ═ { G ═ G1,G2,...,Gm};
5.2.2, sequentially grouping ShirG in the set SiMapping F (L) to the triplet structure of the knowledge-graph as followsi::name→head,R→relation,Ai: : name → tail) to be mapped into a triple (head, relation, tail), whether the current triple (head, relation, tail) exists in the constructed corpus knowledge graph is matched, and result is True or False, which is respectively expressed as 1 and 0. Firstly, matching head entity node head and tail entity node tail in the corpus knowledge graph, then matching relation between the head entity node and the tail entity node, and only whenWhen the head entity head, the tail entity tail and the relation are successfully matched, result is 1;
5.2.3, after the object query in the set S is completed, returning a set Q { (G)i,resulti) Q is used for judging whether semantic association exists in the current qualified binary group or not, and the semantic association is used as judgment of a matching result of attribute combination of a candidate coordinate axis/legend, so that only structure matching exists and resultiWhen the number is 1, the matching is judged to be successful. Therefore, the accuracy of data attribute matching is improved, and the generation rate of images without practical significance is reduced.
The invention has the following beneficial effects: when the Web data generates the graph in a visual mode, the method can be used for constructing a network corpus knowledge graph by utilizing the network corpus data, analyzing, inducing and modeling common visual graphs, optimizing the matching process of a Web data prototype structure and a common visual graph model, reducing the generation of redundant graphs and improving the generation rate of effective graphs. Meanwhile, the work of manual participation in graph screening is reduced in the automatic data visualization process, and the Web data visualization process is simplified.
Drawings
FIG. 1 shows a knowledge graph construction flow chart based on the k-means algorithm.
Fig. 2 shows a block diagram of a visual model tree VT.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a method for optimizing Web data based on a knowledge graph and oriented to visualization requirements includes the following steps:
the first step, the construction of a target domain corpus: the method comprises the steps of taking the corpus content of a network corpus (such as Wikipedia) as the basis of constructing a knowledge graph, improving the text quality of the content of a domain corpus and the comprehensiveness of the content, using the vocabulary entry information of the network corpus as the content of an original corpus, screening the content of the original network corpus for constructing the knowledge graph, analyzing the content of a webpage of a vocabulary entry, finding out that the corpus content contains an HTML (hypertext markup language) label, editing information of the vocabulary entry, webpage link information and other redundant information irrelevant to the vocabulary entry, filtering and cleaning a target vocabulary entry, and extracting a title and effective text content. The filtering content comprises: performing HTML tag/text style symbol filtering (e.g., deleting HTML tags such as < h1> text </h1>, < p text < p >, and < div text </div >, and retaining text; deleting style symbols such as span { font-color: # effeffefe; }), term editing information filtering (e.g., deleting [ edit ] tags), picture information filtering (e.g., deleting < img src ═ …'/> picture tags), link information filtering (e.g., deleting < a href: "…" title ″ > text hyperlink tag < a, and retaining text information), page title/attribute name filtering (e.g., deleting proprietary titles and attribute names such as background reading, External links), and numerical value filtering (e.g., deleting numerical values such as 20,30, and the like) on the web page content of the target term;
for example, a web corpus of Wikipedia (Wikipedia) is used, web page contents of Wikipedia competitive sports (Athletic sports) are obtained through a crawler, and vocabulary entry corpus contents containing the competitive sports (Athletic sports) and subcategories thereof are obtained after filtering and screening;
step two, corpus-oriented entity extraction: the knowledge graph is a data information network with a graph structure formed by entities and relations, the basic structure of the knowledge graph is represented by a triple of 'entity-relation-entity', the triple comprises two entities with a real semantic relation and a relation between the two entities, and is represented by a form of G ═ tail, wherein G represents the triple, head represents a head entity, tail represents a tail entity, and relation represents the relation between the head entity and the tail entity, each entity also comprises an attribute and an attribute value, the attribute of the entity is also converted into the tail entity connected with the entity, a relation is established between the head entity and the tail entity, and the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
2.1, entity extraction: entity extraction, also known as named entity recognition, is the automatic recognition of named entities from a textual dataset, which generally refers to the names of people, places, institutional nouns, and all other entities identified by name. The process can be completed by using some mainstream named entity recognition systems, for example, a standard NER can mark entities in a text by classes, can recognize seven types of attributes including Time, Location, Person, Date, Organization, Money, Percent, and the like, and uses it as a tool to perform named entity recognition on the content of the corpus, where the recognized named entity is labeled with its type attribute, and the main process is as follows: firstly, naming entity recognition is carried out on the content of a material library through a tool; secondly, marking the type attribute of the identified named entity; and thirdly, filtering the named entities according to the type attributes, deleting the unsuitable named entities, keeping labels of other named entities, and defining the default name of the entry as the named entity.
2.2, extracting attribute entities: taking an information frame of a vocabulary entry network corpus as a source of attributes, extracting the attributes from the information frame, then intercepting information frame information of each vocabulary entry in a corpus, extracting attribute names according to the structure of the information frame, taking the attribute names as tail entities of named entities corresponding to the names of the vocabulary entries, not reserving attribute values, if a certain vocabulary entry does not have the information frame, not creating the tail entities for the named entities corresponding to the vocabulary entry, taking an information frame (Info Box) of a vocabulary entry "National BasketBill Association (NBA)" in Wikipedia as an example, and forming the information frame in a form of a table, wherein the content of the 1 st column of the 1 st line is "Sport", the content of the 2 nd column of the 1 st line is "BasketBill", the content of the 1 st line of the 2 nd line is "found", and the content of the 2 nd column of the 2 nd line is "June 6,1946; 73years ago ' and the like, and constructing triples by extracting the contents of the first column, namely ' Sport ' and ' found ', and the vocabulary entry ' National Basketcall Association (NBA) ';
2.3, noun entity extraction, comprising four steps: word splitting (Split), part-of-speech Tagging (POS Tagging), Stop Word Filtering (Stop Word Filtering) and Stemming (Stemming), wherein the named entity is already tagged in the named entity extraction step, so that the following operation only extracts the corpus content outside the tagged entity;
2.3.1, splitting words: designing a splitting rule mode by using a regular expression, splitting words of the corpus content according to spaces, symbols and paragraphs, and obtaining word texts;
2.3.2, part of speech tagging: to obtain nouns in a corpus, part-of-speech tagging is first performed on text vocabularies. Part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, and is a text data processing technology for tagging the part of speech of words in a corpus according to the meaning and context content in the linguistic science of the corpus, wherein a plurality of words possibly contain a plurality of parts of speech and have a plurality of meanings, the selection of the part of speech depends on the context meaning, the corpus with named entity tagging is used as a tagging object text for part-of-speech tagging, a noun object is found out according to a tagging result, a non-noun object is eliminated from the corpus but the name of the entry of the non-noun is not included, at the moment, the named entity, the noun object and an original punctuation point in each entry are reserved in the corpus, and all the contents still keep the original text sequence;
2.3.3, stop word filtering: the Stop Word is from Stop Word and refers to a Word or a Word which is automatically filtered out when processing natural language text in order to save storage space and improve search efficiency in information retrieval. For a given purpose, any type of word may be selected as stop words, which in the present invention mainly includes two types: one category is functional Words (Function Words) contained in human languages, often referred to as articles, conjunctions, adverbs, prepositions, etc., which are commonly used and occur with a very high frequency, but without exact practical meaning, such as a, an, the, which, etc.; the other type is real Words (Content Words), which refers to a part of Words having actual concrete meanings but not explicitly referred or pointed, such as want, welcom, enough, contider, indeed, etc. In the natural language processing, a Stop word list (Stop WordList) is existed, the Stop word list is used as a reference dictionary, Stop words are deleted from a corpus through word comparison, the corpus content is further simplified, and no Stop words in the corpus are ensured to be reserved;
2.3.4, stem extraction: the word drying extraction is a process of removing morphological affixes to obtain corresponding roots, and is a specific processing process of western languages such as english, the same english word has singular and plural deformations (such as applets and applets), temporal deformations (such as doing and did) such as ing and ed, deformations (such as like and tokens) corresponding to different predicates of a person-named pronoun, and the like, the words have slight differences in form but all correspond to the same roots, and the words should be processed as the same words when calculating the correlation, and then the word drying processing is needed. The baud stem Algorithm (porter stemming Algorithm) is a mainstream stem extraction Algorithm, the core idea is that words are classified and restored according to the type of morphological affix, most word deformations have certain rules except for part of special deformations, the deformations are divided into 6 categories according to the rules, and the stem extraction steps are as follows:
2.3.4.1, according to the word deformation category, proceed affix removal and word recovery for each case to obtain the stem information of noun object in corpus to reduce the different shapes of the same word. The 6 different words are morphed as follows:
2.3.4.1.1, complex number, words ending in ed and ing;
2.3.4.1.2, words that contain vowels and end in y;
2.3.4.1.3, double suffix words;
2.3.4.1.4, suffixed with-ic, -ful, -new, -active, etc.;
2.3.4.1.5, < c > vcvc < v >, in case of a suffix of-ant, -ref, etc. (c is a consonant, v is a vowel);
2.3.4.1.6, < c > vc < v > where there are more than 1 pair of vc between vowel consonants, a word ending with e;
2.3.4.2, creating noun objects reduced to stem as noun entities, and updating the noun objects in the corpus to be expressed in stem form.
The third step: combining Word2vec, carrying out secondary pre-grouping on the material base, and constructing the knowledge graph by using a k-means clustering algorithm, wherein the structure of the triple G is (head, relation, tail), the relation has various relations along with the difference of the head and the tail, and the relation is actually a relation set in the knowledge graph and is used for representing the complex relation among various entities. The method aims to judge whether semantic association exists between two attributes, namely whether a relation exists between two entities, and does not pay attention to the relation. The word vectors of the words in the corpus are calculated, the corpus is subjected to secondary grouping, and the entity relation is extracted by using a k-means clustering algorithm. The construction process of the knowledge graph comprises the following steps:
3.1, training Word vectors using Word2 vec: word2vec is a Word vector tool that represents words as feature vectors of words. Words are converted to numerical form using Word2vec, represented using an N-dimensional vector. For example, Word vector calculations are performed on the content of an Athletic sports (Athletic sports) corpus using Word2vec, setting the dimensions of the Word vectors to 300 dimensions. The higher the dimension of the word vector is, the richer the feature expression of the word is, but the time cost of training and the calculation time overhead when the model is called are increased.
3.2, pre-grouping the material library twice: because the clustering of the k-means algorithm is easily influenced by the distribution condition of the data set, in order to ensure the core concept, namely that the main classification object of the target field is a clustering center and cannot directly use k-means clustering, the material library is pre-grouped twice, the method comprises the following steps:
3.2.1, grouping the material libraries for one time, and comprising the following steps:
3.2.1.1, extracting the first-layer sub-classification labels of the previously acquired target field labels, wherein the target field labels form core entities, and generating a first-layer sub-classification label set Tag which comprises n sub-classification labels, each label has a corresponding entity and word vector, and the entities are connected with the core entities to form n triples.
3.2.1.2, taking the first-layer sub-classification label object as a centroid, calculating the Euclidean distance from each data point in the corpus data set to each centroid, and then distributing the data points to the class corresponding to each centroid according to the principle of closeness. N clusters, i.e. n grouped data sets, with the first-level subcategory labels as the centroids are obtained, and the corpus is also divided into n corpus sets.
Wherein the Euclidean distance (Euclidean Distan) in step 3.2.1.2ce) is an important criterion for determining the class of a data point, assuming a given sample is available
Figure BDA0002309924410000151
And
Figure BDA0002309924410000152
wherein i, j is 1,2, …, m represents the number of samples, n represents the number of features, and the calculation method of the Euclidean distance is as follows:
Figure BDA0002309924410000153
for example, a pre-classification is first performed on an entity data set of a constructed competitive sports (Athletic sports) corpus, a first-layer sub-classification label of a previously crawled competitive sports (Athletic sports) wikipedia corpus label is extracted to form a label set Tag { "Association football", "basell", "Baseball", "Badminton", "Beach", … … }, wherein 55 sub-classification labels are included, each label has a corresponding entity and Word vector trained by Word2vec, and the entities are connected with a core entity "Athleticsports", so as to form 55 triplets. And then, taking the label objects as centroids, calculating Euclidean distances from each data point in the data set to each centroid, and then distributing the data points to the class corresponding to each centroid according to the principle of closeness. At this time, 55 clusters with the type of the event as the center of mass, i.e. 55 grouped data sets, are obtained, and the corpus is also divided into 55 corpus sets.
And 3.2.2, secondarily grouping the corpus in combination with the TF-IDF algorithm. The method comprises the following steps:
and 3.2.2.1, finding out the key words in each corpus set by calculating TF-IDF.
The TF-IDF algorithm in step 3.2.2 is a numerical statistical method for evaluating the importance of a word to a given document. The Term Frequency TF (Term Frequency) refers to the Frequency with which a given Term appears in a given document, and is calculated by the formula:
Figure BDA0002309924410000154
nx,yrefers to the number of times the term x appears in the document y, Σknk,yRefers to the total number of words in document y. The Inverse Document Frequency IDF (Inverse Document Frequency) is used to evaluate the amount of information provided by a word or term, i.e., whether the term is common throughout the Document, and is calculated by the formula:
Figure BDA0002309924410000155
n refers to the total number of documents, NxReferring to the number of documents in the term x, each term in the text is treated as a document. And finally, jointly calculating the values of the TF and the IDF, wherein the formula for obtaining the TF-IDT is as follows:
TF-IDFx,y=TFx,y×IDFx
and 3.2.2.2, manually screening the keywords of each corpus set, removing the keywords with low correlation with the core entity of the current corpus set, reserving partial keywords with the highest correlation, and reserving the correlation between the number of the keywords and the overall quality of all the extracted keywords.
And 3.2.2.3, constructing a triple by the entity corresponding to the key word extracted from each corpus set and screened and the core entity of the current corpus set. And taking the keywords as the mass centers in each corpus set, calculating Euclidean distances from the data points in the set to each mass center again, and classifying the data points. The original corpus has now been divided into a plurality of corpus sets.
For example, keywords in each Athletic sports (Athletic sports) corpus are found through TF-IDF calculation, for example, there are "football", "nation", "match", "team" and "composition" in the corpus corresponding to the Association football, but there are some vocabularies, such as "list", "find", "body", etc., which appear frequently but have no great correlation. Therefore, the keywords of each corpus set need to be screened by manual intervention, the keywords with low relevance with the core entity of the current corpus set are removed, and the part of the keywords with the highest relevance is reserved. And constructing a triple by using the entity corresponding to the screened keyword extracted from each small corpus set and the core entity of the current corpus set. And then, taking the keywords as the mass centers in each corpus set, calculating Euclidean distances from the data points in the set to each mass center, classifying the data points, and dividing the data points into a plurality of small corpus sets.
3.3, automatically searching a clustering center for the small corpus set through a k-means clustering algorithm, clustering, and simultaneously constructing a triple, wherein the method comprises the following steps:
3.3.1, determining the size of k according to the size of the corpus set, wherein the larger the set is, the larger the k value is.
3.3.2, the entity corresponding to the centroid obtained by k-means clustering calculation and the entity corresponding to the centroid when the upper layer is grouped construct a triple.
The k-means algorithm in step 3.3.2 is an unsupervised clustering algorithm, and each Word is represented by a Word vector trained by Word2vec in a corpus. And taking each corpus set as a data set, and performing clustering calculation by using a k-means clustering algorithm. The steps of k-means clustering are as follows:
3.3.2.1, selecting k objects in the data set as initial centers, wherein each object represents a clustering center;
3.3.2.2, dividing the objects in the word vector sample into the class corresponding to the nearest cluster center according to the Euclidean distance between the objects and the cluster centers;
3.3.2.3, updating the clustering center: taking the mean values corresponding to all the objects in each category as the clustering center of the category, and calculating the value of a target function;
3.3.2.4, judging whether the values of the cluster center and the objective function are changed, if not, outputting the result, and if so, returning to 3.3.2.2.
And 3.3.3, taking the new grouping as a data set, calling the k-means algorithm again, and repeating the steps 3.3.1-3.3.3 until each grouping only contains the number of elements smaller than a certain threshold value Z.
And 3.3.4, constructing a triple by the entity corresponding to the data point in each group and the entity corresponding to the current centroid.
All the entities in the corpus establish relationships with other entities, and the triples formed by the entities are combined with each other, so that the knowledge graph is formed. Because the cluster heart and the clustering condition searched by automatic clustering may generate an entity relationship with weak correlation, manual proofreading and screening are needed after the construction of the knowledge graph is completed, and entity correlation with low correlation is removed, so that the quality of the knowledge graph is improved.
For example, the original Athletic sports (Athletic sports) corpus is divided into a plurality of corpus sets, and then a clustering center is automatically searched through a k-means clustering algorithm to perform clustering, and a ternary structure is constructed. The size of the given k is determined by the size of the corpus set, with larger sets having larger k values. And finally, constructing a triple by the entity corresponding to the calculated centroid and the entity corresponding to the centroid when the upper layer is grouped. The new packet is then used as a data set, the k-means algorithm is called again, and the above operation is repeated until each packet contains only less than 10 elements (at which time the threshold Z is 10). And finally, constructing a triple by the entity corresponding to the data point in each group and the entity corresponding to the current centroid. At this point, all entities in the Athletic sports (Athletic sports) corpus have established relationships with other entities, and the triplets correspondingly formed by the entities are combined with each other, so that the knowledge graph is formed. However, the mass center and the clustering condition found by automatic clustering may sometimes generate entity associations with weak correlation, and therefore, manual proofreading and screening are finally required to remove entity associations with extremely low correlation.
Fourthly, referring to fig. 2, constructing a visual model Tree (VT): classifying various visual graphs, summarizing and summarizing the attributes and structural features of various graphs, and formally expressing various graph information by creating a visual model tree (VT) comprising the following steps:
4.1, defining VT comprises two parts of basic attribute (basic) and visual structure (DVSCHEMA), formalized definition is as (1), wherein the basic attribute stores general information of graphic titles, subtitles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICATTRIBUTE includes three attributes: title (title), subtitle (subtitle) and attributes (attributes), wherein the formal definitions are shown as (2), the title is used for storing the title of the finally generated visual graph, the subtitle is used for storing the subtitle of the finally generated visual graph, and the attributes are used for storing the setting parameters of the position, the color combination, the font and the font size of the finally generated visual graph;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, the basic visualization graph can be generalized into four basic categories according to the data type, the graph data structure and the graph dimension required by the graph: general graph (General), topological graph (Topology), Map (Map), Text graph (Text), formal definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
4.4, the four basic categories in step 4.3 include two attributes: the method comprises the steps of determining a graph type (VType) and a graph structure (structModel), wherein the VType stores a graph type to which the class belongs, the structModel stores a basic visual structure of a graph to which the class belongs, the formalized definition is shown as (4), and the graph type (VType) and the graph structure (structModel) indicate that the graph type (VType) and the graph structure (structModel) respectively contain an attribute B;
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType><StructModel>
in step 4.4, the graphs of the VType attributes of the four basic categories are as follows:
4.4.1 General includes bar chart (BarChart), line chart (LineChart), pie chart (PieChart), radar chart (RadarChart), scatter chart (ScatterChart);
4.4.2, the Topology comprises a network map (NetworkChart), a tree map (TreeMap) and an area tree map (treemapcart);
4.4.3, Map includes area Map (AreaMapChart), country Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, Text includes word cloud (WorldCloudchart);
4.5, in the step 4.4, the four basic categories have respective Mapping relations (Mapping), and describe data structures, data dimensions, graph structure relations and data Mapping position information of various graphs; according to Mapping information and by combining the data structure of the graph, the basic visual structure structModel of various graphs can be abstracted.
In the 4.5, Mapping relation Mapping and basic visual structure structModel definition of various graphs are defined by the following steps:
4.5.1, graphics in General type, commonly used to represent two-dimensional or three-dimensional data, information can be represented by tuples (XAxis, YAxis) or triples (XAxis, YAxis, ZAxis), Mapping structure of such graphics as (5), where Legendname represents legend name, storing each grouping information in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted as (6), the child node of the structModel is a temporary Root node Root, and the Root comprises two child nodes: a key-value pair K _ V and a legend node Legendnode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, graphs in Topology type are commonly used to represent topological relation data, tree and area tree graphs can represent attribute structures with nested key-value pairs { key: value, child: { key: value } }, Mapping structures such as (7); the network graph can be represented by a node set (Nodes) and an edge set (Links), wherein the Mapping structure is as shown in (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two sub-structures, Root1 and Root2 are temporary Root nodes of the two sub-structures respectively, and Root1 includes two sub-nodes: the child node child is a child node child, and the child structure of child node child is a key value pair K _ V; root2 contains two child nodes: node sets and edge sets, wherein child Nodes of the node sets are key and value, the value can be null, and the child Nodes of the edge sets are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, maps of the type Map, usually used to represent Map information, are represented by key-value pair arrays [ { PlaceName: value } ] or triple arrays [ { lng, lat, value } ], Mapping structures of such maps are as (10), where PlaceName represents place name, lng represents latitude, and lat represents longitude; the structure of the basic structModel can be abstracted according to the Mapping structure, such as (11), the structModel has two substructures, Root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 contains a child-node key-value pair K _ V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng><lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4, the graph in the Text type commonly uses a binary group (Keyword) to represent the Keyword frequency, and the Mapping structure of the graph is as shown in (12), wherein the Keyword is a word extracted from the Text, and the frequency represents the occurrence frequency of the word in the Text; the structure of the basic structModel can be abstracted according to the Mapping structure, wherein the structure is shown as (13), child nodes of the structModel are temporary Root nodes Root, and the Root contains a key value pair K _ V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>
fifthly, the data visualization optimization matching method based on the network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by RESTWeb service; matching the Web data prototype structure M-JSON with each structModel in the visual model tree VT according to the data structure, wherein the returned result is a set formed by attribute combinations of candidate coordinate axes/legends which meet the conditions; on the basis of structure matching, by using the knowledge graph constructed in the third step, inquiring whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic correlation, optimizing and matching according to the inquiry result, and selecting an effective dimension combination to improve the accuracy (Precision) of automatically generated graphs, wherein the steps are as follows:
and 5.1, matching the Web data prototype structure M-JSON with the structModel of the visual model tree VT according to a data structure, matching M attribute combination results of candidate coordinate axes/legends meeting the conditions in the M-JSON, wherein each combination result is represented as a binary group consisting of a key value pair L and an attribute name A, and L and A respectively correspond to LegendNode and K _ V in the step 4.5.1.
5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a doublet: and P is (L:: name, A:: name). Each matching result Pi=(Li::name,Ai: : name), converted to triplet form Gi=(Li::name,R,Ai: : name) into set S ═ { G ═ G1,G2,...,Gm}。
5.2.2, sequentially grouping ShirG in the set SiMapping F (L) to the triplet structure of the knowledge-graph as followsi::name→head,R→relation,AiName → tail) to a triple (tail). And matching whether a current triple (head, relation, tail) exists in the constructed corpus knowledge graph, wherein result is that the matching result is True or False, and is respectively represented as 1 and 0. Firstly, a head entity node head and a tail entity node tail are matched in a corpus knowledge graph, and then matching is carried outThe relationship between the head entity node and the tail entity node. When and only when the head entity head, the tail entity tail and the relation are successfully matched, result is 1; otherwise, result is 0.
5.2.3, after the object query in the set S is completed, returning a set Q { (G)i,resulti) And Q is used for judging whether semantic association exists in the current qualified binary group or not, and the semantic association is used as judgment of an attribute combination matching result of the candidate coordinate axis/legend. Thus, only the structures match and resultiWhen the number is 1, the matching is judged to be successful. Therefore, the accuracy of data attribute matching is improved, and the generation rate of images without practical significance is reduced.

Claims (10)

1. A knowledge graph-based Web data optimization method for visualization demand is characterized by comprising the following steps:
the first step, the construction of a target domain corpus: the method comprises the following steps of taking network corpus content as a basis for constructing a knowledge graph, using network corpus entry information as original corpus content, screening the original network corpus content for constructing the knowledge graph, comparing and analyzing webpage content of the network entries, wherein the original corpus content comprises HTML labels besides headers and text information, editing information of the entries, webpage link information and other redundant information irrelevant to the entries, filtering and cleaning the content of the network entries, extracting the headers and effective text content, and filtering the content, wherein the filtering comprises the following steps: performing HTML tag/text style symbol filtering, entry template symbol and non-English character filtering, entry editing information filtering, picture information filtering, link information filtering, page proprietary title attribute name filtering and numerical value filtering on webpage contents of entries;
step two, corpus-oriented entity extraction: the knowledge graph is a data information network with a graph structure formed by entities and relations, the basic structure of the knowledge graph is represented by an entity-relation-entity triple, the triple comprises two entities with real semantic relations and the relation between the two entities, and is represented by a (head, relation, tail) form, wherein G represents the triple, head represents a head entity, tail represents a tail entity, and relation represents the relation between the head entity and the tail entity; each entity also comprises attributes and attribute values, the attributes of the entities are also converted into tail entities connected with the entities, a relationship is established between the attributes and the tail entities, and the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
the third step: combining Word2vec, performing secondary pre-grouping on a corpus, and constructing a knowledge graph by using a k-means clustering algorithm, wherein a triple G has a structure of (head, relation, tail), the relation has various relations along with the difference of the head and the tail, the relation is actually a relation set in the knowledge graph and is used for representing complex relation among various entities, the purpose is to judge whether semantic relation exists between two attributes, namely whether relation exists between two entities, and does not pay attention to the relation, the corpus is subjected to secondary grouping by calculating Word vectors of the vocabulary of the corpus, and the entity relation is extracted by using the k-means clustering algorithm;
fourthly, constructing a visual model tree VT: classifying various visual graphs, summarizing and summarizing the attributes and structural features of various graphs, and formally expressing various graph information by creating a visual model tree VT;
fifthly, the data visualization optimization matching method based on the network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by REST Web service; matching the Web data prototype structure M-JSON with each structModel in the visual model tree VT according to the data structure, wherein the returned result is a set formed by attribute combinations of candidate coordinate axes/legends which meet the conditions; on the basis of structure matching, the knowledge graph constructed in the third step is utilized to inquire whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic correlation, the matching is optimized according to the inquiry result, and effective dimension combination is selected, so that the accuracy rate of automatically generating the graph is improved.
2. The method for optimizing knowledge-graph-based Web data based on visualization requirements of claim 1, wherein in the second step, the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
2.1, entity extraction: entity extraction, also called named entity recognition, is the automatic recognition of named entities from text data sets, which generally refers to entities identified by names of people, places, names of organizations, and all other names, and is accomplished by using some mainstream named entity recognition systems, which includes the steps of: firstly, naming entity recognition is carried out on the content of a material library through a tool; secondly, marking the type attribute of the identified named entity; thirdly, filtering the named entities according to the type attributes, deleting the unsuitable named entities, keeping labels of other named entities, and defining the entry names as the named entities by default;
2.2, extracting attribute entities: extracting attributes from information frames of the vocabulary entry network corpus by taking the information frames of the vocabulary entry network corpus as the sources of the attributes, then intercepting information of the information frames of each vocabulary entry from a corpus, extracting attribute names according to the structure of the information frames, taking the attribute names as tail entities of the named entities corresponding to the names of the vocabulary entries, not reserving attribute values, and if no information frame exists in a certain vocabulary entry, not establishing the tail entities for the named entities corresponding to the vocabulary entry;
2.3, noun entity extraction, comprising four steps: the method comprises the following steps of Word splitting (Split), part-of-speech Tagging (POS Tagging), Stop Word Filtering (Stop Word Filtering) and stem extraction (Stemming), wherein the named entities are already tagged in the named entity extraction step, so that the following operation only extracts the corpus content outside the tagged entities.
3. The knowledge-graph-based Web data optimization method for visualization requirements as claimed in claim 2, wherein the process of 2.3 is as follows:
2.3.1, splitting words: designing a splitting rule mode by using a regular expression, splitting words of the corpus content according to spaces, symbols and paragraphs, and obtaining word texts;
2.3.2, part of speech tagging: in order to obtain nouns in a corpus, part-of-speech tagging is needed to be carried out on text vocabularies, wherein the part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, the part-of-speech tagging is a text data processing technology for tagging the part of speech of words in the corpus according to the meaning and context content of the words in the corpus linguistics, a plurality of words possibly contain a plurality of parts of speech and have multiple meanings, the part-of-speech selection depends on the context meaning, the corpus which is subjected to named entity tagging is used as a tagging object text for carrying out part-of-speech tagging, a noun object is found out according to tagging results, a non-noun object is removed from the corpus but the non-noun entry names are not included, named entities, noun objects and original punctuation points in each entry are reserved in the corpus at this time, and all;
2.3.3, stop word filtering: stop words, the name from Stop Word, refer to words or phrases that are automatically filtered out when processing natural language text in order to save storage space and improve search efficiency in information retrieval, and for a given purpose, any kind of words can be selected as Stop words, and Stop words include two kinds: one is the functional Words (Function Words) contained in human language, which are used very commonly and appear very frequently, but have no exact actual meaning; the other type is real Words (Content Words), which refers to a part of Words with actual specific meanings but without specific reference or direction; in the natural language processing, a Stop Word List (Stop Word List) is existed, the Stop Word List is used as a reference dictionary, Stop words are deleted from a corpus through Word comparison, the corpus content is further simplified, and no Stop words in the corpus are ensured to be reserved;
2.3.4, stem extraction: the word drying extraction is a process of removing morphological affix to obtain a corresponding root, and is a specific processing process of western languages such as English; the same English word has singular and plural deformation, temporal deformation and deformation of different predicates corresponding to the nominal pronouns, the words have slight differences in form but all correspond to the same root, and the words are treated as the same words under the condition of calculating the correlation, and then the word stem processing is needed; the baud stem Algorithm (Porter Stemming Algorithm) is a mainstream stem extraction Algorithm, and the core idea is that words are classified and restored according to the type of morphological affixes, most word deformations have certain rules except for part of special deformations, and the deformations are divided into 6 categories according to the rules.
4. The knowledge-graph-based Web data optimization method for visualization requirements as claimed in claim 3, wherein in 2.3.4, the stem extraction step is as follows:
2.3.4.1, according to the word deformation category, proceed affix removal and word recovery for each case to obtain the stem information of noun object in corpus to reduce the cases of different shapes of different words, the 6 different word deformations are as follows:
2.3.4.1.1, complex number, words ending in ed and ing;
2.3.4.1.2, words that contain vowels and end in y;
2.3.4.1.3, double suffix words;
2.3.4.1.4, suffixed with-ic, -ful, -new, -active, etc.;
2.3.4.1.5, < c > vcvc < v >, in case of a suffix of-ant, -ref, etc. (c is a consonant, v is a vowel);
2.3.4.1.6, < c > vc < v > where there are more than 1 pair of vc between vowel consonants, a word ending with e;
2.3.4.2, creating noun objects reduced to stem as noun entities, and updating the noun objects in the corpus to be expressed in stem form.
5. The visualization demand-oriented Web data optimization method based on the knowledge graph as claimed in any one of claims 1 to 4, wherein in the third step, the construction process of the knowledge graph is as follows:
3.1, training Word vectors using Word2 vec: word2vec is a Word vector tool that represents words as feature vectors of words, converts words into numerical form using Word2vec, and represents using an N-dimensional vector;
3.2, grouping the corpus in advance twice: because the clustering of the k-means algorithm is easily influenced by the distribution condition of the data set, in order to ensure the core concept, namely that the main classification object of the target field is a clustering center and cannot directly use k-means clustering, the corpus needs to be pre-grouped twice;
3.3, automatically searching a clustering center for the small corpus set through a k-means clustering algorithm, clustering, and simultaneously constructing a triple, wherein the method comprises the following steps:
3.3.1, determining the size of k according to the size of the corpus aggregation, wherein the larger the aggregation is, the larger the k value is;
3.3.2, constructing a triple by the entity corresponding to the centroid obtained by k-means clustering calculation and the entity corresponding to the centroid when the upper layer is grouped;
the k-means algorithm in step 3.3.2 is an unsupervised clustering algorithm, each Word is represented by a Word vector trained by Word2vec in a corpus, each small corpus set is used as a data set, clustering calculation is carried out by using the k-means clustering algorithm, and the k-means clustering step is as follows:
3.3.2.1, selecting k objects in the data set as initial centers, wherein each object represents a clustering center;
3.3.2.2, dividing the objects in the word vector sample into the class corresponding to the nearest cluster center according to the Euclidean distance between the objects and the cluster centers;
3.3.2.3, updating the clustering center: taking the mean values corresponding to all the objects in each category as the clustering center of the category, and calculating the value of a target function;
3.3.2.4, judging whether the values of the clustering center and the objective function are changed, if not, outputting the result, and if so, returning to 3.3.2.2;
3.3.3, taking the new group as a data set, calling the k-means algorithm again, and repeating the steps 3.3.1-3.3.3 until each group only contains the number of elements smaller than a certain threshold value Z;
3.3.4, constructing a triple of the entity corresponding to the data point in each group and the entity corresponding to the current centroid;
all entities in the corpus establish relations with other entities, triples formed by the entities and the triplets correspondingly combine with each other to form a knowledge graph, and entity relations with weak relevance may be generated due to the fact that the cluster centers and the clustering conditions found out by automatic clustering are likely to be generated, so that manual proofreading and screening are needed after the knowledge graph is built, entity relations with low relevance are removed, and the quality of the knowledge graph is improved.
6. The knowledge-graph-based Web data optimization method for visualization requirements as claimed in claim 5, wherein the step of 3.2 is as follows:
3.2.1, grouping the material libraries for one time, and comprising the following steps:
3.2.1.1, extracting a first layer of sub-classification labels of the previously acquired target field labels, wherein the target field labels form core entities and generate a first layer of sub-classification label set Tag, wherein the first layer of sub-classification label set Tag comprises n sub-classification labels, each label has a corresponding entity and word vector, and the entities are connected with the core entities to form n triples;
3.2.1.2, taking the first-layer sub-classification label object as a centroid, calculating the Euclidean distance from each data point to each centroid in the corpus data set, then distributing the data points to the class corresponding to each centroid according to the principle of closeness, obtaining n clusters which take the first-layer sub-classification label as the centroid, namely n grouped data sets, and simultaneously dividing the corpus into n corpus sets;
wherein the Euclidean Distance (Euclidean Distance) in step 3.2.1.2 is an important basis for determining the class of the data point, and a given sample is assumed to exist
Figure RE-FDA0002402132170000051
And
Figure RE-FDA0002402132170000052
wherein i, j is 1,2, …, m represents the number of samples, n represents the number of features, and the calculation method of the Euclidean distance is as follows:
Figure RE-FDA0002402132170000053
3.2.2, combining TF-IDF algorithm, grouping the material base for the second time, comprising the following steps:
3.2.2.1, searching out the key words in each corpus set by calculating TF-IDF;
the TF-IDF algorithm in step 3.2.2 is a numerical statistical method for evaluating the importance of a word to a given document, and the Term Frequency TF (Term Frequency) refers to the Frequency of occurrence of a given word in a given document, and is calculated by the formula:
Figure RE-FDA0002402132170000054
nx,yrefers to the number of times the term x appears in the document y,
Figure RE-FDA0002402132170000055
referring to the total vocabulary number in the Document y, the Inverse Document Frequency IDF (Inverse Document Frequency) is used to evaluate the amount of information provided by a word or term, i.e., whether the term is common in the whole Document, and is calculated by the formula:
Figure RE-FDA0002402132170000056
n refers to the total number of documents, NxAnd (3) indicating the document number of the term x, taking each entry in the text as a document, and finally, jointly calculating the values of TF and IDF to obtain the formula of TF-IDF:
TF-IDFx,y=TFx,y×IDFx
3.2.2.2, manually screening the keywords of each corpus set, removing the keywords with low correlation with the core entity of the current corpus set, reserving partial keywords with the highest correlation, and reserving the correlation between the number of the keywords and the overall quality of all the extracted keywords;
3.2.2.3, constructing a triple of the entity corresponding to the screened keyword extracted from each corpus set and the core entity of the current corpus set, taking the keywords as the mass center in each corpus set, calculating the Euclidean distance from the data point in the set to each mass center, classifying the data point, and dividing the primitive corpus into a plurality of small corpus sets at the moment.
7. The knowledge-graph-based Web data optimization method for visualization requirements according to any one of claims 1 to 4, wherein the fourth step comprises the following steps:
4.1, defining VT comprises two parts of basic attribute (basic) and visual structure (DVSCHEMA), formalized definition is as (1), wherein the basic attribute stores general information of graphic titles, subtitles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICATTRIBUTE includes three attributes: title (title), subtitle (subtitle) and attributes (attributes), wherein the formal definitions are shown as (2), the title is used for storing the title of the finally generated visual graph, the subtitle is used for storing the subtitle of the finally generated visual graph, and the attributes are used for storing the setting parameters of the position, the color combination, the font and the font size of the finally generated visual graph;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, the basic visualization graph can be generalized into four basic categories according to the data type, the graph data structure and the graph dimension required by the graph: general graph (General), topological graph (Topology), Map (Map), Text graph (Text), formal definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
4.4, the four basic categories in step 4.3 include two attributes: the method comprises the steps of determining a graph type (VType) and a graph structure (structModel), wherein the VType stores a graph type to which the class belongs, the structModel stores a basic visual structure of a graph to which the class belongs, the formalized definition is shown as (4), and the graph type (VType) and the graph structure (structModel) indicate that the graph type (VType) and the graph structure (structModel) respectively contain an attribute B;
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType>
<StructModel>。
8. the knowledge-graph-based Web data optimization method facing visualization requirements of claim 7, wherein in 4.4, the graph of VType attribute of four basic categories is as follows:
4.4.1 General includes bar chart (BarChart), line chart (LineChart), pie chart (PieChart), radar chart (RadarChart), scatter chart (ScatterChart);
4.4.2, the Topology comprises a network map (NetworkChart), a tree map (TreeMap) and an area tree map (treemapcart);
4.4.3, Map includes area Map (AreaMapChart), country Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, Text includes word cloud (WorldCloudchart);
4.5, in the step 4.4, the four basic categories have respective Mapping relations (Mapping), and describe data structures, data dimensions, graph structure relations and data Mapping position information of various graphs; according to Mapping information and by combining the data structure of the graph, the basic visual structure structModel of various graphs can be abstracted.
9. The knowledge-graph-based Web data optimization method for visualization requirements as claimed in claim 7, wherein in 4.5, Mapping relation Mapping and basic visualization structure structModel of each type of graph are defined as follows:
4.5.1, graphics in General type, commonly used to represent two-dimensional or three-dimensional data, information can be represented by tuples (XAxis, YAxis) or triples (XAxis, YAxis, ZAxis), Mapping structure of such graphics as (5), where Legendname represents legend name, storing each grouping information in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted as (6), the child node of the structModel is a temporary Root node Root, and the Root comprises two child nodes: a key-value pair K _ V and a legend node Legendnode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, graphs in Topology type are commonly used to represent topological relation data, tree and area tree graphs can represent attribute structures with nested key-value pairs { key: value, child: { key: value } }, Mapping structures such as (7); the network graph can be represented by a node set (Nodes) and an edge set (Links), wherein the Mapping structure is as shown in (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two sub-structures, Root1 and Root2 are temporary Root nodes of the two sub-structures respectively, and Root1 includes two sub-nodes: the child node child is a child node child, and the child structure of child node child is a key value pair K _ V; root2 contains two child nodes: node sets and edge sets, wherein child Nodes of the node sets are key and value, the value can be null, and the child Nodes of the edge sets are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, maps of the type Map, usually used to represent Map information, are represented by key-value pair arrays [ { PlaceName: value } ] or triple arrays [ { lng, lat, value } ], Mapping structures of such maps are as (10), where PlaceName represents place name, lng represents latitude, and lat represents longitude; the structure of the basic structModel can be abstracted according to the Mapping structure, such as (11), the structModel has two substructures, Root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 contains a child-node key-value pair K _ V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng><lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4, the graph in the Text type commonly uses a binary group (Keyword) to represent the Keyword frequency, and the Mapping structure of the graph is as shown in (12), wherein the Keyword is a word extracted from the Text, and the frequency represents the occurrence frequency of the word in the Text; the structure of the basic structModel can be abstracted according to the Mapping structure, wherein the structure is shown as (13), child nodes of the structModel are temporary Root nodes Root, and the Root contains a key value pair K _ V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>。
10. the knowledge-graph-based Web data optimization method for visualization requirements according to any one of claims 1 to 4, wherein the fifth step comprises the following steps:
5.1, matching the Web data prototype structure M-JSON with a structModel of a visual model tree VT according to a data structure, and matching M attribute combination results of candidate coordinate axes/legends meeting conditions in the M-JSON, wherein each combination result is represented as a binary group consisting of a key value pair L and an attribute name A, and L and A respectively correspond to Legendnode and K _ V in the step 4.5.1;
5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the 5.2 process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a doublet: p: (L:: name, A:: name), and matching each result Pi=(Li::name,Ai: : name), converted to triplet form Gi=(Li::name,R,Ai: : name) into set S ═ { G ═ G1,G2,…,Gm};
5.2.2, sequentially grouping ShirG in the set SiMapping F (L) to the triplet structure of the knowledge-graph as followsi::name→head,R→relation,Ai: : name → tail) is mapped into a triple (head, relation, tail), whether the current triple (head, relation, tail) exists in the constructed corpus knowledge graph is matched, result is True or False, which is respectively expressed as 1 and 0, firstly, a head entity node head and a tail entity node tail are matched in the corpus knowledge graph, then, the relation between the head entity node and the tail entity node is matched, and when and only when the head entity head, the tail entity node and the relation are successfully matched, the result is 1;
5.2.3, after the object query in the set S is completed, returning a set Q { (G)i,resulti) Q is used for judging whether semantic association exists in the current qualified binary group or not, and the semantic association is used as judgment of a matching result of attribute combination of a candidate coordinate axis/legend, so that only structure matching exists and resultiWhen the number of the images is 1, the matching is judged to be successful, so that the accuracy of data attribute matching is improved, and the generation rate of images without practical significance is reduced.
CN201911254814.7A 2019-12-10 2019-12-10 Knowledge graph-based Web data optimization method for visual requirements Active CN111177591B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911254814.7A CN111177591B (en) 2019-12-10 2019-12-10 Knowledge graph-based Web data optimization method for visual requirements

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911254814.7A CN111177591B (en) 2019-12-10 2019-12-10 Knowledge graph-based Web data optimization method for visual requirements

Publications (2)

Publication Number Publication Date
CN111177591A true CN111177591A (en) 2020-05-19
CN111177591B CN111177591B (en) 2023-09-29

Family

ID=70655440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911254814.7A Active CN111177591B (en) 2019-12-10 2019-12-10 Knowledge graph-based Web data optimization method for visual requirements

Country Status (1)

Country Link
CN (1) CN111177591B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680516A (en) * 2020-06-04 2020-09-18 宁波浙大联科科技有限公司 PDM system product design requirement information semantic analysis and extraction method and system
CN111985236A (en) * 2020-06-02 2020-11-24 中国航天科工集团第二研究院 Visual analysis method based on multi-dimensional linkage
CN112016276A (en) * 2020-10-29 2020-12-01 广州欧赛斯信息科技有限公司 Graphical user-defined form data acquisition system
CN112364173A (en) * 2020-10-21 2021-02-12 中国电子科技网络信息安全有限公司 IP address mechanism tracing method based on knowledge graph
CN112507036A (en) * 2020-11-30 2021-03-16 武汉烽火众智数字技术有限责任公司 Knowledge graph visualization analysis method
CN112541072A (en) * 2020-12-08 2021-03-23 成都航天科工大数据研究院有限公司 Supply and demand information recommendation method and system based on knowledge graph
CN112596031A (en) * 2020-12-22 2021-04-02 电子科技大学 Target radar threat degree assessment method based on knowledge graph
CN113342913A (en) * 2021-06-02 2021-09-03 合肥泰瑞数创科技有限公司 Community information model-based epidemic prevention control method, system and storage medium
CN113609309A (en) * 2021-08-16 2021-11-05 脸萌有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
CN115048096A (en) * 2022-08-15 2022-09-13 广东工业大学 Dynamic visualization method and system for data structure
CN118260273A (en) * 2024-05-30 2024-06-28 江西展群科技有限公司 Database storage optimization method, system and medium based on enterprise data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150227559A1 (en) * 2007-07-26 2015-08-13 Dr. Hamid Hatami-Hanza Methods and systems for investigation of compositions of ontological subjects
US20160373423A1 (en) * 2015-06-16 2016-12-22 Business Objects Software, Ltd. Contextual navigation facets panel
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN108345647A (en) * 2018-01-18 2018-07-31 北京邮电大学 Domain knowledge map construction system and method based on Web
US20180232443A1 (en) * 2017-02-16 2018-08-16 Globality, Inc. Intelligent matching system with ontology-aided relation extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150227559A1 (en) * 2007-07-26 2015-08-13 Dr. Hamid Hatami-Hanza Methods and systems for investigation of compositions of ontological subjects
US20160373423A1 (en) * 2015-06-16 2016-12-22 Business Objects Software, Ltd. Contextual navigation facets panel
US20180232443A1 (en) * 2017-02-16 2018-08-16 Globality, Inc. Intelligent matching system with ontology-aided relation extraction
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN108345647A (en) * 2018-01-18 2018-07-31 北京邮电大学 Domain knowledge map construction system and method based on Web

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985236A (en) * 2020-06-02 2020-11-24 中国航天科工集团第二研究院 Visual analysis method based on multi-dimensional linkage
CN111680516A (en) * 2020-06-04 2020-09-18 宁波浙大联科科技有限公司 PDM system product design requirement information semantic analysis and extraction method and system
CN112364173B (en) * 2020-10-21 2022-03-18 中国电子科技网络信息安全有限公司 IP address mechanism tracing method based on knowledge graph
CN112364173A (en) * 2020-10-21 2021-02-12 中国电子科技网络信息安全有限公司 IP address mechanism tracing method based on knowledge graph
CN112016276A (en) * 2020-10-29 2020-12-01 广州欧赛斯信息科技有限公司 Graphical user-defined form data acquisition system
CN112507036A (en) * 2020-11-30 2021-03-16 武汉烽火众智数字技术有限责任公司 Knowledge graph visualization analysis method
CN112541072B (en) * 2020-12-08 2022-12-02 成都航天科工大数据研究院有限公司 Supply and demand information recommendation method and system based on knowledge graph
CN112541072A (en) * 2020-12-08 2021-03-23 成都航天科工大数据研究院有限公司 Supply and demand information recommendation method and system based on knowledge graph
CN112596031A (en) * 2020-12-22 2021-04-02 电子科技大学 Target radar threat degree assessment method based on knowledge graph
CN113342913A (en) * 2021-06-02 2021-09-03 合肥泰瑞数创科技有限公司 Community information model-based epidemic prevention control method, system and storage medium
CN113609309A (en) * 2021-08-16 2021-11-05 脸萌有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
WO2023022655A3 (en) * 2021-08-16 2023-04-13 脸萌有限公司 Knowledge map construction method and apparatus, storage medium, and electronic device
CN113609309B (en) * 2021-08-16 2024-02-06 脸萌有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
CN115048096A (en) * 2022-08-15 2022-09-13 广东工业大学 Dynamic visualization method and system for data structure
CN115048096B (en) * 2022-08-15 2022-11-04 广东工业大学 Dynamic visualization method and system for data structure
CN118260273A (en) * 2024-05-30 2024-06-28 江西展群科技有限公司 Database storage optimization method, system and medium based on enterprise data
CN118260273B (en) * 2024-05-30 2024-07-30 江西展群科技有限公司 Database storage optimization method, system and medium based on enterprise data

Also Published As

Publication number Publication date
CN111177591B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN111143479B (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN111177591B (en) Knowledge graph-based Web data optimization method for visual requirements
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN108763333B (en) Social media-based event map construction method
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN104636466B (en) Entity attribute extraction method and system for open webpage
Kowalski Information retrieval architecture and algorithms
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
CN102955848B (en) A kind of three-dimensional model searching system based on semanteme and method
US20150081277A1 (en) System and Method for Automatically Classifying Text using Discourse Analysis
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
CN113268606B (en) Knowledge graph construction method and device
CN106484797A (en) Accident summary abstracting method based on sparse study
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
Alami et al. Hybrid method for text summarization based on statistical and semantic treatment
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
CN112036178A (en) Distribution network entity related semantic search method
Moncla et al. Automated geoparsing of paris street names in 19th century novels
CN112148886A (en) Method and system for constructing content knowledge graph
CN111104437A (en) Test data unified retrieval method and system based on object model
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
CN113673252A (en) Automatic join recommendation method for data table based on field semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230831

Address after: Room 202, Building B, Tian'an Digital Entrepreneurship Park, No. 441 Huangge Road, Huanggekeng Community, Longcheng Street, Longgang District, Shenzhen City, Guangdong Province, 518000

Applicant after: Shenzhen Shukangyun Information Technology Co.,Ltd.

Address before: No. 9 Santong Road, Houzhou Street, Taijiang District, Fuzhou City, Fujian Province, 350000. Zhongting Street Renovation. 143 Shopping Mall, 3rd Floor, Jiahuiyuan Link Section

Applicant before: Fuzhou Zhiqing Intellectual Property Service Co.,Ltd.

Effective date of registration: 20230831

Address after: No. 9 Santong Road, Houzhou Street, Taijiang District, Fuzhou City, Fujian Province, 350000. Zhongting Street Renovation. 143 Shopping Mall, 3rd Floor, Jiahuiyuan Link Section

Applicant after: Fuzhou Zhiqing Intellectual Property Service Co.,Ltd.

Address before: The city Zhaohui six districts Chao Wang Road Hangzhou City, Zhejiang province 310014 18

Applicant before: JIANG University OF TECHNOLOGY

GR01 Patent grant
GR01 Patent grant