Disclosure of Invention
The invention provides a knowledge graph-based Web data optimization method for visual requirements, which is used for analyzing, generalizing and modeling common visual graphs, and carrying out structural matching on the structure of Web data and the structure of a visual model to obtain attribute combinations of candidate coordinate axes/legends meeting the requirements. And constructing a knowledge graph by using the network corpus, and acquiring whether the attribute combination has semantic association or not by inquiring the knowledge graph so as to further optimize the visualization operation of the Web data and improve the probability of effective graph generation.
In order to realize the invention, the following technical scheme is adopted:
a visual demand-oriented knowledge graph-based Web data optimization method comprises the following steps:
firstly, constructing a target field corpus: the method comprises the steps of taking network corpus content as a basis for constructing a knowledge graph, taking network corpus vocabulary entry information as original corpus content, screening the original network corpus content for constructing the knowledge graph, comparing and analyzing web page content of the network vocabulary entry, wherein the original corpus content contains HTML tags besides title and text information, editing information of the vocabulary entry, web page link information and other redundant information irrelevant to the vocabulary entry, filtering and cleaning the content of the network vocabulary entry, and extracting the title and effective text content. The filtering content comprises: performing HTML tag/text style symbol filtering, vocabulary entry template symbol and non-English character filtering, vocabulary entry editing information filtering, picture information filtering, link information filtering, page proprietary title attribute name filtering and numerical filtering on the webpage content of the vocabulary entry;
Secondly, entity extraction facing to a corpus: the knowledge graph is a data information network with a graph structure formed by entities and relations, the basic structure of the knowledge graph is represented by a triplet of 'entity-relation-entity', the triplet comprises two entities with real semantic relations and the relation between the two entities, the relation is represented by a form of G= (head, relation, tail), wherein G represents the triplet, head represents the head entity, tail represents the tail entity, and relation represents the relation between the head entity and the tail entity; each entity also comprises an attribute and an attribute value, the attribute of the entity is converted into a tail entity connected with the entity, and a relationship is established between the entity and the tail entity;
and a third step of: combining Word2vec, performing secondary pre-grouping on a corpus, using a k-means clustering algorithm to construct a knowledge graph, wherein a structure of a triplet G is head, relation, along with the difference of the head and the tail, the relation also has various relations, the relation is actually a relation set in the knowledge graph and is used for representing complex relations among various entities, the aim is to judge whether semantic relations exist between two attributes, namely whether the relation exists between the two entities, and not paying attention to the relation exists, performing secondary grouping on the corpus by calculating Word vectors of the vocabulary of the corpus, and extracting entity relations by using the k-means clustering algorithm;
Fourthly, constructing a visual model Tree (VT for short): classifying various visual graphics, summarizing the attribute and the structural characteristics of the various graphics, and formally expressing various graphic information by creating a visual model tree (VT);
fifthly, a data visualization optimizing and matching method based on a network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by REST Web service; matching each structModel in the Web data prototype structure M-JSON and the visual model tree VT according to the data structure, and returning a result which is a set formed by attribute combinations of candidate coordinate axes/legends meeting the conditions; based on structure matching, inquiring whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic association or not by utilizing the knowledge graph constructed in the third step, optimizing matching according to the inquiry result, and selecting an effective dimension combination so as to improve the accuracy of automatically generating the graph. Further, in the second step, the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
2.1, entity extraction: entity extraction, also known as named entity recognition, is the automatic recognition of named entities from a text dataset, which is commonly referred to as person names, place names, institution names, and other entities whose all names are identified. The process can be accomplished by using some mainstream named entity recognition system, the steps of which include: 1. carrying out named entity identification on the corpus content by a tool; 2. labeling the identified named entity with its type attribute; 3. filtering named entities according to type attributes, deleting unsuitable named entities, reserving labels of other named entities, and defining entry names as named entities by default;
2.2, extracting attribute entities: extracting attributes from information frames of the vocabulary network by taking the information frames as sources of the attributes, then intercepting the information frame information of each vocabulary in the corpus, extracting attribute names according to the information frame structure, and taking the attribute names as tail entities of named entities corresponding to the names of the corresponding vocabulary, wherein attribute values are not reserved, and if the information frames do not exist in a certain vocabulary, the tail entities do not need to be created for the named entities corresponding to the vocabulary;
2.3, noun entity extraction, comprising four steps: word splitting (Split), part-of-speech Tagging (POS Tagging), stop word filtering (Stop Word Filtering), and stem extraction (Stemming), named entity extraction steps have tagged identified named entities, so that the next operation only extracts corpus content outside the tagged entities.
Still further, the procedure of 2.3 is as follows:
2.3.1, word splitting: using a regular expression design splitting rule mode, and carrying out word splitting on the corpus content according to spaces, symbols and paragraphs to obtain word texts;
2.3.2, part-of-speech tagging: in order to obtain nouns in a corpus, part-of-speech tagging is needed to be carried out on text words, the part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, and is a text data processing technology for tagging word parts in the corpus according to the meaning and the context content of the word parts in the corpus in linguistics, a plurality of word parts possibly simultaneously contain a plurality of part-of-speech and have a plurality of meanings, the part-of-speech selection depends on the context meaning, the corpus marked with named entities is used as tagged object text to carry out part-of-speech tagging, named word objects are searched according to tagging results, non-noun objects are removed from the corpus, but the names of the non-noun entries are not included, so that named entities, noun objects and original punctuations in each word part-of-speech are reserved in the corpus, and all contents still keep original text sequence;
2.3.3, stop word filtering: the term Stop Word is derived from Stop Word, and refers to a Word or Word automatically filtered out when processing natural language text in order to save storage space and improve search efficiency in information retrieval, and for a given purpose, any kind of Word can be selected as a Stop Word, and the Stop Word includes two kinds: the functional Words (functions Words) contained in human language are very common in use, have extremely high frequency of occurrence and have no exact actual meaning; another category is the real Words (Content Words), which refer to a part of Words that have a practical, specific meaning but are not explicitly referred to or pointed at; in natural language processing, a Stop Word List is already available, the Stop Word List is used as a reference dictionary, stop words are deleted from a corpus through Word comparison, the content of the corpus is further simplified, and no Stop words in the corpus are ensured to be reserved;
2.3.4, extracting word stems: word drying extraction is a process of removing morphological affixes to obtain corresponding root words, and is a special processing process of western languages such as English; the same English word has the deformation of singular and plural, the deformation of tense and the deformation of the corresponding different predicates of the human pronoun. These words, although slightly different in form, correspond to the same root word and should be treated as the same word when the correlation is calculated, and then the stemming process is required; the Bode stem algorithm (Porter Stemming Algorithm) is a mainstream stem extraction algorithm, and the core idea is to classify, process and restore words according to the type of morphological affix. Most word variants are regular except for some special variants, which are classified into 6 categories according to the law.
Still further, in the 2.3.4, the stem extraction step is as follows:
2.3.4.1, performing affix removal and word recovery according to word deformation categories, and obtaining stem information of noun objects in a corpus to reduce the situations of different shapes of the same word, wherein 6 different word deformations are as follows:
2.3.4.1.1, complex, end-of-ed and ing words;
2.3.4.1.2 words which contain vowels and end with y;
2.3.4.1.3, double-suffix word;
2.3.4.1.4 words with-ic, -ful, -less, -active, etc. as suffixes;
2.3.4.1.5 in case of < c > vcvc < v >, words of suffixes such as-ant, -ence (c is a consonant and v is a vowel);
2.3.4.1.6 < c > vc < v > vowel consonants have more than 1 pair vc between them, words ending with e;
2.3.4.2, creating noun objects restored to stems as noun entities, and updating the noun objects in a corpus, and representing the noun objects in a stem form.
In the third step, the construction flow of the knowledge graph is as follows:
3.1, using Word2vec training Word vector: word2vec is a Word vector tool that represents words as feature vectors of words. Word2vec is used to convert words into numerical form, represented using an N-dimensional vector.
3.2, pre-grouping the corpus twice: because the clustering of the k-means algorithm is easily affected by the distribution condition of the data set, in order to ensure the core concept, namely that the main classification object in the target field is a clustering center, the k-means clustering cannot be directly used, and the corpus is subjected to twice pre-grouping;
3.3, automatically searching a clustering center for the small corpus set through a k-means clustering algorithm, clustering, and constructing triples at the same time, wherein the method comprises the following steps:
3.3.1, determining the size of k according to the size of the small corpus set, wherein the larger the set is, the larger the k value is.
And 3.3.2, constructing a triplet by an entity corresponding to the centroid obtained through k-means clustering calculation and an entity corresponding to the centroid in the last layer of grouping.
The k-means algorithm in step 3.3.2 is an unsupervised clustering algorithm, and Word vectors trained by Word2vec from the database are used to represent each Word. And taking each small corpus set as a data set, and carrying out clustering calculation by using a k-means clustering algorithm. The steps of k-means clustering are as follows:
3.3.2.1, selecting k objects in the data set as initial centers, wherein each object represents a clustering center;
3.3.2.2, objects in the word vector sample are classified into classes corresponding to the cluster centers closest to the objects according to Euclidean distance between the objects and the cluster centers;
3.3.2.3, update cluster center: taking the average value corresponding to all objects in each category as a clustering center of the category, and calculating the value of an objective function;
3.3.2.4, judging whether the values of the clustering center and the objective function are changed, if not, outputting a result, and if so, returning to 3.3.2.2;
3.3.3, taking the new packet as a data set, calling the k-means algorithm again, and repeating the steps 3.3.1-3.3.3 until each packet only contains the element number smaller than a certain threshold Z;
3.3.4, constructing a triplet between the entity corresponding to the data point in each group and the entity corresponding to the current centroid;
all entities in the corpus are related to other entities, and the triples formed by the entities are combined with each other, so that a knowledge graph is formed, and the entity relationship with weak correlation can be generated due to the fact that cluster centers and cluster conditions are found out by automatic clustering, so that manual checking and screening are needed after the knowledge graph is built, and entity association with low correlation is removed, so that the quality of the knowledge graph is improved.
The step of 3.2 is as follows:
3.2.1, grouping the language database once, wherein the steps are as follows:
3.2.1.1, extracting a first layer of sub-classification labels of the target field label obtained previously, wherein the target field label forms a core entity, and generating a first layer of sub-classification label set Tag, wherein n sub-classification labels are included in total, each label has a corresponding entity and word vector, and the entities are connected with the core entity to form n triples;
3.2.1.2, taking the first layer sub-classification label object as a centroid, calculating Euclidean distance from each data point to each centroid in the corpus data set, and then distributing the data points to the class corresponding to each centroid according to the principle of nearby, so as to obtain n clusters, namely n grouping data sets, taking the first layer sub-classification label as the centroid, wherein the corpus is divided into n corpus sets;
wherein, the Euclidean distance (Euclidean Distance) in step 3.2.1.2 is an important basis for judging the category of the data point, and a given sample is assumedAnd->Wherein i, j=1, 2, …, m, the number of samples, n the number of features, the euclidean distance is calculated by:
3.2.2, combining TF-IDF algorithm, grouping the corpus secondarily, the steps are as follows:
3.2.2.1, searching out the keywords in each corpus set by calculating TF-IDF.
The TF-IDF algorithm in step 3.2.2 is a numerical statistical method for evaluating the importance of a word to a given document, and the term frequency TF (Term Frequency) refers to the frequency of occurrence of the given word in the given document, and its calculation formula is:
n x,y refers to the number of times the term x appears in document y, Σ k n k,y Referring to the total vocabulary in document y, the inverse document frequency IDF (Inverse Document Frequency) is an information amount used to evaluate the provision of words or terms, i.e., whether the term is common throughout the document, and its calculation formula is:
N refers to the total number of documents, N x Referring to the number of documents in which the term x appears, each term in the text is used as a document, and finally the values of TF and IDF are calculated together, and the formula for obtaining TF-IDF is as follows:
TF-IDF x,y =TF x,y ×IDF x
3.2.2.2, manually screening the keywords of each corpus, removing the keywords with low correlation with the core entity of the current corpus, reserving partial keywords with highest correlation, and reserving the number of the keywords to be correlated with the overall quality of all the extracted keywords;
3.2.2.3, constructing triples by the entity corresponding to the selected keywords in each corpus and the core entity of the current corpus, taking the keywords as centroids in each corpus, calculating Euclidean distance from data points in the corpus to each centroid again, classifying the data points, and dividing the primitive database into a plurality of small corpus sets at the moment.
The fourth step comprises the following steps:
the definition VT comprises a basic attribute (basic attribute) and a visual structure (DVSCHEMA), and the formalized definition is as shown in (1), wherein the basic attribute stores general information of graphic titles, subtitles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICTRIBUTE comprises three attributes: title (title) for storing title of final generated visual graphic, subtitle (subtitle) for storing subtitle of final generated visual graphic, attribute (attributes) for storing position, color combination, font size setting parameter of final generated visual graphic;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, BASICTRIBUTE categorizes common visualized graphics into four basic categories according to the data types, graphic data structures, and graphic dimensions required by the graphics: general graphics (General), topology (Topology), map (Map), text graphics (Text), formalized definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
the four basic categories in step 4.4 and step 4.3 respectively comprise two attributes: a graphics type (VType) and a graphics structure (structModel), wherein the VType stores the graphics type to which the type belongs, the structModel stores the basic visual structure of the graphics to which the type belongs, the formalized definition is as (4), and the definition of A is that B indicates that A contains an attribute B;
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType>
<StructModel>。
in 4.4, the graphics of the VType attribute of the four basic categories are as follows:
4.4.1, general includes bar graph (barChart), line graph (LineChart), pie graph (PieChart), radar graph (RadarChart), scatter graph (ScaterChart);
4.4.2, the Topology includes network map (netchart), tree map (TreeMap), area tree map (TreeMapChart);
4.4.3, maps include regional Map (AreaMapChart), national Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, text includes word cloud (WorldCludChart);
4.5, the four basic categories in the step 4.4 all have respective Mapping relations (Mapping), and describe the data structures, data dimensions, graphic structure relations and data Mapping position information of various graphics; according to Mapping information and the data structure of the graph, basic visual structure structModel of various graphs can be abstracted.
In the 4.5, mapping relation Mapping of various graphics and basic visualization structure StructModel are defined as follows:
4.5.1, graphic in General type is commonly used to represent two-dimensional data or three-dimensional data, information can be represented by a binary (XAxis, YAxis) or a triplet (XAxis, YAxis, ZAxis), mapping structure of such graphic is as (5), wherein LegendName represents legend name, each group information is stored in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted, for example, (6), the child nodes of the structModel are temporary Root nodes, and the Root comprises two child nodes: key value pair k_v and legend node LegendNode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, graphics in the Topology type are typically used to represent Topology relationship data, and the tree graph and the area tree graph can represent attribute structures with nested key-value pairs { key: value, child: { key: value } }, mapping structures such as (7); the network graph can use node sets (Nodes) and edge sets (Links) to represent graph structures, and Mapping structures are shown as (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two substructures, root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 comprises two sub nodes: the key value pair K_V and child node child whose substructure is the key value pair K_V; root2 contains two child nodes: node set Nodes and edge set Links, wherein the child Nodes of the node set are key words and value values, wherein the value can be null, and the child Nodes of the edge set are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, graphics in Map types are typically used to represent Map information, with key value pairs arrays [ { PlaceName: value } ] or triplesets [ { lng, lat, value } ], the Mapping structure of such graphics being as (10), where PlaceName represents a place name, lng represents latitude, lat represents longitude; according to the Mapping structure, the structure of a basic structModel can be abstracted, for example, (11), the structModel has two substructures, root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 comprises a sub-node key value pair K_V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng>
<lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4 a graphic common binary group (Keyword) in the Text type represents the Keyword frequency, and the Mapping structure of the graphic is as shown in (12), wherein the Keyword is a word extracted from a Text, and the frequency represents the occurrence frequency of the word in the Text; according to the Mapping structure, the structure of the basic structModel can be abstracted, for example, (13), the child node of the structModel is a temporary Root node, and the Root comprises a key value pair K_V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>。
the fifth step comprises the following steps:
and 5.1, matching the Web data prototype structure M-JSON with the structModel of the visual model tree VT according to the data structure, and matching M candidate coordinate axes/legends meeting the conditions in the M-JSON to obtain attribute combination results, wherein each combination result is expressed as a binary group consisting of a key value pair L and an attribute name A. Wherein L and A correspond to LegendNode and K_V in step 4.5.1, respectively;
And 5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a binary group: p= (L:: nameA::: name), each matching result P i =(L i ::name,A i : : name), into triplet form G i =(L i ::name,R,A i :: name) put set s= { G 1 ,G 2 ,...,G m };
5.2.2 sequentially combining G in set S i The three parameters of the knowledge graph are mapped to the triplet structure as follows F (L i ::name→head,R→relation,A i : : name→tail) is mapped into a triplet (head, relation, tail), whether the current triplet (head, relation, tail) exists in the matched corpus knowledge graph is matched, and the result is a result of matching, namely True or False, and the result is respectively expressed as 1 and 0. Firstly, matching a head entity node head and a tail entity node tail in a corpus knowledge graph, then matching a relation relationship between the head entity node and the tail entity node, and if and only if the head entity head, the tail entity tail and the relation relationship are successfully matched, setting as 1;
5.2.3 after the object query in set S is completed, set Q= { (G) is returned i ,result i ) And Q is used to determine whether there is a semantic association between the currently eligible tuples as a determination of the matching result of the attribute combination of candidate coordinate axes/legends, so that only the structure matches and results i If the matching is 1, the matching is judged to be successful. Thereby improving the accuracy of data attribute matching and reducing the generation rate of the image without practical meaning。
The beneficial effects of the invention are mainly shown in the following steps: when the Web data is visualized to generate the graph, the method can utilize the Web corpus data to construct a Web corpus knowledge graph, analyze, generalize and model the common visualized graph, optimize the matching process of the Web data prototype structure and the common visualized graph model, reduce the generation of redundant graph and improve the generation rate of the effective graph. Meanwhile, the manual participation in the graphic screening work is reduced in the automatic data visualization process, and the Web data visualization flow is simplified.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a knowledge graph-based Web data optimization method facing to visual requirements includes the following steps:
firstly, constructing a target field corpus: the method comprises the steps of taking the corpus content of the network corpus (such as Wikipedia in Wikipedia) as the basis for constructing a knowledge graph, improving the text quality and the comprehensiveness of the content of a domain corpus, using the vocabulary entry information of the network corpus as the original corpus content, screening the original network corpus content for constructing the knowledge graph, analyzing the webpage content of the vocabulary entry, finding out that the webpage content contains HTML tags besides title and text information, editing information of the vocabulary entry, webpage link information and other redundant information which is irrelevant to the vocabulary entry, filtering and cleaning the target vocabulary entry, and extracting the title and the effective text content. The filtering content comprises: performing HTML tag/text style symbol filtering (e.g., deleting HTML tags such as < h1> text < 1>, < p > text < p >, < div > text </div > and retaining text; deleting style symbols such as span { font-color: # effect }), entry editing information filtering (e.g., deleting [ edit ] tags), picture information filtering (e.g., deleting < img src = '…'/> picture tags), link information filtering (e.g., deleting < a href = "…" title = "," > -text < a > hyperlink tags < a >, and retaining text information), page specific title/attribute name filtering (e.g., deleting proprietary titles and attribute names such as Further reads), and numerical filtering (e.g., deleting numerical values such as 20, 30);
For example, using Wikipedia (Wikipedia) web corpus, obtaining webpage content of Wikipedia Athletic sports (athletics sports) through a crawler, filtering and screening to obtain entry corpus content containing Athletic sports (athletics sports) and sub-classifications thereof;
secondly, entity extraction facing to a corpus: the knowledge graph is a data information network of a graph structure formed by entities and relations, an infrastructure of the knowledge graph is represented by a triplet of an entity-relation-entity, the triplet comprises two entities with actual semantic relations and the relation between the two entities, the relation is represented by a form of G= (head, relation, tail), the head represents a head entity, the tail represents a tail entity, the relation represents the relation between the head entity and the tail entity, each entity also comprises an attribute and an attribute value, the attribute of the entity is also converted into the tail entity connected with the entity, the relation is established between the two entities, and the entity extraction is divided into three stages of named entity extraction, attribute entity extraction and noun entity extraction;
2.1, entity extraction: entity extraction, also known as named entity recognition, is the automatic recognition of named entities from a text dataset, which is commonly referred to as person names, place names, institution names, and other entities all of which are identified by names. The process can be completed by using some mainstream named entity recognition system, for example, the standard NER can mark the entities in the text according to the types, can recognize seven types of attributes including Time, location, person, date, organization, money, percentage and the like, uses the named entity recognition system as a tool to recognize the named entity of the corpus content, and the recognized named entity marks the type attribute of the named entity, and the main process is as follows: 1. carrying out named entity identification on the corpus content by a tool; 2. labeling the identified named entity with its type attribute; 3. filtering the named entities according to the type attribute, deleting unsuitable named entities, reserving labels of other named entities, and defining the entry names as named entities by default.
2.2, extracting attribute entities: extracting attributes from an information frame of an entry network corpus by taking the information frame as an attribute source, then intercepting the information frame information of each entry in the corpus, extracting attribute names as tail entities of named entities corresponding to names of the corresponding entries according to an information frame structure, and not reserving attribute values, wherein if no information frame exists in a certain entry, the tail entities do not need to be created for the named entities corresponding to the entry, and the information frame (Info Box) of an entry ' National Basketball Association (NBA) ' in Wikipedia (Wikipedia) is taken as an example, and is formed in a form of a table, wherein the content of the 1 st row and the 1 st column is ' Sport ', the content of the 1 st row and the 2 nd column is ' Basketball ', the content of the 2 nd row and the 1 st column is ' Found ', and the content of the 2 nd row and the 2 nd column is ' June 6,1946;73years ago ", etc., extracts the first column content" Sport "," found "and the term" National Basketball Association (NBA) "respectively to construct triples;
2.3, noun entity extraction, comprising four steps: word splitting (Split), part-of-speech Tagging (POS tag), stop word filtering (Stop Word Filtering) and stem extraction (Stemming), the named entity extraction step already tags the identified named entity, so that the next operation only extracts the corpus content outside the tagged entity;
2.3.1, word splitting: using a regular expression design splitting rule mode, and carrying out word splitting on the corpus content according to spaces, symbols and paragraphs to obtain word texts;
2.3.2, part-of-speech tagging: in order to obtain nouns in the corpus, part-of-speech tagging is first required for text vocabulary. Part-of-speech tagging is also called grammatical tagging or part-of-speech disambiguation, and is a text data processing technology for tagging words in a corpus linguistic according to their meanings and context, wherein a plurality of words may contain a plurality of parts of speech at the same time and have a plurality of meanings, the choice of the parts of speech depends on the context, the corpus tagged with named entities is used as tagged object text for part-of-speech tagging, named word objects are found out according to the tagging result, non-noun objects are removed from the corpus, but the names of non-noun words are not included, and the named entities, noun objects and original punctuations in each word remain in the corpus at this time, and all contents still keep the original text sequence;
2.3.3, stop word filtering: the name of Stop Word is derived from Stop Word, and refers to words or words automatically filtered out when natural language text is processed in order to save storage space and improve search efficiency in information retrieval. For a given purpose, any type of word may be selected as a stop word, and in the present invention, the stop word mainly includes two types: the functional Words (functional Words) contained in human language, such as articles, conjunctions, adverbs or prepositions, are very common in use, have extremely high occurrence frequency, but have no exact actual meaning, such as a, an, the, which, and the like; another category is the real Words (Content Words), which refer herein to a portion of Words that have a substantial meaning but are not explicitly referred to or pointed at, such as want, welcome, enough, consider, indeed, etc. In natural language processing, a Stop Word List is already available, the Stop Word List is used as a reference dictionary, stop words are deleted from a corpus through Word comparison, the content of the corpus is further simplified, and no Stop words in the corpus are ensured to be reserved;
2.3.4, extracting word stems: the word drying extraction is a process for removing morphological affixes to obtain corresponding root words, is a special processing process of western languages such as English, has singular and plural deformation (such as apple and apples), deformation of states such as ing and ed (such as doing and fid), deformation of human pronouns corresponding to different predicates (such as like and like), and the like, and the words correspond to the same root words although in form, are processed as the same words under the condition of calculating correlation, and then the word drying processing is needed. The baud stem algorithm (Porter Stemming Algorithm) is a mainstream stem extraction algorithm, the core idea is that words are classified, processed and restored according to the type of morphological affix, most of word deformation is regular except for partial special deformation, the deformation is divided into 6 categories according to the rule, and the stem extraction steps are as follows:
2.3.4.1, performing affix removal and word restoration according to the word deformation category, and obtaining stem information of noun objects in the corpus to reduce the situations of different shapes of the same word. The 6 different words are morphed as follows:
2.3.4.1.1, complex, end-of-ed and ing words;
2.3.4.1.2 words which contain vowels and end with y;
2.3.4.1.3, double-suffix word;
2.3.4.1.4 words with-ic, -ful, -less, -active, etc. as suffixes;
2.3.4.1.5 in case of < c > vcvc < v >, words of suffixes such as-ant, -ence (c is a consonant and v is a vowel);
2.3.4.1.6 < c > vc < v > vowel consonants have more than 1 pair vc between them, words ending with e;
2.3.4.2, creating noun objects restored to stems as noun entities, and updating the noun objects in a corpus, and representing the noun objects in a stem form.
And a third step of: combining Word2vec, performing secondary pre-grouping on the corpus, and constructing a knowledge graph by using a k-means clustering algorithm, wherein the structure of a triplet G is head, relation, and along with the difference of the head and the tail, the relation also has various relations, and the relation is actually a relation set in the knowledge graph and is used for representing complex relations among various entities. The method aims at judging whether semantic association exists between two attributes, namely whether a relation exists between two entities, and does not pay attention to the existence of the relation. And (3) performing secondary grouping on the corpus by calculating word vectors of the vocabulary of the corpus, and extracting entity relations by using a k-means clustering algorithm. The construction flow of the knowledge graph is as follows:
3.1, using Word2vec training Word vector: word2vec is a Word vector tool that represents words as feature vectors of words. Word2vec is used to convert words into numerical form, represented using an N-dimensional vector. For example, word2vec is used to perform Word vector calculation on the content of the acquired Athletic sports (athletics sports) corpus, and the dimension of the Word vector is set to 300 dimensions. The higher the word vector dimension, the richer the feature expression of the word, but at the same time, the time cost of training and the calculation time cost when the model is called are increased.
3.2, pre-grouping the language library twice: since the clustering of the k-means algorithm is easily affected by the distribution condition of the data set, in order to ensure the core concept, namely that the main classification object in the target field is a clustering center, the k-means clustering cannot be directly used, and the corpus is subjected to two-time pre-grouping, the steps are as follows:
3.2.1, grouping the language database once, wherein the steps are as follows:
3.2.1.1, extracting a first layer of sub-classification labels of the target field label obtained previously, wherein the target field label forms a core entity, and generating a first layer of sub-classification label set Tag, wherein n sub-classification labels are included in total, each label has a corresponding entity and word vector, and the entities are connected with the core entity to form n triples.
3.2.1.2, taking the first layer sub-classification label object as a centroid, calculating the Euclidean distance from each data point to each centroid in the corpus data set, and then distributing the data points to the class corresponding to each centroid according to the principle of nearby. N clusters, i.e. n grouping data sets, with the first layer of sub-classification labels as centroids are obtained, while the corpus is also divided into n corpus sets.
Wherein, the Euclidean distance (Euclidean Distance) in step 3.2.1.2 is an important basis for judging the category of the data point, and a given sample is assumedAnd->Wherein i, j=1, 2, …, m, the number of samples, n the number of features, the euclidean distance is calculated by: />
For example, first, a pre-sort is performed on the entity data set of the constructed sports (athletics sports) corpus, the first-layer sub-sort tags of the previously crawled sports (athletics sports) wikipedic corpus tags are extracted to form Tag sets tag= { "Association football", "Baseball", "badminon", "beacon juice", … … }, wherein 55 sub-sort tags are included in total, each Tag has a corresponding entity and Word2vec trained Word vector, and the entities are connected with the core entity "athletics sports" to form 55 triples. And taking the label objects as centroids, calculating Euclidean distances from each data point in the data set to each centroid, and then distributing the data points into classes corresponding to each centroid according to a nearby principle. At this time, 55 clusters, i.e., 55 grouping data sets, with the event category as the centroid are obtained, and the corpus is also divided into 55 corpus sets.
3.2.2, combining TF-IDF algorithm, grouping the corpus secondarily. The method comprises the following steps:
3.2.2.1, searching out the keywords in each corpus set by calculating TF-IDF.
Wherein the TF-IDF algorithm in step 3.2.2 is a numerical statistical method for evaluating the importance of a word to a given document. Word frequency TF (Term Frequency) refers to the frequency with which a given word appears in a given document, and is calculated by the formula:
n x,y refers to the number of times the term x appears in document y, Σ k n k,y Refers to the total vocabulary in document y. The inverse document frequency IDF (Inverse Document Frequency) is an information amount used to evaluate the provision of a word or term, i.e., whether the term is common throughout the document, and is calculated by the formula:
n refers to the total number of documents, N x Referring to the number of documents in which the term x appears, each term in the text serves as a document. Finally, the TF and IDF values are calculated together to obtain the formula of the TF-IDT as follows:
TF-IDF x,y =TF x,y ×IDF x
3.2.2.2, manually screening the keywords of each corpus, removing the keywords with low correlation with the core entity of the current corpus, reserving partial keywords with highest correlation, and reserving the number of the keywords to be correlated with the overall quality of all the extracted keywords.
And 3.2.2.3, constructing triples by the entity corresponding to the selected keywords extracted from each corpus and the core entity of the current corpus. And taking the keywords as centroids in each corpus set, calculating Euclidean distance from the data points in the set to each centroid, and classifying the data points. The original corpus has now been divided into a plurality of small corpus sets.
For example, keywords in each Athletic sports (athletics) corpus are found through TF-IDF calculation, for example, "text" "" place "" "match" ", team" ", and" match "", in the corpus corresponding to Association football, but some words are frequent but have no great relevance, such as "list" "" finals "", and "body" ". Therefore, manual intervention screening is needed for the keywords of each corpus, the keywords with low correlation with the core entity of the current corpus are removed, and partial keywords with highest correlation are reserved. And constructing a triplet between the entity which is extracted from each small corpus and corresponds to the screened keyword and the core entity of the current corpus. And then taking the keywords as centroids in each corpus set, calculating Euclidean distance from data points in the set to each centroid, classifying the data points, and dividing the data points into a plurality of small corpus sets.
3.3, automatically searching a clustering center for the small corpus set through a k-means clustering algorithm, clustering, and constructing triples at the same time, wherein the method comprises the following steps:
3.3.1, determining the size of k according to the size of the small corpus set, wherein the larger the set is, the larger the k value is.
And 3.3.2, constructing a triplet by an entity corresponding to the centroid obtained through k-means clustering calculation and an entity corresponding to the centroid in the last layer of grouping.
The k-means algorithm in step 3.3.2 is an unsupervised clustering algorithm, and Word vectors trained by Word2vec from the database are used to represent each Word. And taking each small corpus set as a data set, and carrying out clustering calculation by using a k-means clustering algorithm. The steps of k-means clustering are as follows:
3.3.2.1, selecting k objects in the data set as initial centers, wherein each object represents a clustering center;
3.3.2.2, objects in the word vector sample are classified into classes corresponding to the cluster centers closest to the objects according to Euclidean distance between the objects and the cluster centers;
3.3.2.3, update cluster center: taking the average value corresponding to all objects in each category as a clustering center of the category, and calculating the value of an objective function;
3.3.2.4, judging whether the values of the clustering center and the objective function are changed, if not, outputting a result, and if so, returning to 3.3.2.2.
3.3.3, calling the k-means algorithm again by taking the new packet as a data set, and repeating the steps 3.3.1-3.3.3 until each packet only contains the element number smaller than a certain threshold value Z.
And 3.3.4, constructing a triplet by the entity corresponding to the data point in each group and the entity corresponding to the current centroid.
All entities in the corpus are related to other entities, and the triples formed by the entities are combined with each other, so that a knowledge graph is formed. Because the cluster center and the cluster situation searched by the automatic clustering can possibly generate entity relations with weak correlation, the knowledge graph construction is completed and then needs to be manually checked and screened, and the entity relations with low correlation are removed so as to improve the quality of the knowledge graph.
For example, the original Athletic sports (athletics) corpus is divided into a plurality of small corpus sets at this time, and then a clustering center is automatically searched for through a k-means clustering algorithm to perform clustering, and a ternary is constructed at the same time. The size of the designated k is determined by the size of the corpus set, with larger set k values being larger. And finally, constructing a triplet by the entity corresponding to the calculated centroid and the entity corresponding to the centroid in the last layer of grouping. The k-means algorithm is then called again, taking the new packet as a data set, repeating the above operation until each packet contains only less than 10 elements (at this time the threshold z=10). And finally, constructing a triplet by the entity corresponding to the data point in each group and the entity corresponding to the current centroid. So far, all entities in the Athletic sports (athletics sports) corpus are related with other entities, and the triples formed by the entities are combined with each other, so that a knowledge graph is formed. However, the entity association with weak correlation may be generated in the situation of clustering and the centroid found by automatic clustering, so that manual checking and screening are finally needed to remove the entity association with extremely low correlation.
Fourth, referring to fig. 2, a visual model Tree (VT) is constructed: classifying various visual graphics, summarizing the attribute and the structural characteristics of the various graphics, and formally expressing various graphic information by creating a visual model tree (VT), wherein the steps are as follows:
the definition VT comprises a basic attribute (basic attribute) and a visual structure (DVSCHEMA), and the formalized definition is as shown in (1), wherein the basic attribute stores general information of graphic titles, subtitles and other text styles;
(1)、VT::=<BASICATTRIBUTE><DVSCHEMA>
4.2, BASICTRIBUTE comprises three attributes: title (title) for storing title of final generated visual graphic, subtitle (subtitle) for storing subtitle of final generated visual graphic, attribute (attributes) for storing position, color combination, font size setting parameter of final generated visual graphic;
(2)、BASICATTRIBUTE::=<title><subtitle><attributes>
4.3, BASICTRIBUTE categorizes common visualized graphics into four basic categories according to the data types, graphic data structures, and graphic dimensions required by the graphics: general graphics (General), topology (Topology), map (Map), text graphics (Text), formalized definition as (3);
(3)、DVSCHEMA::=<General><Topology><Map><Text>
The four basic categories in step 4.4 and step 4.3 respectively comprise two attributes: a graphics type (VType) and a graphics structure (structModel), wherein the VType stores the graphics type to which the type belongs, the structModel stores the basic visual structure of the graphics to which the type belongs, the formalized definition is as (4), and the definition of A is that B indicates that A contains an attribute B;
(4)、DVSCHEMA::=<General><Topology><Map><Text>::<VType><StructModel>
in step 4.4, the attached graphics of the VType attributes of the four basic categories are as follows:
4.4.1, general includes bar graph (barChart), line graph (LineChart), pie graph (PieChart), radar graph (RadarChart), scatter graph (ScaterChart);
4.4.2, the Topology includes network map (netchart), tree map (TreeMap), area tree map (TreeMapChart);
4.4.3, maps include regional Map (AreaMapChart), national Map (CountryMapChart), world Map (WorldMapChart);
4.4.4, text includes word cloud (WorldCludChart);
4.5, the four basic categories in the step 4.4 all have respective Mapping relations (Mapping), and describe the data structures, data dimensions, graphic structure relations and data Mapping position information of various graphics; according to Mapping information and the data structure of the graph, basic visual structure structModel of various graphs can be abstracted.
In the 4.5 step, mapping relation Mapping of various graphics and basic visual structure structModel definition are defined as follows:
4.5.1, graphic in General type is commonly used to represent two-dimensional data or three-dimensional data, information can be represented by a binary (XAxis, YAxis) or a triplet (XAxis, YAxis, ZAxis), mapping structure of such graphic is as (5), wherein LegendName represents legend name, each group information is stored in ARRAY type; according to the Mapping structure, the structure of the basic structModel can be abstracted, for example, (6), the child nodes of the structModel are temporary Root nodes, and the Root comprises two child nodes: key value pair k_v and legend node LegendNode;
(5)、Mapping::=<XAxis,YAxis,[ZAxis]><LegendName>
(6)、StructModel::=<Root::<K_V><LegendNode>>
4.5.2, graphics in the Topology type are typically used to represent Topology relationship data, and the tree graph and the area tree graph can represent attribute structures with nested key-value pairs { key: value, child: { key: value } }, mapping structures such as (7); the network graph can use node sets (Nodes) and edge sets (Links) to represent graph structures, and Mapping structures are shown as (8), wherein source represents a starting node of an edge link, and target represents a pointing node of the edge link; according to the Mapping structure, the structure of the basic structModel can be abstracted as (9), the structModel has two substructures, root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 comprises two sub nodes: the key value pair K_V and child node child whose substructure is the key value pair K_V; root2 contains two child nodes: node set Nodes and edge set Links, wherein the child Nodes of the node set are key words and value values, wherein the value can be null, and the child Nodes of the edge set are a starting point source and a target;
(7)、Mapping::=<K_V><children::<K_V>>
(8)、Mapping::=<Nodes::<key,[value]><Links::<source><target>>
(9)、StructModel::=<Root1::<K_V><children::<K_V>>><Root2::<Nodes::<key,[value]>,<Links::<source><target>>>
4.5.3, graphics in Map types are typically used to represent Map information, with key value pairs arrays [ { PlaceName: value } ] or triplesets [ { lng, lat, value } ], the Mapping structure of such graphics being as (10), where PlaceName represents a place name, lng represents latitude, lat represents longitude; according to the Mapping structure, the structure of a basic structModel can be abstracted, for example, (11), the structModel has two substructures, root1 and Root2 are temporary Root nodes of the two substructures respectively, and Root1 comprises a sub-node key value pair K_V; root2 contains three child nodes: longitude lat, latitude lng, value;
(10)、Mapping::=<Data1::<PlaceName><value>><Data2::<lng><lat><value>>
(11)、StructModel::=<Root1::<K_V>>,<Root2::<lng>,<lat>,<value>>
4.5.4 a graphic common binary group (Keyword) in the Text type represents the Keyword frequency, and the Mapping structure of the graphic is as shown in (12), wherein the Keyword is a word extracted from a Text, and the frequency represents the occurrence frequency of the word in the Text; according to the Mapping structure, the structure of the basic structModel can be abstracted, for example, (13), the child node of the structModel is a temporary Root node, and the Root comprises a key value pair K_V;
(12)、Mapping::=<Keyword><frequency>
(13)、StructModel::=<Root::<K_V>>
fifthly, a data visualization optimizing and matching method based on a network corpus knowledge graph comprises the following steps: defining M-JSON as a prototype structure of JSON returned by REST Web service; matching each structModel in the Web data prototype structure M-JSON and the visual model tree VT according to the data structure, and returning a result which is a set formed by attribute combinations of candidate coordinate axes/legends meeting the conditions; based on structure matching, inquiring whether the attribute combination of the matched candidate coordinate axis/legend has actual semantic association or not by utilizing the knowledge graph constructed in the third step, optimizing matching according to the inquiry result, and selecting an effective dimension combination to improve the accuracy (Precision) of automatically generating the graph, wherein the method comprises the following steps of:
And 5.1, matching a Web data prototype structure M-JSON with a structModel of a visual model tree VT according to a data structure, and matching attribute combination results of M candidate coordinate axes/legends meeting the conditions in the M-JSON, wherein each combination result is expressed as a binary group consisting of a key value pair L and an attribute name A, and L and A respectively correspond to LegendNode and K_V in the step 4.5.1.
And 5.2, matching and optimizing m attribute combinations meeting the conditions by combining the constructed network corpus knowledge graph, wherein the process is as follows:
5.2.1, each matching result in step 5.1 is represented in the form of a binary group: p= (L:: name, a:: name). Each matching result P i = (L i ::name,A i : : name), into triplet form G i =(L i ::name,R,A i : : name) put set s= { G 1 ,G 2 ,...,G m }。
5.2.2 sequentially combining G in set S i The three parameters of the knowledge graph are mapped to the triplet structure as follows F (L i ::name→head,R→relation,A i Name→tail) into triples (head, relation, tail). And (3) whether a current triplet (head, tail) exists in the matching in the corpus knowledge graph construction, wherein a result is True or False, and the result is respectively expressed as 1 and 0. First, matching head entity nodes head and tail entity nodes tail in a corpus knowledge graph, and then matching relation between the head entity nodes and the tail entity nodes. If and only if the head entity head, the tail entity tail and the relation are successfully matched, the result is 1; otherwise, result is 0.
5.2.3 after the object query in set S is completed, set Q= { (G) is returned i ,result i ) And judging whether the currently qualified binary group has semantic association or not by using the Q as a judgment of the attribute combination matching result of the candidate coordinate axis/legend.Thus, only the structure matches and results i If the matching is 1, the matching is judged to be successful. Therefore, the accuracy of data attribute matching is improved, and the generation rate of the image without practical significance is reduced.