CN115982390B - Industrial chain construction and iterative expansion development method - Google Patents
Industrial chain construction and iterative expansion development method Download PDFInfo
- Publication number
- CN115982390B CN115982390B CN202310260247.6A CN202310260247A CN115982390B CN 115982390 B CN115982390 B CN 115982390B CN 202310260247 A CN202310260247 A CN 202310260247A CN 115982390 B CN115982390 B CN 115982390B
- Authority
- CN
- China
- Prior art keywords
- industrial
- target
- industry
- node
- industrial chain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Abstract
The invention provides an industrial chain construction and iterative expansion development method, which comprises the steps of obtaining a target industrial type input by a user and obtaining industrial corpus data corresponding to the target industrial type; designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain industry new words; determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words; the method comprises the steps of designing a data storage structure of a target industrial chain tree aiming at the upstream and downstream logic and the node association relation of the industrial chain, and carrying out iterative updating based on the original industrial chain tree through the data storage structure. The method provided by the invention greatly improves the efficiency of industrial map construction and updating.
Description
Technical Field
The invention belongs to the technical field of data visualization technology and data application.
Background
At present, when an industry is analyzed, an industrial chain map of the industry needs to be constructed, a great deal of industrial data is often required to be manually consulted in the construction process, the construction is complex, and in addition, the problems of incomplete construction and the like may occur when the industrial chain map is constructed by manually consulting the data.
The industry chain needs to have sufficient reusability, iteration and expansibility. The industry itself is dynamic, and as the industry evolves, new industries continue to emerge. How to mine new words appearing in industry and how to acquire hierarchical relation among industrial words, and adding changes of the industries to original industrial map data, so that the whole map becomes a great challenge.
Meanwhile, the subjectivity of the industrial chain is very strong, different industry standards exist at present, different websites and institutions also classify the same industrial noun into different industries, different people understand the construction of the industrial chain, the types of industrial chain nodes and relations, the granularity problem of the industrial chain is different, and different setting can directly lead to different application results. The prior art lacks a universal development method in the aspects of finding industrial new words and constructing an industrial chain in a personalized way, and is not beneficial to improving engineering development efficiency.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, the invention aims to provide an industrial chain construction and iterative expansion development method which is used for solving the limitations of low accuracy and complicated construction and expansion of the existing industrial map manual data analysis.
To achieve the above objective, an embodiment of a first aspect of the present invention provides an industrial chain construction and iterative expansion development method, including:
acquiring a target industry type input by a user, and acquiring industrial corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
the data storage structure of the target industrial chain tree is designed according to the upstream and downstream logic of the industrial chain and the association relation of the nodes, and iterative updating is carried out on the basis of the original industrial chain tree through the data storage structure.
In addition, an industrial chain construction and iterative expansion development method according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, after obtaining the industrial corpus data corresponding to the target industrial type, the method further includes:
the unified preprocessing of the industrial corpus data comprises the steps of cutting the industrial corpus data according to Chinese characters and non-Chinese characters to remove the words and coding symbols.
Further, in one embodiment of the present invention, the design industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industrial corpus data into a set of single characters, and combining the characters in the set into candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie, acquiring a frequency list of a prefix and a suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of fragments formed by the candidate words;
inquiring the Trie, acquiring word frequencies of the candidate words and word frequencies of left and right fragments, and calculating mutual information among points according to the word frequencies;
calculating the score of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the score to obtain a candidate word set in the target field, wherein the formula is expressed as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing mutual information between points->Left and right information entropy representing candidate word constituent fragments, +.>Left and right information entropy representing the candidate word.
Further, in an embodiment of the present invention, the determining the relationship between the industry new words according to the context relationship and the parallel semantic relationship extraction method, and constructing a target industry chain tree according to the industry new words and the relationship between the industry new words includes:
Performing depth expansion and width expansion of the target industrial chain tree by using an upper-lower relation and parallel semantic relation extraction method; the width expansion of the target industrial chain tree is performed through a width expansion algorithm, and the depth expansion of the target industrial chain tree is performed through depth expansion.
Further, in an embodiment of the present invention, the expanding the width of the target industrial chain tree by the width expanding algorithm includes:
using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing entity->Indicating entity type-> A returned confidence score;
record two entitiesAnd->Is of the brother similarity +.>Similarity of two sibling entities is calculated using the matching pattern features:
wherein, the liquid crystal display device comprises a liquid crystal display device,indicates skip mode, ++>Representing a set of skip modes;
feature computation using the entity and the typeThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing all the features acquired;
acquiring embedded features of two entities through word2vecThe sibling similarity is calculated using a multiplication metric:
calculating the score of the entity according to the sibling similarity:
And screening the entities according to the scores, so as to expand the width of the target industrial chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industrial chain tree by the depth expansion includes:
by usingRepresentation item->Is given a target parent node +.>A set of reference edgesWherein->Is->Is to calculate the node +_>Put in father node->Scoring of:
wherein, the liquid crystal display device comprises a liquid crystal display device,representation vector->And->Cosine similarity between them;
based onFor each candidate entity->Scoring and selecting an entity with a score above a threshold as node +.>And the initial child node below performs depth expansion of the target industrial chain tree.
Further, in one embodiment of the present invention, the designing the data storage structure of the target industry chain tree for the industry chain upstream and downstream logic and node association relation includes:
designing a parent_id field, and storing a unique identifier of a parent node;
all hierarchical ancestor nodes of the current node are stored using the full path field, by means of id # id # id … and splicing the character string representation.
To achieve the above object, a second aspect of the present invention provides an apparatus for industrial chain construction and iterative expansion development, comprising:
The acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industrial new word discovery algorithm to perform unsupervised pre-segmentation on the industrial corpus data to obtain industrial new words;
the construction module is used for determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
and the updating module is used for carrying out iterative updating based on the original industrial chain tree through the data storage structure by designing the data storage structure of the target industrial chain tree aiming at the industrial chain upstream and downstream logic and the node association relation.
To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements an industrial chain construction and iterative expansion development method as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements an industrial chain construction and iterative expansion development method as described above.
The industrial chain construction and iteration expansion development method provided by the embodiment of the invention covers core services of industrial atlas rapid construction, finding out new industrial words, extracting industrial hierarchical relations, updating iteration and the like, solves the limitations of low accuracy of manual data analysis and complicated construction and expansion of the existing industrial atlas, and enables a user to conveniently and rapidly generate and correspond to the industrial atlas under a category according to the industrial atlas demand, thereby balancing the relation between an automatic processing flow and manual intervention and improving the expandability and development efficiency of application.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a schematic flow chart of an industrial chain construction and iterative expansion development method according to an embodiment of the present invention.
Fig. 2 is a flow chart of an industrial new word discovery method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of left and right information entropy of a candidate word and left and right information entropy of a candidate word constituent segment provided by an embodiment of the present invention.
Fig. 4 is a schematic diagram of a hierarchical tree expansion algorithm according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of patch data generation of a new industrial chain according to an embodiment of the present invention.
Fig. 6-9 are schematic diagrams of an industrial map importing implementation process according to an embodiment of the present invention.
Fig. 10 is a schematic flow chart of an industrial chain construction and iterative expansion development device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The following describes an industrial chain construction and iterative expansion development method of an embodiment of the present invention with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of an industrial chain construction and iterative expansion development method according to an embodiment of the present invention.
As shown in fig. 1, the industrial chain construction and iterative expansion development method comprises the following steps:
s101: acquiring a target industry type input by a user, and acquiring industrial corpus data corresponding to the target industry type;
S102: designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
s103: determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
s104: the data storage structure of the target industrial chain tree is designed according to the upstream and downstream logic of the industrial chain and the association relation of the nodes, and iterative updating is carried out on the basis of the original industrial chain tree through the data storage structure.
According to the invention, an unsupervised method is adopted, all text fragments which are possibly formed into words in a section of large-scale corpus are extracted by utilizing a statistical strategy, and the corpus is segmented to form a plurality of text fragments, which is equivalent to one-time rough and shallow word segmentation. Then, the language knowledge is used for removing useless fragments which are not new words, calculating the relevance, searching the word with the largest relevance and the word combination, and cleaning and filtering the text fragments once. And finally, comparing all extracted words with the existing word stock, and taking the text fragments which are not in the range of the word stock as a new word stock. Fig. 2 is a flowchart of an industrial new word discovery method.
After the industrial corpus is imported into the system, unified pretreatment is needed for the data. The industrial corpus often contains not only Chinese characters but also a large number of special punctuations such as Arabic numerals, english letters with lower cases, ellipses and the like, which brings a certain obstruction to the subsequent industrial new word recognition. Taking industrial research report as an example, a large number of numerical values are used for enhancing the authenticity and convincing effect, and given that the length of the longest segment of an industrial noun is set to be 8 characters, a plurality of segments of 8 characters are easy to combine between the numerical values and the letters, and often have larger adjacent entropy and mutual information, if the segments are not processed, the segments without industrial chain map construction value become terms in an industrial new word list.
Further, in an embodiment of the present invention, after obtaining the industrial corpus data corresponding to the target industrial type, the method further includes:
the unified preprocessing of the industrial corpus data comprises the steps of cutting the industrial corpus data according to Chinese characters and non-Chinese characters to remove the words and coding symbols.
The cut corpus is changed into a plurality of short sentences from an original long sentence, and then the subsequent new word recognition work is carried out on the obtained short sentences.
Further, in one embodiment of the present invention, the design industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industrial corpus data into a set of single characters, and combining the characters in the set into candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie, acquiring a frequency list of a prefix and a suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of fragments formed by the candidate words;
inquiring the Trie, acquiring word frequencies of the candidate words and word frequencies of left and right fragments, and calculating mutual information among points according to the word frequencies;
calculating the score of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the score to obtain a candidate word set in the target field, wherein the formula is expressed as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing mutual information between points->Left and right information entropy representing candidate word constituent fragments, +.>Left and right information entropy representing the candidate word.
Specifically, the corpus is divided into a single character set, and the character sets are combined two by two to be used as candidate words. Since a prefix and a suffix are required to calculate information entropy, a fragment of length 3 needs to be stored. The present invention uses Trie trees to store data since the search for prefixes and statistics of word frequencies are subsequently involved. And constructing a prefix Trie tree and a suffix Trie tree by using the 3-gram sequence, wherein the Trie tree takes single characters as nodes, and each node records the frequency of forming the vocabulary from the root node to the current node.
And inquiring the Trie, acquiring a frequency list of the prefix and the suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of the candidate word composition fragments. Because the related information entropy is relatively more, we make the following distinguishing mark (Candidate is a Candidate word, left is a segment formed on the left, right is a segment formed on the right, h_l_l, h_l_r are the left and right information entropy of the left segment, h_r_l, h_r are the left and right information entropy of the ritth segment, and h_l, h_r are the left and right information entropy of the Candidate word) for each information entropy. As shown in fig. 3.
And inquiring the Trie to obtain the word frequency of the candidate word and the word frequency of the left and right fragments. The actual occurrence probability P (a, b) and the expected occurrence probability P (a) and P (b) can be conveniently obtained after the word frequency is available, so that the mutual information and the internal condensation degree are calculated. The word forming standards used in the invention mainly comprise two parts: the internal solidification degree and the free application degree. The internal solidification degree measures the occurrence frequency of the word and the degree to which the word is a meaningful match, and the higher the internal solidification degree is, the more likely the text segment is a word; the degree of freedom is considered to be the richness of the words left and right, and the higher the degree of freedom is, the more likely the text segment is a word.
The internal coagulability is used for measuring whether word collocation is reasonable or not, and is calculated by means of an index of point-to-point information (PMI) in calculation linguistics. If the PMI is high, namely the frequency of co-occurrence of two words is far greater than the product probability of free splicing of the two words, the two words are more reasonable to match. The calculation formula of PMI is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,、/>、/>the occurrence probabilities of a, b and ab combinations in the corpus are respectively represented.
Aiming at the words of the multi-element fragments, the fragments are divided into two sub-fragments word by word, all the divided mutual information is calculated, the minimum value of all the mutual information is taken as the internal solidification degree, and the calculation formula is as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,character string of length m +.>The expression->Is a frequency of occurrence of (a).
And inquiring the Trie, acquiring left and right adjacent characters of the sub-segment, and calculating left and right adjacent entropy of the candidate word. The degree of condensation inside the light-viewing text segment is not yet sufficient, and we need to see its appearance outside from the whole. Assume that the left adjacent character of a word segment is combined intoRight adjacent character is combined as +>The calculation formulas of the left and right adjacent entropy are respectively as follows:
the boundary degrees of freedom of the candidate words in the invention pay attention to adjacent entropy on the left and right sides at the same time, and words with higher degrees of freedom on the left and right sides are taken as a reasonable word, so that a smaller value in left and right adjacent entropy is selected as an adjacent entropy value to be added into calculation when the candidate words are scored, the richness of left adjacent words and right adjacent words of one word is measured, and the richness is higher as the entropy is larger. The calculation formula of the free application degree is as follows:
For the word forming characteristics of new words, in practical application, the invention calculates a score for each candidate word, which indicates the possibility of becoming a new word in the current context. The score calculation formula is as follows:
the score consists of three corresponding parts:
1) Inter-point information: the higher the inter-point information, the higher the internal degree of polymerization.
2) Entropy of two word fragmentsMinimum value +.>: the larger this value, the less likely it is that two words will appear together.
3) Minimum value of word left-right information entropy: the larger this value, the more context that the candidate word appears, the more likely it is to be a word.
Thus, a higher score indicates a greater likelihood of word formation. And filtering candidate words with lower scores by setting a certain threshold value on the scores, dividing the candidate words out of the candidate word sets respectively, and finally obtaining the candidate word sets in the target field.
There are also some common words in the candidate word set that should not exist as new words for the target collar. Based on this, a Chinese stop word list is obtained from hundred-degree downloading, wherein the stop word is a common word of Chinese, and if the words in the candidate word set exist in the stop word list, the candidate word set is also distinguished. Meanwhile, the words in the candidate word set are not necessarily new words relative to the source domain, so that words in the source domain corpus need to be filtered out.
The obtained industrial new word list still has more garbage character strings and character strings which are segmented by mistake, the garbage character strings are mostly similar to common collocations and word internal fragments, and unreasonable candidate words can not be filtered out by using an algorithm alone. Therefore, the method also needs to be manually checked, and supports users to add, delete, check and export candidate word contents at any time. It is appreciated that by layer-by-layer screening of new word discovery algorithms, higher quality results have been obtained, greatly reducing the workload of manual intervention. The candidate words after the manual examination are used as new words in the industrial fields to be stored so as to carry out subsequent construction and updating iteration of an industrial chain based on the new words.
Based on the above steps, a new vocabulary of the target domain can be obtained.
After the new industrial word is extracted, the hierarchical position of the new industrial word in the industrial chain is determined according to the meaning and the characteristics of the new industrial word, entity pairs with upper and lower relation in the new industrial word are searched from the corpus, a hierarchical structure of the industrial chain is built, and the new industrial word is added into the industrial chain. The industrial map usually focuses on the industrial upstream-downstream relationship, and for this purpose, the invention uses a network with the industrial relationship established by the hierarchical tree structure to perform depth expansion and width expansion of the hierarchical tree by an upper-lower relationship and parallel semantic relationship extraction method.
Further, in an embodiment of the present invention, the determining the relationship between the industry new words according to the context relationship and the parallel semantic relationship extraction method, and constructing a target industry chain tree according to the industry new words and the relationship between the industry new words includes:
performing depth expansion and width expansion of the target industrial chain tree by using an upper-lower relation and parallel semantic relation extraction method; the width expansion of the target industrial chain tree is performed through a width expansion algorithm, and the depth expansion of the target industrial chain tree is performed through depth expansion.
Further, in an embodiment of the present invention, the expanding the width of the target industrial chain tree by the width expanding algorithm includes:
using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
wherein, the liquid crystal display device comprises a liquid crystal display device,the representation of the entity is made,/>indicating entity type-> A returned confidence score;
record two entitiesAnd->Is of the brother similarity +.>Similarity of two sibling entities is calculated using the matching pattern features:
wherein, the liquid crystal display device comprises a liquid crystal display device,indicates skip mode, ++>Representing a set of skip modes;
feature computation using the entity and the type The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing all the features acquired;
acquiring embedded features of two entities through word2vecThe sibling similarity is calculated using a multiplication metric:
calculating the score of the entity according to the sibling similarity:
and screening the entities according to the scores, so as to expand the width of the target industrial chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industrial chain tree by the depth expansion includes:
by usingRepresentation item->Is given a target parent node +.>A set of reference edgesWherein->Is->Is to calculate the node +_>Put in father node->Scoring of:
wherein, the liquid crystal display device comprises a liquid crystal display device,representation vector->And->Cosine similarity between them;
based onFor each candidate entity->Scoring and selecting an entity with a score above a threshold as node +.>And the initial child node below performs depth expansion of the target industrial chain tree.
As shown in fig. 4, two expected width extension results are shown. When a given set { "upstream support", "midstream platform" }, we want to find their siblings "downstream integration services" and put them under the parent node "artificial intelligence". Similarly, our goal is to find all siblings of { "underlying hardware", "application technology" }, and append them under the parent node "upstream support".
This naturally creates a tree width expansion problem, and therefore a width expansion algorithm is employed to solve it. One key component in the width expansion algorithm is the computation of two entitiesAnd->Is marked as +.>. The method is mainly used for matching the parallel semantic patterns in natural languageSome punctuation marks (such as a pause number and the like), fixed words (such as 'OR', 'AND', and the like) or sentence patterns are generally used for representing parallel relations, so that a matching mode of parallel semantics can be obtained. First, weights are assigned between each pair of entities and matching patterns as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,is the original co-occurrence count between entity e and skip pattern sk, |v| is the total number of candidate entities.
Similarly, we can define the association weights between entities and types as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,is a confidence score returned by the concept knowledge graph, indicating that it believes the entity +.>With->Degree of confidence in the type. Obtaining each entity by linking it to a concept knowledge graph>Type information, return type as a property of the entity. For unlinked entities, they have no such entity type property at all. According to the invention, probase (A Probabilistic Taxonomy) proposed by Microsoft is selected as an input concept knowledge graph, and the graph can be used for mapping the entity to different semantic concepts and marking corresponding probability labels according to the text content of the entity.
After this, the similarity of the two sibling entities is calculated using the matching pattern features as follows:
where SK represents the selected matching pattern feature set. Similarly, all types of features can be used to calculateFinally, according to the embedded features of both entities +.>The cosine similarity is used to calculate the similarity between two entities.
To combine the three similarities, the present invention uses a multiplication metric to calculate sibling similarity, as follows:
given a set of seed entities S and a list of candidate entities V, first based on each matching pattern feature and the cumulative strength of the entities in S (i.e) It is scored and then the top 200 matching pattern features with highest scores are selected. On this basis, 10 matching pattern feature subsets +.>T=1, 2, … 10. Each->The subset has 120 matching pattern characteristics.
Given one ofOnly if it is equal to->Only if there is an association of at least one matching pattern feature is we consider the candidate entity in V. The score calculation method for the considered entity is as follows:
for each ofWe can obtain candidate entities +.based on their scores>Is a ranked list of (c) in the database. We use Representation entity->At->Rank of (3), if->Do not occur in +.>In we set->. Finally, we calculate every entity +.>Is added to set S, and an entity with an average rank higher than r is added to set S, as follows:
the key insight of the aggregation mechanism described above is that unrelated entities do not occur frequently in multipleAnd thus may have a lower mrr score. At the position ofIn the present invention, r=5 is set.
For newly added nodes in the classification tree (e.g., node "downstream integration service" in fig. 4), they have not had any child nodes yet, so we cannot directly apply the width extension algorithm. To solve this problem, we use a depth expansion algorithm to obtain the initial child node of the target node by considering the relationship between the sibling node and the nephew/nephew node of the target node. Take the node "downstream integrated service" in fig. 4 as an example. The node is generated by the previous width extension algorithm and therefore does not have any child nodes. Our goal is to find its initial child nodes (e.g., "end devices" and "application software") by modeling the relationship between the sibling node of the node "downstream integrated service" (e.g., "upstream support") and its siblings/girls (e.g., "middleware", "operating system").
Our depth expansion algorithm relies on term embedding, which encodes term semantics in dense vectors of fixed length. Let us denote the embedded vector of item t by v (t). The offset of the two item embeddings can represent the relationship between them, resulting in v ("upstream support") -v ("base hardware") -v ("downstream integration service") -v ("application software"). Thus, given a target parent nodeA group of reference edges->Wherein->Is->We calculate the node +.>Put in father node->Under the following commentsThe method is divided into the following steps:
wherein, the liquid crystal display device comprises a liquid crystal display device,representation vector->And->Cosine similarity between them. Finally, based on->For each candidate entity->Scoring and selecting entities with scores above a threshold as nodesThe next initial child node.
Thus, an industrial chain hierarchical relationship tree of the target field can be obtained.
The relationship between the upstream and downstream of the industrial chain is the core in the industrial map, and the fault tolerance is extremely low, so that the relationship is generally constructed manually by analysts and experts. Therefore, the data storage structure of the industrial atlas is designed aiming at the upstream and downstream logics of the industrial chain and the association relation of the nodes, the functions of visual editing of the industrial atlas, one-key importing of atlas data and the like are provided, the steps are simplified by designing the industrial atlas data conversion processing method, a user can select proper industrial nouns from industrial new words automatically mined, the industrial chain can be conveniently and rapidly built in a self-defined mode or the iterative updating can be carried out based on the original industrial atlas, and convenience is provided for follow-up industrial atlas fine analysis and prospective research and judgment.
Further, in one embodiment of the present invention, the designing the data storage structure of the target industry chain tree for the industry chain upstream and downstream logic and node association relation includes:
designing a parent_id field, and storing a unique identifier of a parent node;
all hierarchical ancestor nodes of the current node are stored using the full path field, by means of id # id # id … and splicing the character string representation.
In particular, in the industrial atlas application scenario, the industrial atlas usually focuses on the relationship between the upstream and downstream and the hierarchical dependency between the industrial nodes, and the general hierarchy is not too deep and is basically within ten layers, so the database design goal is to store a multi-level structure and simply and efficiently obtain a complete branch. Aiming at the hierarchical structure of the limited hierarchy with larger data volume, a parent_id field is designed, and the unique identification of a father node is stored, so that the direction of industry can be quickly obtained, and an industry map tree can be obtained through recursive query. In the aspect of industrial map visualization analysis, when an industrial map of a certain field is displayed, all child node information of a certain node needs to be frequently extracted, but if only a parent_id is used, when the depth of the tree is deeper, a database needs to be queried many times when a tree is obtained, and the efficiency is very low. In order to improve efficiency, full_path fields are used for storing all hierarchy ancestor nodes of the current node, character strings are spliced and expressed in an id#id#id … mode, so that a certain node and child nodes thereof can be conveniently matched by like statement prefixes, the hierarchy position of each node in a tree can be obtained, and the tree can be spliced more conveniently and efficiently in an application code layer. If the relation of the nodes in a tree is updated, only the full_path field of the node and the child nodes thereof need to be maintained. The design scheme not only can meet the query and encapsulation of the industrial map structure data, but also is convenient for maintaining the hierarchical relationship of the nodes and the child nodes thereof. The overall database table main field design is shown in table 1.
TABLE 1
To facilitate understanding and analysis of industrial map data, the present invention employs a lay-out approach to observing map data. In the field of Web application program development, javaScript tree controls based on Ajax technology are widely used, and the invention is realized by using an AntV G6 graph visualization engine, so that graph creation, rendering, element configuration, layout, interaction, animation and other basic graph visualization capabilities are provided, and the problems of displaying and editing industrial map level data are perfectly solved. Users can add, delete and change nodes and edges, and can change the upper and lower relationships of the nodes of the industrial map in a dragging mode, and click on the nodes can configure the industrial node entity concepts and attributes, such as solid definition, belonging fields and the like, so that the flexibility and the expansion capability of the industrial map are improved.
The method adopts a Tree Diff algorithm to compare nodes of two new trees and old trees, compares the node difference, thereby determining the node which needs to be updated, forming patch data and transmitting the patch data to a server. The invention adopts a depth-first strategy, and the depth-first ensures that the ancestor node of the child node is up-to-date when the child node is modified. The comparison of the new node and the old node mainly aims at achieving the purpose of maintaining the database around three things, and the new node is created, the waste node is deleted and the existing node is updated. Each editing action of the user is temporarily stored in the front end, the "new addition", "modification" and "deletion" of the front end do not directly operate the database, but mark the data with a status, and the data needing to be added, modified and deleted are respectively put in an add object, an update object and a delete object, and the classified data is transmitted to the server when the "save" is clicked. The method comprises the following specific steps:
(1) If the node content has no id attribute, the node is considered to be newly added and added into the add object. Because the node unique identification id is automatically generated when the node is inserted into the database, the server side returns the id to the browser after placing the id in the node content, and each existing node has an id attribute.
If the node content has id attribute, comparing whether all attribute values of the new and old nodes except the child are consistent;
1) If the attribute values are consistent, the node is considered to be unnecessary to modify;
2) If the attribute values are inconsistent, adding the node into the update object, and reassigning the parent_id and full_path;
judging the relevant conditions of the child nodes of the new node and the old node;
1) Only the new node has child nodes, and the step (1) is switched to;
2) Only the old node has child nodes, the new node is considered to discard the child nodes of the old node, so that the child nodes of the old node need to be deleted and added into delete objects;
3) Under the condition that both the new node and the old node have child nodes, traversing and inquiring the intersection of the child node set of the new node and the child node set of the old node, wherein the intersection can be considered as the same id, and the part of nodes can be judged in the next step and the step (1) is carried out. And if the node is not in the new node child node in the set, the node is considered to be newly added and added into the add object. The old node child node that is not in the set is added to the delete object.
FIG. 5 is a schematic diagram illustrating patch data generation of a new industry chain, in which a server performs batch adding and deleting operations on a database after receiving a request and patch data, newly adds data in an add type object, modifies data in an update type object, and deletes data in a delete type object.
In addition to visual editing, the platform also provides an industrial map one-key import function, and a user can create or update an industrial map in an Excel table import mode. The core implementation steps are as follows:
the Excel file is read. The Node-xlsx module of Node. Js is used to realize the reading and writing of Excel file stream, and the Node module reads according to the reading of Excel line by line, so the read data structure is a two-dimensional array, and the value read by the parallel or column unit cells is NULL. As shown in fig. 6, the reading result is fig. 7.
The valid data in each row is converted into a tree structure having a hierarchy. The two-dimensional array read by the Node script can be converted into a nested structure, and the length of each row of array is the maximum depth of the current row. As shown in fig. 7, the first three values of the second row are NULL, which represents the first three data of the first row, so only the nested objects generated by the second row and the nested objects generated by the first row need to be combined, and similarly, the second row data and the third row data are combined, and then the complete tree can be obtained by the similar method. In the implementation process, if a plurality of data lines of the same level are encountered, at the moment, which data line is inserted cannot be determined, and by observing the rule of the data, the inserted level cannot be wrong by finding that only the depth of the item where the current data is located is required to be obtained when each time of insertion, and then the current data is inserted into the last inserted parent level which is one more than the current data depth. Therefore, a depth-first search algorithm is used herein to search out a parent object tree which is one level larger than the current level, namely a parent relation can be obtained, and a new node can be built and stored in data by combining the complete path full_path and the industry field root_id of the parent node, as shown in fig. 8. Meanwhile, the current object is inserted into the last item of the child array of the parent tree, as shown in the figure, the json object generated by the first three items and the fourth item are combined, so that an industrial atlas json structure from the current node to the root node can be obtained, as shown in fig. 9, the industrial hierarchy data structure required by the front-end tree control is completely met, and the front-end is convenient to carry out visual display.
The industrial chain construction and iteration expansion development method provided by the embodiment of the invention covers core services of industrial atlas rapid construction, finding out new industrial words, extracting industrial hierarchical relations, updating iteration and the like, solves the limitations of low accuracy of manual data analysis and complicated construction and expansion of the existing industrial atlas, and enables a user to conveniently and rapidly generate and correspond to the industrial atlas under a category according to the industrial atlas demand, thereby balancing the relation between an automatic processing flow and manual intervention and improving the expandability and development efficiency of application.
In order to realize the embodiment, the invention also provides an industrial chain construction and iteration expansion development device.
Fig. 10 is a schematic structural diagram of an industrial chain construction and iterative expansion development device according to an embodiment of the present invention.
As shown in fig. 10, the industrial chain construction and iterative expansion development apparatus includes: an acquisition module 100, a screening module 200, a construction module 300, an update module 400, wherein,
the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industrial new word discovery algorithm to perform unsupervised pre-segmentation on the industrial corpus data to obtain industrial new words;
The construction module is used for determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
and the updating module is used for carrying out iterative updating based on the original industrial chain tree through the data storage structure by designing the data storage structure of the target industrial chain tree aiming at the industrial chain upstream and downstream logic and the node association relation.
To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the industrial chain construction and iterative expansion development method as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the industrial chain construction and iterative expansion development method as described above.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.
Claims (7)
1. The industrial chain construction and iterative expansion development method is characterized by comprising the following steps of:
acquiring a target industry type input by a user, and acquiring industrial corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words, wherein the depth expansion and the width expansion of the target industrial chain tree are performed through the upper-lower relation and the parallel semantic relation extraction method, the width expansion of the target industrial chain tree is performed through a width expansion algorithm, and the depth expansion of the target industrial chain tree is performed through the depth expansion;
The method comprises the steps of designing a data storage structure of a target industrial chain tree aiming at the upstream and downstream logic and the node association relation of the industrial chain, and carrying out iterative updating based on the original industrial chain tree through the data storage structure;
the width expansion of the target industrial chain tree by a width expansion algorithm comprises the following steps: using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing entity->Representing entity type, noting two entities +.>And->Is of the brother similarity ofSimilarity of two sibling entities is calculated using the matching pattern features:
wherein, the liquid crystal display device comprises a liquid crystal display device,indicates skip mode, ++>A set of skip modes is represented and,
feature computation using the entity and the typeThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing all of the features acquired and,
acquiring embedded features of two entities through word2vecThe sibling similarity is calculated using a multiplication metric:
calculating the score of the entity according to the sibling similarity:
screening the entities according to the scores, so as to expand the width of the target industrial chain tree;
the depth expansion of the target industrial chain tree by depth expansion comprises the following steps:
By usingRepresentation item->Is given a target parent node +.>A set of reference edgesWherein->Is->Is to calculate the node +_>Put in father node->Scoring of:
wherein, the liquid crystal display device comprises a liquid crystal display device,representation vector->And->The degree of cosine similarity between the two,
2. The method of claim 1, further comprising, after obtaining the industrial corpus data corresponding to the target industrial type:
the unified preprocessing of the industrial corpus data comprises the steps of cutting the industrial corpus data according to Chinese characters and non-Chinese characters to remove the words and coding symbols.
3. The method of claim 1, wherein the designing an industry new word discovery algorithm to perform an unsupervised pre-segmentation on the industry corpus data comprises:
dividing the industrial corpus data into a set of single characters, and combining the characters in the set into candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie, acquiring a frequency list of a prefix and a suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of fragments formed by the candidate words;
Inquiring the Trie, acquiring word frequencies of the candidate words and word frequencies of left and right fragments, and calculating mutual information among points according to the word frequencies;
calculating the score of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the score to obtain a candidate word set in the target field, wherein the formula is expressed as follows:
4. The method of claim 1, wherein the step of designing the data storage structure of the target industry chain tree for the industry chain upstream and downstream logic and node association relationship comprises:
designing a parent_id field, and storing a unique identifier of a parent node;
all hierarchical ancestor nodes of the current node are stored using the full path field, by means of id # id # id … and splicing the character string representation.
5. The industrial chain construction and iteration expansion development device is characterized by comprising the following modules:
the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industrial new word discovery algorithm to perform unsupervised pre-segmentation on the industrial corpus data to obtain industrial new words;
The construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words, wherein the depth expansion and the width expansion of the target industry chain tree are carried out through the upper-lower relation and the parallel semantic relation extraction method, the width expansion of the target industry chain tree is carried out through a width expansion algorithm, and the depth expansion of the target industry chain tree is carried out through the depth expansion;
the updating module is used for carrying out iterative updating based on the original industrial chain tree through the data storage structure by designing the data storage structure of the target industrial chain tree aiming at the industrial chain upstream and downstream logic and the node association relation;
the width expansion of the target industrial chain tree by a width expansion algorithm comprises the following steps: using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing entity->Representing entity type, noting two entities +.>And->Is of the brother similarity ofSimilarity of two sibling entities is calculated using the matching pattern features:
Wherein, the liquid crystal display device comprises a liquid crystal display device,indicates skip mode, ++>A set of skip modes is represented and,
feature computation using the entity and the typeThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing all of the features acquired and,
acquiring embedded features of two entities through word2vecThe sibling similarity is calculated using a multiplication metric:
calculating the score of the entity according to the sibling similarity:
screening the entities according to the scores, so as to expand the width of the target industrial chain tree;
the depth expansion of the target industrial chain tree by depth expansion comprises the following steps:
by usingRepresentation item->Is given a target parent node +.>A set of reference edgesWherein->Is->Is to calculate the node +_>Put in father node->Scoring of:
wherein, the liquid crystal display device comprises a liquid crystal display device,representation vector->And->The degree of cosine similarity between the two,
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the industrial chain construction and iterative expansion development method of any one of claims 1-4 when the computer program is executed.
7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the industrial chain construction and iterative expansion development method according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310260247.6A CN115982390B (en) | 2023-03-17 | 2023-03-17 | Industrial chain construction and iterative expansion development method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310260247.6A CN115982390B (en) | 2023-03-17 | 2023-03-17 | Industrial chain construction and iterative expansion development method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115982390A CN115982390A (en) | 2023-04-18 |
CN115982390B true CN115982390B (en) | 2023-06-23 |
Family
ID=85968496
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310260247.6A Active CN115982390B (en) | 2023-03-17 | 2023-03-17 | Industrial chain construction and iterative expansion development method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115982390B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116975626B (en) * | 2023-06-09 | 2024-04-19 | 浙江大学 | Automatic updating method and device for supply chain data model |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105955950A (en) * | 2016-04-29 | 2016-09-21 | 乐视控股(北京)有限公司 | New word discovery method and device |
CN111897917B (en) * | 2020-07-28 | 2023-06-16 | 成都灵尧科技有限责任公司 | Rail transit industry term extraction method based on multi-modal natural language features |
CN112860692B (en) * | 2021-01-29 | 2023-07-25 | 城云科技(中国)有限公司 | Database table structure conversion method and device and electronic equipment thereof |
CN113779200A (en) * | 2021-09-14 | 2021-12-10 | 中国电信集团系统集成有限责任公司 | Target industry word stock generation method, processor and device |
CN114757147A (en) * | 2022-04-02 | 2022-07-15 | 辽宁工程技术大学 | BERT-based automatic hierarchical tree expansion method |
CN114742061A (en) * | 2022-04-26 | 2022-07-12 | 平安国际智慧城市科技股份有限公司 | Text processing method and device, electronic equipment and storage medium |
-
2023
- 2023-03-17 CN CN202310260247.6A patent/CN115982390B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115982390A (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109492077B (en) | Knowledge graph-based petrochemical field question-answering method and system | |
Chakrabarti et al. | A graph-theoretic approach to webpage segmentation | |
Su et al. | ODE: Ontology-assisted data extraction | |
Lu et al. | Annotating search results from web databases | |
Kayed et al. | FiVaTech: Page-level web data extraction from template pages | |
Schenker | Graph-theoretic techniques for web content mining | |
US20060288275A1 (en) | Method for classifying sub-trees in semi-structured documents | |
CN111950285A (en) | Intelligent automatic construction system and method of medical knowledge map based on multi-modal data fusion | |
CN111190900B (en) | JSON data visualization optimization method in cloud computing mode | |
WO2014210387A2 (en) | Concept extraction | |
CN101515287A (en) | Automatic generating method of wrapper of complex page | |
CN104268148A (en) | Forum page information auto-extraction method and system based on time strings | |
Bing et al. | Towards a unified solution: data record region detection and segmentation | |
CN106528648A (en) | Distributed keyword approximate search method for RDF in combination with Redis memory database | |
CN115982390B (en) | Industrial chain construction and iterative expansion development method | |
Ujwal et al. | Classification-based adaptive web scraper | |
JP2009110508A (en) | Method and system for calculating competitiveness metric between objects | |
CN112084333A (en) | Social user generation method based on emotional tendency analysis | |
Suresh et al. | Data mining and text mining—a survey | |
Pereira et al. | Disambiguating publication venue titles using association rules | |
CN107491524B (en) | Method and device for calculating Chinese word relevance based on Wikipedia concept vector | |
CN116628303A (en) | Semi-structured webpage attribute value extraction method and system based on prompt learning | |
CN101996190A (en) | Method and device for extracting information from webpage | |
CN115617981A (en) | Information level abstract extraction method for short text of social network | |
Liu et al. | Structured data extraction: wrapper generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |