CN115982390A - Industrial chain construction and iterative expansion development method - Google Patents
Industrial chain construction and iterative expansion development method Download PDFInfo
- Publication number
- CN115982390A CN115982390A CN202310260247.6A CN202310260247A CN115982390A CN 115982390 A CN115982390 A CN 115982390A CN 202310260247 A CN202310260247 A CN 202310260247A CN 115982390 A CN115982390 A CN 115982390A
- Authority
- CN
- China
- Prior art keywords
- industry
- industrial
- target
- words
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides an industrial chain construction and iterative expansion development method, which comprises the steps of obtaining a target industrial type input by a user, and obtaining industrial corpus data corresponding to the target industrial type; designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on industry corpus data to obtain an industry new word; determining the relation between the new industrial words according to the superior-subordinate relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words; the data storage structure of the target industrial chain tree is designed according to the upstream and downstream logics of the industrial chain and the node incidence relation, and iterative updating is carried out on the basis of the original industrial chain tree through the data storage structure. By the method provided by the invention, the construction and updating efficiency of the industrial map is greatly improved.
Description
Technical Field
The invention belongs to the technical field of data visualization technology and data application.
Background
At present, when a certain industry is analyzed, an industry chain map of the industry needs to be constructed, the construction process usually needs to look up a large amount of industry data manually, and the construction is complex.
The industry chain needs to have sufficient reusability, iterability and expandability. The industry itself is dynamic, and new industries are continuously emerging along with the development of the industry. It is also a great challenge how to mine new words appearing in the industry, how to obtain hierarchical relations among industrial words, and how to add the changes of the industries into the original industrial map data, so that the whole map becomes advanced all the time.
Meanwhile, the subjectivity of an industrial chain is very strong, different industrial standards exist at present, different websites and mechanisms also classify the same industrial term into different industries, different people have different understandings on the construction of the industrial chain, the types of nodes and the relationships of the industrial chain and the granularity problem of the industrial chain, and different settings can directly lead to different application results. In the prior art, in the aspects of finding new industrial words and individually constructing an industrial chain, a universal development method is lacked, and the engineering development efficiency is not improved.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the invention aims to provide an industrial chain construction and iterative expansion development method, which is used for solving the limitations of low accuracy and complicated construction and expansion of the existing industrial map manual data analysis.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for building an industrial chain and iteratively expanding and developing, including:
acquiring a target industry type input by a user, and acquiring industry corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the new industrial words according to the superior-inferior relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words;
and designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
In addition, the method for building and iteratively expanding the development of the industrial chain according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, after acquiring the industry corpus data corresponding to the target industry type, the method further includes:
and performing unified preprocessing on the industrial corpus data, wherein the preprocessing comprises the steps of cutting the industrial corpus data according to a Chinese character mode and a non-Chinese character mode, and removing language words and coding symbols.
Further, in an embodiment of the present invention, the designing an industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industry corpus data into a single character set, and combining every two characters in the set to serve as candidate words;
constructing a Trie tree storage candidate word;
inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating left and right information entropies of the candidate words and left and right information entropies of segments formed by the candidate words;
inquiring the Trie tree, acquiring the word frequency of the candidate words and the word frequencies of the left and right segments, and calculating mutual information between points according to the word frequencies;
calculating the scores of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the scores to obtain a candidate word set of the target field, wherein the formula is represented as:
wherein the content of the first and second substances,represents mutual information between points, and->Left and right entropy representing the fraction of candidate words constituting a segment>Representing left and right information entropy of the candidate word.
Further, in an embodiment of the present invention, the determining the relationship between the industry new words according to a top-bottom relationship and a parallel semantic relationship extraction method, and constructing a target industry chain tree according to the industry new words and the relationship between the industry new words includes:
performing depth expansion and width expansion of the target industrial chain tree by a superior-subordinate relation and parallel semantic relation extraction method; and performing width expansion of the target industry chain tree through a width expansion algorithm, and performing depth expansion of the target industry chain tree through depth expansion.
Further, in an embodiment of the present invention, the performing width expansion of the target industry chain tree by a width expansion algorithm includes:
representing an industry new word by an entity, representing the part of speech of the industry new word by a type, and defining the association weight between the entity and the type:
wherein, the first and the second end of the pipe are connected with each other,represents a entity>Indicates entity type, is> A returned confidence score;
noting two entitiesAnd &>Has a brother similarity of->Calculating a degree of similarity: -based on matching pattern features for two sibling entities>
Wherein the content of the first and second substances,indicates a skip mode, is asserted>A set representing skip modes;
feature computation using the entity and the type(ii) a Wherein it is present>All the acquired characteristics are represented;
obtaining embedded features of two entities via word2vecThe multiplicative metric is used to compute the sibling similarity:
calculating the score of the entity according to the similarity of the brothers and the sisters:
and screening the entities according to the scores so as to expand the width of the target industry chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industry chain tree by the depth expansion includes:
by usingRepresents an item->Given a target parent node @>A set of reference edgesWherein->Is->Is calculated to be node->Is placed in the father node>The following scores:
wherein the content of the first and second substances,to representVector->And &>Cosine similarity therebetween;
based onFor each candidate entity->Scoring and selecting entities having a score above a threshold as nodes>And performing depth expansion on the target industry chain tree by the lower initial child node.
Further, in an embodiment of the present invention, the designing the data storage structure of the target industry chain tree according to the industry chain upstream and downstream logic and the node association relationship includes:
designing a parent _ id field, and storing the unique identifier of the father node;
all hierarchy ancestor nodes of the current node are stored by adopting a full _ path field and are represented by an id # id \8230ina splicing character string mode.
In order to achieve the above object, a second aspect of the present invention provides an apparatus for building an industrial chain and iteratively expanding development, including the following modules:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industry new word discovery algorithm to perform unsupervised pre-segmentation on the industry corpus data to obtain an industry new word;
the construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words;
and the updating module is used for designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating based on the original industrial chain tree through the data storage structure.
In order to achieve the above object, a third aspect of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement an industry chain building and iterative expansion development method as described above.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to implement an industry chain building and iterative expansion development method as described above when executed by a processor.
The industrial chain construction and iterative expansion development method provided by the embodiment of the invention covers core services of rapid construction of an industrial map, discovery of new industrial words, extraction of industrial hierarchical relations, update and iteration and the like, solves the limitations of low accuracy and complex construction and expansion of manual data analysis of the industrial map at present, and can generate and expand the industrial map under the corresponding category conveniently and rapidly according to the industrial map requirement, thereby balancing the relation between an automatic processing flow and manual intervention and improving the expandability and development efficiency of application.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a method for building an industrial chain and developing an iterative extension according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating a method for discovering an industry new word according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of left and right information entropies of candidate words and left and right information entropies of candidate word constituent segments according to an embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating an overview of a hierarchical tree expansion algorithm process according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of generating patch data of a new industry chain according to an embodiment of the present invention.
Fig. 6 to 9 are schematic diagrams illustrating an industry map importing implementation process according to an embodiment of the present invention.
Fig. 10 is a flowchart illustrating an industrial chain building and iterative expansion developing apparatus according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The industrial chain construction and iterative extension development method of the embodiment of the present invention is described below with reference to the drawings.
Fig. 1 is a schematic flowchart of a method for building an industrial chain and developing an iterative extension according to an embodiment of the present invention.
As shown in fig. 1, the method for building and iteratively expanding the industry chain includes the following steps:
s101: acquiring a target industry type input by a user, and acquiring industry corpus data corresponding to the target industry type;
s102: designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
s103: determining the relation between the new industrial words according to the superior-inferior relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words;
s104: and designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
The invention adopts an unsupervised method, and utilizes a statistical strategy to extract all text segments which can be word-forming in a large-scale corpus according to the common characteristics of words, and segments the corpus to form a plurality of text segments, which is equivalent to a rough word segmentation. Then, the language knowledge is used for eliminating useless fragments which are not new words, the relevancy is calculated, the combination of the characters with the maximum relevancy is found, and the text fragments are cleaned and filtered once. Finally, all the extracted words are compared with the existing word stock, and the text segments which are not in the range of the word stock can be used as a new word stock. Fig. 2 is a flowchart of an industry neologism discovery method.
After the industrial corpus is imported into the system, uniform preprocessing needs to be performed on the data. The industrial linguistic data often contains not only Chinese characters, but also a large number of special punctuations such as Arabic numerals, capital and small English letters, ellipses and the like, which brings certain obstacles for the subsequent new word recognition in the industry. Taking an industry research report as an example, the report uses a large number of numerical values so as to enhance reality and persuasion, and assuming that the length of the longest segment of an industry noun is set to be 8 characters, a plurality of segments of 8 characters are very easy to combine between the numerical values and the letters, the segments often have larger adjacent entropy and mutual information, and if the segments are not processed, the segments without the value of constructing an industry chain map can become entries in an industry new word list.
Further, in an embodiment of the present invention, after acquiring the industry corpus data corresponding to the target industry type, the method further includes:
and performing unified preprocessing on the industrial corpus data, wherein the preprocessing comprises the steps of cutting the industrial corpus data according to a Chinese character mode and a non-Chinese character mode, and removing language words and coding symbols.
The cut corpus is changed into a plurality of short sentences from an original long sentence, and then subsequent new word recognition work is carried out on the obtained short sentences.
Further, in an embodiment of the present invention, the designing industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industrial corpus data into a single character set, and combining every two characters in the set to serve as candidate words;
constructing a Trie tree storage candidate word;
inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating left and right information entropies of the candidate words and left and right information entropies of fragments formed by the candidate words;
inquiring the Trie tree, acquiring the word frequency of the candidate words and the word frequencies of the left and right segments, and calculating mutual information between points according to the word frequencies;
calculating scores of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the scores to obtain a candidate word set of a target field, wherein the formula is represented as follows:
wherein the content of the first and second substances,representing mutual information between points, <' > based on a predetermined criterion>Left and right entropy representing the fraction of candidate words constituting a segment>Representing left and right information entropy of the candidate word.
Specifically, the corpus is divided into a single character set, and characters are combined pairwise to serve as candidate words. Since a prefix and a suffix are required to calculate the information entropy, a segment of length 3 needs to be stored. Since the lookup of the prefix and suffix and the statistics of the word frequency are involved subsequently, the invention uses the Trie tree to store data. Constructing a prefix Trie tree and a suffix Trie tree by using a 3-gram sequence, wherein the Trie tree takes a single character as a node, and each node records the occurrence frequency of words formed from a root node to a current node.
And inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating the left and right information entropies of the candidate words and the left and right information entropies of the candidate word forming segments. Because the related information entropies are more, each information entropy is marked with a distinguishing mark (Candidate is a Candidate word, left is a segment formed on the left, right is a segment formed on the right, h _ l _ l and h _ l _ r are respectively the left and right information entropies of the left segment, h _ r _ l and h _ r are respectively the left and right information entropies of the right segment, and h _ l and h _ r are respectively the left and right information entropies of the Candidate word). As shown in fig. 3.
And querying the Trie tree to obtain the word frequency of the candidate words and the word frequencies of the left and right segments. With the word frequency, the actual occurrence probability P (a, b) and the expected occurrence probability P (a) × P (b) can be conveniently obtained, so that mutual information and internal condensation degree can be calculated. The invention mainly uses two word formation standards: internal solidification degree and free application degree. The internal solidification degree is measured by the appearance frequency of the word and the degree that the word is in meaningful collocation, and the higher the internal solidification degree is, the more likely the text segment is a word; the free application degree is considered to be the richness degree of the left and right adjacent characters of the word, and the higher the free application degree is, the more likely the text segment is to be a word.
The internal freezing degree is used for measuring whether the word collocation is reasonable or not, and the calculation is carried out by means of the index of Point Mutual Information (PMI) in the computational linguistics. If the PMI is high, that is, the frequency of co-occurrence of two words is far greater than the probability of the product of free concatenation of the two words, it indicates that the collocation of the two words is more reasonable. The calculation formula of PMI is as follows:
wherein the content of the first and second substances,、/>、/>respectively shows the appearance of a, b and ab combination in the corpusAnd (4) rate.
Aiming at the words of the multi-element fragments, the fragments are divided into two sub-fragments one by one, all the divided mutual information is calculated, the minimum value of all the mutual information is taken as the internal solidification degree, and the calculation formula is as follows:
wherein the content of the first and second substances,represents a character string of length m, and>means word->The frequency of occurrence of (c).
And querying the Trie tree, acquiring left and right adjacent characters of the sub-segments, and calculating left and right adjacent entropies of the candidate words. The degree of cohesion inside the text segment is not sufficient, and we also need to present it externally as a whole. Assuming left-adjacent character union of word fragmentsThe right adjacent character is combined into->The calculation formulas of the left and right adjacent entropies are respectively as follows:
in the invention, the boundary freedom degree of the candidate word pays attention to the adjacent entropy at the left and right sides simultaneously, and the word with higher left and right freedom degrees is taken as a reasonable word, so that the smaller value of the left and right adjacent entropies is selected as the adjacent entropy value to be added into calculation when scoring the candidate word, and the richness of the left adjacent word and the right adjacent word of a word is measured, and the richness is higher if the entropy is larger. The calculation formula of the degree of free use is as follows:
for the characteristic of forming a new word, in practical application, the invention calculates a score for each candidate word, which represents the possibility of forming the new word in the current context. The score calculation formula is as follows:
the score consists of three corresponding parts:
1) Mutual information between points: the higher the inter-point mutual information, the higher the degree of internal polymerization.
2) Two word segment information entropyIs greater than or equal to>: the larger this value, the less likely it means that two words appear together.
3) Minimum value of word left and right information entropy: the larger the value is, the more the context in which the candidate word appears is, and the more possible the candidate word is.
Therefore, a higher score indicates a higher probability of being a word. And filtering candidate words with lower scores by setting a certain threshold value for the scores, and removing the candidate words from the candidate word set to obtain the candidate word set of the target field.
Some common words in Chinese are also in the candidate word set, and the words should not exist as new words in the target field. Based on the method, a Chinese stop word list is downloaded from Baidu, wherein the stop words are Chinese common words, and if the words in the candidate word list exist in the stop word list, the words are also removed from the candidate word list. Meanwhile, the words in the candidate word set are not necessarily all new words relative to the source domain, and therefore, the words existing in the corpus of the source domain need to be filtered out.
The obtained industrial new word list still has more junk character strings and miscut character strings, the junk strings are mostly the same as common collocation and word internal segments, and unreasonable candidate words cannot be filtered out by simply using an algorithm. Therefore, manual review is needed, and the user is supported to perform addition, deletion, modification, check and export on the content of the candidate words at any time. It is appreciated that by layer-by-layer screening of the new word discovery algorithm, higher quality results have been obtained, greatly reducing the workload of manual intervention. The candidate words after the manual review are used as new words of the industry fields to be stored, so that the construction and updating iteration of the industry chain can be performed on the basis of the industry new words.
Based on the steps, a new word list of the target field can be obtained.
After the new industrial words are extracted, the hierarchical position of the new words in the industrial chain needs to be determined according to the meaning and the characteristics of the new industrial words, entity pairs with upper and lower relations in the new industrial words are searched from the corpus, the hierarchical structure of the industrial chain is constructed, and the new industrial words are added into the industrial chain. The industrial map generally focuses on the upstream and downstream relationship of the industry, therefore, the invention uses the hierarchical tree structure to construct a network of the industrial relationship, and carries out the depth expansion and width expansion of the hierarchical tree by the extraction method of the upper and lower relationship and the parallel semantic relationship.
Further, in an embodiment of the present invention, the determining the relationship between the new industry words according to a top-bottom relationship and a parallel semantic relationship extraction method, and constructing a target industry chain tree according to the new industry words and the relationship between the new industry words includes:
performing depth expansion and width expansion of the target industrial chain tree by a superior-subordinate relation and parallel semantic relation extraction method; and performing width expansion of the target industry chain tree through a width expansion algorithm, and performing depth expansion of the target industry chain tree through depth expansion.
Further, in an embodiment of the present invention, the performing width expansion of the target industry chain tree by a width expansion algorithm includes:
representing an industry new word by an entity, representing the part of speech of the industry new word by a type, and defining the association weight between the entity and the type:
wherein the content of the first and second substances,represents a entity>Indicates entity type, <' > in> A returned confidence score;
noting two entitiesAnd &>Has a brother similarity of->The similarity of two sibling entities is calculated using the matching pattern features:
wherein the content of the first and second substances,indicates a skip mode, is asserted>A set representing skip modes;
feature computation using the entity and the type(ii) a Wherein it is present>All the acquired characteristics are represented;
obtaining embedded features of two entities via word2vecThe multiplicative metric is used to compute the sibling similarity:
calculating the score of the entity according to the similarity of the brothers and the sisters:
and screening the entities according to the scores so as to expand the width of the target industry chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industry chain tree by depth expansion includes:
by usingRepresents an item->Given a target parent node @>A set of reference edgesWherein->Is->Is calculated to be node->Is placed in the father node>The following scores:
wherein the content of the first and second substances,represents a vector pick>And &>Cosine similarity between them;
based onFor each candidate entity>Scoring and selecting entities having a score above a threshold as nodes>And performing depth expansion on the target industry chain tree by the lower initial child node.
As shown in fig. 4, two expected width extension results are shown. Given the set { "upstream support", "midstream platform" }, we want to find their sibling node "downstream integration services" and place it under the parent node "artificial intelligence". Similarly, our goal is to find all siblings of { "underlying hardware", "application technology" } and attach them under the parent node "upstream support".
This naturally forms a tree width expansion problem, so a width expansion algorithm is used to solve it. One key component in the breadth expansion algorithm is to compute two entitiesAnd &>The degree of similarity is recorded as->. Originally, mainly through the parallel semantic pattern matching method, some punctuations (such as pause signs, etc.), fixed words (such as "or", "and", etc.) or sentences are generally used in natural language to represent the parallel relationship, so that the matching pattern of the parallel semantics can be obtained. First, weights are assigned between each pair of entities and matching patterns as follows:
wherein the content of the first and second substances,is the original co-occurrence count between entity e and skip mode sk, | V | is the total number of candidate entities.
Similarly, we can define the association weight between an entity and a type by:
wherein the content of the first and second substances,is a confidence score returned by the concept knowledge-graph that it believes an entity @>Has->The degree of confidence in the type. Obtaining each entity by linking it to a conceptual knowledgebase>Type information, returning a type as a property of the entity. For unlinkable entities, they do not have such entity type characteristics at all. The invention selects the base (A Probasilic Taxonomy) proposed by Microsoft as the input concept knowledge graph, and can map the entity to different semantic concepts by utilizing the graph, and is marked with corresponding probability labels according to the text content of the entity.
After this, the similarity of the two sibling entities is computed using the matching pattern features, as follows:
where SK represents the selected set of matching pattern features. Similarly, all types of features can be used to computeAnd finally based on the embedded characteristic of the two entities->The cosine similarity is used to calculate the similarity between the two entities.
To combine the three similarities, the present invention uses a multiplicative metric to compute the sibling similarity as follows:
given a set S of seed entities and a list V of candidate entities, the cumulative strength of each matching pattern feature with the entities in S (i.e., the cumulative strength of each matching pattern feature with the entities in S) is first determined) Scoring it and then selecting the score that is the mostThe top 200 matching pattern features. On this basis, 10 matching pattern feature subsets are generated using the no-replacement sampling method>T = 1,2, \823010. Each->The subset has 120 matching pattern characteristics.
Is given oneOnly if it is associated with->When at least one of the matching pattern features is associated, we will consider the candidate entities in V. The score of the considered entity is calculated as follows:
for eachWe can get candidate entities based on their scores>A ranked list of (a). We useIndicates entity->Is at>Is medium, if->Does not occur in>In, IAre arranged by people>. Finally, we calculate each entity ≦>And adding entities with average rank above r to the set S, as follows:
a key insight of the aggregation mechanism described above is that unrelated entities are not frequently present in multiple entitiesAnd thus may have a lower mrr fraction. In the present invention, r = 5 is assumed.
For newly added nodes in the classification tree (e.g., the node "downstream integration service" in fig. 4), they do not have any child nodes yet, so we cannot directly apply the breadth extension algorithm. To solve this problem, we use a deep-unrolling algorithm to obtain the initial child node of the target node by considering the relationship between the target node's sibling and nephew/nephew nodes. Take node "downstream integration services" in fig. 4 as an example. This node is generated by the previous width extension algorithm and therefore does not have any child nodes. Our goal is to find its initial child nodes (e.g., "terminal devices" and "application software") by modeling the relationship between a sibling node of the node "downstream integration services" (e.g., "upstream support") and its nephew/nephew node (e.g., "middleware," "operating system").
Our depth extension algorithm relies on term embedding, which encodes the term semantics in dense vectors of fixed length. We denote the embedded vector of term t by v (t). The offset of the two item embeddings may represent the relationship between them, resulting in v ("upstream support") -v ("base hardware") ≈ v ("downstream integration service") -v ("application software"). Thus, given a target parent nodeA group of reference sides->Wherein->Is->Parent node of, we compute the node &>Is placed in the father node>The following scores were given:
wherein the content of the first and second substances,represents a vector pick>And &>Cosine similarity between them. Finally, based on->For each candidate entity->Scoring and selecting entities having a score above a threshold as nodes>The initial child node of.
Thus, the industrial chain hierarchical relation tree of a target field can be obtained.
The upstream and downstream relationship of the industrial chain is the core in the industrial map, and the fault tolerance rate is extremely low, so that the upstream and downstream relationship is generally constructed manually by analysts and experts. Therefore, the data storage structure of the industrial map is designed aiming at the upstream and downstream logics and the node incidence relation of the industrial chain, the functions of visual editing of the industrial map, one-key introduction of map data and the like are provided, the steps are simplified by designing the industrial map data conversion processing method, a user can select a proper industrial noun from the automatically mined industrial new words, the industrial chain is conveniently and quickly constructed in a self-defining mode or iterative updating is carried out based on the original industrial map, and convenience is provided for subsequent industrial map fine analysis and prospective study and judgment.
Further, in an embodiment of the present invention, the designing the data storage structure of the target industry chain tree according to the industry chain upstream and downstream logic and the node association relationship includes:
designing a parent _ id field, and storing the unique identifier of the father node;
all hierarchy ancestor nodes of the current node are stored by adopting a full _ path field and are represented by an id # id \8230ina splicing character string mode.
Specifically, in an industrial graph application scene, an industrial graph usually focuses on upstream and downstream relationships and hierarchical dependency relationships among industrial nodes, and a general hierarchy is not too deep and basically within ten layers, so that the design goal of the database is to store a multi-level structure and simply and efficiently acquire a complete branch. Aiming at the limited hierarchical structure with large data volume, a parent _ id field is designed in the text and stores the unique identifier of a parent node, so that the direction of an industry can be quickly acquired, and an industry map tree can be obtained through recursive query. In the aspect of industrial graph visualization analysis, when an industrial graph in a certain field is displayed, all child node information of a certain node needs to be extracted frequently, but if only parent _ id exists, when the depth of the tree is relatively deep, a database needs to be queried many times when a tree is obtained, and the efficiency is very low. In order to improve efficiency, all level ancestor nodes of a current node are stored in a full _ path field and are represented by id # id \8230ina splicing mode, so that a node and child nodes thereof can be conveniently matched by a like statement prefix, the level position of each node in a tree can be obtained, and the tree can be spliced more conveniently and efficiently in an application code level. If the relationship of a node in a tree is updated, only the full _ path field of the node and its children nodes need to be maintained. The design scheme can not only meet the query and encapsulation of industrial map structure data, but also facilitate the maintenance of the hierarchical relationship of the nodes and the sub-nodes thereof. The overall database table main field design is shown in table 1.
TABLE 1
In order to be more beneficial to understanding and analyzing the industrial map data, the method adopts a tiled layer layout mode to observe the map data. In the field of Web application program development, javaScript tree controls based on Ajax technology are widely used, the invention is realized by using an AntV G6 graph visualization engine, the invention provides graph visualization capabilities of the foundation of graph creation, rendering, element configuration, layout, interaction, animation and the like, and the problems of display and editing of industrial graph level data are perfectly solved. A user can add, delete and modify nodes and edges, can also modify the upper and lower relations of the nodes of the industrial map in a dragging mode, and can configure the entity concept and attribute of the industrial node by clicking the nodes, such as entity definition, the field to which the industrial node belongs, and the like, so that the flexibility and the expansion capability of the industrial map are improved.
The method adopts a Tree Diff algorithm to compare nodes of two old and new trees, compares the node difference, thereby determining the nodes needing to be updated and forming patch data to be transmitted to a server. The invention adopts a depth-first strategy, and the depth-first strategy ensures that the ancestor node is the latest when the child node is modified. The comparison between the new node and the old node mainly aims at achieving the aim of maintaining the database around three things, creating the new node, deleting the waste node and updating the existing node. Each editing action of the user is temporarily stored in the front end, the data which needs to be added, modified and deleted are respectively placed into the add object, the update object and the delete object by marking the state of the data instead of directly operating the database, and the classified data are transmitted to the server when the storage is clicked. The method comprises the following specific steps:
(1) And if the node content has no id attribute, the node is considered to be newly added and is added into the add object. Because the unique node identifier id is automatically generated when the node is inserted into the database, and the server side puts the id in the node content and returns the id to the browser, each existing node has the id attribute.
If the node content has id attribute, comparing whether the attribute values of the new node and the old node except children are consistent;
1) If the attribute values are consistent, the node is considered not to be modified;
2) If the attribute values are inconsistent, adding the node into the update object, and reassigning the parent _ id and full _ path;
judging the related conditions of the new node and the child nodes of the old node;
1) Only the new node has child nodes, and the step (1) is switched to;
2) If only the old node has child nodes, the new node is considered to abandon the child nodes of the old node, so that the child nodes of the old node are required to be deleted and added into the delete object;
3) And (3) under the condition that the new node and the old node have the child nodes, traversing and inquiring the intersection of the child node set of the new node and the child node set of the old node, judging the intersection if the ids are the same, and performing next judgment on the nodes in the part and turning to the step (1). And if the node is not in the new node child node in the set, the node is considered to be newly added and is added into the add object. The old node child node that is not in the set is added to the delete object.
As shown in fig. 5, which is a schematic diagram of generating patch data of a new industrial chain, after receiving a request and patch data, a server performs batch add, delete, and modify operations on a database, adds data in an add type object, modifies data in an update type object, and deletes data in a delete type object.
Besides visual editing, the platform also provides an industry map one-key importing function, and a user can create or update an industry map in an Excel table importing mode. The core realization steps are as follows:
and reading the Excel file. Js Node-xlsx module is used for reading and writing Excel file stream, node module reading is reading according to Excel row by row, so that the read data structure is a two-dimensional array, and the read value of a cell with parallel or column is NULL. As shown in fig. 6, the reading result is fig. 7.
And converting the effective data in each row into a tree structure with hierarchy. The two-dimensional array read by the Node script can be converted into a nested structure, and the length of each row of array is the maximum depth of the current row. As shown in fig. 7, the first three values of the second row are NULL, which represent the first three data of the first row, so that the nested object generated by the second row and the nested object generated by the first row only need to be merged, and in the same way, the second row of data and the third row of data are merged, and then the same process is repeated to obtain the complete tree. In the implementation process, if a plurality of data rows of the same level are encountered, which data row is inserted cannot be determined at this time, and it is found by observing the rule of data that the depth of the item where the current data is located only needs to be obtained during each insertion, and then the current data is inserted into the last inserted data of the parent level which is one more than the depth of the current data, so that the inserted level can be ensured not to be wrong. Therefore, the parent-level object tree which is one level higher than the current level is searched by using a depth-first search algorithm, so that the parent-level relationship can be obtained, and a new node can be constructed and stored in the data by combining the complete path full _ path of the parent node and the root _ id of the industrial field, as shown in fig. 8. Meanwhile, a current object is inserted into the last item of the child array of the parent tree, such as the combination of the json object generated by the first three items and the fourth item, so that the json structure of the industry atlas from the current node to the root node can be obtained, as shown in fig. 9, the industry hierarchy data structure required by the front-end tree control is completely met, and the front end can be conveniently visualized and displayed.
The industrial chain construction and iterative expansion development method provided by the embodiment of the invention covers the core services of rapid construction of the industrial map, discovery of new industrial words, extraction of industrial hierarchical relations, updating of iteration and the like, solves the limitations of low precision of manual data analysis and complex construction and expansion of the conventional industrial map, can be used for generating and expanding the industrial map under the corresponding category conveniently and rapidly according to the industrial map requirement, balances the relation between the automatic processing flow and manual intervention, and improves the expandability and development efficiency of application.
In order to implement the above embodiments, the present invention further provides an industrial chain building and iterative expansion development device.
Fig. 10 is a schematic structural diagram of an apparatus for building an industry chain and iteratively expanding development according to an embodiment of the present invention.
As shown in fig. 10, the apparatus for building and iteratively expanding a industry chain includes: an acquisition module 100, a screening module 200, a construction module 300, an update module 400, wherein,
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industry new word discovery algorithm to perform unsupervised pre-segmentation on the industry corpus data to obtain an industry new word;
the construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words;
and the updating module is used for designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating based on the original industrial chain tree through the data storage structure.
In order to achieve the above object, a third aspect of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for building an industry chain and iteratively expanding development as described above.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for building and iteratively expanding an industry chain as described above.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (10)
1. A method for building and iteratively expanding and developing an industrial chain is characterized by comprising the following steps:
acquiring a target industry type input by a user, and acquiring industry corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the new industrial words according to the superior-inferior relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words;
and designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
2. The method according to claim 1, further comprising, after obtaining the industry corpus data corresponding to the target industry type:
and performing unified preprocessing on the industrial corpus data, wherein the preprocessing comprises the steps of cutting the industrial corpus data according to a Chinese character mode and a non-Chinese character mode, and removing language words and coding symbols.
3. The method according to claim 1, wherein said designing industry new word discovery algorithm unsupervised pre-participling said industry corpus data comprises:
dividing the industrial corpus data into a single character set, and combining every two characters in the set to serve as candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating left and right information entropies of the candidate words and left and right information entropies of fragments formed by the candidate words;
inquiring the Trie tree, acquiring the word frequency of the candidate words and the word frequencies of the left and right segments, and calculating mutual information between points according to the word frequencies;
calculating scores of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the scores to obtain a candidate word set of a target field, wherein the formula is represented as follows:
wherein the content of the first and second substances,representing mutual information between points, <' > based on a predetermined criterion>Left and right entropy representing the fraction of candidate words constituting a segment>Representing left and right information entropy of the candidate word.
4. The method according to claim 1, wherein the determining the relationship between the industry new words according to the superior-inferior relationship and the parallel semantic relationship extraction method, and the constructing the target industry chain tree according to the industry new words and the relationship between the industry new words comprises:
performing depth expansion and width expansion of the target industrial chain tree by a superior-subordinate relation and parallel semantic relation extraction method; and performing width expansion of the target industry chain tree through a width expansion algorithm, and performing depth expansion of the target industry chain tree through depth expansion.
5. The method of claim 4, wherein the expanding the width of the target industry chain tree by a width expansion algorithm comprises:
representing an industry new word by an entity, representing the part of speech of the industry new word by a type, and defining the association weight between the entity and the type:
wherein the content of the first and second substances,represents a entity>Indicates entity type, is> A returned confidence score;
noting two entitiesAnd &>Has a brother similarity of->The similarity of two sibling entities is calculated using the matching pattern features:
wherein, the first and the second end of the pipe are connected with each other,indicates a skip mode, is asserted>A set representing skip modes;
feature computation using the entity and the type(ii) a Wherein it is present>All the acquired characteristics are represented;
get over word2vecEmbedded features of two entitiesThe multiplicative metric is used to compute the sibling similarity:
calculating the score of the entity according to the similarity of the brothers and the sisters:
and screening the entities according to the scores so as to expand the width of the target industry chain tree.
6. The method of claim 4, wherein the deep expansion of the target industry chain tree by deep expansion comprises:
by usingRepresents an item->Given a target parent node @>A set of reference edgesIn which>Is->Is calculated to be node->Is placed in the father node>Lower score>
Wherein, the first and the second end of the pipe are connected with each other,represents a vector pick>And &>Cosine similarity therebetween;
7. The method of claim 1, wherein designing the data storage structure of the target industry chain tree by aiming at the industry chain upstream and downstream logic and the node association relationship comprises:
designing a parent _ id field, and storing the unique identifier of the father node;
all hierarchy ancestor nodes of the current node are stored by adopting a full _ path field and are represented by an id # id \8230ina splicing character string mode.
8. The device for building and iteratively expanding and developing the industrial chain is characterized by comprising the following modules:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industry new word discovery algorithm to perform unsupervised pre-segmentation on the industry corpus data to obtain an industry new word;
the construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words;
and the updating module is used for designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of industry chain construction and iterative expansion development as claimed in any one of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for industrial chain construction and iterative extension development according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310260247.6A CN115982390B (en) | 2023-03-17 | 2023-03-17 | Industrial chain construction and iterative expansion development method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310260247.6A CN115982390B (en) | 2023-03-17 | 2023-03-17 | Industrial chain construction and iterative expansion development method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115982390A true CN115982390A (en) | 2023-04-18 |
CN115982390B CN115982390B (en) | 2023-06-23 |
Family
ID=85968496
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310260247.6A Active CN115982390B (en) | 2023-03-17 | 2023-03-17 | Industrial chain construction and iterative expansion development method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115982390B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116975626A (en) * | 2023-06-09 | 2023-10-31 | 浙江大学 | Automatic updating method and device for supply chain data model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017185674A1 (en) * | 2016-04-29 | 2017-11-02 | 乐视控股(北京)有限公司 | Method and apparatus for discovering new word |
CN111897917A (en) * | 2020-07-28 | 2020-11-06 | 嘉兴运达智能设备有限公司 | Rail transit industry term extraction method based on multi-modal natural language features |
CN112860692A (en) * | 2021-01-29 | 2021-05-28 | 城云科技(中国)有限公司 | Database table structure conversion method and device and electronic equipment thereof |
CN113779200A (en) * | 2021-09-14 | 2021-12-10 | 中国电信集团系统集成有限责任公司 | Target industry word stock generation method, processor and device |
CN114742061A (en) * | 2022-04-26 | 2022-07-12 | 平安国际智慧城市科技股份有限公司 | Text processing method and device, electronic equipment and storage medium |
CN114757147A (en) * | 2022-04-02 | 2022-07-15 | 辽宁工程技术大学 | BERT-based automatic hierarchical tree expansion method |
-
2023
- 2023-03-17 CN CN202310260247.6A patent/CN115982390B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017185674A1 (en) * | 2016-04-29 | 2017-11-02 | 乐视控股(北京)有限公司 | Method and apparatus for discovering new word |
CN111897917A (en) * | 2020-07-28 | 2020-11-06 | 嘉兴运达智能设备有限公司 | Rail transit industry term extraction method based on multi-modal natural language features |
CN112860692A (en) * | 2021-01-29 | 2021-05-28 | 城云科技(中国)有限公司 | Database table structure conversion method and device and electronic equipment thereof |
CN113779200A (en) * | 2021-09-14 | 2021-12-10 | 中国电信集团系统集成有限责任公司 | Target industry word stock generation method, processor and device |
CN114757147A (en) * | 2022-04-02 | 2022-07-15 | 辽宁工程技术大学 | BERT-based automatic hierarchical tree expansion method |
CN114742061A (en) * | 2022-04-26 | 2022-07-12 | 平安国际智慧城市科技股份有限公司 | Text processing method and device, electronic equipment and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116975626A (en) * | 2023-06-09 | 2023-10-31 | 浙江大学 | Automatic updating method and device for supply chain data model |
CN116975626B (en) * | 2023-06-09 | 2024-04-19 | 浙江大学 | Automatic updating method and device for supply chain data model |
Also Published As
Publication number | Publication date |
---|---|
CN115982390B (en) | 2023-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Ad hoc table retrieval using semantic similarity | |
CN104317801B (en) | A kind of Data clean system and method towards big data | |
CN104268148B (en) | A kind of forum page Information Automatic Extraction method and system based on time string | |
CN104462582B (en) | A kind of web data similarity detection method based on structure and content secondary filtration | |
CN104615687B (en) | A kind of entity fine grit classification method and system towards knowledge base update | |
CN105045875B (en) | Personalized search and device | |
CN102419778B (en) | Information searching method for discovering and clustering sub-topics of query statement | |
CN111190900B (en) | JSON data visualization optimization method in cloud computing mode | |
US20150006528A1 (en) | Hierarchical data structure of documents | |
Bing et al. | Towards a unified solution: data record region detection and segmentation | |
CN106528648A (en) | Distributed keyword approximate search method for RDF in combination with Redis memory database | |
Ujwal et al. | Classification-based adaptive web scraper | |
Ahmadi et al. | Unsupervised matching of data and text | |
CN115982390B (en) | Industrial chain construction and iterative expansion development method | |
CN115617981A (en) | Information level abstract extraction method for short text of social network | |
CN107491524B (en) | Method and device for calculating Chinese word relevance based on Wikipedia concept vector | |
CN110162580A (en) | Data mining and depth analysis method and application based on distributed early warning platform | |
Sharma et al. | A probabilistic approach to apriori algorithm | |
Zeng et al. | Construction of scenic spot knowledge graph based on ontology | |
Alobaid et al. | Knowledge-graph-based semantic labeling: Balancing coverage and specificity | |
Li et al. | A novel approach for mining probabilistic frequent itemsets over uncertain data streams | |
CN116401375B (en) | Knowledge graph construction method and system | |
Yuan et al. | Self-adaptive extracting academic entities from World Wide Web | |
JP5903372B2 (en) | Keyword relevance score calculation device, keyword relevance score calculation method, and program | |
Ganeshmoorthy et al. | Eliminating the Web Noise by Text Categorization and Optimization Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |