CN115982390B - Industrial chain construction and iterative expansion development method - Google Patents

Industrial chain construction and iterative expansion development method Download PDF

Info

Publication number
CN115982390B
CN115982390B CN202310260247.6A CN202310260247A CN115982390B CN 115982390 B CN115982390 B CN 115982390B CN 202310260247 A CN202310260247 A CN 202310260247A CN 115982390 B CN115982390 B CN 115982390B
Authority
CN
China
Prior art keywords
industrial
target
industry
node
industrial chain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310260247.6A
Other languages
Chinese (zh)
Other versions
CN115982390A (en
Inventor
鄂海红
宋美娜
梁月梅
周文安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202310260247.6A priority Critical patent/CN115982390B/en
Publication of CN115982390A publication Critical patent/CN115982390A/en
Application granted granted Critical
Publication of CN115982390B publication Critical patent/CN115982390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention provides an industrial chain construction and iterative expansion development method, which comprises the steps of obtaining a target industrial type input by a user and obtaining industrial corpus data corresponding to the target industrial type; designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain industry new words; determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words; the method comprises the steps of designing a data storage structure of a target industrial chain tree aiming at the upstream and downstream logic and the node association relation of the industrial chain, and carrying out iterative updating based on the original industrial chain tree through the data storage structure. The method provided by the invention greatly improves the efficiency of industrial map construction and updating.

Description

Industrial chain construction and iterative expansion development method
Technical Field
The invention belongs to the technical field of data visualization technology and data application.
Background
At present, when an industry is analyzed, an industrial chain map of the industry needs to be constructed, a great deal of industrial data is often required to be manually consulted in the construction process, the construction is complex, and in addition, the problems of incomplete construction and the like may occur when the industrial chain map is constructed by manually consulting the data.
The industry chain needs to have sufficient reusability, iteration and expansibility. The industry itself is dynamic, and as the industry evolves, new industries continue to emerge. How to mine new words appearing in industry and how to acquire hierarchical relation among industrial words, and adding changes of the industries to original industrial map data, so that the whole map becomes a great challenge.
Meanwhile, the subjectivity of the industrial chain is very strong, different industry standards exist at present, different websites and institutions also classify the same industrial noun into different industries, different people understand the construction of the industrial chain, the types of industrial chain nodes and relations, the granularity problem of the industrial chain is different, and different setting can directly lead to different application results. The prior art lacks a universal development method in the aspects of finding industrial new words and constructing an industrial chain in a personalized way, and is not beneficial to improving engineering development efficiency.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, the invention aims to provide an industrial chain construction and iterative expansion development method which is used for solving the limitations of low accuracy and complicated construction and expansion of the existing industrial map manual data analysis.
To achieve the above objective, an embodiment of a first aspect of the present invention provides an industrial chain construction and iterative expansion development method, including:
acquiring a target industry type input by a user, and acquiring industrial corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
the data storage structure of the target industrial chain tree is designed according to the upstream and downstream logic of the industrial chain and the association relation of the nodes, and iterative updating is carried out on the basis of the original industrial chain tree through the data storage structure.
In addition, an industrial chain construction and iterative expansion development method according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, after obtaining the industrial corpus data corresponding to the target industrial type, the method further includes:
the unified preprocessing of the industrial corpus data comprises the steps of cutting the industrial corpus data according to Chinese characters and non-Chinese characters to remove the words and coding symbols.
Further, in one embodiment of the present invention, the design industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industrial corpus data into a set of single characters, and combining the characters in the set into candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie, acquiring a frequency list of a prefix and a suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of fragments formed by the candidate words;
inquiring the Trie, acquiring word frequencies of the candidate words and word frequencies of left and right fragments, and calculating mutual information among points according to the word frequencies;
calculating the score of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the score to obtain a candidate word set in the target field, wherein the formula is expressed as follows:
Figure SMS_1
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_2
representing mutual information between points->
Figure SMS_3
Left and right information entropy representing candidate word constituent fragments, +.>
Figure SMS_4
Left and right information entropy representing the candidate word.
Further, in an embodiment of the present invention, the determining the relationship between the industry new words according to the context relationship and the parallel semantic relationship extraction method, and constructing a target industry chain tree according to the industry new words and the relationship between the industry new words includes:
Performing depth expansion and width expansion of the target industrial chain tree by using an upper-lower relation and parallel semantic relation extraction method; the width expansion of the target industrial chain tree is performed through a width expansion algorithm, and the depth expansion of the target industrial chain tree is performed through depth expansion.
Further, in an embodiment of the present invention, the expanding the width of the target industrial chain tree by the width expanding algorithm includes:
using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
Figure SMS_5
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_6
representing entity->
Figure SMS_7
Indicating entity type->
Figure SMS_8
Figure SMS_9
A returned confidence score;
record two entities
Figure SMS_10
And->
Figure SMS_11
Is of the brother similarity +.>
Figure SMS_12
Similarity of two sibling entities is calculated using the matching pattern features:
Figure SMS_13
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_14
indicates skip mode, ++>
Figure SMS_15
Representing a set of skip modes;
feature computation using the entity and the type
Figure SMS_16
The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure SMS_17
Representing all the features acquired;
acquiring embedded features of two entities through word2vec
Figure SMS_18
The sibling similarity is calculated using a multiplication metric:
Figure SMS_19
calculating the score of the entity according to the sibling similarity:
Figure SMS_20
And screening the entities according to the scores, so as to expand the width of the target industrial chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industrial chain tree by the depth expansion includes:
by using
Figure SMS_22
Representation item->
Figure SMS_24
Is given a target parent node +.>
Figure SMS_26
A set of reference edges
Figure SMS_23
Wherein->
Figure SMS_25
Is->
Figure SMS_27
Is to calculate the node +_>
Figure SMS_28
Put in father node->
Figure SMS_21
Scoring of:
Figure SMS_29
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_30
representation vector->
Figure SMS_31
And->
Figure SMS_32
Cosine similarity between them;
based on
Figure SMS_33
For each candidate entity->
Figure SMS_34
Scoring and selecting an entity with a score above a threshold as node +.>
Figure SMS_35
And the initial child node below performs depth expansion of the target industrial chain tree.
Further, in one embodiment of the present invention, the designing the data storage structure of the target industry chain tree for the industry chain upstream and downstream logic and node association relation includes:
designing a parent_id field, and storing a unique identifier of a parent node;
all hierarchical ancestor nodes of the current node are stored using the full path field, by means of id # id # id … and splicing the character string representation.
To achieve the above object, a second aspect of the present invention provides an apparatus for industrial chain construction and iterative expansion development, comprising:
The acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industrial new word discovery algorithm to perform unsupervised pre-segmentation on the industrial corpus data to obtain industrial new words;
the construction module is used for determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
and the updating module is used for carrying out iterative updating based on the original industrial chain tree through the data storage structure by designing the data storage structure of the target industrial chain tree aiming at the industrial chain upstream and downstream logic and the node association relation.
To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements an industrial chain construction and iterative expansion development method as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements an industrial chain construction and iterative expansion development method as described above.
The industrial chain construction and iteration expansion development method provided by the embodiment of the invention covers core services of industrial atlas rapid construction, finding out new industrial words, extracting industrial hierarchical relations, updating iteration and the like, solves the limitations of low accuracy of manual data analysis and complicated construction and expansion of the existing industrial atlas, and enables a user to conveniently and rapidly generate and correspond to the industrial atlas under a category according to the industrial atlas demand, thereby balancing the relation between an automatic processing flow and manual intervention and improving the expandability and development efficiency of application.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a schematic flow chart of an industrial chain construction and iterative expansion development method according to an embodiment of the present invention.
Fig. 2 is a flow chart of an industrial new word discovery method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of left and right information entropy of a candidate word and left and right information entropy of a candidate word constituent segment provided by an embodiment of the present invention.
Fig. 4 is a schematic diagram of a hierarchical tree expansion algorithm according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of patch data generation of a new industrial chain according to an embodiment of the present invention.
Fig. 6-9 are schematic diagrams of an industrial map importing implementation process according to an embodiment of the present invention.
Fig. 10 is a schematic flow chart of an industrial chain construction and iterative expansion development device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The following describes an industrial chain construction and iterative expansion development method of an embodiment of the present invention with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of an industrial chain construction and iterative expansion development method according to an embodiment of the present invention.
As shown in fig. 1, the industrial chain construction and iterative expansion development method comprises the following steps:
s101: acquiring a target industry type input by a user, and acquiring industrial corpus data corresponding to the target industry type;
S102: designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
s103: determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
s104: the data storage structure of the target industrial chain tree is designed according to the upstream and downstream logic of the industrial chain and the association relation of the nodes, and iterative updating is carried out on the basis of the original industrial chain tree through the data storage structure.
According to the invention, an unsupervised method is adopted, all text fragments which are possibly formed into words in a section of large-scale corpus are extracted by utilizing a statistical strategy, and the corpus is segmented to form a plurality of text fragments, which is equivalent to one-time rough and shallow word segmentation. Then, the language knowledge is used for removing useless fragments which are not new words, calculating the relevance, searching the word with the largest relevance and the word combination, and cleaning and filtering the text fragments once. And finally, comparing all extracted words with the existing word stock, and taking the text fragments which are not in the range of the word stock as a new word stock. Fig. 2 is a flowchart of an industrial new word discovery method.
After the industrial corpus is imported into the system, unified pretreatment is needed for the data. The industrial corpus often contains not only Chinese characters but also a large number of special punctuations such as Arabic numerals, english letters with lower cases, ellipses and the like, which brings a certain obstruction to the subsequent industrial new word recognition. Taking industrial research report as an example, a large number of numerical values are used for enhancing the authenticity and convincing effect, and given that the length of the longest segment of an industrial noun is set to be 8 characters, a plurality of segments of 8 characters are easy to combine between the numerical values and the letters, and often have larger adjacent entropy and mutual information, if the segments are not processed, the segments without industrial chain map construction value become terms in an industrial new word list.
Further, in an embodiment of the present invention, after obtaining the industrial corpus data corresponding to the target industrial type, the method further includes:
the unified preprocessing of the industrial corpus data comprises the steps of cutting the industrial corpus data according to Chinese characters and non-Chinese characters to remove the words and coding symbols.
The cut corpus is changed into a plurality of short sentences from an original long sentence, and then the subsequent new word recognition work is carried out on the obtained short sentences.
Further, in one embodiment of the present invention, the design industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industrial corpus data into a set of single characters, and combining the characters in the set into candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie, acquiring a frequency list of a prefix and a suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of fragments formed by the candidate words;
inquiring the Trie, acquiring word frequencies of the candidate words and word frequencies of left and right fragments, and calculating mutual information among points according to the word frequencies;
calculating the score of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the score to obtain a candidate word set in the target field, wherein the formula is expressed as follows:
Figure SMS_36
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_37
representing mutual information between points->
Figure SMS_38
Left and right information entropy representing candidate word constituent fragments, +.>
Figure SMS_39
Left and right information entropy representing the candidate word.
Specifically, the corpus is divided into a single character set, and the character sets are combined two by two to be used as candidate words. Since a prefix and a suffix are required to calculate information entropy, a fragment of length 3 needs to be stored. The present invention uses Trie trees to store data since the search for prefixes and statistics of word frequencies are subsequently involved. And constructing a prefix Trie tree and a suffix Trie tree by using the 3-gram sequence, wherein the Trie tree takes single characters as nodes, and each node records the frequency of forming the vocabulary from the root node to the current node.
And inquiring the Trie, acquiring a frequency list of the prefix and the suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of the candidate word composition fragments. Because the related information entropy is relatively more, we make the following distinguishing mark (Candidate is a Candidate word, left is a segment formed on the left, right is a segment formed on the right, h_l_l, h_l_r are the left and right information entropy of the left segment, h_r_l, h_r are the left and right information entropy of the ritth segment, and h_l, h_r are the left and right information entropy of the Candidate word) for each information entropy. As shown in fig. 3.
And inquiring the Trie to obtain the word frequency of the candidate word and the word frequency of the left and right fragments. The actual occurrence probability P (a, b) and the expected occurrence probability P (a) and P (b) can be conveniently obtained after the word frequency is available, so that the mutual information and the internal condensation degree are calculated. The word forming standards used in the invention mainly comprise two parts: the internal solidification degree and the free application degree. The internal solidification degree measures the occurrence frequency of the word and the degree to which the word is a meaningful match, and the higher the internal solidification degree is, the more likely the text segment is a word; the degree of freedom is considered to be the richness of the words left and right, and the higher the degree of freedom is, the more likely the text segment is a word.
The internal coagulability is used for measuring whether word collocation is reasonable or not, and is calculated by means of an index of point-to-point information (PMI) in calculation linguistics. If the PMI is high, namely the frequency of co-occurrence of two words is far greater than the product probability of free splicing of the two words, the two words are more reasonable to match. The calculation formula of PMI is as follows:
Figure SMS_40
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_41
、/>
Figure SMS_42
、/>
Figure SMS_43
the occurrence probabilities of a, b and ab combinations in the corpus are respectively represented.
Aiming at the words of the multi-element fragments, the fragments are divided into two sub-fragments word by word, all the divided mutual information is calculated, the minimum value of all the mutual information is taken as the internal solidification degree, and the calculation formula is as follows:
Figure SMS_44
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_45
character string of length m +.>
Figure SMS_46
The expression->
Figure SMS_47
Is a frequency of occurrence of (a).
And inquiring the Trie, acquiring left and right adjacent characters of the sub-segment, and calculating left and right adjacent entropy of the candidate word. The degree of condensation inside the light-viewing text segment is not yet sufficient, and we need to see its appearance outside from the whole. Assume that the left adjacent character of a word segment is combined into
Figure SMS_48
Right adjacent character is combined as +>
Figure SMS_49
The calculation formulas of the left and right adjacent entropy are respectively as follows:
Figure SMS_50
Figure SMS_51
the boundary degrees of freedom of the candidate words in the invention pay attention to adjacent entropy on the left and right sides at the same time, and words with higher degrees of freedom on the left and right sides are taken as a reasonable word, so that a smaller value in left and right adjacent entropy is selected as an adjacent entropy value to be added into calculation when the candidate words are scored, the richness of left adjacent words and right adjacent words of one word is measured, and the richness is higher as the entropy is larger. The calculation formula of the free application degree is as follows:
Figure SMS_52
For the word forming characteristics of new words, in practical application, the invention calculates a score for each candidate word, which indicates the possibility of becoming a new word in the current context. The score calculation formula is as follows:
Figure SMS_53
the score consists of three corresponding parts:
1) Inter-point information
Figure SMS_54
: the higher the inter-point information, the higher the internal degree of polymerization.
2) Entropy of two word fragments
Figure SMS_55
Minimum value +.>
Figure SMS_56
: the larger this value, the less likely it is that two words will appear together.
3) Minimum value of word left-right information entropy
Figure SMS_57
: the larger this value, the more context that the candidate word appears, the more likely it is to be a word.
Thus, a higher score indicates a greater likelihood of word formation. And filtering candidate words with lower scores by setting a certain threshold value on the scores, dividing the candidate words out of the candidate word sets respectively, and finally obtaining the candidate word sets in the target field.
There are also some common words in the candidate word set that should not exist as new words for the target collar. Based on this, a Chinese stop word list is obtained from hundred-degree downloading, wherein the stop word is a common word of Chinese, and if the words in the candidate word set exist in the stop word list, the candidate word set is also distinguished. Meanwhile, the words in the candidate word set are not necessarily new words relative to the source domain, so that words in the source domain corpus need to be filtered out.
The obtained industrial new word list still has more garbage character strings and character strings which are segmented by mistake, the garbage character strings are mostly similar to common collocations and word internal fragments, and unreasonable candidate words can not be filtered out by using an algorithm alone. Therefore, the method also needs to be manually checked, and supports users to add, delete, check and export candidate word contents at any time. It is appreciated that by layer-by-layer screening of new word discovery algorithms, higher quality results have been obtained, greatly reducing the workload of manual intervention. The candidate words after the manual examination are used as new words in the industrial fields to be stored so as to carry out subsequent construction and updating iteration of an industrial chain based on the new words.
Based on the above steps, a new vocabulary of the target domain can be obtained.
After the new industrial word is extracted, the hierarchical position of the new industrial word in the industrial chain is determined according to the meaning and the characteristics of the new industrial word, entity pairs with upper and lower relation in the new industrial word are searched from the corpus, a hierarchical structure of the industrial chain is built, and the new industrial word is added into the industrial chain. The industrial map usually focuses on the industrial upstream-downstream relationship, and for this purpose, the invention uses a network with the industrial relationship established by the hierarchical tree structure to perform depth expansion and width expansion of the hierarchical tree by an upper-lower relationship and parallel semantic relationship extraction method.
Further, in an embodiment of the present invention, the determining the relationship between the industry new words according to the context relationship and the parallel semantic relationship extraction method, and constructing a target industry chain tree according to the industry new words and the relationship between the industry new words includes:
performing depth expansion and width expansion of the target industrial chain tree by using an upper-lower relation and parallel semantic relation extraction method; the width expansion of the target industrial chain tree is performed through a width expansion algorithm, and the depth expansion of the target industrial chain tree is performed through depth expansion.
Further, in an embodiment of the present invention, the expanding the width of the target industrial chain tree by the width expanding algorithm includes:
using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
Figure SMS_58
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_59
the representation of the entity is made,/>
Figure SMS_60
indicating entity type->
Figure SMS_61
Figure SMS_62
A returned confidence score;
record two entities
Figure SMS_63
And->
Figure SMS_64
Is of the brother similarity +.>
Figure SMS_65
Similarity of two sibling entities is calculated using the matching pattern features:
Figure SMS_66
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_67
indicates skip mode, ++>
Figure SMS_68
Representing a set of skip modes;
feature computation using the entity and the type
Figure SMS_69
The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure SMS_70
Representing all the features acquired;
acquiring embedded features of two entities through word2vec
Figure SMS_71
The sibling similarity is calculated using a multiplication metric:
Figure SMS_72
calculating the score of the entity according to the sibling similarity:
Figure SMS_73
and screening the entities according to the scores, so as to expand the width of the target industrial chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industrial chain tree by the depth expansion includes:
by using
Figure SMS_75
Representation item->
Figure SMS_77
Is given a target parent node +.>
Figure SMS_79
A set of reference edges
Figure SMS_76
Wherein->
Figure SMS_78
Is->
Figure SMS_80
Is to calculate the node +_>
Figure SMS_81
Put in father node->
Figure SMS_74
Scoring of:
Figure SMS_82
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_83
representation vector->
Figure SMS_84
And->
Figure SMS_85
Cosine similarity between them;
based on
Figure SMS_86
For each candidate entity->
Figure SMS_87
Scoring and selecting an entity with a score above a threshold as node +.>
Figure SMS_88
And the initial child node below performs depth expansion of the target industrial chain tree.
As shown in fig. 4, two expected width extension results are shown. When a given set { "upstream support", "midstream platform" }, we want to find their siblings "downstream integration services" and put them under the parent node "artificial intelligence". Similarly, our goal is to find all siblings of { "underlying hardware", "application technology" }, and append them under the parent node "upstream support".
This naturally creates a tree width expansion problem, and therefore a width expansion algorithm is employed to solve it. One key component in the width expansion algorithm is the computation of two entities
Figure SMS_89
And->
Figure SMS_90
Is marked as +.>
Figure SMS_91
. The method is mainly used for matching the parallel semantic patterns in natural languageSome punctuation marks (such as a pause number and the like), fixed words (such as 'OR', 'AND', and the like) or sentence patterns are generally used for representing parallel relations, so that a matching mode of parallel semantics can be obtained. First, weights are assigned between each pair of entities and matching patterns as follows:
Figure SMS_92
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_93
is the original co-occurrence count between entity e and skip pattern sk, |v| is the total number of candidate entities.
Similarly, we can define the association weights between entities and types as follows:
Figure SMS_94
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_95
is a confidence score returned by the concept knowledge graph, indicating that it believes the entity +.>
Figure SMS_96
With->
Figure SMS_97
Degree of confidence in the type. Obtaining each entity by linking it to a concept knowledge graph>
Figure SMS_98
Type information, return type as a property of the entity. For unlinked entities, they have no such entity type property at all. According to the invention, probase (A Probabilistic Taxonomy) proposed by Microsoft is selected as an input concept knowledge graph, and the graph can be used for mapping the entity to different semantic concepts and marking corresponding probability labels according to the text content of the entity.
After this, the similarity of the two sibling entities is calculated using the matching pattern features as follows:
Figure SMS_99
where SK represents the selected matching pattern feature set. Similarly, all types of features can be used to calculate
Figure SMS_100
Finally, according to the embedded features of both entities +.>
Figure SMS_101
The cosine similarity is used to calculate the similarity between two entities.
To combine the three similarities, the present invention uses a multiplication metric to calculate sibling similarity, as follows:
Figure SMS_102
given a set of seed entities S and a list of candidate entities V, first based on each matching pattern feature and the cumulative strength of the entities in S (i.e
Figure SMS_103
) It is scored and then the top 200 matching pattern features with highest scores are selected. On this basis, 10 matching pattern feature subsets +.>
Figure SMS_104
T=1, 2, … 10. Each->
Figure SMS_105
The subset has 120 matching pattern characteristics.
Given one of
Figure SMS_106
Only if it is equal to->
Figure SMS_107
Only if there is an association of at least one matching pattern feature is we consider the candidate entity in V. The score calculation method for the considered entity is as follows:
Figure SMS_108
for each of
Figure SMS_111
We can obtain candidate entities +.based on their scores>
Figure SMS_114
Is a ranked list of (c) in the database. We use
Figure SMS_116
Representation entity->
Figure SMS_110
At->
Figure SMS_113
Rank of (3), if->
Figure SMS_115
Do not occur in +.>
Figure SMS_117
In we set->
Figure SMS_109
. Finally, we calculate every entity +.>
Figure SMS_112
Is added to set S, and an entity with an average rank higher than r is added to set S, as follows:
the key insight of the aggregation mechanism described above is that unrelated entities do not occur frequently in multiple
Figure SMS_118
And thus may have a lower mrr score. At the position ofIn the present invention, r=5 is set.
For newly added nodes in the classification tree (e.g., node "downstream integration service" in fig. 4), they have not had any child nodes yet, so we cannot directly apply the width extension algorithm. To solve this problem, we use a depth expansion algorithm to obtain the initial child node of the target node by considering the relationship between the sibling node and the nephew/nephew node of the target node. Take the node "downstream integrated service" in fig. 4 as an example. The node is generated by the previous width extension algorithm and therefore does not have any child nodes. Our goal is to find its initial child nodes (e.g., "end devices" and "application software") by modeling the relationship between the sibling node of the node "downstream integrated service" (e.g., "upstream support") and its siblings/girls (e.g., "middleware", "operating system").
Our depth expansion algorithm relies on term embedding, which encodes term semantics in dense vectors of fixed length. Let us denote the embedded vector of item t by v (t). The offset of the two item embeddings can represent the relationship between them, resulting in v ("upstream support") -v ("base hardware") -v ("downstream integration service") -v ("application software"). Thus, given a target parent node
Figure SMS_119
A group of reference edges->
Figure SMS_120
Wherein->
Figure SMS_121
Is->
Figure SMS_122
We calculate the node +.>
Figure SMS_123
Put in father node->
Figure SMS_124
Under the following commentsThe method is divided into the following steps:
Figure SMS_125
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_126
representation vector->
Figure SMS_127
And->
Figure SMS_128
Cosine similarity between them. Finally, based on->
Figure SMS_129
For each candidate entity->
Figure SMS_130
Scoring and selecting entities with scores above a threshold as nodes
Figure SMS_131
The next initial child node.
Thus, an industrial chain hierarchical relationship tree of the target field can be obtained.
The relationship between the upstream and downstream of the industrial chain is the core in the industrial map, and the fault tolerance is extremely low, so that the relationship is generally constructed manually by analysts and experts. Therefore, the data storage structure of the industrial atlas is designed aiming at the upstream and downstream logics of the industrial chain and the association relation of the nodes, the functions of visual editing of the industrial atlas, one-key importing of atlas data and the like are provided, the steps are simplified by designing the industrial atlas data conversion processing method, a user can select proper industrial nouns from industrial new words automatically mined, the industrial chain can be conveniently and rapidly built in a self-defined mode or the iterative updating can be carried out based on the original industrial atlas, and convenience is provided for follow-up industrial atlas fine analysis and prospective research and judgment.
Further, in one embodiment of the present invention, the designing the data storage structure of the target industry chain tree for the industry chain upstream and downstream logic and node association relation includes:
designing a parent_id field, and storing a unique identifier of a parent node;
all hierarchical ancestor nodes of the current node are stored using the full path field, by means of id # id # id … and splicing the character string representation.
In particular, in the industrial atlas application scenario, the industrial atlas usually focuses on the relationship between the upstream and downstream and the hierarchical dependency between the industrial nodes, and the general hierarchy is not too deep and is basically within ten layers, so the database design goal is to store a multi-level structure and simply and efficiently obtain a complete branch. Aiming at the hierarchical structure of the limited hierarchy with larger data volume, a parent_id field is designed, and the unique identification of a father node is stored, so that the direction of industry can be quickly obtained, and an industry map tree can be obtained through recursive query. In the aspect of industrial map visualization analysis, when an industrial map of a certain field is displayed, all child node information of a certain node needs to be frequently extracted, but if only a parent_id is used, when the depth of the tree is deeper, a database needs to be queried many times when a tree is obtained, and the efficiency is very low. In order to improve efficiency, full_path fields are used for storing all hierarchy ancestor nodes of the current node, character strings are spliced and expressed in an id#id#id … mode, so that a certain node and child nodes thereof can be conveniently matched by like statement prefixes, the hierarchy position of each node in a tree can be obtained, and the tree can be spliced more conveniently and efficiently in an application code layer. If the relation of the nodes in a tree is updated, only the full_path field of the node and the child nodes thereof need to be maintained. The design scheme not only can meet the query and encapsulation of the industrial map structure data, but also is convenient for maintaining the hierarchical relationship of the nodes and the child nodes thereof. The overall database table main field design is shown in table 1.
TABLE 1
Figure SMS_132
To facilitate understanding and analysis of industrial map data, the present invention employs a lay-out approach to observing map data. In the field of Web application program development, javaScript tree controls based on Ajax technology are widely used, and the invention is realized by using an AntV G6 graph visualization engine, so that graph creation, rendering, element configuration, layout, interaction, animation and other basic graph visualization capabilities are provided, and the problems of displaying and editing industrial map level data are perfectly solved. Users can add, delete and change nodes and edges, and can change the upper and lower relationships of the nodes of the industrial map in a dragging mode, and click on the nodes can configure the industrial node entity concepts and attributes, such as solid definition, belonging fields and the like, so that the flexibility and the expansion capability of the industrial map are improved.
The method adopts a Tree Diff algorithm to compare nodes of two new trees and old trees, compares the node difference, thereby determining the node which needs to be updated, forming patch data and transmitting the patch data to a server. The invention adopts a depth-first strategy, and the depth-first ensures that the ancestor node of the child node is up-to-date when the child node is modified. The comparison of the new node and the old node mainly aims at achieving the purpose of maintaining the database around three things, and the new node is created, the waste node is deleted and the existing node is updated. Each editing action of the user is temporarily stored in the front end, the "new addition", "modification" and "deletion" of the front end do not directly operate the database, but mark the data with a status, and the data needing to be added, modified and deleted are respectively put in an add object, an update object and a delete object, and the classified data is transmitted to the server when the "save" is clicked. The method comprises the following specific steps:
(1) If the node content has no id attribute, the node is considered to be newly added and added into the add object. Because the node unique identification id is automatically generated when the node is inserted into the database, the server side returns the id to the browser after placing the id in the node content, and each existing node has an id attribute.
If the node content has id attribute, comparing whether all attribute values of the new and old nodes except the child are consistent;
1) If the attribute values are consistent, the node is considered to be unnecessary to modify;
2) If the attribute values are inconsistent, adding the node into the update object, and reassigning the parent_id and full_path;
judging the relevant conditions of the child nodes of the new node and the old node;
1) Only the new node has child nodes, and the step (1) is switched to;
2) Only the old node has child nodes, the new node is considered to discard the child nodes of the old node, so that the child nodes of the old node need to be deleted and added into delete objects;
3) Under the condition that both the new node and the old node have child nodes, traversing and inquiring the intersection of the child node set of the new node and the child node set of the old node, wherein the intersection can be considered as the same id, and the part of nodes can be judged in the next step and the step (1) is carried out. And if the node is not in the new node child node in the set, the node is considered to be newly added and added into the add object. The old node child node that is not in the set is added to the delete object.
FIG. 5 is a schematic diagram illustrating patch data generation of a new industry chain, in which a server performs batch adding and deleting operations on a database after receiving a request and patch data, newly adds data in an add type object, modifies data in an update type object, and deletes data in a delete type object.
In addition to visual editing, the platform also provides an industrial map one-key import function, and a user can create or update an industrial map in an Excel table import mode. The core implementation steps are as follows:
the Excel file is read. The Node-xlsx module of Node. Js is used to realize the reading and writing of Excel file stream, and the Node module reads according to the reading of Excel line by line, so the read data structure is a two-dimensional array, and the value read by the parallel or column unit cells is NULL. As shown in fig. 6, the reading result is fig. 7.
The valid data in each row is converted into a tree structure having a hierarchy. The two-dimensional array read by the Node script can be converted into a nested structure, and the length of each row of array is the maximum depth of the current row. As shown in fig. 7, the first three values of the second row are NULL, which represents the first three data of the first row, so only the nested objects generated by the second row and the nested objects generated by the first row need to be combined, and similarly, the second row data and the third row data are combined, and then the complete tree can be obtained by the similar method. In the implementation process, if a plurality of data lines of the same level are encountered, at the moment, which data line is inserted cannot be determined, and by observing the rule of the data, the inserted level cannot be wrong by finding that only the depth of the item where the current data is located is required to be obtained when each time of insertion, and then the current data is inserted into the last inserted parent level which is one more than the current data depth. Therefore, a depth-first search algorithm is used herein to search out a parent object tree which is one level larger than the current level, namely a parent relation can be obtained, and a new node can be built and stored in data by combining the complete path full_path and the industry field root_id of the parent node, as shown in fig. 8. Meanwhile, the current object is inserted into the last item of the child array of the parent tree, as shown in the figure, the json object generated by the first three items and the fourth item are combined, so that an industrial atlas json structure from the current node to the root node can be obtained, as shown in fig. 9, the industrial hierarchy data structure required by the front-end tree control is completely met, and the front-end is convenient to carry out visual display.
The industrial chain construction and iteration expansion development method provided by the embodiment of the invention covers core services of industrial atlas rapid construction, finding out new industrial words, extracting industrial hierarchical relations, updating iteration and the like, solves the limitations of low accuracy of manual data analysis and complicated construction and expansion of the existing industrial atlas, and enables a user to conveniently and rapidly generate and correspond to the industrial atlas under a category according to the industrial atlas demand, thereby balancing the relation between an automatic processing flow and manual intervention and improving the expandability and development efficiency of application.
In order to realize the embodiment, the invention also provides an industrial chain construction and iteration expansion development device.
Fig. 10 is a schematic structural diagram of an industrial chain construction and iterative expansion development device according to an embodiment of the present invention.
As shown in fig. 10, the industrial chain construction and iterative expansion development apparatus includes: an acquisition module 100, a screening module 200, a construction module 300, an update module 400, wherein,
the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industrial new word discovery algorithm to perform unsupervised pre-segmentation on the industrial corpus data to obtain industrial new words;
The construction module is used for determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words;
and the updating module is used for carrying out iterative updating based on the original industrial chain tree through the data storage structure by designing the data storage structure of the target industrial chain tree aiming at the industrial chain upstream and downstream logic and the node association relation.
To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the industrial chain construction and iterative expansion development method as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the industrial chain construction and iterative expansion development method as described above.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (7)

1. The industrial chain construction and iterative expansion development method is characterized by comprising the following steps of:
acquiring a target industry type input by a user, and acquiring industrial corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the industrial new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the industrial new words and the relation between the industrial new words, wherein the depth expansion and the width expansion of the target industrial chain tree are performed through the upper-lower relation and the parallel semantic relation extraction method, the width expansion of the target industrial chain tree is performed through a width expansion algorithm, and the depth expansion of the target industrial chain tree is performed through the depth expansion;
The method comprises the steps of designing a data storage structure of a target industrial chain tree aiming at the upstream and downstream logic and the node association relation of the industrial chain, and carrying out iterative updating based on the original industrial chain tree through the data storage structure;
the width expansion of the target industrial chain tree by a width expansion algorithm comprises the following steps: using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
Figure QLYQS_1
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_2
representing entity->
Figure QLYQS_3
Representing entity type, noting two entities +.>
Figure QLYQS_4
And->
Figure QLYQS_5
Is of the brother similarity of
Figure QLYQS_6
Similarity of two sibling entities is calculated using the matching pattern features:
Figure QLYQS_7
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_8
indicates skip mode, ++>
Figure QLYQS_9
A set of skip modes is represented and,
feature computation using the entity and the type
Figure QLYQS_10
The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure QLYQS_11
Representing all of the features acquired and,
acquiring embedded features of two entities through word2vec
Figure QLYQS_12
The sibling similarity is calculated using a multiplication metric:
Figure QLYQS_13
calculating the score of the entity according to the sibling similarity:
Figure QLYQS_14
screening the entities according to the scores, so as to expand the width of the target industrial chain tree;
the depth expansion of the target industrial chain tree by depth expansion comprises the following steps:
By using
Figure QLYQS_17
Representation item->
Figure QLYQS_19
Is given a target parent node +.>
Figure QLYQS_20
A set of reference edges
Figure QLYQS_16
Wherein->
Figure QLYQS_18
Is->
Figure QLYQS_21
Is to calculate the node +_>
Figure QLYQS_22
Put in father node->
Figure QLYQS_15
Scoring of:
Figure QLYQS_23
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_24
representation vector->
Figure QLYQS_25
And->
Figure QLYQS_26
The degree of cosine similarity between the two,
based on
Figure QLYQS_27
For each candidate entity->
Figure QLYQS_28
Scoring and selecting an entity with a score above a threshold as node +.>
Figure QLYQS_29
And the initial child node below performs depth expansion of the target industrial chain tree.
2. The method of claim 1, further comprising, after obtaining the industrial corpus data corresponding to the target industrial type:
the unified preprocessing of the industrial corpus data comprises the steps of cutting the industrial corpus data according to Chinese characters and non-Chinese characters to remove the words and coding symbols.
3. The method of claim 1, wherein the designing an industry new word discovery algorithm to perform an unsupervised pre-segmentation on the industry corpus data comprises:
dividing the industrial corpus data into a set of single characters, and combining the characters in the set into candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie, acquiring a frequency list of a prefix and a suffix, and calculating left and right information entropy of the candidate words and left and right information entropy of fragments formed by the candidate words;
Inquiring the Trie, acquiring word frequencies of the candidate words and word frequencies of left and right fragments, and calculating mutual information among points according to the word frequencies;
calculating the score of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the score to obtain a candidate word set in the target field, wherein the formula is expressed as follows:
Figure QLYQS_30
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_31
representing mutual information between points->
Figure QLYQS_32
Left and right information entropy representing the candidate word constituent segments,
Figure QLYQS_33
left and right information entropy representing the candidate word.
4. The method of claim 1, wherein the step of designing the data storage structure of the target industry chain tree for the industry chain upstream and downstream logic and node association relationship comprises:
designing a parent_id field, and storing a unique identifier of a parent node;
all hierarchical ancestor nodes of the current node are stored using the full path field, by means of id # id # id … and splicing the character string representation.
5. The industrial chain construction and iteration expansion development device is characterized by comprising the following modules:
the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industrial new word discovery algorithm to perform unsupervised pre-segmentation on the industrial corpus data to obtain industrial new words;
The construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words, wherein the depth expansion and the width expansion of the target industry chain tree are carried out through the upper-lower relation and the parallel semantic relation extraction method, the width expansion of the target industry chain tree is carried out through a width expansion algorithm, and the depth expansion of the target industry chain tree is carried out through the depth expansion;
the updating module is used for carrying out iterative updating based on the original industrial chain tree through the data storage structure by designing the data storage structure of the target industrial chain tree aiming at the industrial chain upstream and downstream logic and the node association relation;
the width expansion of the target industrial chain tree by a width expansion algorithm comprises the following steps: using an entity to represent an industry new word, using a type to represent the part of speech of the industry new word, and defining the association weight between the entity and the type:
Figure QLYQS_34
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_35
representing entity->
Figure QLYQS_36
Representing entity type, noting two entities +.>
Figure QLYQS_37
And->
Figure QLYQS_38
Is of the brother similarity of
Figure QLYQS_39
Similarity of two sibling entities is calculated using the matching pattern features:
Figure QLYQS_40
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_41
indicates skip mode, ++>
Figure QLYQS_42
A set of skip modes is represented and,
feature computation using the entity and the type
Figure QLYQS_43
The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>
Figure QLYQS_44
Representing all of the features acquired and,
acquiring embedded features of two entities through word2vec
Figure QLYQS_45
The sibling similarity is calculated using a multiplication metric:
Figure QLYQS_46
calculating the score of the entity according to the sibling similarity:
Figure QLYQS_47
screening the entities according to the scores, so as to expand the width of the target industrial chain tree;
the depth expansion of the target industrial chain tree by depth expansion comprises the following steps:
by using
Figure QLYQS_49
Representation item->
Figure QLYQS_51
Is given a target parent node +.>
Figure QLYQS_54
A set of reference edges
Figure QLYQS_50
Wherein->
Figure QLYQS_52
Is->
Figure QLYQS_53
Is to calculate the node +_>
Figure QLYQS_55
Put in father node->
Figure QLYQS_48
Scoring of:
Figure QLYQS_56
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_57
representation vector->
Figure QLYQS_58
And->
Figure QLYQS_59
The degree of cosine similarity between the two,
based on
Figure QLYQS_60
For each candidate entity->
Figure QLYQS_61
Scoring and selecting an entity with a score above a threshold as node +.>
Figure QLYQS_62
And the initial child node below performs depth expansion of the target industrial chain tree.
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the industrial chain construction and iterative expansion development method of any one of claims 1-4 when the computer program is executed.
7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the industrial chain construction and iterative expansion development method according to any one of claims 1-4.
CN202310260247.6A 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method Active CN115982390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310260247.6A CN115982390B (en) 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310260247.6A CN115982390B (en) 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method

Publications (2)

Publication Number Publication Date
CN115982390A CN115982390A (en) 2023-04-18
CN115982390B true CN115982390B (en) 2023-06-23

Family

ID=85968496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310260247.6A Active CN115982390B (en) 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method

Country Status (1)

Country Link
CN (1) CN115982390B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975626B (en) * 2023-06-09 2024-04-19 浙江大学 Automatic updating method and device for supply chain data model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955950A (en) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 New word discovery method and device
CN111897917B (en) * 2020-07-28 2023-06-16 成都灵尧科技有限责任公司 Rail transit industry term extraction method based on multi-modal natural language features
CN112860692B (en) * 2021-01-29 2023-07-25 城云科技(中国)有限公司 Database table structure conversion method and device and electronic equipment thereof
CN113779200A (en) * 2021-09-14 2021-12-10 中国电信集团系统集成有限责任公司 Target industry word stock generation method, processor and device
CN114757147A (en) * 2022-04-02 2022-07-15 辽宁工程技术大学 BERT-based automatic hierarchical tree expansion method
CN114742061A (en) * 2022-04-26 2022-07-12 平安国际智慧城市科技股份有限公司 Text processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115982390A (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
Chakrabarti et al. A graph-theoretic approach to webpage segmentation
Su et al. ODE: Ontology-assisted data extraction
Lu et al. Annotating search results from web databases
Kayed et al. FiVaTech: Page-level web data extraction from template pages
Schenker Graph-theoretic techniques for web content mining
US20060288275A1 (en) Method for classifying sub-trees in semi-structured documents
CN111950285A (en) Intelligent automatic construction system and method of medical knowledge map based on multi-modal data fusion
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
WO2014210387A2 (en) Concept extraction
CN101515287A (en) Automatic generating method of wrapper of complex page
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
Bing et al. Towards a unified solution: data record region detection and segmentation
CN106528648A (en) Distributed keyword approximate search method for RDF in combination with Redis memory database
CN115982390B (en) Industrial chain construction and iterative expansion development method
Ujwal et al. Classification-based adaptive web scraper
JP2009110508A (en) Method and system for calculating competitiveness metric between objects
CN112084333A (en) Social user generation method based on emotional tendency analysis
Suresh et al. Data mining and text mining—a survey
Pereira et al. Disambiguating publication venue titles using association rules
CN107491524B (en) Method and device for calculating Chinese word relevance based on Wikipedia concept vector
CN116628303A (en) Semi-structured webpage attribute value extraction method and system based on prompt learning
CN101996190A (en) Method and device for extracting information from webpage
CN115617981A (en) Information level abstract extraction method for short text of social network
Liu et al. Structured data extraction: wrapper generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant