CN115982390A - Industrial chain construction and iterative expansion development method - Google Patents

Industrial chain construction and iterative expansion development method Download PDF

Info

Publication number
CN115982390A
CN115982390A CN202310260247.6A CN202310260247A CN115982390A CN 115982390 A CN115982390 A CN 115982390A CN 202310260247 A CN202310260247 A CN 202310260247A CN 115982390 A CN115982390 A CN 115982390A
Authority
CN
China
Prior art keywords
industry
industrial
target
words
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310260247.6A
Other languages
Chinese (zh)
Other versions
CN115982390B (en
Inventor
鄂海红
宋美娜
梁月梅
周文安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202310260247.6A priority Critical patent/CN115982390B/en
Publication of CN115982390A publication Critical patent/CN115982390A/en
Application granted granted Critical
Publication of CN115982390B publication Critical patent/CN115982390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides an industrial chain construction and iterative expansion development method, which comprises the steps of obtaining a target industrial type input by a user, and obtaining industrial corpus data corresponding to the target industrial type; designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on industry corpus data to obtain an industry new word; determining the relation between the new industrial words according to the superior-subordinate relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words; the data storage structure of the target industrial chain tree is designed according to the upstream and downstream logics of the industrial chain and the node incidence relation, and iterative updating is carried out on the basis of the original industrial chain tree through the data storage structure. By the method provided by the invention, the construction and updating efficiency of the industrial map is greatly improved.

Description

Industrial chain construction and iterative expansion development method
Technical Field
The invention belongs to the technical field of data visualization technology and data application.
Background
At present, when a certain industry is analyzed, an industry chain map of the industry needs to be constructed, the construction process usually needs to look up a large amount of industry data manually, and the construction is complex.
The industry chain needs to have sufficient reusability, iterability and expandability. The industry itself is dynamic, and new industries are continuously emerging along with the development of the industry. It is also a great challenge how to mine new words appearing in the industry, how to obtain hierarchical relations among industrial words, and how to add the changes of the industries into the original industrial map data, so that the whole map becomes advanced all the time.
Meanwhile, the subjectivity of an industrial chain is very strong, different industrial standards exist at present, different websites and mechanisms also classify the same industrial term into different industries, different people have different understandings on the construction of the industrial chain, the types of nodes and the relationships of the industrial chain and the granularity problem of the industrial chain, and different settings can directly lead to different application results. In the prior art, in the aspects of finding new industrial words and individually constructing an industrial chain, a universal development method is lacked, and the engineering development efficiency is not improved.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the invention aims to provide an industrial chain construction and iterative expansion development method, which is used for solving the limitations of low accuracy and complicated construction and expansion of the existing industrial map manual data analysis.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for building an industrial chain and iteratively expanding and developing, including:
acquiring a target industry type input by a user, and acquiring industry corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the new industrial words according to the superior-inferior relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words;
and designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
In addition, the method for building and iteratively expanding the development of the industrial chain according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, after acquiring the industry corpus data corresponding to the target industry type, the method further includes:
and performing unified preprocessing on the industrial corpus data, wherein the preprocessing comprises the steps of cutting the industrial corpus data according to a Chinese character mode and a non-Chinese character mode, and removing language words and coding symbols.
Further, in an embodiment of the present invention, the designing an industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industry corpus data into a single character set, and combining every two characters in the set to serve as candidate words;
constructing a Trie tree storage candidate word;
inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating left and right information entropies of the candidate words and left and right information entropies of segments formed by the candidate words;
inquiring the Trie tree, acquiring the word frequency of the candidate words and the word frequencies of the left and right segments, and calculating mutual information between points according to the word frequencies;
calculating the scores of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the scores to obtain a candidate word set of the target field, wherein the formula is represented as:
Figure SMS_1
wherein the content of the first and second substances,
Figure SMS_2
represents mutual information between points, and->
Figure SMS_3
Left and right entropy representing the fraction of candidate words constituting a segment>
Figure SMS_4
Representing left and right information entropy of the candidate word.
Further, in an embodiment of the present invention, the determining the relationship between the industry new words according to a top-bottom relationship and a parallel semantic relationship extraction method, and constructing a target industry chain tree according to the industry new words and the relationship between the industry new words includes:
performing depth expansion and width expansion of the target industrial chain tree by a superior-subordinate relation and parallel semantic relation extraction method; and performing width expansion of the target industry chain tree through a width expansion algorithm, and performing depth expansion of the target industry chain tree through depth expansion.
Further, in an embodiment of the present invention, the performing width expansion of the target industry chain tree by a width expansion algorithm includes:
representing an industry new word by an entity, representing the part of speech of the industry new word by a type, and defining the association weight between the entity and the type:
Figure SMS_5
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_6
represents a entity>
Figure SMS_7
Indicates entity type, is>
Figure SMS_8
Figure SMS_9
A returned confidence score;
noting two entities
Figure SMS_10
And &>
Figure SMS_11
Has a brother similarity of->
Figure SMS_12
Calculating a degree of similarity: -based on matching pattern features for two sibling entities>
Figure SMS_13
Wherein the content of the first and second substances,
Figure SMS_14
indicates a skip mode, is asserted>
Figure SMS_15
A set representing skip modes;
feature computation using the entity and the type
Figure SMS_16
(ii) a Wherein it is present>
Figure SMS_17
All the acquired characteristics are represented;
obtaining embedded features of two entities via word2vec
Figure SMS_18
The multiplicative metric is used to compute the sibling similarity:
Figure SMS_19
calculating the score of the entity according to the similarity of the brothers and the sisters:
Figure SMS_20
and screening the entities according to the scores so as to expand the width of the target industry chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industry chain tree by the depth expansion includes:
by using
Figure SMS_22
Represents an item->
Figure SMS_24
Given a target parent node @>
Figure SMS_26
A set of reference edges
Figure SMS_23
Wherein->
Figure SMS_25
Is->
Figure SMS_27
Is calculated to be node->
Figure SMS_28
Is placed in the father node>
Figure SMS_21
The following scores:
Figure SMS_29
wherein the content of the first and second substances,
Figure SMS_30
to representVector->
Figure SMS_31
And &>
Figure SMS_32
Cosine similarity therebetween;
based on
Figure SMS_33
For each candidate entity->
Figure SMS_34
Scoring and selecting entities having a score above a threshold as nodes>
Figure SMS_35
And performing depth expansion on the target industry chain tree by the lower initial child node.
Further, in an embodiment of the present invention, the designing the data storage structure of the target industry chain tree according to the industry chain upstream and downstream logic and the node association relationship includes:
designing a parent _ id field, and storing the unique identifier of the father node;
all hierarchy ancestor nodes of the current node are stored by adopting a full _ path field and are represented by an id # id \8230ina splicing character string mode.
In order to achieve the above object, a second aspect of the present invention provides an apparatus for building an industrial chain and iteratively expanding development, including the following modules:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industry new word discovery algorithm to perform unsupervised pre-segmentation on the industry corpus data to obtain an industry new word;
the construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words;
and the updating module is used for designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating based on the original industrial chain tree through the data storage structure.
In order to achieve the above object, a third aspect of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement an industry chain building and iterative expansion development method as described above.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to implement an industry chain building and iterative expansion development method as described above when executed by a processor.
The industrial chain construction and iterative expansion development method provided by the embodiment of the invention covers core services of rapid construction of an industrial map, discovery of new industrial words, extraction of industrial hierarchical relations, update and iteration and the like, solves the limitations of low accuracy and complex construction and expansion of manual data analysis of the industrial map at present, and can generate and expand the industrial map under the corresponding category conveniently and rapidly according to the industrial map requirement, thereby balancing the relation between an automatic processing flow and manual intervention and improving the expandability and development efficiency of application.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a method for building an industrial chain and developing an iterative extension according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating a method for discovering an industry new word according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of left and right information entropies of candidate words and left and right information entropies of candidate word constituent segments according to an embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating an overview of a hierarchical tree expansion algorithm process according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of generating patch data of a new industry chain according to an embodiment of the present invention.
Fig. 6 to 9 are schematic diagrams illustrating an industry map importing implementation process according to an embodiment of the present invention.
Fig. 10 is a flowchart illustrating an industrial chain building and iterative expansion developing apparatus according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The industrial chain construction and iterative extension development method of the embodiment of the present invention is described below with reference to the drawings.
Fig. 1 is a schematic flowchart of a method for building an industrial chain and developing an iterative extension according to an embodiment of the present invention.
As shown in fig. 1, the method for building and iteratively expanding the industry chain includes the following steps:
s101: acquiring a target industry type input by a user, and acquiring industry corpus data corresponding to the target industry type;
s102: designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
s103: determining the relation between the new industrial words according to the superior-inferior relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words;
s104: and designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
The invention adopts an unsupervised method, and utilizes a statistical strategy to extract all text segments which can be word-forming in a large-scale corpus according to the common characteristics of words, and segments the corpus to form a plurality of text segments, which is equivalent to a rough word segmentation. Then, the language knowledge is used for eliminating useless fragments which are not new words, the relevancy is calculated, the combination of the characters with the maximum relevancy is found, and the text fragments are cleaned and filtered once. Finally, all the extracted words are compared with the existing word stock, and the text segments which are not in the range of the word stock can be used as a new word stock. Fig. 2 is a flowchart of an industry neologism discovery method.
After the industrial corpus is imported into the system, uniform preprocessing needs to be performed on the data. The industrial linguistic data often contains not only Chinese characters, but also a large number of special punctuations such as Arabic numerals, capital and small English letters, ellipses and the like, which brings certain obstacles for the subsequent new word recognition in the industry. Taking an industry research report as an example, the report uses a large number of numerical values so as to enhance reality and persuasion, and assuming that the length of the longest segment of an industry noun is set to be 8 characters, a plurality of segments of 8 characters are very easy to combine between the numerical values and the letters, the segments often have larger adjacent entropy and mutual information, and if the segments are not processed, the segments without the value of constructing an industry chain map can become entries in an industry new word list.
Further, in an embodiment of the present invention, after acquiring the industry corpus data corresponding to the target industry type, the method further includes:
and performing unified preprocessing on the industrial corpus data, wherein the preprocessing comprises the steps of cutting the industrial corpus data according to a Chinese character mode and a non-Chinese character mode, and removing language words and coding symbols.
The cut corpus is changed into a plurality of short sentences from an original long sentence, and then subsequent new word recognition work is carried out on the obtained short sentences.
Further, in an embodiment of the present invention, the designing industry new word discovery algorithm performs unsupervised pre-segmentation on the industry corpus data, including:
dividing the industrial corpus data into a single character set, and combining every two characters in the set to serve as candidate words;
constructing a Trie tree storage candidate word;
inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating left and right information entropies of the candidate words and left and right information entropies of fragments formed by the candidate words;
inquiring the Trie tree, acquiring the word frequency of the candidate words and the word frequencies of the left and right segments, and calculating mutual information between points according to the word frequencies;
calculating scores of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the scores to obtain a candidate word set of a target field, wherein the formula is represented as follows:
Figure SMS_36
wherein the content of the first and second substances,
Figure SMS_37
representing mutual information between points, <' > based on a predetermined criterion>
Figure SMS_38
Left and right entropy representing the fraction of candidate words constituting a segment>
Figure SMS_39
Representing left and right information entropy of the candidate word.
Specifically, the corpus is divided into a single character set, and characters are combined pairwise to serve as candidate words. Since a prefix and a suffix are required to calculate the information entropy, a segment of length 3 needs to be stored. Since the lookup of the prefix and suffix and the statistics of the word frequency are involved subsequently, the invention uses the Trie tree to store data. Constructing a prefix Trie tree and a suffix Trie tree by using a 3-gram sequence, wherein the Trie tree takes a single character as a node, and each node records the occurrence frequency of words formed from a root node to a current node.
And inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating the left and right information entropies of the candidate words and the left and right information entropies of the candidate word forming segments. Because the related information entropies are more, each information entropy is marked with a distinguishing mark (Candidate is a Candidate word, left is a segment formed on the left, right is a segment formed on the right, h _ l _ l and h _ l _ r are respectively the left and right information entropies of the left segment, h _ r _ l and h _ r are respectively the left and right information entropies of the right segment, and h _ l and h _ r are respectively the left and right information entropies of the Candidate word). As shown in fig. 3.
And querying the Trie tree to obtain the word frequency of the candidate words and the word frequencies of the left and right segments. With the word frequency, the actual occurrence probability P (a, b) and the expected occurrence probability P (a) × P (b) can be conveniently obtained, so that mutual information and internal condensation degree can be calculated. The invention mainly uses two word formation standards: internal solidification degree and free application degree. The internal solidification degree is measured by the appearance frequency of the word and the degree that the word is in meaningful collocation, and the higher the internal solidification degree is, the more likely the text segment is a word; the free application degree is considered to be the richness degree of the left and right adjacent characters of the word, and the higher the free application degree is, the more likely the text segment is to be a word.
The internal freezing degree is used for measuring whether the word collocation is reasonable or not, and the calculation is carried out by means of the index of Point Mutual Information (PMI) in the computational linguistics. If the PMI is high, that is, the frequency of co-occurrence of two words is far greater than the probability of the product of free concatenation of the two words, it indicates that the collocation of the two words is more reasonable. The calculation formula of PMI is as follows:
Figure SMS_40
wherein the content of the first and second substances,
Figure SMS_41
、/>
Figure SMS_42
、/>
Figure SMS_43
respectively shows the appearance of a, b and ab combination in the corpusAnd (4) rate.
Aiming at the words of the multi-element fragments, the fragments are divided into two sub-fragments one by one, all the divided mutual information is calculated, the minimum value of all the mutual information is taken as the internal solidification degree, and the calculation formula is as follows:
Figure SMS_44
wherein the content of the first and second substances,
Figure SMS_45
represents a character string of length m, and>
Figure SMS_46
means word->
Figure SMS_47
The frequency of occurrence of (c).
And querying the Trie tree, acquiring left and right adjacent characters of the sub-segments, and calculating left and right adjacent entropies of the candidate words. The degree of cohesion inside the text segment is not sufficient, and we also need to present it externally as a whole. Assuming left-adjacent character union of word fragments
Figure SMS_48
The right adjacent character is combined into->
Figure SMS_49
The calculation formulas of the left and right adjacent entropies are respectively as follows:
Figure SMS_50
Figure SMS_51
in the invention, the boundary freedom degree of the candidate word pays attention to the adjacent entropy at the left and right sides simultaneously, and the word with higher left and right freedom degrees is taken as a reasonable word, so that the smaller value of the left and right adjacent entropies is selected as the adjacent entropy value to be added into calculation when scoring the candidate word, and the richness of the left adjacent word and the right adjacent word of a word is measured, and the richness is higher if the entropy is larger. The calculation formula of the degree of free use is as follows:
Figure SMS_52
for the characteristic of forming a new word, in practical application, the invention calculates a score for each candidate word, which represents the possibility of forming the new word in the current context. The score calculation formula is as follows:
Figure SMS_53
the score consists of three corresponding parts:
1) Mutual information between points
Figure SMS_54
: the higher the inter-point mutual information, the higher the degree of internal polymerization.
2) Two word segment information entropy
Figure SMS_55
Is greater than or equal to>
Figure SMS_56
: the larger this value, the less likely it means that two words appear together.
3) Minimum value of word left and right information entropy
Figure SMS_57
: the larger the value is, the more the context in which the candidate word appears is, and the more possible the candidate word is.
Therefore, a higher score indicates a higher probability of being a word. And filtering candidate words with lower scores by setting a certain threshold value for the scores, and removing the candidate words from the candidate word set to obtain the candidate word set of the target field.
Some common words in Chinese are also in the candidate word set, and the words should not exist as new words in the target field. Based on the method, a Chinese stop word list is downloaded from Baidu, wherein the stop words are Chinese common words, and if the words in the candidate word list exist in the stop word list, the words are also removed from the candidate word list. Meanwhile, the words in the candidate word set are not necessarily all new words relative to the source domain, and therefore, the words existing in the corpus of the source domain need to be filtered out.
The obtained industrial new word list still has more junk character strings and miscut character strings, the junk strings are mostly the same as common collocation and word internal segments, and unreasonable candidate words cannot be filtered out by simply using an algorithm. Therefore, manual review is needed, and the user is supported to perform addition, deletion, modification, check and export on the content of the candidate words at any time. It is appreciated that by layer-by-layer screening of the new word discovery algorithm, higher quality results have been obtained, greatly reducing the workload of manual intervention. The candidate words after the manual review are used as new words of the industry fields to be stored, so that the construction and updating iteration of the industry chain can be performed on the basis of the industry new words.
Based on the steps, a new word list of the target field can be obtained.
After the new industrial words are extracted, the hierarchical position of the new words in the industrial chain needs to be determined according to the meaning and the characteristics of the new industrial words, entity pairs with upper and lower relations in the new industrial words are searched from the corpus, the hierarchical structure of the industrial chain is constructed, and the new industrial words are added into the industrial chain. The industrial map generally focuses on the upstream and downstream relationship of the industry, therefore, the invention uses the hierarchical tree structure to construct a network of the industrial relationship, and carries out the depth expansion and width expansion of the hierarchical tree by the extraction method of the upper and lower relationship and the parallel semantic relationship.
Further, in an embodiment of the present invention, the determining the relationship between the new industry words according to a top-bottom relationship and a parallel semantic relationship extraction method, and constructing a target industry chain tree according to the new industry words and the relationship between the new industry words includes:
performing depth expansion and width expansion of the target industrial chain tree by a superior-subordinate relation and parallel semantic relation extraction method; and performing width expansion of the target industry chain tree through a width expansion algorithm, and performing depth expansion of the target industry chain tree through depth expansion.
Further, in an embodiment of the present invention, the performing width expansion of the target industry chain tree by a width expansion algorithm includes:
representing an industry new word by an entity, representing the part of speech of the industry new word by a type, and defining the association weight between the entity and the type:
Figure SMS_58
wherein the content of the first and second substances,
Figure SMS_59
represents a entity>
Figure SMS_60
Indicates entity type, <' > in>
Figure SMS_61
Figure SMS_62
A returned confidence score;
noting two entities
Figure SMS_63
And &>
Figure SMS_64
Has a brother similarity of->
Figure SMS_65
The similarity of two sibling entities is calculated using the matching pattern features:
Figure SMS_66
,/>
wherein the content of the first and second substances,
Figure SMS_67
indicates a skip mode, is asserted>
Figure SMS_68
A set representing skip modes;
feature computation using the entity and the type
Figure SMS_69
(ii) a Wherein it is present>
Figure SMS_70
All the acquired characteristics are represented;
obtaining embedded features of two entities via word2vec
Figure SMS_71
The multiplicative metric is used to compute the sibling similarity:
Figure SMS_72
calculating the score of the entity according to the similarity of the brothers and the sisters:
Figure SMS_73
and screening the entities according to the scores so as to expand the width of the target industry chain tree.
Further, in an embodiment of the present invention, the performing the depth expansion of the target industry chain tree by depth expansion includes:
by using
Figure SMS_75
Represents an item->
Figure SMS_77
Given a target parent node @>
Figure SMS_79
A set of reference edges
Figure SMS_76
Wherein->
Figure SMS_78
Is->
Figure SMS_80
Is calculated to be node->
Figure SMS_81
Is placed in the father node>
Figure SMS_74
The following scores:
Figure SMS_82
wherein the content of the first and second substances,
Figure SMS_83
represents a vector pick>
Figure SMS_84
And &>
Figure SMS_85
Cosine similarity between them;
based on
Figure SMS_86
For each candidate entity>
Figure SMS_87
Scoring and selecting entities having a score above a threshold as nodes>
Figure SMS_88
And performing depth expansion on the target industry chain tree by the lower initial child node.
As shown in fig. 4, two expected width extension results are shown. Given the set { "upstream support", "midstream platform" }, we want to find their sibling node "downstream integration services" and place it under the parent node "artificial intelligence". Similarly, our goal is to find all siblings of { "underlying hardware", "application technology" } and attach them under the parent node "upstream support".
This naturally forms a tree width expansion problem, so a width expansion algorithm is used to solve it. One key component in the breadth expansion algorithm is to compute two entities
Figure SMS_89
And &>
Figure SMS_90
The degree of similarity is recorded as->
Figure SMS_91
. Originally, mainly through the parallel semantic pattern matching method, some punctuations (such as pause signs, etc.), fixed words (such as "or", "and", etc.) or sentences are generally used in natural language to represent the parallel relationship, so that the matching pattern of the parallel semantics can be obtained. First, weights are assigned between each pair of entities and matching patterns as follows:
Figure SMS_92
wherein the content of the first and second substances,
Figure SMS_93
is the original co-occurrence count between entity e and skip mode sk, | V | is the total number of candidate entities.
Similarly, we can define the association weight between an entity and a type by:
Figure SMS_94
wherein the content of the first and second substances,
Figure SMS_95
is a confidence score returned by the concept knowledge-graph that it believes an entity @>
Figure SMS_96
Has->
Figure SMS_97
The degree of confidence in the type. Obtaining each entity by linking it to a conceptual knowledgebase>
Figure SMS_98
Type information, returning a type as a property of the entity. For unlinkable entities, they do not have such entity type characteristics at all. The invention selects the base (A Probasilic Taxonomy) proposed by Microsoft as the input concept knowledge graph, and can map the entity to different semantic concepts by utilizing the graph, and is marked with corresponding probability labels according to the text content of the entity.
After this, the similarity of the two sibling entities is computed using the matching pattern features, as follows:
Figure SMS_99
where SK represents the selected set of matching pattern features. Similarly, all types of features can be used to compute
Figure SMS_100
And finally based on the embedded characteristic of the two entities->
Figure SMS_101
The cosine similarity is used to calculate the similarity between the two entities.
To combine the three similarities, the present invention uses a multiplicative metric to compute the sibling similarity as follows:
Figure SMS_102
/>
given a set S of seed entities and a list V of candidate entities, the cumulative strength of each matching pattern feature with the entities in S (i.e., the cumulative strength of each matching pattern feature with the entities in S) is first determined
Figure SMS_103
) Scoring it and then selecting the score that is the mostThe top 200 matching pattern features. On this basis, 10 matching pattern feature subsets are generated using the no-replacement sampling method>
Figure SMS_104
T = 1,2, \823010. Each->
Figure SMS_105
The subset has 120 matching pattern characteristics.
Is given one
Figure SMS_106
Only if it is associated with->
Figure SMS_107
When at least one of the matching pattern features is associated, we will consider the candidate entities in V. The score of the considered entity is calculated as follows:
Figure SMS_108
for each
Figure SMS_111
We can get candidate entities based on their scores>
Figure SMS_114
A ranked list of (a). We use
Figure SMS_116
Indicates entity->
Figure SMS_110
Is at>
Figure SMS_113
Is medium, if->
Figure SMS_115
Does not occur in>
Figure SMS_117
In, IAre arranged by people>
Figure SMS_109
. Finally, we calculate each entity ≦>
Figure SMS_112
And adding entities with average rank above r to the set S, as follows:
a key insight of the aggregation mechanism described above is that unrelated entities are not frequently present in multiple entities
Figure SMS_118
And thus may have a lower mrr fraction. In the present invention, r = 5 is assumed.
For newly added nodes in the classification tree (e.g., the node "downstream integration service" in fig. 4), they do not have any child nodes yet, so we cannot directly apply the breadth extension algorithm. To solve this problem, we use a deep-unrolling algorithm to obtain the initial child node of the target node by considering the relationship between the target node's sibling and nephew/nephew nodes. Take node "downstream integration services" in fig. 4 as an example. This node is generated by the previous width extension algorithm and therefore does not have any child nodes. Our goal is to find its initial child nodes (e.g., "terminal devices" and "application software") by modeling the relationship between a sibling node of the node "downstream integration services" (e.g., "upstream support") and its nephew/nephew node (e.g., "middleware," "operating system").
Our depth extension algorithm relies on term embedding, which encodes the term semantics in dense vectors of fixed length. We denote the embedded vector of term t by v (t). The offset of the two item embeddings may represent the relationship between them, resulting in v ("upstream support") -v ("base hardware") ≈ v ("downstream integration service") -v ("application software"). Thus, given a target parent node
Figure SMS_119
A group of reference sides->
Figure SMS_120
Wherein->
Figure SMS_121
Is->
Figure SMS_122
Parent node of, we compute the node &>
Figure SMS_123
Is placed in the father node>
Figure SMS_124
The following scores were given:
Figure SMS_125
,/>
wherein the content of the first and second substances,
Figure SMS_126
represents a vector pick>
Figure SMS_127
And &>
Figure SMS_128
Cosine similarity between them. Finally, based on->
Figure SMS_129
For each candidate entity->
Figure SMS_130
Scoring and selecting entities having a score above a threshold as nodes>
Figure SMS_131
The initial child node of.
Thus, the industrial chain hierarchical relation tree of a target field can be obtained.
The upstream and downstream relationship of the industrial chain is the core in the industrial map, and the fault tolerance rate is extremely low, so that the upstream and downstream relationship is generally constructed manually by analysts and experts. Therefore, the data storage structure of the industrial map is designed aiming at the upstream and downstream logics and the node incidence relation of the industrial chain, the functions of visual editing of the industrial map, one-key introduction of map data and the like are provided, the steps are simplified by designing the industrial map data conversion processing method, a user can select a proper industrial noun from the automatically mined industrial new words, the industrial chain is conveniently and quickly constructed in a self-defining mode or iterative updating is carried out based on the original industrial map, and convenience is provided for subsequent industrial map fine analysis and prospective study and judgment.
Further, in an embodiment of the present invention, the designing the data storage structure of the target industry chain tree according to the industry chain upstream and downstream logic and the node association relationship includes:
designing a parent _ id field, and storing the unique identifier of the father node;
all hierarchy ancestor nodes of the current node are stored by adopting a full _ path field and are represented by an id # id \8230ina splicing character string mode.
Specifically, in an industrial graph application scene, an industrial graph usually focuses on upstream and downstream relationships and hierarchical dependency relationships among industrial nodes, and a general hierarchy is not too deep and basically within ten layers, so that the design goal of the database is to store a multi-level structure and simply and efficiently acquire a complete branch. Aiming at the limited hierarchical structure with large data volume, a parent _ id field is designed in the text and stores the unique identifier of a parent node, so that the direction of an industry can be quickly acquired, and an industry map tree can be obtained through recursive query. In the aspect of industrial graph visualization analysis, when an industrial graph in a certain field is displayed, all child node information of a certain node needs to be extracted frequently, but if only parent _ id exists, when the depth of the tree is relatively deep, a database needs to be queried many times when a tree is obtained, and the efficiency is very low. In order to improve efficiency, all level ancestor nodes of a current node are stored in a full _ path field and are represented by id # id \8230ina splicing mode, so that a node and child nodes thereof can be conveniently matched by a like statement prefix, the level position of each node in a tree can be obtained, and the tree can be spliced more conveniently and efficiently in an application code level. If the relationship of a node in a tree is updated, only the full _ path field of the node and its children nodes need to be maintained. The design scheme can not only meet the query and encapsulation of industrial map structure data, but also facilitate the maintenance of the hierarchical relationship of the nodes and the sub-nodes thereof. The overall database table main field design is shown in table 1.
TABLE 1
Figure SMS_132
In order to be more beneficial to understanding and analyzing the industrial map data, the method adopts a tiled layer layout mode to observe the map data. In the field of Web application program development, javaScript tree controls based on Ajax technology are widely used, the invention is realized by using an AntV G6 graph visualization engine, the invention provides graph visualization capabilities of the foundation of graph creation, rendering, element configuration, layout, interaction, animation and the like, and the problems of display and editing of industrial graph level data are perfectly solved. A user can add, delete and modify nodes and edges, can also modify the upper and lower relations of the nodes of the industrial map in a dragging mode, and can configure the entity concept and attribute of the industrial node by clicking the nodes, such as entity definition, the field to which the industrial node belongs, and the like, so that the flexibility and the expansion capability of the industrial map are improved.
The method adopts a Tree Diff algorithm to compare nodes of two old and new trees, compares the node difference, thereby determining the nodes needing to be updated and forming patch data to be transmitted to a server. The invention adopts a depth-first strategy, and the depth-first strategy ensures that the ancestor node is the latest when the child node is modified. The comparison between the new node and the old node mainly aims at achieving the aim of maintaining the database around three things, creating the new node, deleting the waste node and updating the existing node. Each editing action of the user is temporarily stored in the front end, the data which needs to be added, modified and deleted are respectively placed into the add object, the update object and the delete object by marking the state of the data instead of directly operating the database, and the classified data are transmitted to the server when the storage is clicked. The method comprises the following specific steps:
(1) And if the node content has no id attribute, the node is considered to be newly added and is added into the add object. Because the unique node identifier id is automatically generated when the node is inserted into the database, and the server side puts the id in the node content and returns the id to the browser, each existing node has the id attribute.
If the node content has id attribute, comparing whether the attribute values of the new node and the old node except children are consistent;
1) If the attribute values are consistent, the node is considered not to be modified;
2) If the attribute values are inconsistent, adding the node into the update object, and reassigning the parent _ id and full _ path;
judging the related conditions of the new node and the child nodes of the old node;
1) Only the new node has child nodes, and the step (1) is switched to;
2) If only the old node has child nodes, the new node is considered to abandon the child nodes of the old node, so that the child nodes of the old node are required to be deleted and added into the delete object;
3) And (3) under the condition that the new node and the old node have the child nodes, traversing and inquiring the intersection of the child node set of the new node and the child node set of the old node, judging the intersection if the ids are the same, and performing next judgment on the nodes in the part and turning to the step (1). And if the node is not in the new node child node in the set, the node is considered to be newly added and is added into the add object. The old node child node that is not in the set is added to the delete object.
As shown in fig. 5, which is a schematic diagram of generating patch data of a new industrial chain, after receiving a request and patch data, a server performs batch add, delete, and modify operations on a database, adds data in an add type object, modifies data in an update type object, and deletes data in a delete type object.
Besides visual editing, the platform also provides an industry map one-key importing function, and a user can create or update an industry map in an Excel table importing mode. The core realization steps are as follows:
and reading the Excel file. Js Node-xlsx module is used for reading and writing Excel file stream, node module reading is reading according to Excel row by row, so that the read data structure is a two-dimensional array, and the read value of a cell with parallel or column is NULL. As shown in fig. 6, the reading result is fig. 7.
And converting the effective data in each row into a tree structure with hierarchy. The two-dimensional array read by the Node script can be converted into a nested structure, and the length of each row of array is the maximum depth of the current row. As shown in fig. 7, the first three values of the second row are NULL, which represent the first three data of the first row, so that the nested object generated by the second row and the nested object generated by the first row only need to be merged, and in the same way, the second row of data and the third row of data are merged, and then the same process is repeated to obtain the complete tree. In the implementation process, if a plurality of data rows of the same level are encountered, which data row is inserted cannot be determined at this time, and it is found by observing the rule of data that the depth of the item where the current data is located only needs to be obtained during each insertion, and then the current data is inserted into the last inserted data of the parent level which is one more than the depth of the current data, so that the inserted level can be ensured not to be wrong. Therefore, the parent-level object tree which is one level higher than the current level is searched by using a depth-first search algorithm, so that the parent-level relationship can be obtained, and a new node can be constructed and stored in the data by combining the complete path full _ path of the parent node and the root _ id of the industrial field, as shown in fig. 8. Meanwhile, a current object is inserted into the last item of the child array of the parent tree, such as the combination of the json object generated by the first three items and the fourth item, so that the json structure of the industry atlas from the current node to the root node can be obtained, as shown in fig. 9, the industry hierarchy data structure required by the front-end tree control is completely met, and the front end can be conveniently visualized and displayed.
The industrial chain construction and iterative expansion development method provided by the embodiment of the invention covers the core services of rapid construction of the industrial map, discovery of new industrial words, extraction of industrial hierarchical relations, updating of iteration and the like, solves the limitations of low precision of manual data analysis and complex construction and expansion of the conventional industrial map, can be used for generating and expanding the industrial map under the corresponding category conveniently and rapidly according to the industrial map requirement, balances the relation between the automatic processing flow and manual intervention, and improves the expandability and development efficiency of application.
In order to implement the above embodiments, the present invention further provides an industrial chain building and iterative expansion development device.
Fig. 10 is a schematic structural diagram of an apparatus for building an industry chain and iteratively expanding development according to an embodiment of the present invention.
As shown in fig. 10, the apparatus for building and iteratively expanding a industry chain includes: an acquisition module 100, a screening module 200, a construction module 300, an update module 400, wherein,
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industry new word discovery algorithm to perform unsupervised pre-segmentation on the industry corpus data to obtain an industry new word;
the construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words;
and the updating module is used for designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating based on the original industrial chain tree through the data storage structure.
In order to achieve the above object, a third aspect of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for building an industry chain and iteratively expanding development as described above.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for building and iteratively expanding an industry chain as described above.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A method for building and iteratively expanding and developing an industrial chain is characterized by comprising the following steps:
acquiring a target industry type input by a user, and acquiring industry corpus data corresponding to the target industry type;
designing an industry new word discovery algorithm to perform unsupervised pre-word segmentation on the industry corpus data to obtain an industry new word;
determining the relation between the new industrial words according to the superior-inferior relation and the parallel semantic relation extraction method, and constructing a target industrial chain tree according to the new industrial words and the relation between the new industrial words;
and designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
2. The method according to claim 1, further comprising, after obtaining the industry corpus data corresponding to the target industry type:
and performing unified preprocessing on the industrial corpus data, wherein the preprocessing comprises the steps of cutting the industrial corpus data according to a Chinese character mode and a non-Chinese character mode, and removing language words and coding symbols.
3. The method according to claim 1, wherein said designing industry new word discovery algorithm unsupervised pre-participling said industry corpus data comprises:
dividing the industrial corpus data into a single character set, and combining every two characters in the set to serve as candidate words;
constructing a Trie tree to store candidate words;
inquiring the Trie tree, acquiring a frequency list of prefixes and suffixes, and calculating left and right information entropies of the candidate words and left and right information entropies of fragments formed by the candidate words;
inquiring the Trie tree, acquiring the word frequency of the candidate words and the word frequencies of the left and right segments, and calculating mutual information between points according to the word frequencies;
calculating scores of the candidate words according to a formula, and filtering the candidate words with lower scores by setting a threshold value for the scores to obtain a candidate word set of a target field, wherein the formula is represented as follows:
Figure QLYQS_1
wherein the content of the first and second substances,
Figure QLYQS_2
representing mutual information between points, <' > based on a predetermined criterion>
Figure QLYQS_3
Left and right entropy representing the fraction of candidate words constituting a segment>
Figure QLYQS_4
Representing left and right information entropy of the candidate word.
4. The method according to claim 1, wherein the determining the relationship between the industry new words according to the superior-inferior relationship and the parallel semantic relationship extraction method, and the constructing the target industry chain tree according to the industry new words and the relationship between the industry new words comprises:
performing depth expansion and width expansion of the target industrial chain tree by a superior-subordinate relation and parallel semantic relation extraction method; and performing width expansion of the target industry chain tree through a width expansion algorithm, and performing depth expansion of the target industry chain tree through depth expansion.
5. The method of claim 4, wherein the expanding the width of the target industry chain tree by a width expansion algorithm comprises:
representing an industry new word by an entity, representing the part of speech of the industry new word by a type, and defining the association weight between the entity and the type:
Figure QLYQS_5
wherein the content of the first and second substances,
Figure QLYQS_6
represents a entity>
Figure QLYQS_7
Indicates entity type, is>
Figure QLYQS_8
Figure QLYQS_9
A returned confidence score;
noting two entities
Figure QLYQS_10
And &>
Figure QLYQS_11
Has a brother similarity of->
Figure QLYQS_12
The similarity of two sibling entities is calculated using the matching pattern features:
Figure QLYQS_13
wherein, the first and the second end of the pipe are connected with each other,
Figure QLYQS_14
indicates a skip mode, is asserted>
Figure QLYQS_15
A set representing skip modes;
feature computation using the entity and the type
Figure QLYQS_16
(ii) a Wherein it is present>
Figure QLYQS_17
All the acquired characteristics are represented;
get over word2vecEmbedded features of two entities
Figure QLYQS_18
The multiplicative metric is used to compute the sibling similarity:
Figure QLYQS_19
calculating the score of the entity according to the similarity of the brothers and the sisters:
Figure QLYQS_20
and screening the entities according to the scores so as to expand the width of the target industry chain tree.
6. The method of claim 4, wherein the deep expansion of the target industry chain tree by deep expansion comprises:
by using
Figure QLYQS_22
Represents an item->
Figure QLYQS_24
Given a target parent node @>
Figure QLYQS_27
A set of reference edges
Figure QLYQS_23
In which>
Figure QLYQS_25
Is->
Figure QLYQS_26
Is calculated to be node->
Figure QLYQS_28
Is placed in the father node>
Figure QLYQS_21
Lower score>
Figure QLYQS_29
Wherein, the first and the second end of the pipe are connected with each other,
Figure QLYQS_30
represents a vector pick>
Figure QLYQS_31
And &>
Figure QLYQS_32
Cosine similarity therebetween;
based on
Figure QLYQS_33
For each candidate entity->
Figure QLYQS_34
Scoring and selecting entities having a score above a threshold as nodes>
Figure QLYQS_35
And performing depth expansion on the target industry chain tree by the lower initial child node.
7. The method of claim 1, wherein designing the data storage structure of the target industry chain tree by aiming at the industry chain upstream and downstream logic and the node association relationship comprises:
designing a parent _ id field, and storing the unique identifier of the father node;
all hierarchy ancestor nodes of the current node are stored by adopting a full _ path field and are represented by an id # id \8230ina splicing character string mode.
8. The device for building and iteratively expanding and developing the industrial chain is characterized by comprising the following modules:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a target industry type input by a user and acquiring industry corpus data corresponding to the target industry type;
the screening module is used for designing an industry new word discovery algorithm to perform unsupervised pre-segmentation on the industry corpus data to obtain an industry new word;
the construction module is used for determining the relation between the industry new words according to the upper-lower relation and the parallel semantic relation extraction method, and constructing a target industry chain tree according to the industry new words and the relation between the industry new words;
and the updating module is used for designing a data storage structure of the target industrial chain tree according to the upstream and downstream logics of the industrial chain and the node incidence relation, and performing iterative updating on the basis of the original industrial chain tree through the data storage structure.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of industry chain construction and iterative expansion development as claimed in any one of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for industrial chain construction and iterative extension development according to any one of claims 1 to 7.
CN202310260247.6A 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method Active CN115982390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310260247.6A CN115982390B (en) 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310260247.6A CN115982390B (en) 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method

Publications (2)

Publication Number Publication Date
CN115982390A true CN115982390A (en) 2023-04-18
CN115982390B CN115982390B (en) 2023-06-23

Family

ID=85968496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310260247.6A Active CN115982390B (en) 2023-03-17 2023-03-17 Industrial chain construction and iterative expansion development method

Country Status (1)

Country Link
CN (1) CN115982390B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975626A (en) * 2023-06-09 2023-10-31 浙江大学 Automatic updating method and device for supply chain data model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN111897917A (en) * 2020-07-28 2020-11-06 嘉兴运达智能设备有限公司 Rail transit industry term extraction method based on multi-modal natural language features
CN112860692A (en) * 2021-01-29 2021-05-28 城云科技(中国)有限公司 Database table structure conversion method and device and electronic equipment thereof
CN113779200A (en) * 2021-09-14 2021-12-10 中国电信集团系统集成有限责任公司 Target industry word stock generation method, processor and device
CN114742061A (en) * 2022-04-26 2022-07-12 平安国际智慧城市科技股份有限公司 Text processing method and device, electronic equipment and storage medium
CN114757147A (en) * 2022-04-02 2022-07-15 辽宁工程技术大学 BERT-based automatic hierarchical tree expansion method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN111897917A (en) * 2020-07-28 2020-11-06 嘉兴运达智能设备有限公司 Rail transit industry term extraction method based on multi-modal natural language features
CN112860692A (en) * 2021-01-29 2021-05-28 城云科技(中国)有限公司 Database table structure conversion method and device and electronic equipment thereof
CN113779200A (en) * 2021-09-14 2021-12-10 中国电信集团系统集成有限责任公司 Target industry word stock generation method, processor and device
CN114757147A (en) * 2022-04-02 2022-07-15 辽宁工程技术大学 BERT-based automatic hierarchical tree expansion method
CN114742061A (en) * 2022-04-26 2022-07-12 平安国际智慧城市科技股份有限公司 Text processing method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975626A (en) * 2023-06-09 2023-10-31 浙江大学 Automatic updating method and device for supply chain data model
CN116975626B (en) * 2023-06-09 2024-04-19 浙江大学 Automatic updating method and device for supply chain data model

Also Published As

Publication number Publication date
CN115982390B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
Zhang et al. Ad hoc table retrieval using semantic similarity
CN104317801B (en) A kind of Data clean system and method towards big data
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN104462582B (en) A kind of web data similarity detection method based on structure and content secondary filtration
CN104615687B (en) A kind of entity fine grit classification method and system towards knowledge base update
CN105045875B (en) Personalized search and device
CN102419778B (en) Information searching method for discovering and clustering sub-topics of query statement
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
US20150006528A1 (en) Hierarchical data structure of documents
Bing et al. Towards a unified solution: data record region detection and segmentation
CN106528648A (en) Distributed keyword approximate search method for RDF in combination with Redis memory database
Ujwal et al. Classification-based adaptive web scraper
Ahmadi et al. Unsupervised matching of data and text
CN115982390B (en) Industrial chain construction and iterative expansion development method
CN115617981A (en) Information level abstract extraction method for short text of social network
CN107491524B (en) Method and device for calculating Chinese word relevance based on Wikipedia concept vector
CN110162580A (en) Data mining and depth analysis method and application based on distributed early warning platform
Sharma et al. A probabilistic approach to apriori algorithm
Zeng et al. Construction of scenic spot knowledge graph based on ontology
Alobaid et al. Knowledge-graph-based semantic labeling: Balancing coverage and specificity
Li et al. A novel approach for mining probabilistic frequent itemsets over uncertain data streams
CN116401375B (en) Knowledge graph construction method and system
Yuan et al. Self-adaptive extracting academic entities from World Wide Web
JP5903372B2 (en) Keyword relevance score calculation device, keyword relevance score calculation method, and program
Ganeshmoorthy et al. Eliminating the Web Noise by Text Categorization and Optimization Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant