CN113987197B - Dynamic fusion and growth method for product node system in all fields - Google Patents

Dynamic fusion and growth method for product node system in all fields Download PDF

Info

Publication number
CN113987197B
CN113987197B CN202111166990.2A CN202111166990A CN113987197B CN 113987197 B CN113987197 B CN 113987197B CN 202111166990 A CN202111166990 A CN 202111166990A CN 113987197 B CN113987197 B CN 113987197B
Authority
CN
China
Prior art keywords
product
node
concept
candidate
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111166990.2A
Other languages
Chinese (zh)
Other versions
CN113987197A (en
Inventor
张啸天
宗畅
杨彦飞
许源泓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Liangzhi Data Technology Co ltd
Original Assignee
Hangzhou Liangzhi Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Liangzhi Data Technology Co ltd filed Critical Hangzhou Liangzhi Data Technology Co ltd
Priority to CN202111166990.2A priority Critical patent/CN113987197B/en
Publication of CN113987197A publication Critical patent/CN113987197A/en
Application granted granted Critical
Publication of CN113987197B publication Critical patent/CN113987197B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a dynamic fusion and growth method of a product node system in the whole field. Aiming at the cognitive decision requirement of the fine granularity emerging field in the regional industry economic development process, the invention continuously excavates product concept nodes from massive Internet semi-structured and unstructured heterogeneous data sources by utilizing natural language processing and knowledge graph technologies such as concept acquisition, relationship discrimination, attribute fusion and the like on the basis of the existing authoritative product classification system, characterizes the product concepts by utilizing a text embedding technology, further judges, fuses and articulates the relationship between the product concepts and the nodes of the original product system, and continuously expands the node system content to form a set of full-field product node system capable of dynamic fusion growth. In addition, the invention can also ensure the authority and accuracy of the product node system in the whole field in the human-computer cooperative interaction flow in the system construction and updating process.

Description

Dynamic fusion and growth method for product node system in all fields
Technical Field
The invention relates to the fields of computer technology, artificial intelligence and natural language processing, in particular to a dynamic fusion and growth method of a product node system in the whole field.
Background
Along with the development of computer science and technology and artificial intelligence technology, automation and intellectualization become key elements of various industries in digital innovation and upgrading, especially in economic digital innovation, traditional regional industry economic development cognitive analysis and decision depend on expert experience, a large amount of labor cost is submerged, a product system framework in industrial analysis cannot be effectively abstracted and precipitated, and rapid and effective standardized construction cannot be implemented on a fine-granularity product framework system.
In combination with the scene requirements of product system construction, the whole field product system needs to be constructed above a set of international standard industry classification, and further extends downwards product categories and finer granularity product subdivision nodes. The problems to be considered include how to select standard industry classification system bases, how to identify product concepts from semi-structured annual report data and unstructured paper patent texts, how to embed and characterize system nodes, how to judge synonymous and upper-lower relationship among products, how to ensure accurate and continuous dynamic fusion and growth of a product system, and ensure high-quality construction of a product system in the whole field.
Therefore, a dynamic fusion and growth method of a product node system in the whole field is needed, automatic construction of the product system is rapidly carried out, cognitive decision of regional industries is assisted, and the innovation requirement of industrial economy and digitization is met.
Disclosure of Invention
In view of the above, the invention aims to solve the problem that industry knowledge in the mind of an expert cannot be precipitated in industrial analysis and research and an industrial product knowledge system cannot be quickly constructed, and provides a method for dynamically fusing and growing a product node system in the whole field. The method is a cross-language full-industry chain product system construction and system self-growth dynamic updating method provided for a system framework in industrial analysis decision, can mine product concepts based on heterogeneous data sources, and dynamically fuses and updates the system for continuously emerging new products, thereby providing a new thought for exploration, classification and analysis of the new products.
In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps:
A dynamic fusion and growth method of a full-field product node system comprises the following steps:
s1, taking a general product classification system required by building a full-field product system as an upper layer framework of a product node system, and further performing fine adjustment on a data set of the general product classification system by utilizing a pre-training language model to obtain a field language model, wherein the field language model is used for obtaining word embedding representation of each node in the product node system;
S2, extracting product concepts from unstructured text data containing the product concepts by utilizing a pre-trained product concept extraction model, extracting the product concepts on semi-structured text data containing the product concepts based on rules, continuously and dynamically updating both the unstructured text data and the semi-structured text data so as to continuously extract vocabulary and phrases of the product concepts from the unstructured text data and the semi-structured text data, and combining the vocabulary and phrases to form a candidate product concept set;
S3, training a synonym concept judgment model by using a product concept alias library, judging a synonym relationship between a candidate product concept in the candidate product concept set and a node in the existing product node system, fusing the product concept conforming to the synonym relationship with the node as a concept-node pair to obtain a node system after alias expansion, and simultaneously taking the product concept which does not conform to the synonym relationship with any node as a new product concept;
S4, based on the domain language model obtained in the S1, constructing a node-node pair training set conforming to the upper and lower relation according to the existing product node system, and training to obtain an upper and lower relation classification judgment model, so that the node classification judgment model can judge the direct father level node of the node concept, further predicting the father level node of each new product concept obtained in the S3 by using the trained upper and lower relation classification judgment model, and hanging and expanding the new product concept into the product node system according to the prediction result;
S5, respectively transmitting the candidate product concept set obtained in the S2 and the node systems expanded in the S3 and the S4 to a manual auditing end for verification, and finally updating the product node system according to a verification result, and simultaneously updating training samples of all models used in the S2-S4 to improve the performance of all models, thereby realizing continuous dynamic construction of the all-field product node system.
Preferably, the step S1 specifically includes the following steps:
S11, according to the construction requirements of a product system in the whole field, taking a general product classification system HS code as a seed node system to form an upper layer framework of the product node system, and further obtaining an upper-lower relationship data set in the product node system;
S12, performing fine tuning training on a description text of the seed node system by using the Bert pre-training language model, learning semantic features in the field text expression to obtain a field language model, and obtaining feature vectors of each node concept in the product node system by using the field language model.
Preferably, the step S2 specifically includes the following steps:
s21, carrying out rule-based structural analysis and extraction on product concepts in text for continuously acquired semi-structured text data containing the product concepts to generate a first candidate product concept set;
S22, for unstructured text data containing product concepts, which is obtained through continuous collection, a training sample set containing product concept sequences is obtained through manual labeling, then a product concept extraction model is trained on the basis of the training sample set by using an NLP sequence labeling model, and a product concept sequence is extracted on new unstructured text data through continuous collection through the product concept extraction model, so that a second candidate product concept set is generated;
S23, combining the first candidate product concept set and the second candidate product concept set into a candidate product concept set, and using the candidate product concept set as a basis for expanding an existing product node system.
Further, the semi-structured text data containing the product concept is an enterprise annual report.
Further, the unstructured text data containing product concepts is patent text data and/or paper text data.
Preferably, the step S3 specifically includes the following steps:
S31, constructing a synonymous concept sample set conforming to a synonymous relation of a product according to product concept alias information, training a synonymous concept discrimination model based on the synonymous concept sample set by utilizing a sequence classification task in a Bert pre-training language model application scene, further predicting the synonymous concept relation between each candidate product concept in the candidate product concept set and each node in the existing product node system by utilizing the synonymous concept discrimination model according to the candidate product concept set, and taking one candidate product concept in the candidate product concept set and one node in the existing product node system as a concept-node pair conforming to the synonymous relation if the two candidate product concepts conform to the synonymous concept relation; if one candidate product concept in the candidate product concept set does not accord with the synonymous concept relation with any node in the existing product node system, the candidate product concept is used as a new product concept and added into the new product concept candidate set;
S32, aiming at the concept-node pair meeting the synonymous relation obtained in S31, fusing candidate product concept nouns into corresponding node attributes in the existing product node system, and storing the node attribute fields of the node instances in the product library to realize node alias attribute fusion.
Preferably, the step S4 specifically includes the following steps:
s41, constructing a query-node concept pair training set by utilizing the upper and lower relationships of the existing nodes in a product node system, wherein in each query-node concept pair, a query represents a product concept to be hung, a node represents a product node in the product node system, product node information formed by all nodes is represented by a node graph structure, a training set label is set to be 1 or 0, wherein 1 represents a node as a direct parent node of the query, and 0 is opposite;
S42, initializing feature vectors of each product node in a query product concept and node graph structure by using the domain language model obtained in the S1, carrying out propagation fusion and iterative updating on each node feature by adopting a GNN graph neural network model in the node graph structure to obtain respective word embedding representation of the query and the node, inputting the word embedding representation into a two-class model, and training the two-class model to enable the node in the query-node concept pair to be a direct father node of the query or not so as to obtain an upper-lower relationship classification judgment model;
S43, for the new product concepts obtained in the S3, judging the upper and lower relationship of each new product concept and each existing node in the product node system one by utilizing an upper and lower relationship classification judgment model, and calculating the existing node with the highest matching degree as a direct father level node so as to carry out the hanging expansion of the product node system.
Preferably, in the step S5, a verification tool of man-machine interaction is used to perform manual auditing and verification on the product concept extraction result, the product alias fusion result and the product system growth result in the steps S2, S3 and S4, and simultaneously sample data sets used for training each model in the steps S2, S3 and S4 are continuously deposited and updated according to the verification result, so that the model performance is iteratively improved, and a high-quality full-field product node system is continuously constructed.
The dynamic fusion and growth method of the all-field product node system provided by the invention can be used for quickly constructing a standardized product knowledge system facing to application scenes of regional industry cognitive decision, and besides, a set of semi-automatic construction process of the all-field product node system is also output, so that the method can be applied to an industrial chain analysis decision system, and is beneficial to improving the automation and intelligent degree of an industrial development cognitive decision process.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to the structures illustrated in these drawings without the need of inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a whole flow chart of a dynamic fusion and growth method of a product node system in the whole field, which is provided by the embodiment of the invention;
FIG. 2 is a schematic diagram of a concept of identifying a product by mining heterogeneous data sources according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a modeling method of a query-node concept to an upper-lower relationship determination model according to an embodiment of the present invention;
Fig. 4 is a schematic diagram of iterative updating of a product knowledge system through man-machine interaction according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, steps, operations, or elements, but do not preclude the presence or addition of one or more other features, steps, operations, or groups of elements.
It should also be understood that where directional indications are referred to in embodiments of the present invention, the directional indications are merely used to explain the relative relationship of the components being constructed at a particular location, and that if the particular location changes, the directional indications correspondingly change.
In addition, the product node system in the whole field constructed in the invention is constructed by using general industry product knowledge as a guide in the construction process, but the construction method is not limited to any industry, and the product node system in any specific industry can be constructed by the construction method and the dynamic updating method.
In the invention, the dynamic fusion and growth method of the product node system in the whole field comprises the following steps:
S1, taking an authoritative general product classification system required by building a full-field product system as an upper layer framework of a product node system, and further performing Fine tuning (Fine-tune) on a data set of the general product classification system by utilizing a pre-training language model to obtain a field language model, wherein the field language model is used for obtaining word embedding representation of each node in the product node system.
S2, extracting product concepts from unstructured text data containing the product concepts by utilizing a pre-trained product concept extraction model, extracting the product concepts on semi-structured text data containing the product concepts based on rules, continuously and dynamically updating both the unstructured text data and the semi-structured text data so as to continuously extract vocabulary and phrases of the product concepts from the unstructured text data and the semi-structured text data, and combining the vocabulary and phrases to form a candidate product concept set.
S3, training a synonym concept judgment model by using a product concept alias library, judging the synonym relation between the candidate product concepts in the candidate product concept set and nodes in the existing product node system, fusing the product concepts conforming to the synonym relation with the nodes as concept-node pairs to obtain a node system after alias expansion, and simultaneously taking the product concepts which do not conform to the synonym relation with any node as new product concepts.
S4, based on the domain language model obtained in the S1, a node-node pair training set conforming to the upper and lower relation is constructed according to the existing product node system, and an upper and lower relation classification judgment model is obtained through training, so that the direct father level node of the node concept (query) can be judged, the father level node of each new product concept obtained in the S3 is predicted by using the trained upper and lower relation classification judgment model, and the new product concept is hung and expanded into the product node system according to a prediction result.
S5, respectively transmitting the candidate product concept set obtained in the S2 and the node systems expanded in the S3 and the S4 to a manual auditing end for verification, and finally updating the product node system according to a verification result, and simultaneously updating training samples of all models used in the S2-S4 to improve the performance of all models, thereby realizing continuous dynamic construction of the all-field product node system.
The specific implementation of the steps S1 to S5 in this embodiment will be described below.
In the embodiment of the present invention, for the step S1 referred to in fig. 1, the specific implementation process is as follows:
S11, according to the construction requirements of the product system in the whole field, selecting a universal product classification system meeting the conditions as an upper layer framework of the product node system, and constructing an upper-lower relationship data set of nodes in the product node system. It should be noted that the selected general product classification system needs to have authority, in this embodiment, based on professional background knowledge, from aspects of node scale, node granularity, node expression form and the like, HS codes (customs codes) are selected as a seed node system to form an authority product node system upper layer architecture, and then an upper and lower relationship data set in the node system is obtained according to a standard system construction. The seed node system is used as the basic framework of the whole product node system, the nodes in the system represent product concepts of different levels and have upper and lower relations, and then the attribute expansion of the nodes and the hanging growth of new product concept nodes can be further carried out on the basis of the seed node system, so that the product node system is continuously expanded, and the product node system in the whole field is continuously and dynamically constructed.
S12, performing Fine-tuning training on a description text of a seed node system by using the Bert pre-training language model, learning semantic features in the text expression of the field, obtaining a field language model after training, and obtaining feature vectors of each node concept in the product node system, namely word embedding representation of the nodes by using the field language model.
Fig. 2 is a schematic diagram of extracting product concept words from a heterogeneous data source according to an embodiment of the present invention, where product concept needs to be extracted from two heterogeneous data including semi-structured text data and unstructured text data to further expand an existing product node system, and the process corresponds to step S2 in the method shown in fig. 1, and the specific implementation process is as follows:
S21, carrying out rule-based structural analysis and extraction on product concepts in the text for continuously acquired semi-structured text data containing the product concepts, and generating a first candidate product concept set.
S22, for unstructured text data containing product concepts, which is obtained through continuous collection, a training sample set containing product concept sequences is obtained through manual labeling, then a product concept extraction model is trained on the basis of the training sample set by using an NLP sequence labeling model, and a product concept sequence is extracted through the product concept extraction model on new unstructured text data which is continuously collected, so that a second candidate product concept set is generated.
S23, combining the first candidate product concept set and the second candidate product concept set into a candidate product concept set, and using the candidate product concept set as a basis for expanding an existing product node system.
It should be noted that, because the product concepts in the industry are updated continuously over time, the extraction process of the candidate product concepts in this step may also be performed continuously, so as to implement dynamic update of the product node system. In actual operation, new semi-structured text data and unstructured text data can be continuously collected and accumulated, and then candidate product concepts are regularly extracted from the accumulated data for updating a product node system.
The specific forms of the above-described semi-structured text data and unstructured text data may be varied as long as product concepts are contained therein. In this embodiment, the semi-structured text data may be semi-structured product information table data disclosed in the annual newspaper of the enterprise, and related product information may be extracted by a rule method, and a specific extraction rule may be determined according to the language characteristics of the product concept in the text.
Whereas for unstructured text data, the embodiments of the present invention employ proprietary text data and paper text data instead. Product concepts in unstructured text data cannot be extracted by rules, so that accumulated paper text and patent text data are required to be used as annotation corpus to form annotation data, a product concept extraction model is trained by using the annotation data based on a frame of Bert+LSTM+CRF, product concept entities are mined on an accumulated patent and paper data set, and candidate product concept words are generated. The specific process of the product concept extraction model for extracting the concept is as follows:
First, labeling labels for token sequences in text data include three categories, "O", "B-Product", "I-Product", where "B-Product" represents a beginning token of a Product concept sequence, "I-Product" represents intermediate and ending tokens of a Product concept sequence, and "O" represents a token of a non-Product concept sequence.
And then, the marked training data is read into the Bert pre-training model after being processed to obtain pre-trained vector features, and then the features are read into a classical LSTM+CRF model to obtain final features for judging whether each token belongs to a product concept sequence or not, so that model training is completed, and new product concepts are extracted.
In addition, the specific implementation procedure of step S3 in fig. 1 is as follows:
S31, constructing a synonymous concept sample set conforming to a product synonymous relation according to pre-accumulated product concept alias information, wherein each synonymous concept sample corresponds to a group of product concept words with the same meaning but different expressions, training a synonymous concept discrimination model based on a synonymous concept sample set by utilizing classical sequence classification tasks in a Bert pre-training language model application scene, further predicting the synonymous concept relation between each candidate product concept in the candidate product concept set and each node in an existing product node system by utilizing the synonymous concept discrimination model according to the candidate product concept set, and if one candidate product concept in the candidate product concept set conforms to the synonymous concept relation between one node in the existing product node system, taking the two as a concept-node pair conforming to the synonymous relation; if one candidate product concept in the candidate product concept set does not accord with the synonymous concept relation with any node in the existing product node system, the candidate product concept is used as a new product concept and added into the new product concept candidate set;
S32, aiming at the concept-node pair meeting the synonymous relation obtained in S31, fusing candidate product concept nouns into corresponding node attributes in the existing product node system, and storing the node attribute fields of the node instances in the product library to realize node alias attribute fusion.
In this embodiment, for step S31, in the process of training the synonymy relationship discrimination model between concepts, the Bert is also used as a pre-training model to construct semantic feature vectors of concepts, and then a typical sequence classification task in a bertfine-tune task scene is used as a method, concept-node pairs are used as inputs, whether they are synonymy relationships are used as prediction labels, and model training is performed to predict whether new concept-node pairs satisfy the synonymy relationships.
Fig. 3 is a method for modeling a node-node pair (i.e., query-node concept pair) context determination model involved in step S4, specifically including the following steps:
training and identifying a context relation model between a new node and an existing node according to the following method:
s41, constructing a query-node concept pair training set by utilizing the upper and lower relationships of the existing nodes in a product node system, wherein in each query-node concept pair, a query represents a product concept to be hung, a node represents a product node in the product node system, product node information formed by all nodes is represented by a node graph structure, a training set label is set to be 1 or 0, wherein 1 represents a node as a direct parent node of the query, and 0 is opposite;
S42, initializing feature vectors of each product node in a query product concept and node graph structure by using the domain language model obtained in the S1, carrying out propagation fusion and iterative updating on each node feature by adopting a GNN graph neural network model in the node graph structure to obtain respective word embedding representation of the query and the node, inputting the word embedding representation into a two-class model, and training the two-class model to enable the node in the query-node concept pair to be a direct father node of the query or not so as to obtain an upper-lower relationship classification judgment model;
s43, for the new product concepts obtained in S3, judging the upper and lower relationship of each new product concept and each existing node in the product node system one by utilizing an upper and lower relationship classification judgment model, and calculating the existing node with the highest matching degree as a direct father level node so as to carry out the hanging expansion of the product node system
For easy understanding, the following describes in detail the graph structure data set constructed in step S41 and the feature vector generation method on the graph structure of step S42 in the embodiment.
As shown in FIG. 3, part 1 is a graph structure corresponding to node n i The graph structure comprehensively considers the father node and the brother node of the node. For a node N i, assuming that it is node b in the left graph, its neighboring node set is denoted by N (b), where/>, for the purpose of expressing the calculation formula of GNN, is definedFor the union of node b and the neighboring node set N (b), the vector update calculation formula of node b at the kth iteration is:
Wherein, Word embedded representation for initial state,/>Word embedding representation for the kth iteration of node b, agg (k) for all nodes/>, at the kth iterationAnd carrying out propagation aggregation calculation based on the last iteration result. The commonly used Agg functions in the graph structure are a graph convolution neural network GCN and a graph annotation neural network GAT, wherein the GCN defines the Agg functions as follows:
Where ρ is a nonlinear activation function, e.g., reLU, etc., W is a parameter for model learning, For the normalized coefficients, the values for each participating iteration are the same.
If the normalized coefficient is regarded as an importance weight coefficient between two nodes, the GCN treats each adjacent node equally in iterative calculation, only the structural characteristics of the nodes are considered, and weight information between the nodes and the adjacent nodes is not considered. The embodiment of the invention adopts an optimized GAT graph annotating force model, and GAT pairsRedefinition is performed such that node information is propagated taking into consideration not only structural information but also information of the node itself:
In the above formula, z and W are parameters to be learned, γ is a nonlinear activation function, i represents a splicing operation, and (3) is brought into (2) to obtain a single GAT model, in the embodiment, a multi-head attention mechanism is adopted, and the results are spliced to obtain final feature vectors of all nodes.
As shown in part 2 of fig. 3, embodiments also embed a representation of location information in the iterative process, taking into account that the same node may be located at different locations in different graph structures, trained with the GAT propagation process. Node b location feature at the kth iterationExpressed, the/>, in the formula (2)The matrix calculation after the position embedding and splicing is replaced:
Wherein, the 'I' represents the splicing operation, and O (k-1) is a part of training parameters which need to be aligned and added to the matrix W to be trained after feature splicing.
The final output feature vector Embed of the node feature map uses the weighted mean value of the position vectors of all nodes in the map, and the calculation method is as follows:
As shown in fig. 3 in section 3, feature vectors of the query and the node are spliced and then input into an MLP model, and two-classification prediction is performed to determine whether the product node is a direct parent of the concept word of the query product. The embodiment of the invention adopts InfoLoss as a loss function for optimization, and the loss function is defined as follows:
In the above formula, X i represents a set of 1 positive example and N negative example generated by an upper-lower relationship edge < N P,nC>(nP is a direct parent node of N C); x is a sample set generated by all sides; j traverses all samples generated by the computation from 1 to n+1.
FIG. 4 is a schematic diagram of iterative updating of a product knowledge system in the step S5, wherein the updating can be realized by using a verification tool of man-machine interaction, and the updating is realized by manually checking and verifying the product concept extraction result, the product alias fusion result and the product system growth result in the steps S2, S3 and S4, and if the verification is passed, the updating is performed, otherwise, the updating is not performed; and simultaneously, continuously precipitating and updating the sample data set used for training each model in the steps S2, S3 and S4 according to the verification result, iteratively improving the model performance, and continuously constructing to obtain a high-quality all-field product node system.
Taking checking and iterative updating of upper and lower relation discrimination and node system hooking as an example, the method comprises the following steps:
In the 1 st part, aiming at the newly added product node generated in the step S42, a corresponding knowledge checking interface tool can be designed and developed so as to check whether the upper and lower relationship in the system is correct or not manually, if the upper and lower relationship is correct or not, the self-growth update of the product node is confirmed, if the self-growth update is required to be regulated, the hierarchy position of the product node is regulated manually, and then the update of a new product node system is confirmed;
in the 2 nd part, synchronously updating the upper and lower hierarchical relation node pair database of the product according to the manual verification result, and expanding a training sample set;
In the 3 rd part, according to the new model training sample, the upper and lower relation judgment model is retrained after a period of time according to the accumulated magnitude, the model version is updated, and the model performance is continuously improved through iteration.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A dynamic fusion and growth method of a full-field product node system is characterized by comprising the following steps:
s1, taking a general product classification system required by building a full-field product system as an upper layer framework of a product node system, and further performing fine adjustment on a data set of the general product classification system by utilizing a pre-training language model to obtain a field language model, wherein the field language model is used for obtaining word embedding representation of each node in the product node system;
S2, extracting product concepts from unstructured text data containing the product concepts by utilizing a pre-trained product concept extraction model, extracting the product concepts on semi-structured text data containing the product concepts based on rules, continuously and dynamically updating both the unstructured text data and the semi-structured text data so as to continuously extract vocabulary and phrases of the product concepts from the unstructured text data and the semi-structured text data, and combining the vocabulary and phrases to form a candidate product concept set;
S3, training a synonym concept judgment model by using a product concept alias library, judging a synonym relationship between a candidate product concept in the candidate product concept set and a node in the existing product node system, fusing the product concept conforming to the synonym relationship with the node as a concept-node pair to obtain a node system after alias expansion, and simultaneously taking the product concept which does not conform to the synonym relationship with any node as a new product concept;
S4, based on the domain language model obtained in the S1, constructing a node-node pair training set conforming to the upper and lower relation according to the existing product node system, and training to obtain an upper and lower relation classification judgment model, so that the node classification judgment model can judge the direct father level node of the node concept, further predicting the father level node of each new product concept obtained in the S3 by using the trained upper and lower relation classification judgment model, and hanging and expanding the new product concept into the product node system according to the prediction result;
S5, respectively transmitting the candidate product concept set obtained in the S2 and the node systems expanded in the S3 and the S4 to a manual auditing end for verification, and finally updating the product node system according to a verification result, and simultaneously updating training samples of all models used in the S2-S4 to improve the performance of all models, thereby realizing continuous dynamic construction of the product node system in the whole field;
The step S1 specifically comprises the following steps:
S11, according to the construction requirements of a product system in the whole field, taking a general product classification system HS code as a seed node system to form an upper layer framework of the product node system, and further obtaining an upper-lower relationship data set in the product node system;
S12, performing fine tuning training on a description text of a seed node system by using a Bert pre-training language model, learning semantic features in a field text expression to obtain a field language model, and obtaining a feature vector of each node concept in a product node system by using the field language model;
The step S2 specifically comprises the following steps:
s21, carrying out rule-based structural analysis and extraction on product concepts in text for continuously acquired semi-structured text data containing the product concepts to generate a first candidate product concept set;
S22, for unstructured text data containing product concepts, which is obtained through continuous collection, a training sample set containing product concept sequences is obtained through manual labeling, then a product concept extraction model is trained on the basis of the training sample set by using an NLP sequence labeling model, and a product concept sequence is extracted on new unstructured text data through continuous collection through the product concept extraction model, so that a second candidate product concept set is generated;
s23, merging the first candidate product concept set and the second candidate product concept set into a candidate product concept set, wherein the candidate product concept set is used as a basis for expanding an existing product node system;
the step S3 specifically comprises the following steps:
S31, constructing a synonymous concept sample set conforming to a synonymous relation of a product according to product concept alias information, training a synonymous concept discrimination model based on the synonymous concept sample set by utilizing a sequence classification task in a Bert pre-training language model application scene, further predicting the synonymous concept relation between each candidate product concept in the candidate product concept set and each node in the existing product node system by utilizing the synonymous concept discrimination model according to the candidate product concept set, and taking one candidate product concept in the candidate product concept set and one node in the existing product node system as a concept-node pair conforming to the synonymous relation if the two candidate product concepts conform to the synonymous concept relation; if one candidate product concept in the candidate product concept set does not accord with the synonymous concept relation with any node in the existing product node system, the candidate product concept is used as a new product concept and added into the new product concept candidate set;
S32, aiming at the concept-node pair which accords with the synonymous relation and is obtained in the S31, fusing candidate product concept nouns in the concept-node pair into corresponding node attributes in the existing product node system, and storing the candidate product concept nouns into alias attribute fields of node instances in a product library to realize node alias attribute fusion;
the step S4 specifically comprises the following steps:
s41, constructing a query-node concept pair training set by utilizing the upper and lower relationships of the existing nodes in a product node system, wherein in each query-node concept pair, a query represents a product concept to be hung, a node represents a product node in the product node system, product node information formed by all nodes is represented by a node graph structure, a training set label is set to be 1 or 0, wherein 1 represents a node as a direct parent node of the query, and 0 is opposite;
S42, initializing feature vectors of each product node in a query product concept and node graph structure by using the domain language model obtained in the S1, carrying out propagation fusion and iterative updating on each node feature by adopting a GNN graph neural network model in the node graph structure to obtain respective word embedding representation of the query and the node, inputting the word embedding representation into a two-class model, and training the two-class model to enable the node in the query-node concept pair to be a direct father node of the query or not so as to obtain an upper-lower relationship classification judgment model;
S43, for the new product concepts obtained in the S3, judging the upper and lower relationship of each new product concept and each existing node in the product node system one by utilizing an upper and lower relationship classification judgment model, and calculating the existing node with the highest matching degree as a direct father level node so as to carry out the hanging expansion of the product node system.
2. The method for dynamically fusing and growing a full-field product node system according to claim 1, wherein the semi-structured text data containing product concepts is an enterprise annual report.
3. The method for dynamically fusing and growing a full-field product node system according to claim 1, wherein the unstructured text data containing product concepts is patent text data and/or paper text data.
4. The dynamic fusion and growth method of the full-field product node system according to claim 1, wherein in the step S5, a verification tool of man-machine interaction is utilized to perform manual verification and verification on the product concept extraction result, the product alias fusion result and the product system growth result in the step S2, the step S3 and the step S4, and meanwhile, sample data sets used for training each model in the step S2, the step S3 and the step S4 are continuously deposited and updated according to the verification result, so that model performance is iteratively improved, and a high-quality full-field product node system is continuously constructed.
CN202111166990.2A 2021-10-01 2021-10-01 Dynamic fusion and growth method for product node system in all fields Active CN113987197B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111166990.2A CN113987197B (en) 2021-10-01 2021-10-01 Dynamic fusion and growth method for product node system in all fields

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111166990.2A CN113987197B (en) 2021-10-01 2021-10-01 Dynamic fusion and growth method for product node system in all fields

Publications (2)

Publication Number Publication Date
CN113987197A CN113987197A (en) 2022-01-28
CN113987197B true CN113987197B (en) 2024-04-23

Family

ID=79737609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111166990.2A Active CN113987197B (en) 2021-10-01 2021-10-01 Dynamic fusion and growth method for product node system in all fields

Country Status (1)

Country Link
CN (1) CN113987197B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880551B (en) * 2022-04-12 2023-05-02 北京三快在线科技有限公司 Method and device for acquiring upper and lower relationship, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491469A (en) * 2018-03-07 2018-09-04 浙江大学 Introduce the neural collaborative filtering conceptual description word proposed algorithm of concepts tab
KR20200064880A (en) * 2018-11-29 2020-06-08 부산대학교 산학협력단 System and Method for Word Embedding using Knowledge Powered Deep Learning based on Korean WordNet
CN113191152A (en) * 2021-06-30 2021-07-30 杭州费尔斯通科技有限公司 Entity identification method and system based on entity extension

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491469A (en) * 2018-03-07 2018-09-04 浙江大学 Introduce the neural collaborative filtering conceptual description word proposed algorithm of concepts tab
KR20200064880A (en) * 2018-11-29 2020-06-08 부산대학교 산학협력단 System and Method for Word Embedding using Knowledge Powered Deep Learning based on Korean WordNet
CN113191152A (en) * 2021-06-30 2021-07-30 杭州费尔斯通科技有限公司 Entity identification method and system based on entity extension

Also Published As

Publication number Publication date
CN113987197A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
CN111291185B (en) Information extraction method, device, electronic equipment and storage medium
CN112487143B (en) Public opinion big data analysis-based multi-label text classification method
CN107330032B (en) Implicit discourse relation analysis method based on recurrent neural network
CN103955451B (en) Method for judging emotional tendentiousness of short text
CN111597347B (en) Knowledge embedding defect report reconstruction method and device
CN113436698B (en) Automatic medical term standardization system and method integrating self-supervision and active learning
CN112528034B (en) Knowledge distillation-based entity relationship extraction method
CN113761893B (en) Relation extraction method based on mode pre-training
CN112115264B (en) Text classification model adjustment method for data distribution change
CN116521882A (en) Domain length text classification method and system based on knowledge graph
CN115048447A (en) Database natural language interface system based on intelligent semantic completion
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN112417862A (en) Knowledge point prediction method, system and readable storage medium
CN113987197B (en) Dynamic fusion and growth method for product node system in all fields
CN113868432A (en) Automatic knowledge graph construction method and system for iron and steel manufacturing enterprises
CN116484024A (en) Multi-level knowledge base construction method based on knowledge graph
CN110781297B (en) Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN115270797A (en) Text entity extraction method and system based on self-training semi-supervised learning
CN114358017A (en) Label classification method, device, equipment and storage medium
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN112463960B (en) Entity relationship determination method and device, computing equipment and storage medium
CN117390189A (en) Neutral text generation method based on pre-classifier
CN110008344B (en) Method for automatically marking data structure label on code
CN116738952A (en) Information report generation method based on domain knowledge graph
JP2023147236A (en) Machine learning pipeline augmented with explanation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant