CN113987197B

CN113987197B - Dynamic fusion and growth method for product node system in all fields

Info

Publication number: CN113987197B
Application number: CN202111166990.2A
Authority: CN
Inventors: 张啸天; 宗畅; 杨彦飞; 许源泓
Original assignee: Hangzhou Liangzhi Data Technology Co ltd
Current assignee: Hangzhou Liangzhi Data Technology Co ltd
Priority date: 2021-10-01
Filing date: 2021-10-01
Publication date: 2024-04-23
Anticipated expiration: 2041-10-01
Also published as: CN113987197A

Abstract

The invention provides a dynamic fusion and growth method of a product node system in the whole field. Aiming at the cognitive decision requirement of the fine granularity emerging field in the regional industry economic development process, the invention continuously excavates product concept nodes from massive Internet semi-structured and unstructured heterogeneous data sources by utilizing natural language processing and knowledge graph technologies such as concept acquisition, relationship discrimination, attribute fusion and the like on the basis of the existing authoritative product classification system, characterizes the product concepts by utilizing a text embedding technology, further judges, fuses and articulates the relationship between the product concepts and the nodes of the original product system, and continuously expands the node system content to form a set of full-field product node system capable of dynamic fusion growth. In addition, the invention can also ensure the authority and accuracy of the product node system in the whole field in the human-computer cooperative interaction flow in the system construction and updating process.

Description

Dynamic fusion and growth method for product node system in all fields

Technical Field

The invention relates to the fields of computer technology, artificial intelligence and natural language processing, in particular to a dynamic fusion and growth method of a product node system in the whole field.

Background

Along with the development of computer science and technology and artificial intelligence technology, automation and intellectualization become key elements of various industries in digital innovation and upgrading, especially in economic digital innovation, traditional regional industry economic development cognitive analysis and decision depend on expert experience, a large amount of labor cost is submerged, a product system framework in industrial analysis cannot be effectively abstracted and precipitated, and rapid and effective standardized construction cannot be implemented on a fine-granularity product framework system.

In combination with the scene requirements of product system construction, the whole field product system needs to be constructed above a set of international standard industry classification, and further extends downwards product categories and finer granularity product subdivision nodes. The problems to be considered include how to select standard industry classification system bases, how to identify product concepts from semi-structured annual report data and unstructured paper patent texts, how to embed and characterize system nodes, how to judge synonymous and upper-lower relationship among products, how to ensure accurate and continuous dynamic fusion and growth of a product system, and ensure high-quality construction of a product system in the whole field.

Therefore, a dynamic fusion and growth method of a product node system in the whole field is needed, automatic construction of the product system is rapidly carried out, cognitive decision of regional industries is assisted, and the innovation requirement of industrial economy and digitization is met.

Disclosure of Invention

In view of the above, the invention aims to solve the problem that industry knowledge in the mind of an expert cannot be precipitated in industrial analysis and research and an industrial product knowledge system cannot be quickly constructed, and provides a method for dynamically fusing and growing a product node system in the whole field. The method is a cross-language full-industry chain product system construction and system self-growth dynamic updating method provided for a system framework in industrial analysis decision, can mine product concepts based on heterogeneous data sources, and dynamically fuses and updates the system for continuously emerging new products, thereby providing a new thought for exploration, classification and analysis of the new products.

In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps:

A dynamic fusion and growth method of a full-field product node system comprises the following steps:

s1, taking a general product classification system required by building a full-field product system as an upper layer framework of a product node system, and further performing fine adjustment on a data set of the general product classification system by utilizing a pre-training language model to obtain a field language model, wherein the field language model is used for obtaining word embedding representation of each node in the product node system;

S2, extracting product concepts from unstructured text data containing the product concepts by utilizing a pre-trained product concept extraction model, extracting the product concepts on semi-structured text data containing the product concepts based on rules, continuously and dynamically updating both the unstructured text data and the semi-structured text data so as to continuously extract vocabulary and phrases of the product concepts from the unstructured text data and the semi-structured text data, and combining the vocabulary and phrases to form a candidate product concept set;

S3, training a synonym concept judgment model by using a product concept alias library, judging a synonym relationship between a candidate product concept in the candidate product concept set and a node in the existing product node system, fusing the product concept conforming to the synonym relationship with the node as a concept-node pair to obtain a node system after alias expansion, and simultaneously taking the product concept which does not conform to the synonym relationship with any node as a new product concept;

S4, based on the domain language model obtained in the S1, constructing a node-node pair training set conforming to the upper and lower relation according to the existing product node system, and training to obtain an upper and lower relation classification judgment model, so that the node classification judgment model can judge the direct father level node of the node concept, further predicting the father level node of each new product concept obtained in the S3 by using the trained upper and lower relation classification judgment model, and hanging and expanding the new product concept into the product node system according to the prediction result;

S5, respectively transmitting the candidate product concept set obtained in the S2 and the node systems expanded in the S3 and the S4 to a manual auditing end for verification, and finally updating the product node system according to a verification result, and simultaneously updating training samples of all models used in the S2-S4 to improve the performance of all models, thereby realizing continuous dynamic construction of the all-field product node system.

Preferably, the step S1 specifically includes the following steps:

S11, according to the construction requirements of a product system in the whole field, taking a general product classification system HS code as a seed node system to form an upper layer framework of the product node system, and further obtaining an upper-lower relationship data set in the product node system;

S12, performing fine tuning training on a description text of the seed node system by using the Bert pre-training language model, learning semantic features in the field text expression to obtain a field language model, and obtaining feature vectors of each node concept in the product node system by using the field language model.

Preferably, the step S2 specifically includes the following steps:

s21, carrying out rule-based structural analysis and extraction on product concepts in text for continuously acquired semi-structured text data containing the product concepts to generate a first candidate product concept set;

S22, for unstructured text data containing product concepts, which is obtained through continuous collection, a training sample set containing product concept sequences is obtained through manual labeling, then a product concept extraction model is trained on the basis of the training sample set by using an NLP sequence labeling model, and a product concept sequence is extracted on new unstructured text data through continuous collection through the product concept extraction model, so that a second candidate product concept set is generated;

S23, combining the first candidate product concept set and the second candidate product concept set into a candidate product concept set, and using the candidate product concept set as a basis for expanding an existing product node system.

Further, the semi-structured text data containing the product concept is an enterprise annual report.

Further, the unstructured text data containing product concepts is patent text data and/or paper text data.

Preferably, the step S3 specifically includes the following steps:

S31, constructing a synonymous concept sample set conforming to a synonymous relation of a product according to product concept alias information, training a synonymous concept discrimination model based on the synonymous concept sample set by utilizing a sequence classification task in a Bert pre-training language model application scene, further predicting the synonymous concept relation between each candidate product concept in the candidate product concept set and each node in the existing product node system by utilizing the synonymous concept discrimination model according to the candidate product concept set, and taking one candidate product concept in the candidate product concept set and one node in the existing product node system as a concept-node pair conforming to the synonymous relation if the two candidate product concepts conform to the synonymous concept relation; if one candidate product concept in the candidate product concept set does not accord with the synonymous concept relation with any node in the existing product node system, the candidate product concept is used as a new product concept and added into the new product concept candidate set;

S32, aiming at the concept-node pair meeting the synonymous relation obtained in S31, fusing candidate product concept nouns into corresponding node attributes in the existing product node system, and storing the node attribute fields of the node instances in the product library to realize node alias attribute fusion.

Preferably, the step S4 specifically includes the following steps:

s41, constructing a query-node concept pair training set by utilizing the upper and lower relationships of the existing nodes in a product node system, wherein in each query-node concept pair, a query represents a product concept to be hung, a node represents a product node in the product node system, product node information formed by all nodes is represented by a node graph structure, a training set label is set to be 1 or 0, wherein 1 represents a node as a direct parent node of the query, and 0 is opposite;

S42, initializing feature vectors of each product node in a query product concept and node graph structure by using the domain language model obtained in the S1, carrying out propagation fusion and iterative updating on each node feature by adopting a GNN graph neural network model in the node graph structure to obtain respective word embedding representation of the query and the node, inputting the word embedding representation into a two-class model, and training the two-class model to enable the node in the query-node concept pair to be a direct father node of the query or not so as to obtain an upper-lower relationship classification judgment model;

S43, for the new product concepts obtained in the S3, judging the upper and lower relationship of each new product concept and each existing node in the product node system one by utilizing an upper and lower relationship classification judgment model, and calculating the existing node with the highest matching degree as a direct father level node so as to carry out the hanging expansion of the product node system.

Preferably, in the step S5, a verification tool of man-machine interaction is used to perform manual auditing and verification on the product concept extraction result, the product alias fusion result and the product system growth result in the steps S2, S3 and S4, and simultaneously sample data sets used for training each model in the steps S2, S3 and S4 are continuously deposited and updated according to the verification result, so that the model performance is iteratively improved, and a high-quality full-field product node system is continuously constructed.

The dynamic fusion and growth method of the all-field product node system provided by the invention can be used for quickly constructing a standardized product knowledge system facing to application scenes of regional industry cognitive decision, and besides, a set of semi-automatic construction process of the all-field product node system is also output, so that the method can be applied to an industrial chain analysis decision system, and is beneficial to improving the automation and intelligent degree of an industrial development cognitive decision process.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to the structures illustrated in these drawings without the need of inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a whole flow chart of a dynamic fusion and growth method of a product node system in the whole field, which is provided by the embodiment of the invention;

FIG. 2 is a schematic diagram of a concept of identifying a product by mining heterogeneous data sources according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a modeling method of a query-node concept to an upper-lower relationship determination model according to an embodiment of the present invention;

Fig. 4 is a schematic diagram of iterative updating of a product knowledge system through man-machine interaction according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, steps, operations, or elements, but do not preclude the presence or addition of one or more other features, steps, operations, or groups of elements.

It should also be understood that where directional indications are referred to in embodiments of the present invention, the directional indications are merely used to explain the relative relationship of the components being constructed at a particular location, and that if the particular location changes, the directional indications correspondingly change.

In addition, the product node system in the whole field constructed in the invention is constructed by using general industry product knowledge as a guide in the construction process, but the construction method is not limited to any industry, and the product node system in any specific industry can be constructed by the construction method and the dynamic updating method.

In the invention, the dynamic fusion and growth method of the product node system in the whole field comprises the following steps:

S1, taking an authoritative general product classification system required by building a full-field product system as an upper layer framework of a product node system, and further performing Fine tuning (Fine-tune) on a data set of the general product classification system by utilizing a pre-training language model to obtain a field language model, wherein the field language model is used for obtaining word embedding representation of each node in the product node system.

S2, extracting product concepts from unstructured text data containing the product concepts by utilizing a pre-trained product concept extraction model, extracting the product concepts on semi-structured text data containing the product concepts based on rules, continuously and dynamically updating both the unstructured text data and the semi-structured text data so as to continuously extract vocabulary and phrases of the product concepts from the unstructured text data and the semi-structured text data, and combining the vocabulary and phrases to form a candidate product concept set.

S3, training a synonym concept judgment model by using a product concept alias library, judging the synonym relation between the candidate product concepts in the candidate product concept set and nodes in the existing product node system, fusing the product concepts conforming to the synonym relation with the nodes as concept-node pairs to obtain a node system after alias expansion, and simultaneously taking the product concepts which do not conform to the synonym relation with any node as new product concepts.

S4, based on the domain language model obtained in the S1, a node-node pair training set conforming to the upper and lower relation is constructed according to the existing product node system, and an upper and lower relation classification judgment model is obtained through training, so that the direct father level node of the node concept (query) can be judged, the father level node of each new product concept obtained in the S3 is predicted by using the trained upper and lower relation classification judgment model, and the new product concept is hung and expanded into the product node system according to a prediction result.

The specific implementation of the steps S1 to S5 in this embodiment will be described below.

In the embodiment of the present invention, for the step S1 referred to in fig. 1, the specific implementation process is as follows:

S11, according to the construction requirements of the product system in the whole field, selecting a universal product classification system meeting the conditions as an upper layer framework of the product node system, and constructing an upper-lower relationship data set of nodes in the product node system. It should be noted that the selected general product classification system needs to have authority, in this embodiment, based on professional background knowledge, from aspects of node scale, node granularity, node expression form and the like, HS codes (customs codes) are selected as a seed node system to form an authority product node system upper layer architecture, and then an upper and lower relationship data set in the node system is obtained according to a standard system construction. The seed node system is used as the basic framework of the whole product node system, the nodes in the system represent product concepts of different levels and have upper and lower relations, and then the attribute expansion of the nodes and the hanging growth of new product concept nodes can be further carried out on the basis of the seed node system, so that the product node system is continuously expanded, and the product node system in the whole field is continuously and dynamically constructed.

S12, performing Fine-tuning training on a description text of a seed node system by using the Bert pre-training language model, learning semantic features in the text expression of the field, obtaining a field language model after training, and obtaining feature vectors of each node concept in the product node system, namely word embedding representation of the nodes by using the field language model.

Fig. 2 is a schematic diagram of extracting product concept words from a heterogeneous data source according to an embodiment of the present invention, where product concept needs to be extracted from two heterogeneous data including semi-structured text data and unstructured text data to further expand an existing product node system, and the process corresponds to step S2 in the method shown in fig. 1, and the specific implementation process is as follows:

S21, carrying out rule-based structural analysis and extraction on product concepts in the text for continuously acquired semi-structured text data containing the product concepts, and generating a first candidate product concept set.

S22, for unstructured text data containing product concepts, which is obtained through continuous collection, a training sample set containing product concept sequences is obtained through manual labeling, then a product concept extraction model is trained on the basis of the training sample set by using an NLP sequence labeling model, and a product concept sequence is extracted through the product concept extraction model on new unstructured text data which is continuously collected, so that a second candidate product concept set is generated.

It should be noted that, because the product concepts in the industry are updated continuously over time, the extraction process of the candidate product concepts in this step may also be performed continuously, so as to implement dynamic update of the product node system. In actual operation, new semi-structured text data and unstructured text data can be continuously collected and accumulated, and then candidate product concepts are regularly extracted from the accumulated data for updating a product node system.

The specific forms of the above-described semi-structured text data and unstructured text data may be varied as long as product concepts are contained therein. In this embodiment, the semi-structured text data may be semi-structured product information table data disclosed in the annual newspaper of the enterprise, and related product information may be extracted by a rule method, and a specific extraction rule may be determined according to the language characteristics of the product concept in the text.

Whereas for unstructured text data, the embodiments of the present invention employ proprietary text data and paper text data instead. Product concepts in unstructured text data cannot be extracted by rules, so that accumulated paper text and patent text data are required to be used as annotation corpus to form annotation data, a product concept extraction model is trained by using the annotation data based on a frame of Bert+LSTM+CRF, product concept entities are mined on an accumulated patent and paper data set, and candidate product concept words are generated. The specific process of the product concept extraction model for extracting the concept is as follows:

First, labeling labels for token sequences in text data include three categories, "O", "B-Product", "I-Product", where "B-Product" represents a beginning token of a Product concept sequence, "I-Product" represents intermediate and ending tokens of a Product concept sequence, and "O" represents a token of a non-Product concept sequence.

And then, the marked training data is read into the Bert pre-training model after being processed to obtain pre-trained vector features, and then the features are read into a classical LSTM+CRF model to obtain final features for judging whether each token belongs to a product concept sequence or not, so that model training is completed, and new product concepts are extracted.

In addition, the specific implementation procedure of step S3 in fig. 1 is as follows:

S31, constructing a synonymous concept sample set conforming to a product synonymous relation according to pre-accumulated product concept alias information, wherein each synonymous concept sample corresponds to a group of product concept words with the same meaning but different expressions, training a synonymous concept discrimination model based on a synonymous concept sample set by utilizing classical sequence classification tasks in a Bert pre-training language model application scene, further predicting the synonymous concept relation between each candidate product concept in the candidate product concept set and each node in an existing product node system by utilizing the synonymous concept discrimination model according to the candidate product concept set, and if one candidate product concept in the candidate product concept set conforms to the synonymous concept relation between one node in the existing product node system, taking the two as a concept-node pair conforming to the synonymous relation; if one candidate product concept in the candidate product concept set does not accord with the synonymous concept relation with any node in the existing product node system, the candidate product concept is used as a new product concept and added into the new product concept candidate set;

In this embodiment, for step S31, in the process of training the synonymy relationship discrimination model between concepts, the Bert is also used as a pre-training model to construct semantic feature vectors of concepts, and then a typical sequence classification task in a bertfine-tune task scene is used as a method, concept-node pairs are used as inputs, whether they are synonymy relationships are used as prediction labels, and model training is performed to predict whether new concept-node pairs satisfy the synonymy relationships.

Fig. 3 is a method for modeling a node-node pair (i.e., query-node concept pair) context determination model involved in step S4, specifically including the following steps:

training and identifying a context relation model between a new node and an existing node according to the following method:

s43, for the new product concepts obtained in S3, judging the upper and lower relationship of each new product concept and each existing node in the product node system one by utilizing an upper and lower relationship classification judgment model, and calculating the existing node with the highest matching degree as a direct father level node so as to carry out the hanging expansion of the product node system

For easy understanding, the following describes in detail the graph structure data set constructed in step S41 and the feature vector generation method on the graph structure of step S42 in the embodiment.

As shown in FIG. 3, part 1 is a graph structure corresponding to node n _i The graph structure comprehensively considers the father node and the brother node of the node. For a node N _i, assuming that it is node b in the left graph, its neighboring node set is denoted by N (b), where/>, for the purpose of expressing the calculation formula of GNN, is definedFor the union of node b and the neighboring node set N (b), the vector update calculation formula of node b at the kth iteration is:

Wherein, Word embedded representation for initial state,/>Word embedding representation for the kth iteration of node b, agg ^(k) for all nodes/>, at the kth iterationAnd carrying out propagation aggregation calculation based on the last iteration result. The commonly used Agg functions in the graph structure are a graph convolution neural network GCN and a graph annotation neural network GAT, wherein the GCN defines the Agg functions as follows:

Where ρ is a nonlinear activation function, e.g., reLU, etc., W is a parameter for model learning, For the normalized coefficients, the values for each participating iteration are the same.

If the normalized coefficient is regarded as an importance weight coefficient between two nodes, the GCN treats each adjacent node equally in iterative calculation, only the structural characteristics of the nodes are considered, and weight information between the nodes and the adjacent nodes is not considered. The embodiment of the invention adopts an optimized GAT graph annotating force model, and GAT pairsRedefinition is performed such that node information is propagated taking into consideration not only structural information but also information of the node itself:

In the above formula, z and W are parameters to be learned, γ is a nonlinear activation function, i represents a splicing operation, and (3) is brought into (2) to obtain a single GAT model, in the embodiment, a multi-head attention mechanism is adopted, and the results are spliced to obtain final feature vectors of all nodes.

As shown in part 2 of fig. 3, embodiments also embed a representation of location information in the iterative process, taking into account that the same node may be located at different locations in different graph structures, trained with the GAT propagation process. Node b location feature at the kth iterationExpressed, the/>, in the formula (2)The matrix calculation after the position embedding and splicing is replaced:

Wherein, the 'I' represents the splicing operation, and O ^(k-1) is a part of training parameters which need to be aligned and added to the matrix W to be trained after feature splicing.

The final output feature vector Embed of the node feature map uses the weighted mean value of the position vectors of all nodes in the map, and the calculation method is as follows:

As shown in fig. 3 in section 3, feature vectors of the query and the node are spliced and then input into an MLP model, and two-classification prediction is performed to determine whether the product node is a direct parent of the concept word of the query product. The embodiment of the invention adopts InfoLoss as a loss function for optimization, and the loss function is defined as follows:

In the above formula, X _i represents a set of 1 positive example and N negative example generated by an upper-lower relationship edge < N _P,n_C>(n_P is a direct parent node of N _C); x is a sample set generated by all sides; j traverses all samples generated by the computation from 1 to n+1.

FIG. 4 is a schematic diagram of iterative updating of a product knowledge system in the step S5, wherein the updating can be realized by using a verification tool of man-machine interaction, and the updating is realized by manually checking and verifying the product concept extraction result, the product alias fusion result and the product system growth result in the steps S2, S3 and S4, and if the verification is passed, the updating is performed, otherwise, the updating is not performed; and simultaneously, continuously precipitating and updating the sample data set used for training each model in the steps S2, S3 and S4 according to the verification result, iteratively improving the model performance, and continuously constructing to obtain a high-quality all-field product node system.

Taking checking and iterative updating of upper and lower relation discrimination and node system hooking as an example, the method comprises the following steps:

In the 1 st part, aiming at the newly added product node generated in the step S42, a corresponding knowledge checking interface tool can be designed and developed so as to check whether the upper and lower relationship in the system is correct or not manually, if the upper and lower relationship is correct or not, the self-growth update of the product node is confirmed, if the self-growth update is required to be regulated, the hierarchy position of the product node is regulated manually, and then the update of a new product node system is confirmed;

in the 2 nd part, synchronously updating the upper and lower hierarchical relation node pair database of the product according to the manual verification result, and expanding a training sample set;

In the 3 rd part, according to the new model training sample, the upper and lower relation judgment model is retrained after a period of time according to the accumulated magnitude, the model version is updated, and the model performance is continuously improved through iteration.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A dynamic fusion and growth method of a full-field product node system is characterized by comprising the following steps:

S5, respectively transmitting the candidate product concept set obtained in the S2 and the node systems expanded in the S3 and the S4 to a manual auditing end for verification, and finally updating the product node system according to a verification result, and simultaneously updating training samples of all models used in the S2-S4 to improve the performance of all models, thereby realizing continuous dynamic construction of the product node system in the whole field;

The step S1 specifically comprises the following steps:

S12, performing fine tuning training on a description text of a seed node system by using a Bert pre-training language model, learning semantic features in a field text expression to obtain a field language model, and obtaining a feature vector of each node concept in a product node system by using the field language model;

The step S2 specifically comprises the following steps:

s23, merging the first candidate product concept set and the second candidate product concept set into a candidate product concept set, wherein the candidate product concept set is used as a basis for expanding an existing product node system;

the step S3 specifically comprises the following steps:

S32, aiming at the concept-node pair which accords with the synonymous relation and is obtained in the S31, fusing candidate product concept nouns in the concept-node pair into corresponding node attributes in the existing product node system, and storing the candidate product concept nouns into alias attribute fields of node instances in a product library to realize node alias attribute fusion;

the step S4 specifically comprises the following steps:

2. The method for dynamically fusing and growing a full-field product node system according to claim 1, wherein the semi-structured text data containing product concepts is an enterprise annual report.

3. The method for dynamically fusing and growing a full-field product node system according to claim 1, wherein the unstructured text data containing product concepts is patent text data and/or paper text data.

4. The dynamic fusion and growth method of the full-field product node system according to claim 1, wherein in the step S5, a verification tool of man-machine interaction is utilized to perform manual verification and verification on the product concept extraction result, the product alias fusion result and the product system growth result in the step S2, the step S3 and the step S4, and meanwhile, sample data sets used for training each model in the step S2, the step S3 and the step S4 are continuously deposited and updated according to the verification result, so that model performance is iteratively improved, and a high-quality full-field product node system is continuously constructed.