CN116955639A

CN116955639A - Method and device for constructing future industry chain knowledge graph and computer equipment

Info

Publication number: CN116955639A
Application number: CN202310449192.3A
Authority: CN
Inventors: 吴福文; 康维鹏; 唐逐时; 杨胜利
Original assignee: Zheshang Futures Co ltd
Current assignee: Zheshang Futures Co ltd
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-10-27

Abstract

The embodiment of the invention discloses a method, a device and computer equipment for constructing a future industrial chain knowledge graph. The method comprises the following steps: acquiring original data of various heterogeneous forms of futures industry chains; carrying out text conversion and extraction on the original data to obtain an extraction result; extracting the entity relation of the futures industry chain by using the extraction result to obtain the entity and the entity-to-attribute relation; and constructing a futures industry chain knowledge graph according to the entity and the entity pair attribute relationship. By implementing the method provided by the embodiment of the invention, the knowledge graph of the future industry chain can be extracted and constructed from the multi-source heterogeneous data to form the systematic knowledge logic.

Description

Method and device for constructing future industry chain knowledge graph and computer equipment

Technical Field

The invention relates to a knowledge graph construction method, in particular to a knowledge graph construction method, a knowledge graph construction device and computer equipment for futures industry chains.

Background

In the field of futures finance, a huge amount of data content is provided, including various forms such as news, information, web pages, forms, PDF research reports, videos and videos, and how to extract and construct a futures industry chain knowledge graph from multi-source heterogeneous data so as to form a systematic knowledge logic, thereby being beneficial to analyzing the development trend and the like of the futures industry chain; however, no method is available at present for extracting and constructing a future industry chain knowledge graph from multi-source heterogeneous data, so as to form a systematic knowledge logic.

Therefore, a new method is needed to be designed, so that the knowledge graph of the future industry chain is extracted and constructed from the multi-source heterogeneous data, and the systematic knowledge logic is formed.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method, a device and computer equipment for constructing a future industry chain knowledge graph.

In order to achieve the above purpose, the present invention adopts the following technical scheme: the method for constructing the future industry chain knowledge graph comprises the following steps:

acquiring original data of various heterogeneous forms of futures industry chains;

carrying out text conversion and extraction on the original data to obtain an extraction result;

extracting the entity relation of the futures industry chain by using the extraction result to obtain the entity and the entity-to-attribute relation;

and constructing a futures industry chain knowledge graph according to the entity and the entity pair attribute relationship.

The further technical scheme is as follows: the text conversion and extraction are carried out on the original data to obtain an extraction result, which comprises the following steps:

extracting text contents of the WORD document and the PDF document in the original data to obtain a first extraction result;

extracting text from the web pages and the tables in the original data to obtain a second extraction result;

Extracting text from the audio and video data in the original data to obtain a third extraction result;

extracting text data from the picture data in the original data to obtain a fourth extraction result;

the extraction results comprise a first extraction result, a second extraction result, a third extraction result and a fourth extraction result.

The further technical scheme is as follows: the text content extraction of the WORD document and the PDF document in the original data is performed to obtain a first extraction result, which comprises the following steps:

converting the WORD document in the original data into a PDF document;

extracting text content information from the PDF document in the original data and the PDF document obtained by converting the WORD document to obtain first text content;

and judging the content type of the first text content by adopting a squeezing and exciting network so as to obtain a first extraction result.

The further technical scheme is as follows: extracting text from the web pages and the tables in the original data to obtain a second extraction result, wherein the extracting text comprises the following steps:

the method comprises the steps of regularly extracting and identifying specific source data in a webpage and a form in original data by utilizing webpage structural features and adopting Xpath and other tools to obtain a first identification result;

Identifying the webpage in the original data and the irregular webpage in the form by adopting a FreeDOM model so as to obtain a second identification result;

and combining the first recognition result and the second recognition result to obtain a second extraction result.

The further technical scheme is as follows: the extracting text from the audio and video data in the original data to obtain a third extraction result comprises:

converting the video data in the original data into an audio file by adopting FFmpeg;

performing voice character recognition on the audio data in the original data and the converted audio file, and eliminating possible potential letters in the audio to obtain a processed audio file;

segmenting and cutting the processed audio file to obtain a segmented audio file;

and carrying out voice text recognition on the segmented audio file through deep speech, and splicing text contents obtained by recognition to obtain a third extraction result.

The further technical scheme is as follows: the extracting text data from the picture data in the original data to obtain a fourth extraction result includes:

and carrying out image-text recognition on the image data in the original data by adopting a deep model pooling circulating network CRNN so as to obtain a fourth extraction result.

The further technical scheme is as follows: extracting the entity relation of the futures industry chain by using the extraction result to obtain the entity and the entity-to-attribute relation, wherein the method comprises the following steps:

processing the extraction result to obtain a first sample set;

performing fine tuning training of entity identification by utilizing the BERT pre-training basic model which adopts a Pytorch frame and comes from Huggingface through the first sample set training so as to obtain a futures industry chain entity identification model;

preprocessing the extraction result to obtain a second sample set;

training a BERT pre-training model by using the second sample set to obtain a BERT attribute relationship discrimination model;

and determining the entity and the entity-to-attribute relationship according to the futures industry chain entity identification model and the BERT attribute relationship discrimination model.

The further technical scheme is as follows: the constructing a futures industry chain knowledge graph according to the entity and the entity-to-attribute relationship comprises the following steps:

and fusing the entity and the incremental precipitation of the entity-to-attribute relationship into the existing knowledge graph to obtain the future industry chain knowledge graph.

The invention also provides a futures industry chain knowledge graph construction device, which comprises:

the data acquisition unit is used for acquiring the original data of various heterogeneous forms of the futures industry chain;

The extraction unit is used for carrying out text conversion and extraction on the original data so as to obtain an extraction result;

the relation extracting unit is used for extracting the entity relation of the futures industry chain by utilizing the extracting result so as to obtain the entity and the attribute relation of the entity;

and the map construction unit is used for constructing a futures industry chain knowledge map according to the entity and the entity-to-attribute relationship.

The invention also provides a computer device which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, various heterogeneous data content extraction, entity relation extraction and knowledge fusion precipitation are carried out from massive various data corpora to construct a futures industry chain knowledge graph, so that the futures industry chain knowledge graph is extracted from multi-source heterogeneous data to form a systematic knowledge logic.

The invention is further described below with reference to the drawings and specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of a future industrial chain knowledge graph construction method provided by an embodiment of the present invention;

fig. 2 is a flow chart of a method for constructing a future industrial chain knowledge graph according to an embodiment of the present invention;

fig. 3 is a schematic sub-flowchart of a future industrial chain knowledge graph construction method according to an embodiment of the present invention;

fig. 4 is a schematic sub-flowchart of a future industrial chain knowledge graph construction method according to an embodiment of the present invention;

fig. 5 is a schematic sub-flowchart of a future industrial chain knowledge graph construction method according to an embodiment of the present invention;

fig. 6 is a schematic sub-flowchart of a future industrial chain knowledge graph construction method according to an embodiment of the present invention;

fig. 7 is a schematic sub-flowchart of a future industrial chain knowledge graph construction method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a BERT attribute relationship discrimination model according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of feature fusion provided by an embodiment of the present invention;

FIG. 10 is a schematic diagram of a Jakey pre-training model for learning a knowledge fusion representation according to an embodiment of the present invention;

fig. 11 is a schematic block diagram of a future industrial chain knowledge graph construction device provided by an embodiment of the present invention;

fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of an application scenario of a future industry chain knowledge graph construction method according to an embodiment of the present invention. Fig. 2 is a schematic flowchart of a future industry chain knowledge graph construction method provided by an embodiment of the present invention. The method for constructing the knowledge graph of the futures industrial chain is applied to a server. The server performs data interaction with the terminal, trains corresponding deep learning models according to data characteristics on unstructured news, information, web pages, forms, PDF research reports, videos and other data in various heterogeneous forms, and accordingly extracts and builds a structured future industrial chain knowledge graph, and achieves extraction and construction of the future industrial chain knowledge graph from multi-source heterogeneous data to form a systematic knowledge logic.

Fig. 2 is a flow chart of a method for constructing a future industrial chain knowledge graph according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S140.

S110, acquiring original data of various heterogeneous forms of futures industry chains.

In this embodiment, the original data refers to data information in various heterogeneous forms, such as news, information, web pages, forms, PDF reports, videos, etc., and the original data sources mainly include main stream websites in fields such as futures exchanges, financial portals, financial institution homepages, etc., and also include company own research reports, transaction information, etc., and the original data collection is mainly performed by adopting large data tools and methods such as Flume, datax, sqoop, kakfa, API (application programming interface ) request, intelligent crawlers, etc. Specifically, structured data such as futures trading quotations are mainly collected by adopting modes such as ETL (data warehouse technology) tools, API (application program interface) interfaces and the like from third parties such as general, peng Bo and the like, third party exchange, mechanism autonomous generation and the like; the semi-structured data with financial normalization requirements such as directional notices, logs and the like are collected by adopting an API interface mode; for data such as news, information, forms, pictures, audios and videos, the intelligent crawler technology is generally used for directional collection.

S120, carrying out text conversion and extraction on the original data to obtain an extraction result.

In this embodiment, the extraction result refers to extracting information content of various types of data in the futures industry chain, and converting the various types of heterogeneous data into unified text data.

Because the original data comprises multi-source heterogeneous forms such as news, information, web pages, forms, PDF (portable document format) research reports, videos and the like, unified textualization conversion and extraction of the data in different forms are needed, so that a futures industry chain knowledge graph is built in the text in the futures knowledge graph data.

In one embodiment, referring to fig. 3, the step S120 may include steps S121 to S124.

S121, extracting text contents of the WORD document and the PDF document in the original data to obtain a first extraction result.

In this embodiment, the first extraction result refers to information such as title, time, source, publisher, text content and the like in the collected documents such as WORD and PDF.

In one embodiment, referring to fig. 4, the step S121 may include steps S1211 to S1213.

S1211, converting the WORD document in the original data into a PDF document.

In this embodiment, the WORD file is unified into a PDF document.

S1212, extracting text content information of the PDF document in the original data and the PDF document obtained by converting the WORD document so as to obtain first text content.

In this embodiment, the first text content refers to information such as character content, character encoding, font size, start-stop position, and picture display size of the PDF document object.

Specifically, first, a PDF analysis tool such as pdfbox, itextpdf is used to analyze a document, PDF document objects (including text, pictures, tables, attachments, etc.) are traversed and read according to page numbers of the document, and text object contents including character contents, character codes, font sizes, start and stop information of position points of the document objects are extracted.

S1213, judging the content type of the first text content by adopting a squeezing and exciting network so as to obtain a first extraction result.

In this embodiment, a SENet (squeezing and exciting network) is used to determine the content type, and specific content parts such as a catalog, a title, a hierarchical title, a subtitle, a header, a footer, and a text in a PDF document are identified mainly according to character form, font color, size, thickening, a front chapter symbol or chapter number, a title line, a word count discontent line, and feature information such as a non-punctuation mark. In addition, for the DOC, DOCX, etc., the report document is issued, and then extracted and processed in the above manner after being converted into PDF form by using python tool.

Among the key architectural elements of the extrusion and stimulus network are extrusion and SE blocks (SE blocks), which are critical to the performance of the SENet network, which can be divided into 3 main parts, compression, computation and stimulus, respectively, with the aim of improving the quality of the representation produced by the network by explicitly modeling the interdependencies between the characteristic channels of the CNN convolutional network. In performing the squeeze operation, the SE block performs a global averaging pooling of the output element map of the CNN layer, in effect averaging over all activations in the spatial dimension (hxw), giving one activation output vector per channel to produce an aggregate feature map over the spatial dimension to produce a channel descriptor, allowing information from the global receiving domain of the network to be used by all of its layers. The purpose of this is to fully capture the channel-by-channel correlation aggregated from the spatial map when the extraction operation is performed, generating weights for each feature channel by parameters that are learned to explicitly model the correlation between feature channels, by a mechanism similar to a gate in an RNN cyclic neural network, and then weighting the channel-by-channel onto the previous features by multiplication, thus completing the recalibration of the original features in the channel dimension. In the embodiment, the SNet neural network is adopted to identify the complex format typesetting in the PDF page, and the information such as the title, the text, the annotation, the header, the footer and the like is determined, so that the method is a mechanism similar to a gate in the cyclic neural network. Weights are generated for each feature channel by parameters that are learned to explicitly model correlations between feature channels. 3) Reweight (zoom). The weight of the output of the specification is regarded as the importance of each feature channel after feature selection, and then the weight is weighted to the previous feature channel by channel through multiplication, so that the recalibration of the original feature in the channel dimension is completed.

In addition, the structural characteristics of the web pages of the PDF document can be utilized, and tools such as Xpath and the like can be adopted for regularized extraction and identification. And when the Xpath rule mode can not cover some regulatory content extraction with a complexity table mode, the PDF content is recognized in a complementary manner by using tools such as Trecs and the like to finish high-quality recognition.

S122, extracting texts from the webpages and the tables in the original data to obtain a second extraction result.

In this embodiment, the second extraction result refers to useful data such as the title and text of the web page and the form.

In one embodiment, referring to fig. 5, the step S122 may include steps S1221 to S1223.

S1221, regularized extraction and identification are carried out on the web pages in the original data and specific source data in the table by utilizing the structural characteristics of the web pages and adopting tools such as Xpath and the like so as to obtain a first identification result.

Specifically, the captured web page data includes useful data such as a title, a text, and a large amount of noise data such as labels, styles, links, JS codes, notes, and the like, and needs to be processed to obtain the useful data.

In this embodiment, the first recognition result refers to useful data such as a title, a text, and the like of the specific source data.

S1222, identifying the webpage in the original data and the irregular webpage in the form by adopting a FreeDOM model so as to obtain a second identification result.

In this embodiment, the second recognition result refers to useful data such as a title, a text, etc. of the randomized web page.

Specifically, the FreeDOM model mainly comprises three steps, wherein in the first step, modeling learning is mainly performed by using path information of a webpage label language DOM tree, the internal local information of each node label of the webpage DOM tree is mainly subjected to representation learning, and the three parts of local information content, namely labels of nodes, label texts and label attribute values, such as < a >, < li >, < td >, labels such as front and back texts and label attributes, are respectively included, and the characteristics are coded by Word encoding, so that vectorization characterization of the nodes of the webpage DOM tree is obtained. Based on the vectorization of the node, a Softmax classification model can be used to determine whether the node is a segment containing text content, thereby obtaining a potential text content distribution node.

The second step is mainly to characterize the node relation in the DOM number of the webpage, mainly to learn the structure dependency relation among the HTML webpage nodes, and mainly to characterize the content block distribution and the attached information among the nodes of the DOM tree of the learned webpage at the webpage text view level. The method mainly uses the current Node as engine information in the algorithm level, and models the dependency relationship among the nodes through Pair-wise modeling, specifically, the thought refers to judging whether the given Node Pair Node-Pair is in a Value-Value relationship by encoding each Node Pair (Node-Pair) and learning the relationship types of Value-Value, value-None, none-Value and None-None. The information representation of node pairs is also made up of three parts: the vectorization representation of the node itself obtained by the learning in the previous step; the HTML label path relation of the node is mainly characterized by using XPath to perform label path sequence representation; and the position coding information of the nodes mainly comprises information such as the width and the height of the labels, the number of sequences after the labels of the webpage body and the like, and finally the three information are connected and combined in a vector splicing mode, and the relation type of the node pairs is judged by adopting a full-connection classification model.

Finally, after the two stages of processing, the information such as node representation, node relation and the like of the web page DOM tree is obtained, the nodes are assembled into a node sequence according to the sequence from front to back of the DOM tree, and then a BiLSTM deep serialization semantic learning model is adopted to conduct prediction standard on the text content of the node, so that final text extraction is achieved.

Because the view structured modeling thought is adopted in the step, the information such as row and column blocks of labels such as < tr >, < li >, < td > and the like, single content, character patterns and the like can be extracted and determined to form the information such as the header and the table description and the like in the hypertext markup language (HTML), noise data such as the HTML labels, font patterns, picture links, structural layout patterns, JS codes, notes, hyperlinks and the like can be effectively identified, and therefore, the text extraction of the HTML webpage is completed better.

S1223, combining the first identification result and the second identification result to obtain a second extraction result.

S123, extracting text from the audio and video data in the original data to obtain a third extraction result.

In this embodiment, the third extraction result refers to text content in the file. The processing of audio and video data mainly identifies text content in a file, and can adopt a mode of combining two open source tools of FFmpeg and deep to carry out voice recognition.

In one embodiment, referring to fig. 6, the step S123 may include steps S1231 to S1234.

S1231, converting the video data in the original data into an audio file by adopting FFmpeg.

In particular, FFmpeg is a set of open source computer programs that can be used to record, convert and convert digital audio and video into streams, providing a complete solution to audio and video streaming.

S1232, performing voice and text recognition on the audio data in the original data and the converted audio file, and eliminating possible potential letters in the audio to obtain the processed audio file.

In this embodiment, the processed audio file refers to a result obtained by performing voice text recognition on the audio data and the audio file and eliminating possible potential letters in the audio.

S1233, segment cutting is carried out on the processed audio file so as to obtain a segmented audio file.

In this embodiment, the segmented audio file refers to an audio file within a 5 minute scale.

The long audio file is segmented and cut to form segmented audio files within 5 minutes, which is mainly used for meeting the input requirement of the deep special tool model.

S1234, performing voice text recognition on the segmented audio file through deep speech, and splicing text contents obtained through recognition to obtain a third extraction result.

In the embodiment, deep uses LSTM-CTC (Connectionist Temporal Classification) end-to-end model structure based on pre-training model technology, and uses LSTM-CTC end-to-end method to carry out acoustic modeling, so that the method has higher advantage for Chinese character voice recognition.

S124, extracting text data from the picture data in the original data to obtain a fourth extraction result.

In this embodiment, the fourth extraction result refers to text data in the picture.

Specifically, image-text recognition is performed on the image data in the original data by adopting a deep model pooling circulating network CRNN so as to obtain a fourth extraction result.

In the futures field, a large amount of picture data are provided, content extraction is carried out from the picture data, mainly the text data in the picture are identified, the text is mainly identified and processed by adopting a deep model pooling circulation network CRNN (Convolutional Recurrent Neural Network), and the specific process is as follows:

firstly, preprocessing a picture, including picture graying, picture size scaling normalization, picture tilting rotation correction, text region positioning and the like, and converting the picture into a standardized input subgraph. The image tilt rotation correction firstly uses Hough Transform (Hough Transform) straight line detection algorithm to detect and calculate the image rotation angle, and then rotates the image back to the correct position according to the angle. And the positioning and identification of the text area of the picture are carried out by utilizing a multi-layer CNN convolution network to carry out feature extraction, the position block information of the text block in the picture is determined, and the picture is converted into a series of text character areas. A text character area is mainly represented by 4 coordinate position points, namely a standard rectangular matrix block, and if the text character area is a parallelogram, the text character area is converted into a standard rectangular block and blank pixel information is filled in a newly added part.

Next, character recognition is performed for each text character region. In the embodiment, final text recognition is performed by using a form of combining a CRNN network and a language model, wherein the CRNN is a currently popular text recognition network, and various CNN convolution characteristic extraction is performed on pictures according to sliding windows, so that abstract characteristic representation of character patterns in each sliding window in a text matrix frame is obtained and input into a BiLSTM (RNN) network for context characteristic joint extraction; then, the characteristic information in each small sliding window is subjected to sequential character classification recognition, and for Chinese, more than 4000 Chinese characters are classified and recognized. The BiLSTM can effectively learn the context relation between the character images and improve the accuracy of image-text recognition. In addition, the character is predicted by adopting vector information representations fused with character forms, word combination semantics and the like, so that the problem of insufficient capture of contextual information in the original ecological CRNN model can be solved, and the recognition problem caused by picture wrinkles, unclear and the like is solved. After the characters are identified, the central position point information of the sliding window needs to be recorded, and the position information of the original image where the characters are located is restored.

And finally, performing text splicing according to the relation of the upper part, the lower part, the left part and the right part of each character in the text character area block, and finally identifying text content information in the picture. And identifying tabular information of the pictures according to the row and column arrangement of the pictures, the coordinate positions of the pictures, the character types and other information, and converting the potential characters into tabular data.

And the processing treatment is carried out by the steps, so that the information content extraction of various data of the futures industry chain is finished, and various heterogeneous data are converted into unified text data.

And S130, extracting the entity relation of the futures industry chain by using the extraction result to obtain the entity and the entity-to-attribute relation.

In this embodiment, the entity refers to an entity in the extraction result, and the entity pair attribute relationship refers to an attribute relationship of an entity pair in the extraction result.

Specifically, the attribute relation extraction is a core technology for constructing a knowledge graph, and the future industry chain entity relation extraction comprises two aspects of content extraction, namely, future industry chain entity identification and future industry chain attribute relation extraction. Because Chinese has complex grammar and sentence patterns, the traditional neural network model has limited extraction characteristics and poor semantic characterization capability, thereby influencing the extraction performance of Chinese entity relationship. The method for extracting the multi-feature fusion entity relation based on the BERT pre-training model is adopted in the embodiment, firstly, the language materials are preprocessed and used for training the entity recognition model, then keyword information extraction and entity information extraction are carried out on the basis of the entity recognition model, and the fusion information is used for extracting the entity relation model. The fusion information can strengthen the semantic learning capability of the BERT model, greatly reduce the loss of semantic information characteristics and finally carry out relationship classification through a Softmax classifier.

In one embodiment, referring to fig. 7, the step S130 may include steps S131 to S135.

S131, processing the extraction result to obtain a first sample set.

In this embodiment, the first sample set refers to data obtained by training a futures industry chain entity identification model.

The obtained raw corpus data of the industrial chain in the futures field needs to be further processed for recognition training. The method mainly converts the BERT model data into BERT model format demand corpus, prepares high-quality training data for the main preprocessing of the BERT model data, and mainly comprises the following steps: and (5) data sorting and labeling and external feature extraction and fusion. Corpus data arrangement labeling is mainly used for processing corpus formats required by model training, extraction and fusion of external features are important points of whole data preprocessing, and extraction and fusion of the external features are sequentially and carefully analyzed.

Firstly, the text corpus is segmented according to paragraph and sentence levels, because of the limit of the input length of the BERT model, each prepared sentence segment does not exceed N characters (N generally takes a value of 512), namely, for part of the corpus, the articles are uniformly segmented into a plurality of independent article fragments according to paragraph and sentence levels. Again, the entity and word segmentation labeling is performed on these article segments, and since the main entity in the futures industry chain includes futures variety (futures), plate (Plate), place (Loc), organization (Org), person (Per), index (Idx), behavior (Act), specific event (Evt), etc., the entity is prepared for corpus according to the BIO form.

In order to accelerate the preparation efficiency and accuracy of the pre-training data, the corpus is primarily marked by adopting a way of combining dictionary matching and a general model, and then manual auditing and correction and alignment are carried out.

S132, performing fine tuning training of entity identification by using the BERT pre-training basic model which adopts a Pytorch frame and comes from Huggingface through training of the first sample set so as to obtain a futures industry chain entity identification model.

In this embodiment, the futures industry chain entity identification model is a model for extracting futures industry chain entities.

Fine tuning training for entity identification is performed using a Pytorch framework and using a BERT pre-training base model from Huggingface. Since entity recognition classifies text at the token level, adding a model of a linear layer above the BERT model will act as a token level classifier.

S133, preprocessing the extraction result to obtain a second sample set.

In this embodiment, the second sample set refers to data used to train the BERT pre-training model.

And a futures industry chain entity recognition model based on the BERT pre-training model is obtained, and the text corpus can be effectively subjected to effective entity recognition, extraction and analysis. On this basis, the classification and recognition of the relation between the entities in the text corpus are required. The embodiment also adopts a BERT pre-training model to realize the extraction task of the entity attribute relationship of the futures industry chain, and the model comprises 4 layers except the original text output layer: the system comprises a data preprocessing layer, a model training layer, a classification layer and a final output layer.

The corpus preprocessing layer mainly constructs a corpus feature set from an original text, converts the original text sequence into a corpus feature sequence, and features adopted in the embodiment mainly comprise key words, entity types, entity pair information and other features, and particularly, the following corpus feature extraction is performed. And the model training layer is mainly used for training the corpus feature sequence by using a BERT pre-training model to obtain the semantic word vector feature representation of the corpus feature sequence. And the classification layer is used for splicing and fusing various semantic feature vectors according to the semantic word vector feature representation, obtaining final vector representation through the full-connection layer and finally carrying out relationship classification by adopting Softmax. And the final output layer performs visual result mapping output and other business processing according to the Softmax classification result. The construction mode of the BERT attribute relationship judging model can directly output the original text to the BERT pre-training model without word vector training performed in advance, the model can automatically train out feature vectors with richer semantics, and then the feature vectors are directly used for relationship judgment. The structure of the BERT attribute relationship discrimination model is shown in fig. 8.

S134, training the BERT pre-training model by using the second sample set to obtain a BERT attribute relationship judging model.

Specifically, the data preprocessing layer is mainly used for preprocessing the original text corpus to construct a corpus feature set, and the feature is mainly used for keyword and entity type information in the embodiment. The keywords are used as an explicit feature, are words with high distinction degree, and for future industry chain knowledge graphs, mainly include upstream and downstream relations, supply and demand relations, alternative relations, co-designation relations, generic relations and the like, and the used keywords can be used for sorting related relation keywords according to various relations, for example: the keywords can make up for the defect of insufficient capturing of the character training features of the BERT model words, and the semantic information capturing capacity is enhanced, so that the more the keywords are extracted, the higher the relation extracting performance of the whole model is; the entity type is a shallow semantic feature, which also utilizes the limiting requirement of the specific entity relationship on the entity type, and captures the specific semantic role information in the classification and identification of the entity relationship more fully, for example: both sides of the upstream-downstream relationship of the futures industry chain need to be constrained to be of the entity type "commodity", for example, the pair of the upstream-downstream relationship entities of the iron ore and the steel are of the type "commodity". When model training is carried out, candidate entities in the corpus are replaced by corresponding entity types, and then the whole sentence is input into the BERT model for learning, so that semantic features are extracted.

When the BERT model is trained, the extracted features are further spliced and fused, so that the model output feature vector can express more semantic information as much as possible, the entity relation extraction effect is further improved, and the whole feature fusion process is shown in fig. 9, and comprises two part of feature fusion of entity types, keywords and original text corpus. For keywords, the keywords are ranked according to weight information, and about 15% of the keywords before ranking are selected for corpus serialization and splicing, mainly because the keywords with weak weight information have small influence on the correct discrimination of the whole relation, and noise information is easy to be caused by the reverse. And the Softmax relationship classification and discrimination layer outputs based on the BERT pre-trained network model and splices and fuses entity pair vectors to be subjected to relationship discrimination, thereby obtaining the relationship category between the final entity pairs.

S135, determining the entity and the entity-to-attribute relationship according to the futures industry chain entity identification model and the BERT attribute relationship discrimination model.

And (3) determining the entity and the attribute relationship of the entity to the data which are required to be processed in the step (S120) by utilizing the futures industry chain entity identification model and the BERT attribute relationship discrimination model, so as to be used for subsequent map construction.

And S140, constructing a futures industry chain knowledge graph according to the entity and the entity-to-attribute relationship.

Specifically, the entity and the incremental precipitation of the entity-to-attribute relationship are fused into the existing knowledge graph to obtain the future industry chain knowledge graph.

In this embodiment, the knowledge graph is mainly composed of entities, entity attributes and entity relationships, and the preliminary graph, that is, the first industry chain knowledge graph, can be obtained according to the data extracted in the earlier stage in a triplet manner, and then the knowledge data extracted later gradually iterates and increases, so that the newly extracted knowledge data and the preliminary graph need to be fused.

Acquiring a large number of related entities and entity pair attribute relations of the futures industry chain, and fusing the entity and entity relation attribute pairs and incremental precipitation into the existing industry chain knowledge graph to finally construct the futures industry chain knowledge graph.

The knowledge graph adopted in the embodiment can be divided into two layers of a mode layer and a data layer in a logic structure, and the data layer is mainly stored by taking a series of knowledge facts, such as the entity, the entity attribute, the entity relation and the like, obtained by extraction as a unit. If the fact is expressed by a triplet (entity 1, relation, entity 2), (entity, attribute value), the graph database can be selected as a storage medium, mainly using the Neo4j knowledge graph storage framework of open source. The schema layer is built on top of the data layer, and a series of fact expressions of the data layer are normalized mainly by the ontology library. The ontology is a conceptual template of the structured knowledge base, and the knowledge base formed by the ontology base has a strong hierarchical structure and a small redundancy degree.

Because of the newly mined entity and relationship, there may be problems such as conflict with the entity relationship existing in the existing knowledge-graph library, how to determine the optimal precipitation fusion method, namely: how to determine whether to update, keep original or give manual examination comments is determined, the embodiment adopts a Joint pre-training learning technology Jakey (Joint pre-training of knowledge graph and language understanding) based on the fusion of language text understanding and knowledge spectrum for spectrum conflict detection and fusion determination.

Firstly, extracting meta information such as entity, entity attribute, entity relation and the like in potential event pairs in a context according to a futures industry chain entity identification model and a BERT attribute relation discrimination model in the previous step; then, on one hand, combining the constructed futures industry chain map, determining nodes and attribute information of meta-information in the map, and carrying out semantic vectorization representation on the map information according to GCN (graph rolling network, graph Convolution Network); on the other hand, the language model technology such as a transducer is adopted to vectorize the context original text, so as to obtain the semantic vector representation of the relation between the events. And finally, combining the semantic vectorization representation in the atlas with the semantic vectorization representation of the original text, sending the semantic vectorization representation into a language model LM2 to obtain the context semantic relationship of the entity relationship, finally classifying and determining whether the newly identified entity relationship pair collides with the existing knowledge stock, and determining the confidence score of the related conflict. The next figure is a jakey pre-training model flow for learning a knowledge fusion representation as shown in fig. 10. In the figure, the text corpus mainly describes the text about the influence of an event on natural gas, and the tooth decay in the existing knowledge graph describes an example of the supply and demand relationship of an industrial chain. The Jaket carries out the combined enhanced pre-training of text understanding and knowledge graph two NLU tasks on the co-occurrence text corpus and the existing knowledge graph of the two events, applies a Softmax classification function on the output result of the Jaket model, carries out fine adjustment of conflict type detection classification tasks, and is similar to a general classification task based on a Bert pre-training language model. And finally determining the supply relation between the area A and the natural gas and the generic relation between the natural gas and the energy source through conflict type detection and classification analysis, and finally precipitating the knowledge in the map.

According to the futures industry chain knowledge graph construction method, various heterogeneous data content extraction, entity relation extraction and knowledge fusion precipitation are carried out from massive various data corpora, so that the futures industry chain knowledge graph is constructed, and the futures industry chain knowledge graph is extracted and constructed from multi-source heterogeneous data, so that the systematic knowledge logic is formed.

Fig. 11 is a schematic block diagram of a future-industry-chain knowledge graph construction apparatus 300, which is provided in an embodiment of the present invention. As shown in fig. 11, the present invention further provides a futures industry chain knowledge graph construction device 300 corresponding to the above futures industry chain knowledge graph construction method. The futures industry chain knowledge graph construction apparatus 300 includes a unit for performing the above-described futures industry chain knowledge graph construction method, and the apparatus may be configured in a server. Specifically, referring to fig. 11, the future industrial chain knowledge graph construction apparatus 300 includes a data acquisition unit 301, an extraction unit 302, a relationship extraction unit 303, and a graph construction unit 304.

A data acquisition unit 301, configured to acquire raw data in multiple heterogeneous forms of a futures industry chain; an extraction unit 302, configured to perform text conversion and extraction on the raw data to obtain an extraction result; a relationship extraction unit 303, configured to extract the entity relationship of the futures industry chain according to the extraction result, so as to obtain an entity and an entity-to-attribute relationship; and the map construction unit 304 is configured to construct a future industry chain knowledge map according to the entity and the entity-to-attribute relationship.

In one embodiment, the extraction unit 302 includes a first extraction subunit, a second extraction subunit, a third extraction subunit, and a fourth extraction subunit.

The first extraction subunit is used for extracting text contents of the WORD document and the PDF document in the original data to obtain a first extraction result; the second extraction subunit is used for extracting texts from the webpages and the tables in the original data to obtain a second extraction result; the third extraction subunit is used for extracting the text from the audio and video data in the original data to obtain a third extraction result; a fourth extraction subunit, configured to extract text data from the picture data in the original data, so as to obtain a fourth extraction result; the extraction results comprise a first extraction result, a second extraction result, a third extraction result and a fourth extraction result.

In an embodiment, the first extraction subunit includes a conversion module, a text content extraction module, and a discrimination module.

The conversion module is used for converting the WORD document in the original data into a PDF document; the text content extraction module is used for extracting text content information of the PDF document in the original data and the PDF document obtained by converting the WORD document so as to obtain first text content; and the judging module is used for judging the content type of the first text content by adopting a squeezing and exciting network so as to obtain a first extraction result.

In an embodiment, the second extraction subunit includes an extraction identification module, a second identification module, and a combination identification module.

The extraction and identification module is used for regularly extracting and identifying the webpage in the original data and the specific source data in the form by utilizing the structural characteristics of the webpage and adopting tools such as Xpath and the like so as to obtain a first identification result; the second recognition module is used for recognizing the webpage in the original data and the irregular webpage in the form by adopting a FreeDOM model so as to obtain a second recognition result; and the combination recognition module is used for combining the first recognition result and the second recognition result to obtain a second extraction result.

In an embodiment, the third extraction subunit comprises: the device comprises a conversion module, a word processing module, a cutting module and a splicing module.

The conversion module is used for converting the video data in the original data into an audio file by adopting FFmpeg; the word processing module is used for carrying out voice and word recognition on the audio data in the original data and the converted audio file and eliminating potential letters in the audio so as to obtain the processed audio file; the cutting module is used for carrying out segmentation cutting on the processed audio file so as to obtain a segmented audio file; and the splicing module is used for carrying out voice text recognition on the segmented audio file through deep speech and splicing text contents obtained by recognition so as to obtain a third extraction result.

In an embodiment, the fourth extraction subunit is configured to perform image-text recognition on the image data in the original data by using a deep model pooling cyclic network CRNN to obtain a fourth extraction result.

In an embodiment, the relation extraction unit 303 includes a first processing subunit, a first training subunit, a second processing subunit, and a determining subunit.

A first processing subunit, configured to process the extraction result to obtain a first sample set; the first training subunit is used for performing fine tuning training of entity identification by utilizing the first sample set training and adopting a Pytorch frame and a BERT pre-training basic model from Huggingface so as to obtain a futures industry chain entity identification model; the second processing subunit is used for preprocessing the extraction result to obtain a second sample set; the second training subunit is used for training the BERT pre-training model by using the second sample set so as to obtain a BERT attribute relationship judging model; and the determining subunit is used for determining the entity and the entity-to-attribute relationship according to the futures industry chain entity identification model and the BERT attribute relationship judging model.

In an embodiment, the map construction unit 304 is configured to fuse the entity and the entity-to-attribute relationship incremental precipitation into an existing knowledge map to obtain a future industry chain knowledge map.

It should be noted that, as those skilled in the art can clearly understand, the detailed implementation process of the future industrial chain knowledge graph construction apparatus 300 and each unit may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, the detailed description is omitted here.

The futures industry chain knowledge graph construction apparatus 300 described above may be implemented in the form of a computer program that can run on a computer device as shown in fig. 12.

Referring to fig. 12, fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, where the server may be a stand-alone server or may be a server cluster formed by a plurality of servers.

With reference to FIG. 12, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform a futures industry chain knowledge graph construction method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a futures industry chain knowledge graph construction method.

The network interface 505 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 500 to which the present inventive arrangements may be applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to implement the steps of:

acquiring original data of various heterogeneous forms of futures industry chains; carrying out text conversion and extraction on the original data to obtain an extraction result; extracting the entity relation of the futures industry chain by using the extraction result to obtain the entity and the entity-to-attribute relation; and constructing a futures industry chain knowledge graph according to the entity and the entity pair attribute relationship.

In one embodiment, when the step of performing the text conversion and extraction on the raw data to obtain the extraction result, the processor 502 specifically performs the following steps:

extracting text contents of the WORD document and the PDF document in the original data to obtain a first extraction result; extracting text from the web pages and the tables in the original data to obtain a second extraction result; extracting text from the audio and video data in the original data to obtain a third extraction result; extracting text data from the picture data in the original data to obtain a fourth extraction result;

In an embodiment, when implementing the step of extracting text content from the WORD document and PDF document in the original data to obtain the first extraction result, the processor 502 specifically implements the following steps:

converting the WORD document in the original data into a PDF document; extracting text content information from the PDF document in the original data and the PDF document obtained by converting the WORD document to obtain first text content; and judging the content type of the first text content by adopting a squeezing and exciting network so as to obtain a first extraction result.

In an embodiment, when the processor 502 performs the step of extracting text from the web pages and tables in the original data to obtain the second extraction result, the following steps are specifically implemented:

the method comprises the steps of regularly extracting and identifying specific source data in a webpage and a form in original data by utilizing webpage structural features and adopting Xpath and other tools to obtain a first identification result; identifying the webpage in the original data and the irregular webpage in the form by adopting a FreeDOM model so as to obtain a second identification result; and combining the first recognition result and the second recognition result to obtain a second extraction result.

In an embodiment, when implementing the step of extracting the text from the audio and video data in the original data to obtain the third extraction result, the processor 502 specifically implements the following steps:

converting the video data in the original data into an audio file by adopting FFmpeg; performing voice character recognition on the audio data in the original data and the converted audio file, and eliminating possible potential letters in the audio to obtain a processed audio file; segmenting and cutting the processed audio file to obtain a segmented audio file; and carrying out voice text recognition on the segmented audio file through deep speech, and splicing text contents obtained by recognition to obtain a third extraction result.

In an embodiment, when the processor 502 performs the step of extracting text data from the picture data in the original data to obtain the fourth extraction result, the following steps are specifically implemented:

In one embodiment, when the step of extracting the entity relationship of the futures industry chain by using the extraction result to obtain the entity and the entity-attribute relationship is implemented by the processor 502, the following steps are specifically implemented:

processing the extraction result to obtain a first sample set; performing fine tuning training of entity identification by utilizing the BERT pre-training basic model which adopts a Pytorch frame and comes from Huggingface through the first sample set training so as to obtain a futures industry chain entity identification model; preprocessing the extraction result to obtain a second sample set; training a BERT pre-training model by using the second sample set to obtain a BERT attribute relationship discrimination model; and determining the entity and the entity-to-attribute relationship according to the futures industry chain entity identification model and the BERT attribute relationship discrimination model.

In one embodiment, when the step of constructing the future industry chain knowledge graph according to the entity and the entity-to-attribute relationship is implemented by the processor 502, the following steps are specifically implemented:

It should be appreciated that in an embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program which, when executed by a processor, causes the processor to perform the steps of:

In one embodiment, when the processor executes the computer program to implement the step of textually converting and extracting the raw data to obtain an extraction result, the following steps are specifically implemented:

In one embodiment, when the processor executes the computer program to implement the text content extraction on the WORD document and the PDF document in the original data to obtain the first extraction result, the following steps are specifically implemented:

In one embodiment, when the processor executes the computer program to extract text from the web pages and tables in the original data to obtain the second extraction result, the following steps are specifically implemented:

In an embodiment, when the processor executes the computer program to extract the text from the audio and video data in the original data to obtain the third extraction result, the following steps are specifically implemented:

In an embodiment, when the processor executes the computer program to extract text data from the picture data in the original data to obtain a fourth extraction result, the following steps are specifically implemented:

In one embodiment, when the processor executes the computer program to implement the step of extracting the entity relationship of the futures industrial chain by using the extraction result to obtain the entity and the entity-attribute relationship, the following steps are specifically implemented:

In one embodiment, when the processor executes the computer program to implement the step of constructing a future industry chain knowledge graph according to the entity and the entity-to-attribute relationship, the method specifically includes the following steps:

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The method for constructing the futures industry chain knowledge graph is characterized by comprising the following steps of:

2. The futures industry chain knowledge graph construction method according to claim 1, wherein the text conversion and extraction are performed on the original data to obtain an extraction result, and the method comprises the following steps:

3. The method for constructing a future industrial chain knowledge graph according to claim 2, wherein extracting text content from the WORD document and the PDF document in the original data to obtain a first extraction result comprises:

converting the WORD document in the original data into a PDF document;

4. The method for constructing the future industrial chain knowledge graph according to claim 2, wherein the extracting text from the web pages and tables in the original data to obtain the second extraction result comprises:

5. The future-industry-chain knowledge graph construction method of claim 2, wherein the extracting text from the audio-video data in the original data to obtain a third extraction result comprises:

6. The future-industry-chain knowledge graph construction method of claim 2, wherein the extracting text data from the picture data in the original data to obtain a fourth extraction result comprises:

7. The method for constructing a future industrial chain knowledge graph according to claim 2, wherein extracting the physical relationship of the future industrial chain by using the extraction result to obtain the physical and physical-to-attribute relationship comprises:

processing the extraction result to obtain a first sample set;

preprocessing the extraction result to obtain a second sample set;

8. The method for constructing a futures industry chain knowledge graph according to claim 1, wherein the constructing the futures industry chain knowledge graph according to the entity and the entity-to-attribute relationship comprises:

9. Futures industry chain knowledge graph construction device, its characterized in that includes:

10. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-8.