CN115391569B

CN115391569B - Method for automatically constructing industry chain map from research report and related equipment

Info

Publication number: CN115391569B
Application number: CN202211325252.2A
Authority: CN
Inventors: 陈清财; 杨新兰; 李东方
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-03-24
Anticipated expiration: 2042-10-27
Also published as: CN115391569A

Abstract

The invention discloses a method for automatically constructing an industrial chain map from research and report and related equipment. The method comprises the following steps: loading a research and report oriented industrial chain chart mode; acquiring an original research message file set, and respectively preprocessing each original research message book in the original research message set to obtain a target text; simultaneously extracting target triples and target independent entities in a sentence sequence by adopting an entity relation synchronous extraction model; extracting a target attribute pair in a sentence sequence containing index description by adopting an index attribute extraction model; matching and aligning the obtained one or more target attribute pairs with the initial second triple to obtain a target second triple; adding the target first triple and the target second triple to the target industry chain atlas. The method for automatically constructing the industrial chain map from the research report can effectively meet the requirement of automatically constructing the large-scale industrial chain map through the research report under the complex situation, and reduces the manpower loss and the time cost.

Description

Method for automatically constructing industry chain map from research report and related equipment

Technical Field

The invention relates to the technical field of word processing, in particular to a method for automatically constructing an industrial chain map from research and report and related equipment.

Background

With the development of financial technology and the continuous expansion of the global capital market, the financial field generates a great amount of industry information data every day, wherein abundant valuable information is contained. The knowledge graph describes and stores knowledge contained in data in a structured form, can express information of the Internet into a form closer to human cognition, has strong capacity of organizing, managing and understanding mass information, utilizes the graph to carry out incidence relation mining and reasoning analysis, and has wide application in academia and industry. The industry chain map is based on industry chain data of industry subdivision products, and can better describe the upstream and downstream relationship, the product hierarchical relationship, the main and operation relationship between a company and a product, and the relationship between related economic indexes and the company, the product and the industry. The industrial chain map can provide an accurate and instant solution for a client, is beneficial to relevant personnel to capture the internal dynamic state of the industry, and brings certain economic benefit for enterprises. Reasoning is carried out along the industrial chain map, potential accident risks and investment business opportunities can be found, and then people are assisted in making intelligent investment decisions, so that actual financial business scenes such as investment, wind control, investment and marketing services are enabled.

However, the current financial field still lacks a large-scale, open-source industry chain panoramic knowledge map. The research on the construction of the industrial chain map in the systematic exposition field is relatively deficient, and most of the researches fail to effectively focus on the complex index attributes of the relationship. The traditional manual-based key information extraction method cannot meet the requirement of rapidly processing massive information, and has high labor cost and time consumption.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

Aiming at the defects in the prior art, a method for automatically constructing an industry chain map from research and report and related equipment are provided, and the method and the related equipment aim to solve the problems that systematic methods for automatically constructing the industry chain map in the prior art are few and complex index attributes of relationships cannot be effectively concerned.

In a first aspect of the present invention, there is provided a method for automatically constructing an industry chain map from a research report, comprising:

loading a research-oriented industrial chain graph spectrum mode containing a target entity type, a target relationship type and a target attribute type, predefining entity type information needing to be extracted and triple type information needing to be extracted in the industrial chain graph spectrum mode, wherein the triples are first triples or second triples, the first triples and the second triples are of a structure of head entity type-relationship type-tail entity type, in the second triples, the relationship type further comprises at least one attribute pair corresponding to the relationship type, the attribute pair is a simple attribute pair or a complex attribute pair, the simple attribute pair comprises a first attribute and a second attribute, the first attribute is the name of the attribute pair, and the second attribute is the value of the attribute pair; the complex attribute pair comprises a first attribute and a plurality of second attributes, and the second attributes in the complex attribute pair comprise values of the complex attribute pair and at least one constraint on the complex attribute pair;

acquiring an original research message file set, and respectively preprocessing each original research message book in the original research message set to obtain a target text, wherein the target text consists of a non-empty sentence sequence;

simultaneously extracting a target triple and a target independent entity in the sentence sequence by adopting an entity relation synchronous extraction model, wherein the target triple is a target first triple or an initial second triple;

extracting a target attribute pair in a sentence sequence containing index description by adopting an index attribute extraction model, wherein the target attribute pair comprises a target first attribute and a target second attribute;

for a sentence sequence containing attribute pairs, matching and aligning the obtained one or more target attribute pairs with the initial second triple to obtain a target second triple, wherein the target second triple contains the initial second triple and one or more target attribute pairs corresponding to the initial second triple;

adding the target first triple and the target second triple to a target industry chain graph.

The method for automatically constructing an industry chain map from research reports, wherein the preprocessing of each original research report in the original research report set comprises:

performing text recognition on the original text book by an optical character recognition technology to obtain a first text which is convenient to read and write;

performing text cleaning on the first text, and removing noise characters in the first text to obtain a second text, wherein the noise characters are characters without actual description effect on a real text;

and carrying out sentence segmentation processing on the second word text, and dividing the second word text into a non-empty sentence sequence to obtain the target text.

The method for automatically constructing the industrial chain map from the research and the report is characterized in that the entity relationship synchronous extraction model comprises a sentence sequence coding module, a subtask feature selection module and a subtask target information prediction module;

the sentence sequence coding module codes the sentence sequence by adopting a general pre-training model based on a training set and a verification set of the labeled entity and the relationship information to obtain a target vector;

the subtask feature selection module is used for acquiring feature information corresponding to an entity extraction subtask and a relation prediction subtask respectively, and the entity extraction subtask is used for extracting a target entity fragment in the sentence sequence according to the target vector;

the subtask target information prediction module judges whether the type of the target entity fragment belongs to the target entity type or not based on the characteristic information of the entity extraction subtask, if so, the target entity fragment is reserved, and if not, the target entity fragment is discarded;

the subtask target information prediction module is also used for judging the relationship between the entity pairs based on the characteristic information of the relationship prediction subtask to obtain the characteristic representation of the target relationship, judging whether the type of the target relationship belongs to the target relationship type or not according to the characteristic representation of the target relationship, if so, keeping the target relationship, and if not, discarding the target relationship;

and obtaining a target triple according to the target entity fragment and the target relation corresponding to the target entity fragment, wherein the target entity fragment without the corresponding relation is the target independent entity information.

The method for automatically constructing the industry chain map from the research and the report, wherein the extracting of the target attribute pair in the sentence sequence containing the index description comprises the following steps:

judging whether the sentence sequence contains indexes, if so, extracting a target attribute pair in the sentence sequence by adopting the index attribute extraction model;

the target attribute pair is a simple attribute pair or a complex attribute pair.

The method for automatically constructing an industry chain atlas from a research report, wherein the matching and aligning the obtained one or more target attribute pairs with the initial second triplet includes:

and matching and aligning the target second attribute in the obtained target attribute pair with the corresponding initial second triple, wherein a relationship between part of the target second attribute and the corresponding initial second triple is aligned, and values corresponding to the other part of the target second attribute are matched and aligned with a head entity or a tail entity in the triple, so that a target second triple is obtained, and the target second triple comprises the initial second triple and attribute information corresponding to the initial second triple.

The method for automatically constructing the industry chain map from the research and report is characterized in that the list of the target entity types is dynamically adjusted according to the target text and the target task scene requirements;

the list of the target relation types is dynamically adjusted according to the target entity types and the target texts;

and the list of the target attribute types is dynamically adjusted according to the target attribute types and the target text.

In a second aspect of the present invention, there is provided an apparatus for automatically constructing an industry chain map from a research report, comprising:

the system comprises an industry chain diagram spectrum pattern loading module, a relation model loading module and a relation model loading module, wherein the industry chain diagram spectrum pattern loading module is used for loading a research-oriented industry chain diagram spectrum pattern containing a target entity type, a target relation type and a target attribute type, entity type information needing to be extracted and triple type information needing to be extracted are predefined in the industry chain diagram spectrum pattern, the triple is a first triple or a second triple, the first triple and the second triple are in a structure of head entity type-relation type-tail entity type, in the second triple, the relation type further comprises at least one attribute pair corresponding to the relation type, the attribute pair is a simple attribute pair or a complex attribute pair, the simple attribute pair comprises a first attribute and a second attribute, the first attribute is the name of the attribute pair, and the second attribute is the value of the attribute pair; the complex attribute pair comprises a first attribute and a plurality of second attributes, and the second attributes in the complex attribute pair comprise values of the complex attribute pair and at least one constraint on the complex attribute pair;

the system comprises a target text acquisition module, a target text acquisition module and a search module, wherein the target text acquisition module is used for acquiring an original research message book set and respectively preprocessing each original research message book in the original research message set to obtain a target text, and the target text consists of a non-empty sentence sequence;

the entity relationship synchronous extraction module is used for simultaneously extracting a target triple and a target independent entity in the sentence sequence by adopting an entity relationship synchronous extraction model, wherein the target triple is a target first triple or an initial second triple;

the index attribute extraction module is used for adopting an index attribute extraction model, the index attribute extraction model is used for extracting a target attribute pair in a sentence sequence containing index description, and the target attribute pair comprises a target first attribute and a target second attribute;

the attribute-relationship alignment module is configured to, for a sentence sequence including an attribute pair, perform matching alignment on the obtained one or more target attribute pairs and the initial second triple to obtain a target second triple, where the target second triple includes the initial second triple and one or more target attribute pairs corresponding to the initial second triple;

a target industry chain map spectrum obtaining module, configured to add the target first triple and the target second triple to a target industry chain map.

In a third aspect of the present invention, a terminal is provided, which includes: the system comprises a processor and a storage medium which is in communication connection with the processor, wherein the storage medium is suitable for storing a plurality of instructions, and the processor is suitable for calling the instructions in the storage medium to execute the steps of realizing the method for automatically constructing the industry chain map from the research report.

In a fourth aspect of the present invention, there is provided a storage medium storing one or more programs executable by one or more processors to implement the steps of any one of the above methods for automatically constructing an industry chain graph from a research report.

Has the advantages that: the method comprises the steps of loading a research-report-oriented industrial chain spectrum mode containing a target entity type, a target relationship type and a target attribute type, then respectively preprocessing each original research report in an original research report set to obtain a target text, simultaneously extracting a target triple and a target independent entity in a sentence sequence by using an entity relationship synchronous extraction model, wherein the target triple is a target first triple or an initial second triple, extracting a target attribute pair in the sentence sequence containing index description by using an index attribute extraction model, the target attribute pair comprises a target first attribute and a target second attribute, for the sentence sequence containing the attribute pair, matching one or more obtained target attribute pairs with the initial second triple to obtain a target second triple, and the target second triple comprises the initial second attribute and the initial second triple or a plurality of obtained target attribute pairs are added to the target triple and the target second triple corresponding to the industrial chain spectrum. The method for automatically constructing the industrial chain atlas from the research report can effectively meet the requirement of automatically constructing the large-scale industrial chain atlas from the research report under the complex situation, effectively pay attention to the complex index attribute of the relation, meet the extraction requirement of the entity relation and the related attribute by using a more accurate and efficient model, and reduce the manpower loss and the time cost.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for automatically constructing an industry chain graph from a research report provided by the present invention;

FIG. 2 is a flowchart illustrating a preprocessing process of an original research report according to an embodiment of the method for automatically constructing an industry chain map from the research report of the present invention;

FIG. 3 is a schematic structural diagram of an entity relationship synchronous extraction model in an embodiment of a method for automatically constructing an industry chain graph from a research report according to the present invention;

FIG. 4 is a flowchart illustrating the extraction of index attributes according to an embodiment of the method for automatically constructing an industry chain graph from a research report;

FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for automatically constructing an industry chain map from a research report according to the present invention;

fig. 6 is a schematic structural diagram of an embodiment of a terminal provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The method for automatically constructing the industry chain map from the research report can be applied to a terminal with computing power, and the terminal can execute the task of extracting the target first triple and the target second triple in the original research report file set and constructing the target industry chain map by the method for automatically constructing the industry chain map from the research report provided by the invention.

Example one

In this embodiment, a method for automatically constructing an industry chain map from a survey is provided. As shown in fig. 1, the method for automatically constructing an industry chain map from a research report provided by the present invention comprises the steps of:

s100, loading a research-and-report-oriented industrial chain diagram spectrum mode containing a target entity type, a target relationship type and a target attribute type, predefining entity type information to be extracted and triple type information to be extracted in the industrial chain diagram spectrum mode, wherein the triplets are first triplets or second triplets, the first triplets and the second triplets are of a structure of head entity type-relationship type-tail entity type, in the second triplets, the relationship types further comprise at least one attribute pair corresponding to the relationship types, the attribute pairs are simple attribute pairs or complex attribute pairs, the simple attribute pairs comprise a first attribute and a second attribute, the first attribute is the name of the attribute pair, and the second attribute is the value of the attribute pair; the complex attribute pair comprises a first attribute and a plurality of second attributes, and the second attributes in the complex attribute pair comprise values of the complex attribute pair and at least one constraint on the complex attribute pair.

Specifically, before processing a research report book, a research report-oriented industry chain chart pattern containing entity, relationship, attribute types and definitions needs to be loaded first. Besides simple definition between entity relations, the triple overlapping situation is covered, and at the same time, necessary relation attributes are defined to describe a large amount of index data in the research text.

The loading of the industry chain diagram mode which is oriented to the research and the report and contains a target entity type, a target relation type and a target attribute type comprises the following steps:

and loading the predefined target entity type according to the target task scene requirement. And the list of the target entity types is dynamically adjusted according to the target text and the target task scene requirement. Specifically, according to a target task scenario requirement, based on analysis of the content of the research report, the predefined target entity types are loaded, specifically including but not limited to companies, characters, brands, products, industries, regions, services, risk events, and the like, and the list of the target entity types is dynamically adjusted according to the target text and the target task scenario requirement.

And loading the predefined target relationship types among the target entity types according to the requirements of target task scenes, wherein the list of the target relationship types is dynamically adjusted according to the target entity types and the target texts. Specifically, according to the requirements of a target task scene, the target relationship types among the predefined target entity types are loaded, including but not limited to the upstream and downstream relationships among industries and businesses, the production and sales relationships between companies and products, and the like, and the list of the target relationship types is dynamically adjusted according to the target entity types and the target text.

And loading the predefined target attribute type according to the requirements of a target task scene, wherein the target attribute type comprises a first attribute and a second attribute, the first attribute is a specific name of the index corresponding to the second triple, and the second attribute is a value of the index corresponding to the second triple and other descriptions for restraining the index corresponding to the second triple. Wherein the list of target attribute types is dynamically adjusted according to the target attribute types and the target text. Specifically, according to the requirements of a target task scene, loading predefined target attribute types, wherein the target attributes refer to index data in a research and report text, that is, the index data in the research and report text is used for describing attributes shared by the entities and the relationships between the entities. The target attribute is divided into the first attribute and the second attribute, wherein a specific name of an index corresponds to the first attribute, the first attribute is a specific name of an index corresponding to the second triple, and a value of the index corresponding to the second triple is the second attribute. Since the first attribute only contains the specific name of the index corresponding to the second triple, the target relationship attribute is divided into the first attribute and the second attribute, and the situation that one sentence contains a plurality of indexes can be effectively described.

In the industrial chain diagram spectrum mode, entity type information to be extracted and type information of a triple to be extracted are further predefined, the triple is a first triple or a second triple, the first triple and the second triple are both relation triples with a structure of 'head entity type-relation type-tail entity type', the first triple is a simple triple and does not contain attribute information, the second triple is a complex triple, the relation type further comprises at least one attribute pair corresponding to the relation type in the second triple, the attribute pair is a simple attribute pair or a complex attribute pair, the simple attribute pair comprises a first attribute and a second attribute, the first attribute is a name of the attribute pair, and the second attribute is a value of the attribute pair; the complex attribute pair comprises a first attribute and a plurality of second attributes, and the second attribute in the complex attribute pair comprises a value of the complex attribute pair and at least one constraint on the complex attribute pair

And after loading the industrial chain chart pattern, acquiring an original research message file set, and preprocessing the original research message file set.

S200, acquiring an original research message file set, and respectively preprocessing each original research message book in the original research message set to obtain a target text, wherein the target text consists of a non-empty sentence sequence.

Referring to fig. 2, the preprocessing each original research report in the original research report set includes:

s210, performing text recognition on the original research text through an optical character recognition technology to obtain a first text convenient to read and write.

In this embodiment, the original research and report text is an industry research and report document, and the original research and report text is converted into the first text convenient for reading and writing by an Optical Character Recognition (OCR) technology.

S220, text cleaning is carried out on the first text, noise characters in the first text are removed, and a second text is obtained, wherein the noise characters are characters without actual description effect on a real text.

Further, text cleaning is carried out on the first text, redundant spaces, special identifiers and more than 6 continuous solid point numbers in the first text are removed in a unified mode, and the second text is obtained.

And S230, performing sentence segmentation processing on the second text, and dividing the second text into non-empty sentence sequences to obtain the target text.

The principle of sentence segmentation processing is to ensure that entities contained in sentences are not segmented as far as possible. First, common sentence separators are used, including but not limited to. ","! "," \8230; "8230;", "; "etc., dividing the first textual text into a sequence of sentences. And for the long sentence with more than 512 characters after division, performing secondary segmentation by using the terms of "" and the like on the basis of following the sentence division principle to obtain the target text.

The method for automatically constructing the industry chain map from the research and report further comprises the following steps:

s300, simultaneously extracting a target triple and a target independent entity in the sentence sequence by adopting an entity relation synchronous extraction model, wherein the target triple is a target first triple or an initial second triple.

Referring to fig. 3, the entity relationship synchronous extraction model includes a sentence sequence encoding module, a subtask feature selection module and a subtask target information prediction module;

s310, the sentence sequence coding module codes the sentence sequence by adopting a general pre-training model based on a training set and a verification set of the labeled entity and the relationship information to obtain a target vector.

Specifically, a training set verification set is manually marked, a universal pre-training model is input, the universal pre-training model is finely adjusted based on the training set and the verification set, a sentence sequence coding model suitable for the method for automatically constructing the industry chain map from the research and the report is obtained, and the sentence sequence in the target text is coded based on the finely adjusted sentence sequence coding model, so that a target vector is obtained.

And S320, the subtask feature selection module is used for acquiring feature information corresponding to an entity extraction subtask and a relation prediction subtask respectively, wherein the entity extraction subtask is used for extracting a target entity fragment in the sentence sequence according to the target vector.

And capturing the respective feature information of the entity extraction subtask and the relationship prediction subtask according to the obtained target vector, and calculating the shared feature information between the entity extraction subtask and the relationship prediction subtask, thereby realizing the feature division of the tasks. Wherein. And the entity extraction subtask is used for extracting a target entity fragment in the sentence sequence according to the target vector.

According to the characteristic information shared between the entity extraction subtask and the relation prediction subtask and the characteristic information specific to the subtask, the characteristics between the entity extraction subtask and the relation prediction subtask are recombined, so that new characteristic information of each subtask is obtained, bidirectional information interaction between subtasks can be promoted, and the interference of redundant characteristics is avoided.

Through the characteristic selection and recombination mechanism, the bidirectional information interaction between the entity extraction subtasks and the relation prediction subtasks is promoted, the influence on precision and efficiency caused by error transmission and redundant calculation is relieved, and meanwhile, nested entities and complex extraction scenes of single entity overlapping and entity pair overlapping in the triple overlapping problem can be effectively dealt with.

S330, the subtask target information prediction module judges whether the type of the target entity fragment belongs to the target entity type based on the characteristic information of the entity extraction subtask, if so, the target entity fragment is reserved, and if not, the target entity fragment is discarded.

Specifically, the subtask target information is based on the entity extraction subtask feature information, a target entity segment in the sentence sequence is extracted according to the target vector, features of a start position and an end position of a connection character level and sentence level features are obtained in the sentence sequence, a target entity segment and feature representations of the target entity segment are obtained, and whether the target entity segment belongs to an entity with a type k is predicted according to the feature representations of the target entity segment, where a value range of k is the target entity type corpus predefined in the industrial chain graph mode in this embodiment.

S340, the subtask target information prediction module further judges the relationship between the entity pairs based on the feature information of the relationship prediction subtask to obtain the feature representation of the target relationship, judges whether the type of the target relationship belongs to the target relationship type according to the feature representation of the target relationship, if so, retains the target relationship, and if not, discards the target relationship.

Specifically, the subtask target information prediction module judges the relationship between the entity pairs based on the feature information of the relationship prediction subtask, and refines the judgment of the relationship between the entity pairs into the type judgment between the corresponding start positions and end positions of the head entity and the tail entity. Taking a start position as an example, taking respective character level features of head and tail entities, connecting sentence level features, judging a relationship between the entities according to a target relationship prediction feature to obtain a feature representation of a target relationship, and predicting whether the target relationship belongs to a relationship with a type l according to the feature representation of the target relationship, wherein a value range of l is the target relationship type corpus predefined in the industrial chain graph spectrum mode in this embodiment. The same applies to the calculation of the type of relationship between the end positions of the entity pairs.

S350, obtaining a target triple according to the target entity fragment and the corresponding target relation thereof, wherein the target entity fragment without the corresponding relation is the target independent entity information

Combining the target entity fragments and the corresponding target relationships thereof, and combining the target entity fragments with the corresponding relationships into a triple, wherein the triple is a relationship triple with a structure of a head entity-relationship-tail entity, and the target entity fragments without the corresponding relationships are the independent entity information, and the target triple is a target first triple or an initial second triple.

Referring again to fig. 1, the method for automatically constructing an industry chain map from a research report further comprises the steps of:

s400, extracting a target attribute pair in the sentence sequence containing the index description by adopting an index attribute extraction model, wherein the target attribute pair comprises a target first attribute and a target second attribute.

Referring to fig. 4, the extracting a target attribute pair in a sentence sequence containing an index description includes:

s410, judging whether the sentence sequence contains indexes, if so, extracting a target attribute pair in the sentence sequence by adopting the index attribute extraction model;

and S420, the target attribute pair is a simple attribute pair or a complex attribute pair.

Specifically, whether the sentence sequence contains indexes or not is judged through a text classification model, if yes, a target attribute pair in the sentence sequence is extracted through the index attribute extraction model, the target attribute pair is one or more, the target attribute pair is a target simple attribute pair or a target complex attribute pair, the target simple attribute pair comprises a target first attribute and a target second attribute, the target first attribute is the name of the index corresponding to the target attribute pair, and the target second attribute is the value of the index corresponding to the target attribute pair; the target complex attribute pair comprises a target first attribute and a plurality of target second attributes, and the target second attributes in the target complex attribute pair comprise values of the target complex attribute pair and at least one constraint on the target complex attribute pair.

And S500, for a sentence sequence containing attribute pairs, matching and aligning the obtained one or more target attribute pairs with the initial second triple to obtain a target second triple, wherein the target second triple contains the initial second triple and one or more target attribute pairs corresponding to the initial second triple.

The matching and aligning the obtained one or more target attribute pairs with the initial second triple includes:

For the complex sentence sequence containing the index description, the initial second triple and a corresponding target attribute pair thereof are respectively obtained through the entity relationship synchronous extraction model and the index attribute extraction model, the target attribute pair comprises the target first attribute and the target second attribute, and partial attributes in the target second attribute are matched with head and tail entities in the triples by matching and aligning the target first attribute and the target second attribute with the triples, so that the alignment between the attributes and the relationships is completed, the information expression of the initial second triple is perfected, and the target second triple is obtained.

S600, adding the target first triple and the target second triple to a target industry chain map.

And constructing a target industry chain map according to the obtained target first triple and the target second triple, adding the target first triple and the target second triple into the industry chain map, constructing a complex situation covering triple overlapping except simple entity relation definition, and defining the target industry chain map with necessary relation attributes for describing a large amount of index data in a research and report text.

The extracted target independent entity can facilitate subsequent reasoning evolution, and when more research reports are added to jointly construct the target industry chain map, more related features can be extracted from the newly added research reports more quickly, so that the subsequent reasoning evolution is facilitated.

The embodiment provides a method for automatically constructing an industrial chain map from research and report, which can automatically convert natural language long text description containing industrial chain knowledge into entities with attributes and relationship links in the map. The embodiment provides a method for automatically constructing an industrial chain atlas from research and report, which uses an entity relationship extraction model, promotes the bidirectional information interaction between tasks, alleviates the influence on precision and efficiency caused by error transmission and redundant computation, and can effectively cope with nested entities and complex extraction scenes of single entity overlapping and entity pair overlapping in the triple overlapping problem. In addition, the attribute extraction can dig out beneficial information contained in a large amount of index data in the research and report text. Further, by aligning the index attributes to the corresponding relationships, a target industrial chain map composed of the target triple represented by the more complete information and the target independent entity information is finally obtained.

In summary, this embodiment provides a method for automatically constructing an industry chain map from a research report, where after a research report-oriented industry chain map pattern including a target entity type, a target relationship type, and a target attribute type is loaded, an original research message document set is used to respectively preprocess each original research message in the original research report set to obtain a target text, then an entity relationship synchronous extraction model is used to simultaneously extract a target triple and a target independent entity in the sentence sequence, where the target triple is a target first triple or an initial second triple, and then an index attribute extraction model is used to extract a target attribute pair in the sentence sequence including an index description, where the target attribute pair includes the target first attribute and the target second attribute, and for the sentence sequence including the attribute pair, the obtained one or more target attribute pairs are aligned with the initial second triple to obtain a target second triple, where the target second triple includes the initial second triple and one or more target attribute pairs corresponding to the initial second triple, and finally the target first triple and the target second triple are added to the industry chain map. The method for automatically constructing the industrial chain atlas from the research report can effectively meet the requirement of automatically constructing the large-scale industrial chain atlas from the research report under the complex situation, effectively pay attention to the complex index attribute of the relation, meet the extraction requirement of the entity relation and the related attribute by using a more accurate and efficient model, and reduce the manpower loss and the time cost.

It should be understood that, although the steps in the flowcharts shown in the drawings of the present specification are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps of the present invention are not limited to being performed in the exact order disclosed, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps of the present invention may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

Example two

Based on the above embodiment, the present invention further provides an apparatus for automatically constructing an industry chain map from a research report, a schematic diagram of functional modules of the apparatus is shown in fig. 5, and the apparatus for automatically constructing an industry chain map from a research report includes:

the system comprises an industry chain diagram spectrum pattern loading module, a relation model loading module and a relation model loading module, wherein the industry chain diagram spectrum pattern loading module is used for loading a research-oriented industry chain diagram spectrum pattern containing a target entity type, a target relation type and a target attribute type, entity type information needing to be extracted and triple type information needing to be extracted are predefined in the industry chain diagram spectrum pattern, the triple is a first triple or a second triple, the first triple and the second triple are in a structure of head entity type-relation type-tail entity type, in the second triple, the relation type further comprises at least one attribute pair corresponding to the relation type, the attribute pair is a simple attribute pair or a complex attribute pair, the simple attribute pair comprises a first attribute and a second attribute, the first attribute is the name of the attribute pair, and the second attribute is the value of the attribute pair; the complex attribute pair includes a first attribute and a plurality of second attributes, and the second attribute in the complex attribute pair includes a value of the complex attribute pair and at least one constraint on the complex attribute pair, which is specifically described in embodiment one;

a target text acquisition module, configured to acquire an original research report set, and respectively pre-process each original research report in the original research report set to obtain a target text, where the target text is composed of a non-empty sentence sequence, and is specifically described in embodiment one;

an entity relationship synchronous extraction module, configured to extract a target triple and a target independent entity in the sentence sequence simultaneously by using an entity relationship synchronous extraction model, where the target triple is a target first triple or an initial second triple, and the target triple is specifically as described in embodiment one;

an index attribute extraction module, configured to employ an index attribute extraction model, where the index attribute extraction model is configured to extract a target attribute pair in a sentence sequence containing an index description, where the target attribute pair includes a target first attribute and a target second attribute, and is specifically described in embodiment one;

an attribute-relationship alignment module, configured to, for a sentence sequence including an attribute pair, perform matching alignment on the obtained one or more target attribute pairs and the initial second triple to obtain a target second triple, where the target second triple includes the initial second triple and one or more target attribute pairs corresponding to the initial second triple, and the specific example is as described in embodiment one;

a target industry chain graph spectrum obtaining module, configured to add the target first triple and the target second triple to a target industry chain graph, as described in embodiment one.

EXAMPLE III

Based on the method for automatically constructing the industry chain map from the research report in the first embodiment, the invention also provides a terminal, and a schematic block diagram of the terminal can be shown in fig. 6. The terminal comprises a memory 10 and a processor 20, wherein the memory 10 stores a program for automatically constructing an industry chain map from a research report, and the processor 10 executes a computer program to realize at least the following steps:

Wherein the preprocessing each original research report in the original research report set includes:

The entity relationship synchronous extraction model comprises a sentence sequence coding module, a subtask feature selection module and a subtask target information prediction module;

The extracting of the target attribute pair in the sentence sequence containing the index description comprises the following steps:

Wherein the matching and aligning the obtained one or more target attribute pairs with the initial second triple includes:

The list of the target entity types is dynamically adjusted according to the target text and the target task scene requirements;

Example four

The present invention also provides a storage medium storing one or more programs executable by one or more processors to implement the steps of the method for automatically constructing an industry chain graph from a research report according to the above-described embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for automatically constructing an industry chain map from a research report, comprising:

loading a research-oriented industry chain diagram spectrum mode containing a target entity type, a target relationship type and a target attribute type, predefining entity type information to be extracted and triple type information to be extracted in the industry chain diagram spectrum mode, wherein the triple is a first triple or a second triple, the first triple and the second triple are of a structure of head entity type-relationship type-tail entity type, in the second triple, the relationship type further comprises at least one attribute pair corresponding to the relationship type, the attribute pair is a simple attribute pair or a complex attribute pair, the simple attribute pair comprises a first attribute and a second attribute, the first attribute is a name of the attribute pair, and the second attribute is a value of the attribute pair; the complex attribute pair comprises a first attribute and a plurality of second attributes, and the second attributes in the complex attribute pair comprise values of the complex attribute pair and at least one constraint on the complex attribute pair;

acquiring an original research message file set, and respectively preprocessing each original research message book in the original research message file set to obtain a target text, wherein the target text consists of a non-empty sentence sequence;

the entity relation synchronous extraction model comprises a sentence sequence coding module, a subtask feature selection module and a subtask target information prediction module;

obtaining a target triple according to the target entity fragment and a target relation corresponding to the target entity fragment, wherein the target entity fragment without the corresponding relation is target independent entity information;

2. The method of automatically constructing an industry chain atlas from research reports of claim 1, wherein the preprocessing each original research report book in the set of original research report books comprises:

performing text cleaning on the first text, and removing noise characters in the first text to obtain a second text, wherein the noise characters in the first text are characters without actual description on a real text;

3. The method for automatically constructing an industry chain graph from research reports according to claim 1, wherein the extracting target attribute pairs in a sentence sequence containing index descriptions comprises:

4. The method for automatically constructing an industry chain atlas from a research report of claim 1, wherein the matching and aligning the obtained one or more target attribute pairs with the initial second triplet comprises:

5. The method for automatically constructing an industry chain graph from a research report of claim 1, wherein the list of target entity types is dynamically adjusted according to the target text and the target task scenario requirements;

6. An apparatus for automatically constructing an industry chain atlas from a survey, the apparatus comprising:

the system comprises a target text acquisition module, a target text acquisition module and a search module, wherein the target text acquisition module is used for acquiring an original research message file set and respectively preprocessing each original research message book in the original research message file set to obtain a target text, and the target text consists of a non-empty sentence sequence;

7. A terminal, characterized in that the terminal comprises: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the storage medium to perform the steps of implementing the method for automatically constructing an industry chain atlas from a research report of any of claims 1-5.

8. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to perform the steps of the method for automatically constructing an industry chain graph from a newspaper as recited in any of claims 1-5.