CN117473036B - Enterprise information extraction method, map construction method and system and storage medium - Google Patents

Enterprise information extraction method, map construction method and system and storage medium Download PDF

Info

Publication number
CN117473036B
CN117473036B CN202311552385.8A CN202311552385A CN117473036B CN 117473036 B CN117473036 B CN 117473036B CN 202311552385 A CN202311552385 A CN 202311552385A CN 117473036 B CN117473036 B CN 117473036B
Authority
CN
China
Prior art keywords
current
financial
enterprise information
information
enterprise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311552385.8A
Other languages
Chinese (zh)
Other versions
CN117473036A (en
Inventor
王建
孙昕
王佐成
李�浩
吕孝忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Data Space Research Institute
Original Assignee
Data Space Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Data Space Research Institute filed Critical Data Space Research Institute
Priority to CN202311552385.8A priority Critical patent/CN117473036B/en
Publication of CN117473036A publication Critical patent/CN117473036A/en
Application granted granted Critical
Publication of CN117473036B publication Critical patent/CN117473036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of information processing, and particularly relates to an enterprise information extraction method, a map construction method, a system and a storage medium. The extraction method comprises the following steps: capturing financial information, de-duplicating, storing in a local database, and sending the newly added financial information into a text preprocessing module by the local database to obtain financial text; sending the requirement to a specific task unit in the adjustment module, generating a specific task by the specific task unit based on the requirement and an optimization unit in the adjustment module, and extracting keywords from financial texts by the large language model based on the specific task to form enterprise information; and after the screening module calculates the similarity between the enterprise information and the current specific task, outputting the enterprise information with the similarity above a set value to an optimization unit of a technician and an adjustment module to calculate a joint loss function, and optimizing parameters in the specific task and the large language model based on the joint loss function by the optimization unit. The invention can accurately and efficiently extract the enterprise information from the financial information.

Description

Enterprise information extraction method, map construction method and system and storage medium
Technical Field
The invention belongs to the technical field of information processing, and particularly relates to an enterprise information extraction method, a map construction method, a system and a storage medium.
Background
With the development of the information age, a great amount of financial information is continuously emerging, the information not only comprises the development state of each field, but also comprises the development state of enterprises in each field, the upstream-downstream relationship among various enterprises and the like, the effective content is integrated from the financial information to form enterprise information, and the enterprise information is used for constructing an enterprise map so as to effectively assist market analysis and business layout.
In the prior art, financial information is collected manually in a large amount, and according to the requirement direction, enterprise information meeting the requirement is extracted from the financial information through reading summary, and then summarized, and then an enterprise map is constructed manually according to the summarized enterprise information.
However, the process of obtaining enterprise information and constructing enterprise atlases is time-consuming and labor-consuming; and because different personnel have different familiarity degree and different standards to different fields, the obtained enterprise information has stronger subjectivity, uneven quality and low accuracy, and the quality of the enterprise atlas constructed later is directly influenced. Especially when deep market understanding and potential market development in the specific field are carried out, a large amount of accurate enterprise information is needed to construct an enterprise map, meanwhile, the richness of the enterprise information is guaranteed, but the existing manual extraction of a large amount of enterprise information not only needs extremely long time, but also can miss some enterprise information containing potential market content, namely, the richness of the enterprise information cannot be guaranteed, and further the market research and judgment carried out by follow-up technicians according to the enterprise map is influenced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides an enterprise information extraction method which can accurately and efficiently extract enterprise information meeting requirements from financial information.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
S1, duplicate removal is carried out on financial information captured from each large platform, the financial information is stored in a local database, and the local database sends newly added financial information to a text preprocessing module for preprocessing to obtain financial texts;
s2, sending the requirement A into a specific task unit in the adjustment module, generating a specific task by the specific task unit based on the requirement A and an optimization unit in the adjustment module, sequentially calling financial texts from the text preprocessing module by the large language model based on the specific task, and extracting keywords from the financial texts to form structured enterprise information;
one piece of enterprise information consists of a plurality of keywords;
S3, each piece of enterprise information enters a screening module, and after the screening module calculates the similarity between each piece of enterprise information and the current specific task, the enterprise information with the similarity above a set value is used as qualified enterprise information to be output to technicians; meanwhile, the qualified enterprise information is sent to an optimizing unit in the adjusting module by the screening module to calculate a joint loss function, and then the system returns to S2, the optimizing unit optimizes various parameters in the specific task and the large language model based on the joint loss function, and the large language model after optimizing the parameters extracts keywords from new financial texts based on the optimized specific task to form structured enterprise information.
Preferably, the step S1 further comprises the following substeps:
s11, when the acquired financial information is foreign language or contains letter abbreviations, translating the foreign language financial information or letter abbreviations into Chinese, comparing the Chinese financial information with the financial information stored in the local database, discarding the current financial information if the current acquired financial information is repeated with the financial information stored in the local database, otherwise, sending the current acquired financial information into the local database for storage;
The financial information comprises financial news, enterprise annual report and enterprise business information;
S12, the local database sends the newly added financial information into a text preprocessing module, and the text preprocessing module carries out synonym replacement on each financial information, namely, the words in each financial information are replaced by standard words according to a synonym standard word list;
S13, the text preprocessing module carries out regular expression on financial information replaced by synonyms, the regular expression is used for removing noise content in the financial information by matching the financial information with a regular expression template to form financial texts, and the financial texts are numbered according to time sequence and then stored;
noise content includes blank characters, underlines, charts, formulas, image addresses, web page labels.
Preferably, the step S2 further comprises the following substeps:
S21, manually selecting a plurality of financial texts and marking keywords used for forming enterprise information in the financial texts, taking the financial texts with the manually marked keywords as a pre-training data set, constructing a preliminary Prompt text template by a Prompt module in a large language model after the large language model is pre-trained based on the training set,
In a large language model, a Prompt module generates Prompt texts T1 according to the Prompt text templates and financial texts, and the large language model extracts keywords from each financial text based on prompts of the Prompt texts T1 to form structured enterprise information;
S22, a technician sets a requirement A, the requirement A is sent to a specific task unit in an adjustment module, the specific task unit generates a specific task according to the requirement A and the output of an optimization unit in the adjustment module, parameters of a Prompt text template of a Prompt module in a large language model are adjusted through a P-Tuning fine adjustment method based on the current specific task, meanwhile, the current specific task is introduced into the Prompt module, and the Prompt module generates a Prompt text T2 with specific constraint according to the current specific task by using the adjusted Prompt text template; the large language model retrieves financial texts from the text preprocessing module in a numbering sequence, and performs keyword extraction on the currently retrieved financial texts based on the prompt text T2 to form structured enterprise information.
Preferably, the step S3 further comprises the following substeps:
S31, binding enterprise information with a financial text number for obtaining current enterprise information by a large language model, then sending the binding enterprise information into a screening module, marking any piece of enterprise information as W (i), marking b keywords contained in the enterprise information W (i) as W (i) = { W i1,...,Wia,...,Wib }, wherein a is more than or equal to 1 and less than or equal to b, a and b are positive integers, and calling financial text x of the current enterprise information W (i) in a text preprocessing module by the screening module according to the financial text number;
Meanwhile, the screening module invokes the current specific task k from the specific task unit, and based on machine learning, the pre-training language model is adopted to combine with the conditional random field to segment the current specific task k and remove repeated words, so as to obtain n task words, and a task phrase R (k) of the specific task k is formed: r (k) = { R k1,...,Rkd,...,Rkn }, wherein R kd represents the d-th task word in the current task phrase R (k), d is more than or equal to 1 and less than or equal to n, and d and n are both positive integers;
S32, the screening module calculates word frequency of each keyword in the enterprise information W (i) in the corresponding financial text x and inverse document frequency of each keyword, and calculates weight of each keyword in the current enterprise information W (i) in the current financial text x based on the word frequency and the inverse document frequency; meanwhile, the screening module calculates word frequency and inverse document frequency of each task word in the current task word group R (k), calculates weight of each task word in the current task word group R (k) based on the word frequency and the inverse document frequency, calculates similarity between the current task word group R (k) and the enterprise information W (i) according to the weight of each keyword in the enterprise information W (i) and the weight of each task word in the current task word group R (k), and outputs enterprise information with the similarity above a set value to technicians and an optimizing unit as qualified enterprise information;
s33, the optimization unit calculates a joint loss function L total (theta) in real time according to the received qualified enterprise information:
Ltotal(θ)=Lpretrain(θ)+λ·Ltask(θ)
Wherein L pretrain (theta) is a large language model loss function calculated by using qualified enterprise information last time, L task (theta) is a large language model loss function calculated by using current qualified enterprise information and based on a specific task, lambda is a specific loss super-parameter, and is manually set, and theta represents a parameter set in the large language model,
Optimizing the specific task in the specific task unit and various parameters in the large language model in the gradient decreasing direction of the joint loss function L total (theta), wherein the parameters in the large language model comprise parameters of a Prompt text template of a Prompt module and the like, namely returning to S22, extracting keywords from new financial texts by using the specific task and the large language model after optimizing the parameters to form enterprise information more conforming to the requirement A.
Preferably, in S32, the following sub-steps are further included:
S321, the screening module divides each financial text in the text preprocessing module by adopting a pre-training language model and combining a conditional random field based on machine learning to respectively obtain word sets of each financial text, the word set corresponding to the financial text x is marked as P (x) = { P x1,...,Pxc,...,Pxd }, c is not less than 1 and not more than d, c and d are positive integers, P xc represents the c-th word in the word set P (x),
The filtering module only keeps the repeatedly appeared vocabulary in the word segmentation set P (x) once to form a corresponding word segmentation set F (x) = { F x1,...,Fxc,...,Fxm},Fxc to represent the c-th candidate keyword in the word segmentation set F (x), c is more than or equal to 1 and less than or equal to m and less than or equal to F, and c, m and F are positive integers,
Word frequency of each keyword in the enterprise information W (i) in the corresponding financial text x is calculated respectively:
TF(Wia)=Eia/d,
wherein TF (W ia) represents word frequency of occurrence of keyword W ia in enterprise information W (i) in financial text x, E ia represents number of occurrence of keyword W ia in corresponding vocabulary set P (x), d represents total number of vocabulary contained in vocabulary set P (x),
The screening module records the total number g of all word sets in the current text preprocessing module, records the vocabulary set number of each keyword in the current enterprise information W (i) in the g word sets respectively, records the word set number of the keywords W ia as H ia, and calculates the inverse document frequency IDF (W ia) of the current keywords W ia:
IDF(Wia)=[log2(g)-log2(Hia)],
respectively calculating weights of the keywords in the current financial text x in the current enterprise information W (i):
TF-IDF(Wia)=TF(Wia)×IDF(Wia)=(Eia/d)×[log2(g)-log2(Hia)],
Wherein TF-IDF (W ia) represents the weight of keyword W ia in current enterprise information W (i) in financial text x,
Vectorizing current enterprise information W (i): v [ W (i) ]= { V wi1,...,Vwia,...,Vwib },
Wherein, V [ W (i) ] is the vectorization representation of the current enterprise information W (i), the number of dimensions in V [ W (i) ] is the same as the number of patent keywords in the current enterprise information W (i), each dimension in V [ W (i) ] corresponds to the keywords in the current enterprise information W (i) one by one, V wia represents the value of the a-th dimension in V [ W (i) ] and the value of V wia is the weight TF-IDF (Wia) of the keyword W ia in the corresponding financial text x;
Meanwhile, the screening module records the total number g of vocabulary sets in the current text preprocessing module, records the vocabulary set number of each task word in the current task word set R (k) in the g word sets, and the number of times that each task word in the current task word set R (k) appears in each vocabulary set in the current text preprocessing module, and calculates the word frequency of each task word in the current task word set R (k):
wherein P (y) represents a vocabulary set corresponding to the financial text y, the vocabulary set P (y) is one of g vocabulary sets contained in the current text preprocessing module, TF [ R kd |P (y) ] represents the word frequency of occurrence of the task word R kd in the vocabulary set P (y), E [ R kd |P (y) ] represents the number of occurrence of the task word R kd in the vocabulary set P (y), f [ P (y) ] represents the total number of words contained in the vocabulary set P (y),
Calculating the inverse document frequency of each task word in the current task word group R (k):
IDF(Rkd)={log2(g)-log2[H(Rkd)]},
Where IDF (R kd) represents the inverse document frequency of occurrence of task word R kd in the g vocabulary sets contained in the current text preprocessing module, H (R kd) represents the vocabulary set number of the g vocabulary sets containing task word R kd in the current text preprocessing module,
Weights of g word sets contained in the current text preprocessing module of each task word in the current task word group R (k) are calculated respectively:
Wherein TF-IDF [ R kd |P (y) ] represents the weight of task word R kd in current task phrase R (k) in vocabulary set P (y) contained in current text preprocessing module,
Vectorizing the current task phrase R (k):
V[R(k)|P(y)]={V[Rk1|P(y)],...,V[Rkd|P(y)],...,V[Rkn|P(y)]},
Wherein V [ R (k) |P (y) ] is the vectorization representation of the current task phrase R (k) based on the word set P (y), the number of dimensions in V [ R (k) |P (y) ] is the same as the number of task words in the current task phrase R (k), the dimensions in V [ R (k) |P (y) ] are all n-dimensions, each dimension in V [ R (k) |P (y) ] corresponds to the task word in the current task phrase R (k) one by one, V [ R kd |P (y) ] represents the value on the d-th dimension in V [ R (k) |P (y) ], and the value of V [ R kd |P (y) ] is the weight TF-IDF [ R kd |P (y) ] of the task word R kd in the word set P (y) contained in the current text preprocessing module;
s322, the screening module calculates the similarity between the current task phrase R (k) and the current enterprise information W (i):
Wherein Simi [ R (k), W (i) ] represents the similarity between the current task phrase R (k) and the current enterprise information W (i), G represents all word sets contained in the current text preprocessing module,
If Simi [ R (k), W (i) ] is above the set value, the current enterprise information W (i) is qualified enterprise information, and the screening module outputs the qualified enterprise information to the technician and the optimizing unit.
Preferably, each piece of enterprise information is composed of a plurality of keywords, the keywords comprise entity names, relationships among the entities and duration time periods of the relationships among the entities, the entities comprise enterprises, products, technologies and fields, the relationships among the entities comprise the fields of the enterprises, products produced by the enterprises, technologies produced by the enterprises, suppliers of the enterprises, fields potentially involved by the enterprises and potential suppliers of the enterprises.
The invention also provides an enterprise map construction method which is applied to the enterprise information extraction method, and further comprises S4:
s4, merging the same entities in the output qualified enterprise information to form an enterprise map, and continuously updating the enterprise map by using the newly generated qualified enterprise information.
Preferably, the following are specifically included in S4:
Combining the same entities in all the qualified enterprise information to form an enterprise map;
If the entities contained in the newly generated qualified enterprise information already exist in the current enterprise atlas, but the relationship between the entities is different from the relationship between the corresponding existing entities in the current enterprise atlas, adding the relationship between the entities in the current qualified enterprise information and/or the duration time period of the relationship between the entities to the corresponding entities in the current enterprise atlas;
If the entities contained in the newly generated qualified enterprise information exist in the current enterprise atlas and the relationship between the entities is the same as the relationship between the corresponding existing entities in the current enterprise atlas, judging whether the duration time of the relationship between the entities contained in the current qualified enterprise information is the same as that in the current enterprise atlas, and if the duration time of the relationship between the corresponding entities in the current enterprise atlas is not longer than the duration time of the duration time in the current qualified enterprise information, replacing the duration time of the relationship between the corresponding entities in the current enterprise atlas by the duration time in the current qualified enterprise information so as to update the current enterprise atlas; and if the termination time of the corresponding inter-entity relation duration time period in the current enterprise map is later than the termination time of the duration time period in the current qualified enterprise information or the inter-entity relation duration time period does not exist in the current qualified enterprise information, discarding the current qualified enterprise information.
The invention also provides an enterprise atlas construction system, which comprises:
A grabbing module, a local database, a text preprocessing module, an adjusting module, a large language model, a screening module and a map construction module,
The grabbing module is used for grabbing financial information from each large platform, removing repeated financial information from the local database and then sending the financial information into the local database for storage; the local database is used for storing the financial information sent by the grabbing module and sending the newly added financial information into the text preprocessing module; the text preprocessing module processes financial information into financial text; the adjusting module comprises a specific task unit and an optimizing unit, the requirements enter the specific task unit, the specific task unit forms or modifies the specific task based on the requirements and the joint loss function calculated by the optimizing unit, the specific task unit sends the specific task into the prompting module in the large language model, the optimizing unit calculates the joint loss function by using the qualified enterprise information output by the screening module, and the parameters in the specific task and the large language model are optimized and optimized based on the joint loss function; the prompting module in the large language model extracts keywords from financial texts called from the text preprocessing module based on specific tasks in specific task units to form structured enterprise information, and sends the structured enterprise information to the screening module; the screening module screens out qualified enterprise information and outputs the qualified enterprise information to a technician, and meanwhile, the qualified enterprise information is sent into the optimizing unit and the map construction module; the map construction module constructs an enterprise map based on the qualified enterprise information;
the modules, units are programmed or configured to perform the steps of an enterprise atlas construction method as described above.
The present invention also provides a computer readable storage medium storing a computer program programmed or configured to perform an enterprise atlas construction method as described above.
The invention has the beneficial effects that:
(1) According to the enterprise information extraction method, based on the large language model, the adjusting module and the screening module are creatively arranged in front of and behind the large language model, and financial information is extracted efficiently and accurately. The adjustment module not only directly receives the requirements of technicians, but also calculates a joint loss function of the output result of the screening module, and adjusts the content of a specific task and the parameters of a large language model based on the joint loss function and the content of the requirements; the specific tasks help the large language model to better understand the requirements, and keywords which are more in line with the requirements are extracted from financial texts to form enterprise information; the screening module calculates the similarity between each enterprise information generated by extracting keywords by the large language model and the current specific task based on word frequency and inverse document frequency, qualified enterprise information meeting the specific task is further screened out through the similarity, meanwhile, because the similarity set value of the qualified enterprise information is defined by technicians according to requirements, the flexibility of the qualified enterprise information before being output to the technicians is high, when the technicians need to construct a brief and accurate enterprise map in a short time, a small amount of enterprise information meeting the requirements is needed correspondingly, the similarity set value of the qualified enterprise information can be improved, when the technicians want to know in depth an industry/field and do potential market development, namely, a detailed and accurate enterprise map needs to be constructed, and corresponding enterprise information meeting the requirements in a large amount can be reduced, the similarity set value of the qualified enterprise information can be output, namely, the screening module can enable the requirements of the technicians to be better met.
(2) The invention extracts the keywords to form enterprise information, and sends the qualified enterprise information output to the technician as a verification set to an optimization unit in an adjustment module to calculate a joint loss function, namely, the invention optimizes the content of a specific task and the parameters of a large language model while generating the qualified enterprise information, so that the subsequently generated enterprise information is generated under the optimized specific task and the large language model, and the invention meets the current requirement better. According to the method, the verification set is not required to be additionally arranged, the large language model parameters and the specific tasks are continuously optimized in the process of extracting the keywords to generate the enterprise information, and the enterprise information which meets the requirements can be output in a short time, so that the method is accurate and efficient. The enterprise information extraction process can be self-adjusted to adapt to extraction of different types of enterprise information in different fields.
(3) The joint loss function is the basis for adjusting the parameters of a specific task and a large language model, the loss function L task (theta) is calculated by taking qualified enterprise information as a verification set in the optimization unit, the loss function L task (theta) is multiplied by a specific loss super-parameter lambda and then is used as a part of the joint loss function, the joint loss function also comprises the large language model loss function L pretrain (theta) calculated by last time by using the qualified enterprise information, namely the joint loss function is not one-hammer of the loss function L task (theta) calculated by the current qualified enterprise information based on the large language model of the specific task, so that the phenomenon of over fitting or under fitting can be avoided as much as possible, and the accuracy of the enterprise information formed by missing extracted keywords or extracted keywords is reduced.
(4) Compared with the prior art that the enterprise information is extracted from the financial information according to the requirements by manpower, the method and the device automatically extract the enterprise information meeting the requirements from the financial information, save a great deal of labor cost, have higher extraction speed than the manpower, avoid the condition that the quality of an extraction result is influenced by the subjectivity of individuals in the process of extracting the enterprise information by the manpower, and have high accuracy and stability. The invention can carry out exhaustive enterprise information extraction on financial information of each large platform, does not miss any piece of enterprise information based on requirements, avoids the possible missing situation of manual work when carrying out enterprise information extraction, and has extremely high timeliness because the invention can carry out enterprise information extraction meeting the requirements on the latest financial information in time, thereby providing powerful guarantee for subsequent market layout and potential market excavation of enterprises.
(5) The requirements of technicians can be changed at any time, the specific tasks in the invention can be changed according to the change of the requirements, and the screening module can output qualified enterprise information meeting the information requirements in extremely short time, namely, the invention can adjust the requirements at any time and has high flexibility.
(6) Compared with manual construction, the enterprise atlas constructed by the invention has the advantages of accuracy, high efficiency, low cost, high quality and high satisfaction degree of requirements, and the constructed enterprise atlas has the advantages of strong structure, convenient viewing, and real-time updating function, namely extremely high timeliness, and provides a solid foundation for market development of enterprises or industrial layout analysis of bidding companies.
Drawings
FIG. 1 is a flow chart of an enterprise information extraction method according to the present invention;
FIG. 2 is a flow chart of an enterprise atlas construction method according to the present invention;
FIG. 3 is a schematic diagram of the structure and data flow of an enterprise atlas building system according to the present invention;
FIG. 4 is a graph of time-consuming comparison between an enterprise information extraction method and manual extraction according to the present invention;
fig. 5 is a graph of accuracy contrast between an enterprise information extraction method and manual extraction according to the present invention.
Detailed Description
In order to make the technical scheme of the invention clearer and more definite, the invention is clearly and completely described below with reference to the accompanying drawings, and the technical characteristics of the technical scheme of the invention are equivalently replaced and the scheme obtained by conventional reasoning is within the protection scope of the invention under the premise of not making creative labor by a person of ordinary skill in the art.
Example 1
An enterprise information extraction method as shown in fig. 1, comprising the following steps:
S1, duplicate removal is carried out on financial information captured from each large platform, the financial information is stored in a local database, and the local database sends newly added financial information to a text preprocessing module for preprocessing to obtain financial texts;
S2, sending the requirement A into a specific task unit in the adjustment module, generating a specific task by the specific task unit based on the requirement A and an optimization unit in the adjustment module, sequentially calling financial texts from the text preprocessing module by the large language model based on the specific task, and extracting keywords from the financial texts to form structured enterprise information; one piece of enterprise information consists of a plurality of keywords;
S3, each piece of enterprise information enters a screening module, and after the screening module calculates the similarity between each piece of enterprise information and the current specific task, the enterprise information with the similarity above a set value is used as qualified enterprise information to be output to technicians; meanwhile, the qualified enterprise information is sent to an optimizing unit in the adjusting module by the screening module to calculate a joint loss function, and then the system returns to S2, the optimizing unit optimizes various parameters in the specific task and the large language model based on the joint loss function, and the large language model after optimizing the parameters extracts keywords from new financial texts based on the optimized specific task to form structured enterprise information.
Optionally, in S2, when the requirement a changes, the large language model retrieves financial texts from the text preprocessing module again based on the new specific task, and performs keyword extraction on the financial texts to form structured enterprise information.
In S1 the following sub-steps are also included:
s11, when the acquired financial information is foreign language or contains letter abbreviations, translating the foreign language financial information or letter abbreviations into Chinese, comparing the Chinese financial information with the financial information stored in the local database, discarding the current financial information if the current acquired financial information is repeated with the financial information stored in the local database, otherwise, sending the current acquired financial information into the local database for storage;
Only the Chinese financial information is stored in the local database;
the financial information comprises financial news, business annual report, business industry and commerce information and the like;
S12, the local database sends the newly added financial information into a text preprocessing module, and the text preprocessing module carries out synonym replacement on each financial information, namely, the words in each financial information are replaced by standard words according to a synonym standard word list;
in the embodiment, the synonym standard vocabulary is set by a technician according to the industry standard standards of the Xinhua dictionary and/or the target enterprise, and the vocabulary in each financial information is replaced by a unified standard vocabulary, so that the accuracy of the subsequent keyword extraction can be improved;
S13, the text preprocessing module carries out regular expression on financial information replaced by synonyms, the regular expression is used for removing noise content in the financial information by matching the financial information with a regular expression template to form financial texts, and the financial texts are numbered according to time sequence and then stored;
noise content includes blank characters, underlines, charts, formulas, image addresses, web page labels, and the like.
The use of canonical representations to remove noise content is prior art and is not described in detail herein.
S2, the following substeps are also included:
S21, manually selecting a plurality of financial texts and marking keywords used for forming enterprise information in the financial texts, taking the financial texts with the manually marked keywords as a pre-training data set, constructing a preliminary Prompt text template by a Prompt module in a large language model after the large language model is pre-trained based on the training set,
In a large language model, a Prompt module generates Prompt texts T1 according to the Prompt text templates and financial texts, and the large language model extracts keywords from each financial text based on prompts of the Prompt texts T1 to form structured enterprise information;
In the invention, a financial text can obtain a plurality of pieces of enterprise information, each piece of enterprise information is structured information formed by a plurality of keywords, the keywords forming the structured enterprise information comprise entity names, relationships among the entities, the duration time period of the relationships among the entities and the like, and the entities in the invention comprise enterprises, products, technologies, fields and the like, and the relationships among the entities comprise the fields of the enterprises, products produced by the enterprises, technologies produced by the enterprises, suppliers of the enterprises, the fields potentially involved by the enterprises, potential suppliers of the enterprises and the like;
A piece of structured business information, such as "C company-manufacture-car-since the date of 1983", "C company-main service-new energy car field", "D company-supply-tire-C company-since the date of 2020", etc.;
S22, a technician sets a requirement A, the requirement A is sent to a specific task unit in an adjustment module, the specific task unit generates a specific task according to the requirement A and the output of an optimization unit in the adjustment module, parameters of a Prompt text template of a Prompt module in a large language model are adjusted through a P-Tuning fine adjustment method based on the current specific task, meanwhile, the current specific task is introduced into the Prompt module, and the Prompt module generates a Prompt text T2 with specific constraint according to the current specific task by using the adjusted Prompt text template; the large language model retrieves financial texts from the text preprocessing module according to the serial number sequence, and extracts keywords from the currently retrieved financial texts based on the prompt text T2 to form structured enterprise information;
A financial document may not contain any business information, and may also contain one or more pieces of business information;
when the demand is generated for the first time, namely, the specific task unit in the adjusting module generates the specific task for the first time, the optimizing unit in the adjusting module does not output the specific task, so that the specific task generated by the specific task unit for the first time is only according to the demand A; the follow-up specific task unit is based on the requirement A, the content of the specific task is continuously adjusted according to the output of the optimizing unit, the specific task is mainly a plurality of characters or character strings contained in the requirement A,
The requirement a is set manually, for example, what the technician wants to obtain structured enterprise information, such as "upstream and downstream relationship between enterprises in the artificial intelligence field", "downstream relationship between brands under the jurisdiction of the commercial group to which the B brand belongs", "potential market of the C company", and the like; for example, when the requirement A is "the upstream-downstream relationship among enterprises in the artificial intelligence field", the specific task unit generates the specific task for the first time "the enterprise related to artificial intelligence", the subsequent specific task unit adjusts the specific task to "(the product produced by the artificial intelligence enterprise or the artificial intelligence enterprise/the sales relationship among the technical or artificial intelligence enterprises) and (the time for the establishment of the enterprise exceeds 1 year)",
In the present invention, the hint text T2 with specific constraints directs the large language model to focus on specific constraints, and keywords are extracted from the financial texts according to the specific constraints to form structured business information meeting the requirement a.
In S22, after generating the Prompt text T2 with specific constraints according to the current specific task by using the adjusted Prompt text template, the Prompt module converts the Prompt text T2 into a text format that can be understood by the large language model, and encodes the text, and then embeds the encoded Prompt text T2 into the financial text currently called from the text preprocessing module in a splicing manner, and the large language model extracts keywords in the current financial text based on the Prompt text T2 to form structured enterprise information.
In S3 the following sub-steps are also included:
S31, binding enterprise information with a financial text number for obtaining current enterprise information by a large language model, then sending the binding enterprise information into a screening module, marking any piece of enterprise information as W (i), marking b keywords contained in the enterprise information W (i) as W (i) = { W i1,...,Wia,...,Wib }, wherein a is more than or equal to 1 and less than or equal to b, a and b are positive integers, and calling financial text x of the current enterprise information W (i) in a text preprocessing module by the screening module according to the financial text number;
Meanwhile, the screening module invokes the current specific task k from the specific task unit, and based on machine learning, the pre-training language model is adopted to combine with the conditional random field to segment the current specific task k and remove repeated words, so as to obtain n task words, and a task phrase R (k) of the specific task k is formed: r (k) = { R k1,...,Rkd,...,Rkn }, wherein R kd represents the d-th task word in the current task phrase R (k), d is more than or equal to 1 and less than or equal to n, and d and n are both positive integers;
The pre-training language model is adopted by combining the conditional random field as the prior art, and the description is omitted here;
S32, the screening module calculates word frequency of each keyword in the enterprise information W (i) in the corresponding financial text x and inverse document frequency of each keyword, and calculates weight of each keyword in the current enterprise information W (i) in the current financial text x based on the word frequency and the inverse document frequency; meanwhile, the screening module calculates word frequency and inverse document frequency of each task word in the current task word group R (k), calculates weight of each task word in the current task word group R (k) based on the word frequency and the inverse document frequency, calculates similarity between the current task word group R (k) and the enterprise information W (i) according to the weight of each keyword in the enterprise information W (i) and the weight of each task word in the current task word group R (k), and outputs enterprise information with the similarity above a set value to technicians and an optimizing unit as qualified enterprise information;
s33, the optimization unit calculates a joint loss function L total (theta) in real time according to the received qualified enterprise information:
Ltotal(θ)=Lpretrain(θ)+λ·Ltask(θ)
Wherein L pretrain (theta) is a large language model loss function calculated by using qualified enterprise information last time, L task (theta) is a large language model loss function calculated by using current qualified enterprise information and based on a specific task, lambda is a specific loss super-parameter, and is manually set, and theta represents a parameter set in the large language model,
Optimizing the specific task in the specific task unit and various parameters in the large language model in the gradient decreasing direction of the joint loss function L total (theta), wherein the parameters in the large language model comprise parameters of a Prompt text template of a Prompt module and the like, namely returning to S22, extracting keywords from new financial texts by using the specific task and the large language model after optimizing the parameters to form enterprise information more conforming to the requirement A.
In S32 the following sub-steps are also included:
S321, the screening module divides each financial text in the text preprocessing module by adopting a pre-training language model and combining a conditional random field based on machine learning to respectively obtain word sets of each financial text, the word set corresponding to the financial text x is marked as P (x) = { P x1,...,Pxc,...,Pxd }, c is not less than 1 and not more than d, c and d are positive integers, P xc represents the c-th word in the word set P (x),
The filtering module only keeps the repeatedly appeared vocabulary in the word segmentation set P (x) once to form a corresponding word segmentation set F (x) = { F x1,...,Fxc,...,Fxm},Fxc to represent the c-th candidate keyword in the word segmentation set F (x), c is more than or equal to 1 and less than or equal to m and less than or equal to F, and c, m and F are positive integers,
Word frequency of each keyword in the enterprise information W (i) in the corresponding financial text x is calculated respectively:
TF(Wia)=Eia/d,
wherein TF (W ia) represents word frequency of occurrence of keyword W ia in enterprise information W (i) in financial text x, E ia represents number of occurrence of keyword W ia in corresponding vocabulary set P (x), d represents total number of vocabulary contained in vocabulary set P (x),
The screening module records the total number g of all word sets in the current text preprocessing module, records the vocabulary set number of each keyword in the current enterprise information W (i) in the g word sets respectively, records the word set number of the keywords W ia as H ia, and calculates the inverse document frequency IDF (W ia) of the current keywords W ia:
IDF(Wia)=[log2(g)-log2(Hia)],
g word sets correspond to g financial texts, so H ia also represents the financial texts of which the keywords W ia appear in the g financial texts,
Respectively calculating weights of the keywords in the current financial text x in the current enterprise information W (i):
TF-IDF(Wia)=TF(Wia)×IDF(Wia)=(Eia/d)×[log2(g)-log2(Hia)],
Wherein TF-IDF (W ia) represents the weight of keyword W ia in current enterprise information W (i) in financial text x,
Vectorizing current enterprise information W (i): v [ W (i) ]= { V wi1,...,Vwia,...,Vwib },
Wherein, V [ W (i) ] is the vectorization representation of the current enterprise information W (i), the number of dimensions in V [ W (i) ] is the same as the number of patent keywords in the current enterprise information W (i), each dimension in V [ W (i) ] corresponds to the keywords in the current enterprise information W (i) one by one, V wia represents the value of the a-th dimension in V [ W (i) ] and the value of V wia is the weight TF-IDF (W ia) of the keyword W ia in the corresponding financial text x;
Meanwhile, the screening module records the total number g of vocabulary sets in the current text preprocessing module, records the vocabulary set number of each task word in the current task word set R (k) in the g word sets, and the number of times that each task word in the current task word set R (k) appears in each vocabulary set in the current text preprocessing module, and calculates the word frequency of each task word in the current task word set R (k):
wherein P (y) represents a vocabulary set corresponding to the financial text y, the vocabulary set P (y) is one of g vocabulary sets contained in the current text preprocessing module, TF [ R kd |P (y) ] represents the word frequency of occurrence of the task word R kd in the vocabulary set P (y), E [ R kd |P (y) ] represents the number of occurrence of the task word R kd in the vocabulary set P (y), f [ P (y) ] represents the total number of words contained in the vocabulary set P (y),
Namely, a total of n task words in the current task word group R (k) are required to be calculated and respectively correspond to word frequencies in g vocabulary sets contained in the current text preprocessing module, and the total of n times g word frequencies are calculated;
calculating the inverse document frequency of each task word in the current task word group R (k):
IDF(Rkd)={log2(g)-log2[H(Rkd)]},
Where IDF (R kd) represents the inverse document frequency of occurrence of task word R kd in the g vocabulary sets contained in the current text preprocessing module, H (R kd) represents the vocabulary set number of the g vocabulary sets containing task word R kd in the current text preprocessing module,
Weights of g word sets contained in the current text preprocessing module of each task word in the current task word group R (k) are calculated respectively:
Wherein TF-IDF [ R kd |P (y) ] represents the weight of task word R kd in current task phrase R (k) in vocabulary set P (y) contained in current text preprocessing module,
I.e. a total of (n×g) weights need to be calculated;
vectorizing the current task phrase R (k):
V[R(k)|P(y)]={V[Rk1|P(y)],...,V[Rkd|P(y)],...,V[Rkn|P(y)]},
Wherein V [ R (k) |P (y) ] is the vectorization representation of the current task phrase R (k) based on the word set P (y), the number of dimensions in V [ R (k) |P (y) ] is the same as the number of task words in the current task phrase R (k), the dimensions in V [ R (k) |P (y) ] are all n-dimensions, each dimension in V [ R (k) |P (y) ] corresponds to the task word in the current task phrase R (k) one by one, V [ R kd |P (y) ] represents the value on the d-th dimension in V [ R (k) |P (y) ], and the value of V [ R kd |P (y) ] is the weight TF-IDF [ R kd |P (y) ] of the task word R kd in the word set P (y) contained in the current text preprocessing module;
s322, the screening module calculates the similarity between the current task phrase R (k) and the current enterprise information W (i):
Wherein Simi [ R (k), W (i) ] represents the similarity between the current task phrase R (k) and the current enterprise information W (i), G represents all word sets contained in the current text preprocessing module, and when the dimensions contained in V [ R (k) |P (y) ] and V [ W (i) ] are different, the missing dimensions are complemented with 0 by the vector with fewer dimensions,
If Simi [ R (k), W (i) ] is above the set value, the current enterprise information W (i) is qualified enterprise information, and the screening module outputs the qualified enterprise information to the technician and the optimizing unit;
In this embodiment, simi [ R (k), W (i) ] set to the current enterprise information W (i) is determined to be qualified enterprise information is set to 0.92.
For ease of understanding, the following description is provided in connection with specific examples:
technicians set the requirements as 'the enterprise upstream-downstream relationship in the new energy automobile field'.
The first financial information is "Z1 company is deeply ploughed in the field of new energy automobiles for more than twenty years, and nowadays, the new energy automobiles become faucet manufacturers.
The second financial information is "Z2 company signed a tire supply agreement with Z1 company in 2022, which will supply tires for the new energy automobile of Z1 company".
The third piece of financial information is "Z3 company supplies Z2 with natural rubber for producing advanced tires for a long time".
The fourth piece of financial information is "the Z4 company signed the ternary lithium battery supply agreement with the Z1 company for a long time in month 2 of 2023".
The fifth financial information is that the Z5 company takes the latest lithium battery to participate in the conference of the Z1 company, and the latest lithium battery of the Z5 company is faster to charge and stronger in cruising ability than the common lithium battery in the world.
With one enterprise information extraction method of the present invention, the structured enterprise information extracted from the five pieces of financial information are respectively "Z1 company-manufacturing-new energy automobile", "Z2 company-supply-tire-Z1 company-in 2022 years", "Z3 company-supply-natural rubber-Z2 company", "Z4 company-supply-ternary lithium battery-Z1 company-in 2023 months", "Z5 company-potential supply-ternary lithium battery-Z1 company".
According to the enterprise information extraction method, based on the large language model, the adjusting module and the screening module are creatively arranged in front of and behind the large language model, and financial information is extracted efficiently and accurately. The adjustment module not only directly receives the requirements of technicians, but also calculates a joint loss function of the output result of the screening module, and adjusts the content of a specific task and the parameters of a large language model based on the joint loss function and the content of the requirements; the specific tasks help the large language model to better understand the requirements, and keywords which are more in line with the requirements are extracted from financial texts to form enterprise information; the screening module calculates the similarity between each enterprise information generated by extracting keywords by the large language model and the current specific task based on word frequency and inverse document frequency, qualified enterprise information meeting the specific task is further screened out through the similarity, meanwhile, because the similarity set value of the qualified enterprise information is defined by technicians according to requirements, the flexibility of the qualified enterprise information before being output to the technicians is high, when the technicians need to construct a brief and accurate enterprise map in a short time, a small amount of enterprise information meeting the requirements is needed correspondingly, the similarity set value of the qualified enterprise information can be improved, when the technicians want to know in depth an industry/field and do potential market development, namely, a detailed and accurate enterprise map needs to be constructed, and corresponding enterprise information meeting the requirements in a large amount can be reduced, the similarity set value of the qualified enterprise information can be output to the technicians, namely, the screening module enables the invention to meet the requirements of the technicians more flexibly.
The invention extracts the keywords to form enterprise information, and sends the qualified enterprise information output to the technician as a verification set to an optimization unit in an adjustment module to calculate a joint loss function, namely, the invention optimizes the content of a specific task and the parameters of a large language model while generating the qualified enterprise information, so that the subsequently generated enterprise information is generated under the optimized specific task and the large language model, and the invention meets the current requirement better. According to the method, the verification set is not required to be additionally arranged, the large language model parameters and the specific tasks are continuously optimized in the process of extracting the keywords to generate the enterprise information, and the enterprise information which meets the requirements can be output in a short time, so that the method is accurate and efficient. The enterprise information extraction process can be self-adjusted to adapt to extraction of different types of enterprise information in different fields.
The joint loss function is the basis for adjusting the parameters of a specific task and a large language model, the loss function L task (theta) is calculated by taking qualified enterprise information as a verification set in the optimization unit, the loss function L task (theta) is multiplied by a specific loss super-parameter lambda and then is used as a part of the joint loss function, the joint loss function also comprises the large language model loss function L pretrain (theta) calculated by last time by using the qualified enterprise information, namely the joint loss function is not one-hammer of the loss function L task (theta) calculated by the current qualified enterprise information based on the large language model of the specific task, so that the phenomenon of over fitting or under fitting can be avoided as much as possible, and the accuracy of the enterprise information formed by missing extracted keywords or extracted keywords is reduced.
Compared with the prior art that the enterprise information is extracted from the financial information according to the requirements by manpower, the method and the device automatically extract the enterprise information meeting the requirements from the financial information, save a great deal of labor cost, have higher extraction speed than the manpower, avoid the condition that the quality of an extraction result is influenced by the subjectivity of individuals in the process of extracting the enterprise information by the manpower, and have high accuracy and stability. The invention can carry out exhaustive enterprise information extraction on financial information of each large platform, does not miss any piece of enterprise information based on requirements, avoids the possible missing situation of manual work when carrying out enterprise information extraction, and has extremely high timeliness because the invention can carry out enterprise information extraction meeting the requirements on the latest financial information in time, thereby providing powerful guarantee for subsequent market layout and potential market excavation of enterprises.
The requirements of technicians can be changed at any time, the specific tasks in the invention can be changed according to the change of the requirements, and the screening module can output qualified enterprise information meeting the information requirements in extremely short time, namely, the invention can adjust the requirements at any time and has high flexibility.
The method for extracting the enterprise information and the method for manually extracting the enterprise information are adopted to extract the enterprise information based on the same requirement, and the time-consuming comparison curve chart and the accuracy comparison curve chart shown in fig. 4 are obtained by taking the high-quality and repeated check manual check result as a standard, so that the time-consuming of the enterprise information extraction method for extracting the enterprise information from the same 5000 pieces of financial information is obviously lower than that of manual extraction, and the time consumed by manual extraction is longer and longer along with the increase of the quantity of the financial information; as can be seen from the accuracy comparison graph shown in FIG. 5, the accuracy of the enterprise information extraction method of the present invention for extracting enterprise information from the same 5000 pieces of financial information is improved along with the increase of the quantity of financial information, and gradually exceeds the accuracy of manual extraction, but the accuracy of manual extraction is higher than that of the extraction method of the present invention only when the quantity of financial information is small, but the accuracy of manual extraction is greatly reduced along with the increase of the quantity of financial information, i.e. the enterprise information extraction method of the present invention can extract highly accurate enterprise information from a large quantity of financial information.
Example 2
An enterprise atlas construction method as shown in fig. 2, which is applied after an enterprise information extraction method as described in embodiment 1, further includes S4:
s4, merging the same entities in the output qualified enterprise information to form an enterprise map, and continuously updating the enterprise map by using the newly generated qualified enterprise information.
The following are also included in S4:
Combining the same entities in all the qualified enterprise information to form an enterprise map;
If the entities contained in the newly generated qualified enterprise information already exist in the current enterprise atlas, but the relationship between the entities is different from the relationship between the corresponding existing entities in the current enterprise atlas, adding the relationship between the entities in the current qualified enterprise information and/or the duration time period of the relationship between the entities to the corresponding entities in the current enterprise atlas;
If the entities contained in the newly generated qualified enterprise information exist in the current enterprise atlas and the relationship between the entities is the same as the relationship between the corresponding existing entities in the current enterprise atlas, judging whether the duration time of the relationship between the entities contained in the current qualified enterprise information is the same as that in the current enterprise atlas, and if the duration time of the relationship between the corresponding entities in the current enterprise atlas is not longer than the duration time of the duration time in the current qualified enterprise information, replacing the duration time of the relationship between the corresponding entities in the current enterprise atlas by the duration time in the current qualified enterprise information so as to update the current enterprise atlas; and if the termination time of the corresponding inter-entity relation duration time period in the current enterprise map is later than the termination time of the duration time period in the current qualified enterprise information or the inter-entity relation duration time period does not exist in the current qualified enterprise information, discarding the current qualified enterprise information.
Compared with manual construction, the enterprise atlas constructed by the invention has the advantages of accuracy, high efficiency, low cost, high quality and high satisfaction degree of requirements, and the constructed enterprise atlas has the advantages of strong structure, convenient viewing, and real-time updating function, namely extremely high timeliness, and provides a solid foundation for market development of enterprises or industrial layout analysis of bidding companies.
Example 3
The invention also provides an enterprise atlas construction system, as shown in fig. 3, comprising:
A grabbing module, a local database, a text preprocessing module, an adjusting module, a large language model, a screening module and a map construction module,
The grabbing module is used for grabbing financial information from each large platform, removing repeated financial information from the local database and then sending the financial information into the local database for storage;
The local database is used for storing the financial information sent by the grabbing module and sending the newly added financial information into the text preprocessing module;
The text preprocessing module processes financial information into financial text;
The adjustment module comprises a specific task unit and an optimization unit, the requirement enters the specific task unit, the specific task unit forms or modifies the specific task based on the joint loss function calculated by the requirement and the optimization unit, the specific task unit sends the specific task into the prompt module in the large language model,
The optimizing unit calculates a joint loss function by using the qualified enterprise information output by the screening module, and optimizes parameters in a specific task and a large language model based on the joint loss function;
The prompting module in the large language model extracts keywords from financial texts called from the text preprocessing module based on specific tasks in specific task units to form structured enterprise information, and sends the structured enterprise information to the screening module;
the screening module screens out qualified enterprise information and outputs the qualified enterprise information to a technician, and meanwhile, the qualified enterprise information is sent into the optimizing unit and the map construction module;
The map construction module constructs an enterprise map based on the qualified enterprise information;
The modules, units are programmed or configured to perform the steps of an enterprise information extraction method as described in embodiment 1 or the steps of an enterprise atlas construction method as described in embodiment 2.
The present invention also provides a computer-readable storage medium storing a computer program programmed or configured to perform an enterprise information extraction method as described in embodiment 1 or an enterprise atlas construction method as described in embodiment 2.
The technology, shape, and construction parts of the present invention, which are not described in detail, are known in the art.

Claims (8)

1. An enterprise information extraction method is characterized by comprising the following steps:
S1, duplicate removal is carried out on financial information captured from each large platform, the financial information is stored in a local database, and the local database sends newly added financial information to a text preprocessing module for preprocessing to obtain financial texts;
s2, sending the requirement A into a specific task unit in the adjustment module, generating a specific task by the specific task unit based on the requirement A and an optimization unit in the adjustment module, sequentially calling financial texts from the text preprocessing module by the large language model based on the specific task, and extracting keywords from the financial texts to form structured enterprise information;
one piece of enterprise information consists of a plurality of keywords;
S3, each piece of enterprise information enters a screening module, and after the screening module calculates the similarity between each piece of enterprise information and the current specific task, the enterprise information with the similarity above a set value is used as qualified enterprise information to be output to technicians; meanwhile, the qualified enterprise information is sent to an optimization unit in an adjustment module by a screening module to calculate a joint loss function, and then returns to S2, the optimization unit optimizes various parameters in a specific task and a large language model based on the joint loss function, and the large language model after optimizing the parameters extracts keywords from a new financial text based on the specific task after optimizing so as to form structured enterprise information;
S2, the following substeps are also included:
S21, manually selecting a plurality of financial texts and marking keywords used for forming enterprise information in the financial texts, taking the financial texts with the manually marked keywords as a pre-training data set, constructing a preliminary Prompt text template by a Prompt module in a large language model after the large language model is pre-trained based on the training set,
In a large language model, a Prompt module generates Prompt texts T1 according to the Prompt text templates and financial texts, and the large language model extracts keywords from each financial text based on prompts of the Prompt texts T1 to form structured enterprise information;
S22, a technician sets a requirement A, the requirement A is sent to a specific task unit in an adjustment module, the specific task unit generates a specific task according to the requirement A and the output of an optimization unit in the adjustment module, parameters of a Prompt text template of a Prompt module in a large language model are adjusted through a P-Tuning fine adjustment method based on the current specific task, meanwhile, the current specific task is introduced into the Prompt module, and the Prompt module generates a Prompt text T2 with specific constraint according to the current specific task by using the adjusted Prompt text template; the large language model retrieves financial texts from the text preprocessing module according to the serial number sequence, and extracts keywords from the currently retrieved financial texts based on the prompt text T2 to form structured enterprise information;
s3, the following substeps are also included:
S31, binding enterprise information with a financial text number for obtaining current enterprise information by a large language model, then sending the binding enterprise information into a screening module, marking any piece of enterprise information as W (i), marking b keywords contained in the enterprise information W (i) as W (i) = { W i1,...,Wia,...,Wib }, wherein a is more than or equal to 1 and less than or equal to b, a and b are positive integers, and calling financial text x of the current enterprise information W (i) in a text preprocessing module by the screening module according to the financial text number;
Meanwhile, the screening module invokes the current specific task k from the specific task unit, and based on machine learning, the pre-training language model is adopted to combine with the conditional random field to segment the current specific task k and remove repeated words, so as to obtain n task words, and a task phrase R (k) of the specific task k is formed: r (k) = { R k1,...,Rkd,...,Rkn }, wherein R kd represents the d-th task word in the current task phrase R (k), d is more than or equal to 1 and less than or equal to n, and d and n are both positive integers;
S32, the screening module calculates word frequency of each keyword in the enterprise information W (i) in the corresponding financial text x and inverse document frequency of each keyword, and calculates weight of each keyword in the current enterprise information W (i) in the current financial text x based on the word frequency and the inverse document frequency; meanwhile, the screening module calculates word frequency and inverse document frequency of each task word in the current task word group R (k), calculates weight of each task word in the current task word group R (k) based on the word frequency and the inverse document frequency, calculates similarity between the current task word group R (k) and the enterprise information W (i) according to the weight of each keyword in the enterprise information W (i) and the weight of each task word in the current task word group R (k), and outputs enterprise information with the similarity above a set value to technicians and an optimizing unit as qualified enterprise information;
s33, the optimization unit calculates a joint loss function L total (theta) in real time according to the received qualified enterprise information:
Ltotal(θ)=Lpretrain(θ)+λ·Ltask(θ)
Wherein L pretrain (theta) is a large language model loss function calculated by using qualified enterprise information last time, L task (theta) is a large language model loss function calculated by using current qualified enterprise information and based on a specific task, lambda is a specific loss super-parameter, and is manually set, and theta represents a parameter set in the large language model,
Optimizing the specific task in the specific task unit and various parameters in the large language model in the gradient decreasing direction of the joint loss function L total (theta), wherein the parameters in the large language model comprise parameters of a Prompt text template of a Prompt module and the like, namely returning to S22, extracting keywords from new financial texts by using the specific task and the large language model after optimizing the parameters to form enterprise information more conforming to the requirement A.
2. The method for extracting enterprise information as claimed in claim 1, wherein S1 further comprises the sub-steps of:
s11, when the acquired financial information is foreign language or contains letter abbreviations, translating the foreign language financial information or letter abbreviations into Chinese, comparing the Chinese financial information with the financial information stored in the local database, discarding the current financial information if the current acquired financial information is repeated with the financial information stored in the local database, otherwise, sending the current acquired financial information into the local database for storage;
The financial information comprises financial news, enterprise annual report and enterprise business information;
S12, the local database sends the newly added financial information into a text preprocessing module, and the text preprocessing module carries out synonym replacement on each financial information, namely, the words in each financial information are replaced by standard words according to a synonym standard word list;
S13, the text preprocessing module carries out regular expression on financial information replaced by synonyms, the regular expression is used for removing noise content in the financial information by matching the financial information with a regular expression template to form financial texts, and the financial texts are numbered according to time sequence and then stored;
noise content includes blank characters, underlines, charts, formulas, image addresses, web page labels.
3. The method for extracting enterprise information as claimed in claim 2, further comprising the sub-steps of:
S321, the screening module divides each financial text in the text preprocessing module by adopting a pre-training language model and combining a conditional random field based on machine learning to respectively obtain word sets of each financial text, the word set corresponding to the financial text x is marked as P (x) = { P x1,...,Pxc,...,Pxd }, c is not less than 1 and not more than d, c and d are positive integers, P xc represents the c-th word in the word set P (x),
The filtering module only keeps the repeatedly appeared vocabulary in the word segmentation set P (x) once to form a corresponding word segmentation set F (x) = { F x1,...,Fxc,...,Fxm},Fxc to represent the c-th candidate keyword in the word segmentation set F (x), c is more than or equal to 1 and less than or equal to m and less than or equal to F, and c, m and F are positive integers,
Word frequency of each keyword in the enterprise information W (i) in the corresponding financial text x is calculated respectively:
TF(Wia)=Eia/d,
wherein TF (W ia) represents word frequency of occurrence of keyword W ia in enterprise information W (i) in financial text x, E ia represents number of occurrence of keyword W ia in corresponding vocabulary set P (x), d represents total number of vocabulary contained in vocabulary set P (x),
The screening module records the total number g of all word sets in the current text preprocessing module, records the vocabulary set number of each keyword in the current enterprise information W (i) in the g word sets respectively, records the word set number of the keywords W ia as H ia, and calculates the inverse document frequency IDF (W ia) of the current keywords W ia:
IDF(Wia)=[log2(g)-log2(Hia)],
respectively calculating weights of the keywords in the current financial text x in the current enterprise information W (i):
TF-IDF(Wia)=TF(Wia)×IDF(Wia)=(Eia/d)×[log2(g)-log2(Hia)],
Wherein TF-IDF (W ia) represents the weight of keyword W ia in current enterprise information W (i) in financial text x,
Vectorizing current enterprise information W (i): v [ W (i) ]= { V wi1,...,Vwia,...,Vwib },
Wherein, V [ W (i) ] is the vectorization representation of the current enterprise information W (i), the number of dimensions in V [ W (i) ] is the same as the number of patent keywords in the current enterprise information W (i), each dimension in V [ W (i) ] corresponds to the keywords in the current enterprise information W (i) one by one, V wia represents the value of the a-th dimension in V [ W (i) ] and the value of V wia is the weight TF-IDF (W ia) of the keyword W ia in the corresponding financial text x;
Meanwhile, the screening module records the total number g of vocabulary sets in the current text preprocessing module, records the vocabulary set number of each task word in the current task word set R (k) in the g word sets, and the number of times that each task word in the current task word set R (k) appears in each vocabulary set in the current text preprocessing module, and calculates the word frequency of each task word in the current task word set R (k):
wherein P (y) represents a vocabulary set corresponding to the financial text y, the vocabulary set P (y) is one of g vocabulary sets contained in the current text preprocessing module, TF [ R kd |P (y) ] represents the word frequency of occurrence of the task word R kd in the vocabulary set P (y), E [ R kd |P (y) ] represents the number of occurrence of the task word R kd in the vocabulary set P (y), f [ P (y) ] represents the total number of words contained in the vocabulary set P (y),
Calculating the inverse document frequency of each task word in the current task word group R (k):
IDF(Rkd)={log2(g)-log2[H(Rkd)]},
Where IDF (R kd) represents the inverse document frequency of occurrence of task word R kd in the g vocabulary sets contained in the current text preprocessing module, H (R kd) represents the vocabulary set number of the g vocabulary sets containing task word R kd in the current text preprocessing module,
Weights of g word sets contained in the current text preprocessing module of each task word in the current task word group R (k) are calculated respectively:
Wherein TF-IDF [ R kd |P (y) ] represents the weight of task word R kd in current task phrase R (k) in vocabulary set P (y) contained in current text preprocessing module,
Vectorizing the current task phrase R (k):
V[R(k)|P(y)]={V[Rk1|P(y)],...,V[Rkd|P(y)],...,V[Rkn|P(y)]},
Wherein V [ R (k) |P (y) ] is the vectorization representation of the current task phrase R (k) based on the word set P (y), the number of dimensions in V [ R (k) |P (y) ] is the same as the number of task words in the current task phrase R (k), the dimensions in V [ R (k) |P (y) ] are all n-dimensions, each dimension in V [ R (k) |P (y) ] corresponds to a task word in the current task phrase R (k) one by one, V [ R kd |P (y) ] represents the value on the d-th dimension in V [ R (k) |P (y) ], and the value of V [ R kd |P (y) ] is the weight TF-IDF [ Rkd |P (y) ] of the task word R kd in the word set P (y) contained in the current text preprocessing module;
s322, the screening module calculates the similarity between the current task phrase R (k) and the current enterprise information W (i):
Wherein Simi [ R (k), W (i) ] represents the similarity between the current task phrase R (k) and the current enterprise information W (i), G represents all word sets contained in the current text preprocessing module,
If Simi [ R (k), W (i) ] is above the set value, the current enterprise information W (i) is qualified enterprise information, and the screening module outputs the qualified enterprise information to the technician and the optimizing unit.
4. A method of enterprise information extraction according to any one of claims 1-3, characterized in that: each piece of enterprise information is composed of a plurality of keywords, the keywords comprise entity names, relationships among the entities and the duration time period of the relationships among the entities, the entities comprise enterprises, products, technologies and fields, and the relationships among the entities comprise the fields of the enterprises, products produced by the enterprises, technologies produced by the enterprises, suppliers of the enterprises, the fields potentially involved by the enterprises and potential suppliers of the enterprises.
5. An enterprise atlas construction method, which is applied after an enterprise information extraction method according to claim 4, further comprises S4:
s4, merging the same entities in the output qualified enterprise information to form an enterprise map, and continuously updating the enterprise map by using the newly generated qualified enterprise information.
6. The enterprise atlas construction method according to claim 5, wherein in S4, the method specifically further comprises the following steps:
Combining the same entities in all the qualified enterprise information to form an enterprise map;
If the entities contained in the newly generated qualified enterprise information already exist in the current enterprise atlas, but the relationship between the entities is different from the relationship between the corresponding existing entities in the current enterprise atlas, adding the relationship between the entities in the current qualified enterprise information and/or the duration time period of the relationship between the entities to the corresponding entities in the current enterprise atlas;
If the entities contained in the newly generated qualified enterprise information exist in the current enterprise atlas and the relationship between the entities is the same as the relationship between the corresponding existing entities in the current enterprise atlas, judging whether the duration time of the relationship between the entities contained in the current qualified enterprise information is the same as that in the current enterprise atlas, and if the duration time of the relationship between the corresponding entities in the current enterprise atlas is not longer than the duration time of the duration time in the current qualified enterprise information, replacing the duration time of the relationship between the corresponding entities in the current enterprise atlas by the duration time in the current qualified enterprise information so as to update the current enterprise atlas; and if the termination time of the corresponding inter-entity relation duration time period in the current enterprise map is later than the termination time of the duration time period in the current qualified enterprise information or the inter-entity relation duration time period does not exist in the current qualified enterprise information, discarding the current qualified enterprise information.
7. An enterprise atlas construction system, comprising:
A grabbing module, a local database, a text preprocessing module, an adjusting module, a large language model, a screening module and a map construction module,
The grabbing module is used for grabbing financial information from each large platform, removing repeated financial information from the local database and then sending the financial information into the local database for storage;
The local database is used for storing the financial information sent by the grabbing module and sending the newly added financial information into the text preprocessing module;
The text preprocessing module processes financial information into financial text;
The adjustment module comprises a specific task unit and an optimization unit, the requirement enters the specific task unit, the specific task unit forms or modifies the specific task based on the joint loss function calculated by the requirement and the optimization unit, the specific task unit sends the specific task into the prompt module in the large language model,
The optimizing unit calculates a joint loss function by using the qualified enterprise information output by the screening module, and optimizes parameters in a specific task and a large language model based on the joint loss function;
The prompting module in the large language model extracts keywords from financial texts called from the text preprocessing module based on specific tasks in specific task units to form structured enterprise information, and sends the structured enterprise information to the screening module;
the screening module screens out qualified enterprise information and outputs the qualified enterprise information to a technician, and meanwhile, the qualified enterprise information is sent into the optimizing unit and the map construction module;
The map construction module constructs an enterprise map based on the qualified enterprise information;
the modules, units are programmed or configured to perform the steps of an enterprise atlas construction method according to claim 6.
8. A computer-readable storage medium, characterized by: a computer program programmed or configured to perform an enterprise atlas construction method according to claim 6 is stored on a computer readable storage medium.
CN202311552385.8A 2023-11-17 2023-11-17 Enterprise information extraction method, map construction method and system and storage medium Active CN117473036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311552385.8A CN117473036B (en) 2023-11-17 2023-11-17 Enterprise information extraction method, map construction method and system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311552385.8A CN117473036B (en) 2023-11-17 2023-11-17 Enterprise information extraction method, map construction method and system and storage medium

Publications (2)

Publication Number Publication Date
CN117473036A CN117473036A (en) 2024-01-30
CN117473036B true CN117473036B (en) 2024-08-23

Family

ID=89634669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311552385.8A Active CN117473036B (en) 2023-11-17 2023-11-17 Enterprise information extraction method, map construction method and system and storage medium

Country Status (1)

Country Link
CN (1) CN117473036B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220237A (en) * 2017-05-24 2017-09-29 南京大学 A kind of method of business entity's Relation extraction based on convolutional neural networks
CN115203570A (en) * 2022-07-25 2022-10-18 广东省华南技术转移中心有限公司 Prediction model training method, expert recommendation matching method, device and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11126676B2 (en) * 2018-12-10 2021-09-21 Sap Se Influence rank generation system for enterprise community using social graph
US20220019905A1 (en) * 2020-07-20 2022-01-20 Microsoft Technology Licensing, Llc Enterprise knowledge graph building with mined topics and relationships
CN115618017A (en) * 2022-10-26 2023-01-17 同济大学 Enterprise upstream and downstream relation prediction method oriented to industry knowledge graph

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220237A (en) * 2017-05-24 2017-09-29 南京大学 A kind of method of business entity's Relation extraction based on convolutional neural networks
CN115203570A (en) * 2022-07-25 2022-10-18 广东省华南技术转移中心有限公司 Prediction model training method, expert recommendation matching method, device and medium

Also Published As

Publication number Publication date
CN117473036A (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN109146610B (en) Intelligent insurance recommendation method and device and intelligent insurance robot equipment
CN104050160B (en) Interpreter's method and apparatus that a kind of machine is blended with human translation
CN112307153B (en) Automatic construction method and device of industrial knowledge base and storage medium
CN107657284A (en) A kind of trade name sorting technique and system based on Semantic Similarity extension
CN116541911A (en) Packaging design system based on artificial intelligence
CN114706559A (en) Software scale measurement method based on demand identification
CN113987112B (en) Table information extraction method and device, storage medium and electronic equipment
CN110516057A (en) A kind of petition letter problem answer method and device
CN114297987B (en) Document information extraction method and system based on text classification and reading understanding
CN117370580A (en) Knowledge-graph-based large language model enhanced dual-carbon field service method
CN111914166B (en) Correction strategy personalized recommendation system applied to community correction personnel
CN114168575A (en) Public opinion analysis method and system in financial field
CN117633141A (en) Business knowledge retrieval base construction method and device, electronic equipment and storage medium
CN117473036B (en) Enterprise information extraction method, map construction method and system and storage medium
CN117909458A (en) Construction method of mould specialized question-answering system based on LLM model
CN111061886B (en) NLP-based multimedia operation online management system and method
CN111428472A (en) Article automatic generation system and method based on natural language processing and image algorithm
CN115203429B (en) Automatic knowledge graph expansion method for constructing ontology framework in auditing field
CN116611447A (en) Information extraction and semantic matching system and method based on deep learning method
CN114969511A (en) Content recommendation method, device and medium based on fragments
CN111209375B (en) Universal clause and document matching method
CN114357175A (en) Data mining system based on semantic network
CN112685434A (en) Operation and maintenance question-answering method based on knowledge graph
CN112199114A (en) Software defect report distribution method based on self-attention mechanism
CN118133946B (en) Multi-modal knowledge hierarchical recognition and controlled alignment method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant