CN112528028A - Investment and financing information mining method and device, electronic equipment and storage medium - Google Patents
Investment and financing information mining method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112528028A CN112528028A CN202011584208.4A CN202011584208A CN112528028A CN 112528028 A CN112528028 A CN 112528028A CN 202011584208 A CN202011584208 A CN 202011584208A CN 112528028 A CN112528028 A CN 112528028A
- Authority
- CN
- China
- Prior art keywords
- financing
- information
- text
- target
- investment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005065 mining Methods 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000013145 classification model Methods 0.000 claims description 45
- 238000012549 training Methods 0.000 claims description 43
- 238000004590 computer program Methods 0.000 claims description 10
- 239000012634 fragment Substances 0.000 claims description 10
- 238000012216 screening Methods 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000002778 food additive Substances 0.000 description 1
- 235000013373 food additive Nutrition 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/06—Asset management; Financial planning or analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Development Economics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Accounting & Taxation (AREA)
- Evolutionary Computation (AREA)
- Finance (AREA)
- Entrepreneurship & Innovation (AREA)
- Game Theory and Decision Science (AREA)
- Human Resources & Organizations (AREA)
- Operations Research (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a device for mining investment and financing information, electronic equipment and a storage medium, wherein the method comprises the following steps: determining candidate information texts, and selecting the candidate information texts with the field types consistent with the target field investment and financing types as target information texts; extracting financing information sections in the target information text, performing entity identification on the financing information sections to obtain financing entities contained in the financing information sections, and performing financing round analysis on the financing information sections to obtain financing rounds of the financing information sections; and determining the investment and financing information of the target information text based on the financing entities and the financing rounds contained in each financing information field in the target information text. The method, the device, the electronic equipment and the storage medium provided by the invention improve the accuracy and the reliability of the investment and financing information acquisition, simultaneously effectively avoid the possible operation error or the interference of subjective consciousness caused by manual information mining through the machine execution, and ensure the real-time property and the objectivity of the information mining.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a investment and financing information mining method and device, electronic equipment and a storage medium.
Background
With the rapid development of the internet, various kinds of news and information are layered endlessly, the content relates to the aspect of the aspect, and the quantity is also in an explosive form.
The timeliness and diversity of the investment and financing information are rich information for users, and simultaneously, a large amount of irrelevant or redundant content is carried, so that the users need to consume a large amount of time and energy on information screening and important information extraction, and the process is complicated and low in efficiency.
Disclosure of Invention
The invention provides a method and a device for mining investment and financing information, electronic equipment and a storage medium, which are used for solving the problems that the existing method for mining the investment and financing information needs to be processed by a user, the process is complicated and the efficiency is low.
The invention provides a method for mining investment and financing information, which comprises the following steps:
determining candidate information texts, and selecting the candidate information texts with the field types consistent with the target field investment and financing types as target information texts;
extracting financing information fields in the target information text, performing entity identification on the financing information fields to obtain financing entities contained in the financing information fields, and performing financing round analysis on the financing information fields to obtain financing rounds of the financing information fields;
and determining the investment and financing information of the target information text based on the financing entity and the financing turn contained in each financing information field in the target information text.
The invention provides a method for mining investment and financing information, which selects a candidate information text with the field type consistent with the investment and financing type of a target field as a target information text, and comprises the following steps:
inputting the title text and each language segment of the candidate information text into a domain classification model to obtain a domain classification result output by the domain classification model; the domain classification model is obtained by training based on the sample information text and the sample domain classification result;
and determining the candidate information text of which the domain classification result is the target domain financing type as the target information text.
According to the method for mining investment and financing information provided by the invention, the extraction of the financing information speech section in the target information text comprises the following steps:
respectively splicing the issuing time of the target information text with each language section in the target information text and then inputting the spliced issuing time into a financing language section classification model to obtain a financing classification result of each language section output by the financing language section classification model; the financing field classification model is obtained by training based on the release time of a sample target information text, and each sample field and a financing information label in the sample target information text;
and taking the financing classification result as a language segment containing financing information as the financing information language segment.
The method for mining investment and financing information provided by the invention is used for carrying out entity identification on the financing information language segment to obtain a financing entity contained in the financing information language segment, and comprises the following steps:
inputting the financing information language segment into a financing entity identification model to obtain a financing entity and an entity type output by the financing entity identification model, wherein the entity type is a financing party or financing amount;
the financing entity identification model is obtained by training based on sample financing information language fragment and sample financing entity and entity type label contained in the sample financing information language fragment.
According to the investment and financing information mining method provided by the invention, the financing round analysis is carried out on the financing information language segment to obtain the financing round of the financing information language segment, and the investment and financing information mining method comprises the following steps:
inputting the title text of the target information text and the financing information language section into a financing turn generation model to obtain a financing turn output by the financing turn generation model;
the financing turn generation model is constructed based on a codec model and is obtained by training a sample title text, a sample financing information field and a financing turn label of the sample financing information field based on a sample information text.
According to the investment and financing information mining method provided by the invention, the investment and financing information of the target information text is determined, and then the method further comprises the following steps:
matching the target information text with the existing information text;
if the matching is successful, folding the target information text to the matched existing information text for display;
otherwise, displaying the target information text based on the priority of the information publisher of the target information text.
The invention provides a method for mining investment and financing information, which matches a target information text with an existing information text and comprises the following steps:
matching the investment and financing information of the target information text with the investment and financing information of the existing information text;
and if the matching is successful, performing text similarity matching on the target information text and the existing information text.
The invention also provides a device for mining investment and financing information, which comprises:
the text screening unit is used for determining candidate information texts and selecting the candidate information texts with the field types consistent with the investment and financing types of the target field as the target information texts;
the information mining unit is used for extracting financing information fields in the target information text, performing entity identification on the financing information fields to obtain financing entities contained in the financing information fields, and performing financing round analysis on the financing information fields to obtain financing rounds of the financing information fields;
and the information fusion unit is used for determining the investment and financing information of the target information text based on the financing entities and the financing turns contained in each financing information field in the target information text.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the investment and financing information mining method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of financing information mining as any one of the above.
According to the investment and financing information mining method, the investment and financing information mining device, the target information text is screened from the candidate information text, so that automatic screening of massive information is realized, interference of irrelevant information on acquisition of target investment and financing information is avoided, and subsequent calculation amount is reduced; entity identification and financing round identification are carried out on financing information paragraphs in a target information text, different types of information mining modes are adopted for different types of financing information, accuracy and reliability of financing information investment acquisition can be effectively improved, meanwhile, machine execution can effectively avoid operation errors or interference of subjective consciousness possibly existing in manual information mining, and instantaneity and objectivity of financing information investment mining are guaranteed.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for mining investment and financing information provided by the present invention;
FIG. 2 is a schematic structural diagram of a investment and financing information mining device provided by the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for mining investment and financing information provided by the present invention, as shown in fig. 1, the method includes:
Here, the candidate information texts are collected from each large information website by web crawlers at intervals. Considering that the number of candidate information texts obtained by direct collection is huge, and the information which is actually concerned by the user is only the information related to investment and financing in a specific field, the candidate information texts can be screened to obtain the target information texts belonging to the investment and financing type of the target field.
The target field investment and financing type means that the text content belongs to the target field, and the text content is related to the investment financing. Here, the target area, that is, an area that the user actually focuses on, may be set in advance, and for example, the target area may be a medical area or a publishing area.
The selection of the target information text needs to determine the information of each candidate information text, namely, the field to which each candidate information text belongs on one hand, and whether each candidate information text is related to the financing on the other hand. If the area to which any candidate information text belongs is the target area and the candidate information text is related to the financing, the candidate information text can be directly determined as the target information text. Here, the domains to which the candidate information texts belong and whether the candidate information texts are related to the investment and financing can be realized by a classification model trained in advance.
Step 120, extracting the financing information field in the target information text, performing entity identification on the financing information field to obtain the financing entity contained in the financing information field, and performing financing round analysis on the financing information field to obtain the financing round of the financing information field.
Specifically, the body portion of the target information text may include a plurality of segments, some of which may include content related to the financing information, and others of which may not. In view of the above situation, in order to reduce the amount of calculation, each segment in the target information text may be classified and determined, and whether each segment contains the content related to the financing information is analyzed, so as to select the financing information segment containing the content related to the financing information.
The general financing information may include financers, financing amounts, financing rounds, and the like. Different financing information can be extracted correspondingly in different modes in consideration of different modes of different financing information in the financing information field. For investment and financing information which usually appears in the financing information field in the form of an entity, namely a financing entity, the financing entity in the financing information field can be obtained in an entity identification mode, and the financing entity referred to herein can be a financing party and/or a financing amount. For the financing information language segment which is not necessarily directly and clearly represented by a text, the financing information which is obtained by context derivation may be required, for example, the financing round, and the financing round can be obtained by applying a text generation algorithm according to the semantics contained in the financing information language segment.
Specifically, aiming at the situation that a plurality of financing information fields exist in the target information text and financing entities or financing rounds corresponding to the financing information fields are different, the investment and financing information of the whole target information text can be analyzed by combining the financing entities or financing rounds corresponding to the financing information fields and used as final investment and financing information obtained by mining.
For example, if the financing rounds corresponding to the financing information fields are different, the financing rounds obtained by identification can be sorted according to the chronological order of the preset financing rounds, and the financing round closer to the current stage in time sequence is selected as the financing round finally determined by the target information text. Here, the descending order of financing rounds in time may be "merger", "reimbursement", "IPO", "Pre-IPO", "R round", "I round", etc.
According to the method provided by the embodiment of the invention, the target information text is screened from the candidate information text, so that the automatic screening of massive information is realized, the interference of irrelevant information on the acquisition of the target investment and financing information is avoided, and the subsequent calculation amount is reduced; entity identification and financing round identification are carried out on financing information paragraphs in a target information text, different types of information mining modes are adopted for different types of financing information, accuracy and reliability of financing information investment acquisition can be effectively improved, meanwhile, machine execution can effectively avoid operation errors or interference of subjective consciousness possibly existing in manual information mining, and instantaneity and objectivity of financing information investment mining are guaranteed.
Based on the above embodiment, step 110 includes:
inputting the title text and each language segment of the candidate information text into the domain classification model to obtain a domain classification result output by the domain classification model; the domain classification model is obtained by training based on the sample information text and the sample domain classification result;
and determining the candidate information text with the domain classification result as the target domain financing type as the target information text.
Specifically, the domain classification model is used for judging and classifying the domain to which the input candidate information text belongs and whether the input candidate information text is related to investment and financing or not, so as to output a domain classification result, wherein the domain classification result can be one of a target domain investment and financing type, other domain investment and financing types and a non-investment and financing type, wherein the target domain investment and financing type refers to that the domain to which the corresponding candidate information text belongs is a target domain and is related to investment and financing, and the candidate information text of the type can be directly determined as the target information text; the other field investment and financing type refers to a non-target field of the field to which the corresponding candidate information text belongs and is related to investment and financing, the non-investment and financing type refers to the fact that the corresponding candidate information text is related to investment and financing, and the candidate information texts of the other field investment and financing type and the non-investment and financing type are not information texts required for mining investment and financing information aiming at the target field.
Furthermore, the candidate information text can include the title in the candidate information text and all the language segments in the body, and when the candidate information text is used as the input of the domain classification model, the "title:" and the "paragrams:" can be directly and explicitly labeled, so that the domain classification model can more distinguishably learn the characteristic boundaries of the title text and each language segment, thereby more accurately performing the domain classification.
Accordingly, before performing step 110, a domain classification model may be obtained by pre-training, and the training step of the domain classification model may specifically include: and collecting a large amount of sample information texts, and marking whether the sample information texts belong to the target field and are related to investment and financing or not as sample field classification results of the sample information texts. And training the initial model based on the sample information text and the sample field classification result so as to obtain a field classification model. Preferably, the initial model here may be an mBERT model that is applicable to multiple languages.
For example, when applied to mining of investment information in the medical domain, the classification result of the sample domain of the sample information text may be "medical", "non-medical", or "non-related", wherein "medical" indicates belonging to the medical domain and related to investment, "non-medical" indicates belonging to other domains and related to investment, and "non-related" indicates unrelated to investment. Considering the problem that the length of each speech segment in the text of the sample information text is too long and may be forcibly truncated in the domain classification model, a sliding window (sliding _ window) parameter may be preset, and the length of the sliding window may be expressed as an upper limit of the length of the input sequence (max _ seq _ length) × step size rate (stride), for example, 512 × 0.8, that is, the sample information text may be sliced into subsequences with a length of 512 × 0.8. When the domain classification model is trained with sliding window enabled, each subsequence can be automatically assigned a label in the original sequence.
The training samples applied to the domain classification model are shown in the following table:
text | labels |
title is an English information title, paragrams is all paragraph text | Medicine and food additive |
title a Chinese information title paragrams all paragraph text. | Non-medicine |
title a Chinese information title, paragrams all paragraph text. | Is not related |
Wherein text is the sample information text, and labels is the sample domain classification result. Further, in the actual training process, the testing precision of the domain classification model can reach 99.5% through 3 times of training.
Correspondingly, in the application stage of the domain classification model, the domain classification model can correspondingly allocate classification results for each divided subsequence in the candidate information text, and finally, the mode of the classification results of all subsequences is used as the finally output domain classification result.
The method provided by the embodiment of the invention can accurately identify the information related to investment and financing in the target field from information articles with different contents and languages in a machine reading mode, and effectively overcomes the difficulty that the information header information is too little to accurately classify.
Based on any of the above embodiments, in step 120, extracting the financing information field in the target information text includes:
respectively splicing the issuing time of the target information text with each language section in the target information text, and then inputting the spliced issuing time into a financing language section classification model to obtain a financing classification result of each language section output by the financing language section classification model; the financing field classification model is obtained based on the release time of the sample target information text and training of each sample field and the financing information label in the sample target information text;
and taking the financing classification result as a language segment containing financing information as a financing information language segment.
Specifically, the extraction of the financing information segments is obtained by classifying and judging each segment in the target information text and analyzing whether each segment contains the content related to the financing information. The financing information passage may also be called ira passage, here ira is an acronym for invente, round, and amount.
Considering that some language segments in the target information text may contain financing information and also contain the occurrence time of the financing information, in order to judge whether the occurrence time of the financing information contained in the language segments is close to or consistent with the publishing time of the target information text, namely whether the financing information contained in the language segments is latest information or past information, the publishing time of the target information text and the language segments are spliced and input into a financing segment classification model, and the financing segment classification model analyzes whether the language segments contain content related to the financing information and also analyzes whether the time contained in the language segments is consistent with the input publishing time, so as to obtain a financing classification result of the language segments.
Before this step is executed, a financing segment classification model may be obtained through pre-training, and the step of training the financing segment classification model may specifically include: and collecting a large amount of sample target information texts, wherein the sample target information texts belong to the target field and are related to financing. Then, whether each sample language section in the sample target information text contains the content related to the financing information is marked as a financing information label of each sample language section in the sample target information text. Training the initial model based on the publishing time of the sample target information text, each sample word section in the sample target information text and the financing information label thereof, thereby obtaining a financing word section classification model.
The training sample of the financing segment classification model may be embodied as follows:
(middle) "the date of publication herein is xxxx year xx month xx day. "+" paragraph text ";
(English) "This article was published on Month, Day, Yeast." + "paragraph text";
specifically, it can be expressed in the form shown in the following table:
where text represents a combination of the issue time and the language segment, labels of 0 represents that no financing information is included, and labels of 1 represents that financing information is included.
Based on any of the above embodiments, in step 120, the entity identification is performed on the financing information language segment to obtain the financing entity included in the financing information language segment, including:
inputting the financing information language segment into a financing entity identification model to obtain a financing entity and an entity type output by the financing entity identification model, wherein the entity type is a financing party or financing amount; the financing entity identification model is obtained by training based on the sample financing information language fragment and the sample financing entity and entity type label contained in the sample financing information language fragment.
Specifically, the financing entity identification model is used for identifying financing-related entities contained in the financing information language segment, so as to output financing entities and entity types contained in the financing information language segment. Further, the financing entity may be marked by using a BIO marking method, for example, the entity type is an entity of the financing party, the first word may be marked as "b-invente", the subsequent word is marked as "i-invente", the entity type is an entity of the financing amount, the first word may be marked as "b-amunt", and the subsequent word is marked as "i-amunt". And the remaining words in the financing information corpus that do not belong to either the financing party or the financing amount may be labeled "O".
Before the step is executed, the financing entity recognition model may be obtained by pre-training, and the training step of the financing entity recognition model may specifically include: and collecting a large amount of sample target information texts, extracting a language segment containing financing information from the sample target information texts as a sample financing information language segment, and marking sample financing entities and entity type labels in the sample financing information language segment. And then, training the initial model based on the sample financing information language fragment, the sample financing entity and the entity type label contained in the sample financing information language fragment, thereby obtaining a financing entity identification model. Preferably, the initial model here may be an mBERT model. In the actual training process, the testing precision of the financing entity recognition model can reach 95% through 4 times of training.
Based on any of the above embodiments, in step 120, performing financing turn analysis on the financing information segments to obtain financing turns of the financing information segments, including:
inputting the title text and the financing information field of the target information text into the financing turn generation model to obtain the financing turn output by the financing turn generation model; the financing turn generation model is constructed based on a codec model and is obtained by training a financing turn label based on a sample title text, a sample financing information field and a sample financing information field of a sample information text.
Specifically, considering that the financing rounds are not necessarily directly and clearly represented by texts in the financing information section, the investment and financing information obtained through context derivation may be required, and a generative algorithm may be applied to extract the financing rounds indicated by the financing information section. The application of the codec model makes the predictable round expressions more diverse, and also generates new round expression forms that are not included in the model training set.
Before the step is executed, a financing round generation model may be obtained through pre-training, and the training step of the financing round generation model may specifically include: and collecting a large amount of sample target information texts, extracting a language segment containing financing information from the sample target information texts as a sample financing information language segment, and marking the financing round indicated by the sample financing information language segment as a financing round label. And then, training the codec model based on the sample title text of the sample information text, the sample financing information field and the financing round label of the sample financing information field, thereby obtaining a financing round generation model. Preferably, the codec model herein may be an mBERT model. In the actual training process, the testing precision of the financing entity recognition model can reach 95.5 percent through 3 times of training.
The training sample of the financing round generation model may be specifically represented in the form shown in the following table:
here, input _ text is a combination of the title text of the target information text and the financing information field, and output _ text is the financing turn corresponding to the financing information field.
Based on any of the above embodiments, step 120 further includes:
matching the target information text with the existing information text;
if the matching is successful, folding the target information text to the matched existing information text for display; otherwise, displaying the target information text based on the priority of the information publisher of the target information text.
Specifically, considering that the same information may be issued by different information issuers in sequence, the repeated pushing of the homogeneous information greatly reduces the user experience, so that the target information text can be matched with the existing information text for the target information text. Here, the existing information text is the information text which has been published and pushed to be displayed in a time period adjacent to the publication time of the target information text, relative to the publication time of the target information text. For example, the information text may be within 7 days of the release date of the target information text.
If the target information text is matched with the existing information text, the target information text and the existing information text can be considered to express the homogeneous information, aiming at the situation, in order to avoid repeated pushing, the target information text can be folded under the existing information text for displaying, the homogeneous information is classified in a folding display mode, repeated pushing is avoided, and compared with a mode of directly deleting the information text corresponding to the homogeneous information, different expression forms or the information text issued by different information issuers are reserved, the information text displayed in a folding mode can be checked when a user has a reading requirement, and meanwhile, the integrity of the information is also ensured.
If the target information text is not matched with the existing information text, the target information text and the existing information text express different information. At this time, the homogeneous target information texts obtained by mining at the same time can be pushed and displayed according to a preset priority level, wherein the priority level can be set for the information publisher, and when the homogeneous target information texts with high priority level and low priority level exist, the highest priority target information texts can be directly pushed to the user and displayed.
The method provided by the embodiment of the invention reserves the information basis of each data source by folding and displaying the homogeneous information, greatly eliminates the homogeneous information interference for the working personnel and further improves the data processing efficiency.
Based on any of the above embodiments, the matching the target information text with the existing information text includes:
matching the investment and financing information of the target information text with the investment and financing information of the existing information text;
if the matching is successful, the text similarity matching is performed between the target information text and the existing information text.
Specifically, the matching between the target information text and the existing information text can be divided into two layers, and firstly, the investment and financing information of the target information text and the existing information text are compared. The financing information may include the financing party, the financing turn, and the financing amount of the information text. For matching the investment and financing information, the matching is specifically the matching of the three fields. If the matching of the three fields is successful, entering the next level and carrying out similarity matching of the text level; if the matching of the three fields fails, the target information text is directly determined not to be matched with the existing information text.
When the similarity matching of the text level is carried out, the similarity matching can be realized through a synonymy text recognition model obtained through pre-training, specifically, a target information text and an existing information text can be respectively input into the synonymy text recognition model, the synonymy text recognition model carries out the similarity calculation of the text level on the target information text and the existing information text, and then whether the target information text and the existing information text are similar texts or not is output to serve as a text similarity matching result. If the target information text and the existing information text are similar texts, the target information text and the existing information text are successfully matched, otherwise, the target information text and the existing information text are not matched.
Before this, the synonymous text recognition model may also be obtained through pre-training, and the training step of the synonymous text recognition model may specifically include: a large number of sample information texts are collected and grouped in pairs to mark whether the two texts are similar or not. And then training the initial model based on the group of sample information texts and the label whether the sample information texts are similar to the label to obtain the synonymous text recognition model. Preferably, the initial model here may be an mBERT model.
The training sample of the financing round generation model may be specifically represented in the form shown in the following table:
here, text _ a represents a target information text, text _ b represents an existing information text, and Labels represents a text matching result of the two, 1 is similar, and 0 is dissimilar.
Based on any of the above embodiments, the folding the target information text to the matched existing information text for displaying includes:
and folding and displaying the target information text and the matched existing information text based on the priority of the information issuing party of the target information text and the matched existing information text.
Specifically, when the information is displayed in a folded manner, which of the target information text and the existing information text matched with the target information text needs to be used as the folded text and which needs to be used as the text which is not displayed on the top and is not folded can be determined according to the priority of the information issuing party of each text. The text published by the information publisher with the highest priority can be directly used as the text which is not folded at the top, and the texts published by the rest consulting publishers are folded at the bottom.
Based on any one of the embodiments, the investment and financing information mining method comprises the following steps:
firstly, acquiring candidate information texts at intervals of preset time, inputting header texts and all language sections of all the acquired candidate information texts into a domain classification model to obtain a domain classification result of each candidate information text, selecting the candidate information texts of which the domain classification results are of financing types for a target domain as target information texts, and deleting the candidate information texts of the classification results of other domains.
Then, the issuing time of each target information text is spliced with each language section in the corresponding target information text and then input to the financing language section classification model to obtain the financing classification result of each language section in each target information text, and the financing classification result is the language section containing financing information and is used as the financing information language section of the corresponding target information text.
Inputting the financing information language section into the financing entity identification model to obtain financing parties and financing amount in the financing information language section, and inputting the title text of the target information text and the financing information language section into the financing turn generation model to obtain the financing turn of the financing information language section.
And then, fusing the financing party, the financing amount and the financing turn of each financing information field in the target information text to obtain the investment and financing information of the target information text.
After the investment and financing information of each target information text is obtained, the target information text can be matched with the existing information text; if the matching is successful, folding the target information text to the matched existing information text for display; otherwise, the text display is carried out based on the priority of the information publisher of the target information text.
The investment and financing information mining device provided by the invention is described below, and the investment and financing information mining device described below and the investment and financing information mining method described above can be referred to correspondingly.
Based on any of the above embodiments, fig. 2 is a schematic structural diagram of an investment and financing information mining device provided by the present invention, as shown in fig. 2, the device includes:
the text screening unit 210 is configured to determine candidate information texts, and select a candidate information text having a domain type consistent with a target domain investment and financing type as a target information text;
the information mining unit 220 is configured to extract financing information fields in the target information text, perform entity identification on the financing information fields to obtain financing entities included in the financing information fields, and perform financing round analysis on the financing information fields to obtain financing rounds of the financing information fields;
the information fusion unit 230 is configured to determine the financing information of the target information text based on the financing entity and the financing turn included in each financing information field in the target information text.
According to the device provided by the embodiment of the invention, the target information text is screened from the candidate information text, so that the automatic screening of mass information is realized, the interference of irrelevant information on the acquisition of the target investment and financing information is avoided, and the subsequent calculation amount is reduced; entity identification and financing round identification are carried out on financing information paragraphs in a target information text, different types of information mining modes are adopted for different types of financing information, accuracy and reliability of financing information investment acquisition can be effectively improved, meanwhile, machine execution can effectively avoid operation errors or interference of subjective consciousness possibly existing in manual information mining, and instantaneity and objectivity of financing information investment mining are guaranteed.
Based on any of the above embodiments, the text filtering unit 210 is configured to:
inputting the title text and each language segment of the candidate information text into a domain classification model to obtain a domain classification result output by the domain classification model; the domain classification model is obtained by training based on the sample information text and the sample domain classification result;
and determining the candidate information text of which the domain classification result is the target domain financing type as the target information text.
Based on any of the above embodiments, the information mining unit 220 is configured to:
respectively splicing the issuing time of the target information text with each language section in the target information text and then inputting the spliced issuing time into a financing language section classification model to obtain a financing classification result of each language section output by the financing language section classification model; the financing field classification model is obtained by training based on the release time of a sample target information text, and each sample field and a financing information label in the sample target information text;
and taking the financing classification result as a language segment containing financing information as the financing information language segment.
Based on any of the above embodiments, the information mining unit 220 is configured to:
inputting the financing information language segment into a financing entity identification model to obtain a financing entity and an entity type output by the financing entity identification model, wherein the entity type is a financing party or financing amount;
the financing entity identification model is obtained by training based on sample financing information language fragment and sample financing entity and entity type label contained in the sample financing information language fragment.
Based on any of the above embodiments, the information mining unit 220 is configured to:
inputting the title text of the target information text and the financing information language section into a financing turn generation model to obtain a financing turn output by the financing turn generation model;
the financing turn generation model is constructed based on a codec model and is obtained by training a sample title text, a sample financing information field and a financing turn label of the sample financing information field based on a sample information text.
Based on any embodiment above, the apparatus further comprises a display unit, configured to:
matching the target information text with the existing information text;
if the matching is successful, folding the target information text to the matched existing information text for display;
otherwise, displaying the target information text based on the priority of the information publisher of the target information text.
Based on any one of the above embodiments, the display unit is configured to:
matching the investment and financing information of the target information text with the investment and financing information of the existing information text;
and if the matching is successful, performing text similarity matching on the target information text and the existing information text.
Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. Processor 310 may invoke logic instructions in memory 330 to perform a method of financing information mining comprising: determining candidate information texts, and selecting the candidate information texts with the field types consistent with the target field investment and financing types as target information texts; extracting financing information fields in the target information text, performing entity identification on the financing information fields to obtain financing entities contained in the financing information fields, and performing financing round analysis on the financing information fields to obtain financing rounds of the financing information fields; and determining the investment and financing information of the target information text based on the financing entity and the financing turn contained in each financing information field in the target information text.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of financing information mining provided by the above methods, the method comprising: determining candidate information texts, and selecting the candidate information texts with the field types consistent with the target field investment and financing types as target information texts; extracting financing information fields in the target information text, performing entity identification on the financing information fields to obtain financing entities contained in the financing information fields, and performing financing round analysis on the financing information fields to obtain financing rounds of the financing information fields; and determining the investment and financing information of the target information text based on the financing entity and the financing turn contained in each financing information field in the target information text.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the above-provided investment information mining methods, the method comprising: determining candidate information texts, and selecting the candidate information texts with the field types consistent with the target field investment and financing types as target information texts; extracting financing information fields in the target information text, performing entity identification on the financing information fields to obtain financing entities contained in the financing information fields, and performing financing round analysis on the financing information fields to obtain financing rounds of the financing information fields; and determining the investment and financing information of the target information text based on the financing entity and the financing turn contained in each financing information field in the target information text.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for mining investment and financing information is characterized by comprising the following steps:
determining candidate information texts, and selecting the candidate information texts with the field types consistent with the target field investment and financing types as target information texts;
extracting financing information fields in the target information text, performing entity identification on the financing information fields to obtain financing entities contained in the financing information fields, and performing financing round analysis on the financing information fields to obtain financing rounds of the financing information fields;
and determining the investment and financing information of the target information text based on the financing entity and the financing turn contained in each financing information field in the target information text.
2. The mining method of investment and financing information as claimed in claim 1, wherein the selecting of the candidate information text having the domain type consistent with the investment and financing type of the target domain as the target information text comprises:
inputting the title text and each language segment of the candidate information text into a domain classification model to obtain a domain classification result output by the domain classification model; the domain classification model is obtained by training based on the sample information text and the sample domain classification result;
and determining the candidate information text of which the domain classification result is the target domain financing type as the target information text.
3. The method as claimed in claim 1, wherein the extracting financing information segments from the target information text comprises:
respectively splicing the issuing time of the target information text with each language section in the target information text and then inputting the spliced issuing time into a financing language section classification model to obtain a financing classification result of each language section output by the financing language section classification model; the financing field classification model is obtained by training based on the release time of a sample target information text, and each sample field and a financing information label in the sample target information text;
and taking the financing classification result as a language segment containing financing information as the financing information language segment.
4. The mining method of investment and financing information according to claim 1, wherein the step of performing entity identification on the financing information corpus to obtain financing entities contained in the financing information corpus comprises:
inputting the financing information language segment into a financing entity identification model to obtain a financing entity and an entity type output by the financing entity identification model, wherein the entity type is a financing party or financing amount;
the financing entity identification model is obtained by training based on sample financing information language fragment and sample financing entity and entity type label contained in the sample financing information language fragment.
5. The mining method of investment and financing information according to claim 1, wherein the financing turn analysis of the financing information corpus to obtain the financing turn of the financing information corpus comprises:
inputting the title text of the target information text and the financing information language section into a financing turn generation model to obtain a financing turn output by the financing turn generation model;
the financing turn generation model is constructed based on a codec model and is obtained by training a sample title text, a sample financing information field and a financing turn label of the sample financing information field based on a sample information text.
6. The method as claimed in any one of claims 1 to 5, wherein the determining of the investment information of the target information text further comprises:
matching the target information text with the existing information text;
if the matching is successful, folding the target information text to the matched existing information text for display;
otherwise, displaying the target information text based on the priority of the information publisher of the target information text.
7. The method of claim 6, wherein the matching of the target information text with the existing information text comprises:
matching the investment and financing information of the target information text with the investment and financing information of the existing information text;
and if the matching is successful, performing text similarity matching on the target information text and the existing information text.
8. A investment and financing information mining device, comprising:
the text screening unit is used for determining candidate information texts and selecting the candidate information texts with the field types consistent with the investment and financing types of the target field as the target information texts;
the information mining unit is used for extracting financing information fields in the target information text, performing entity identification on the financing information fields to obtain financing entities contained in the financing information fields, and performing financing round analysis on the financing information fields to obtain financing rounds of the financing information fields;
and the information fusion unit is used for determining the investment and financing information of the target information text based on the financing entities and the financing turns contained in each financing information field in the target information text.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the financing information mining method according to any of claims 1 to 7.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the financing information mining method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011584208.4A CN112528028A (en) | 2020-12-28 | 2020-12-28 | Investment and financing information mining method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011584208.4A CN112528028A (en) | 2020-12-28 | 2020-12-28 | Investment and financing information mining method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112528028A true CN112528028A (en) | 2021-03-19 |
Family
ID=74976922
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011584208.4A Pending CN112528028A (en) | 2020-12-28 | 2020-12-28 | Investment and financing information mining method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112528028A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129072A (en) * | 2021-04-30 | 2021-07-16 | 上海药慧信息技术有限公司 | Enterprise valuation determination method and device based on investment and financing information |
CN114036949A (en) * | 2021-11-08 | 2022-02-11 | 中国银行股份有限公司 | Investment strategy determination method and device based on information analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933800A (en) * | 2016-11-29 | 2017-07-07 | 首都师范大学 | A kind of event sentence abstracting method of financial field |
CN110781299A (en) * | 2019-09-18 | 2020-02-11 | 平安科技(深圳)有限公司 | Asset information identification method and device, computer equipment and storage medium |
CN110929134A (en) * | 2019-12-04 | 2020-03-27 | 深圳市新国都金服技术有限公司 | Investment and financing data management method and device, computer equipment and storage medium |
CN111241298A (en) * | 2020-01-08 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Information processing method, apparatus and computer readable storage medium |
CN111639183A (en) * | 2020-05-19 | 2020-09-08 | 民生科技有限责任公司 | Financial industry consensus public opinion analysis method and system based on deep learning algorithm |
-
2020
- 2020-12-28 CN CN202011584208.4A patent/CN112528028A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933800A (en) * | 2016-11-29 | 2017-07-07 | 首都师范大学 | A kind of event sentence abstracting method of financial field |
CN110781299A (en) * | 2019-09-18 | 2020-02-11 | 平安科技(深圳)有限公司 | Asset information identification method and device, computer equipment and storage medium |
CN110929134A (en) * | 2019-12-04 | 2020-03-27 | 深圳市新国都金服技术有限公司 | Investment and financing data management method and device, computer equipment and storage medium |
CN111241298A (en) * | 2020-01-08 | 2020-06-05 | 腾讯科技(深圳)有限公司 | Information processing method, apparatus and computer readable storage medium |
CN111639183A (en) * | 2020-05-19 | 2020-09-08 | 民生科技有限责任公司 | Financial industry consensus public opinion analysis method and system based on deep learning algorithm |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129072A (en) * | 2021-04-30 | 2021-07-16 | 上海药慧信息技术有限公司 | Enterprise valuation determination method and device based on investment and financing information |
CN114036949A (en) * | 2021-11-08 | 2022-02-11 | 中国银行股份有限公司 | Investment strategy determination method and device based on information analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106919673B (en) | Text mood analysis system based on deep learning | |
Bucur | Using opinion mining techniques in tourism | |
Kumar et al. | Text mining: concepts, process and applications | |
Batool et al. | Precise tweet classification and sentiment analysis | |
WO2020243846A1 (en) | System and method for automated file reporting | |
US20170061285A1 (en) | Data analysis system, data analysis method, program, and storage medium | |
CN111783518A (en) | Training sample generation method and device, electronic equipment and readable storage medium | |
CN113254574A (en) | Method, device and system for auxiliary generation of customs official documents | |
Alsaqer et al. | Movie review summarization and sentiment analysis using rapidminer | |
CN112528028A (en) | Investment and financing information mining method and device, electronic equipment and storage medium | |
Raza et al. | Detecting cyberbullying in social commentary using supervised machine learning | |
Rauf et al. | Logical structure extraction from software requirements documents | |
CN114239588A (en) | Article processing method and device, electronic equipment and medium | |
Martinez Mateo et al. | The modular assessment pack: A new approach to translation quality assessment at the Directorate General for Translation | |
KR102185733B1 (en) | Server and method for automatically generating profile | |
Yatim et al. | A corpus-based lexicon building in Indonesian political context through Indonesian online news media | |
CN103823868A (en) | Event recognition method and event relation extraction method oriented to on-line encyclopedia | |
WO2021012684A1 (en) | Method and system for establishing market sentiment monitoring system | |
Guadie et al. | Amharic text summarization for news items posted on social media | |
Heidari et al. | Financial footnote analysis: developing a text mining approach | |
WO2023180343A1 (en) | Analysing communications data | |
Mohsen et al. | Enhancing bug localization using phase-based approach | |
KR102298397B1 (en) | Citation Relationship Analysis Method and System Based on Citation Type | |
Smirnova et al. | Evaluation of embedding models for automatic extraction and classification of acknowledged entities in scientific documents | |
Alqahtani et al. | Customer Sentiments Toward Saudi Banks During the Covid-19 Pandemic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Country or region after: China Address after: 201210 3rd floor, building 1, No.400, Fangchun Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai Applicant after: Shanghai Huabin Licheng Technology Co.,Ltd. Address before: 102200 c2040, 2 / F, building 16, courtyard 37, Chaoqian Road, science and Technology Park, Changping District, Beijing Applicant before: Beijing Huabin Licheng Technology Co.,Ltd. Country or region before: China |
|
CB02 | Change of applicant information |