CN108416034B

CN108416034B - Information acquisition system based on financial heterogeneous big data and control method thereof

Info

Publication number: CN108416034B
Application number: CN201810201458.1A
Authority: CN
Inventors: 孙善辉
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2021-11-16
Anticipated expiration: 2038-03-12
Also published as: CN108416034A

Abstract

The invention discloses an information acquisition system based on financial heterogeneous big data and a control method thereof, wherein the information acquisition system comprises an internet information source, a Linux background server system, a Web client program system and a client terminal, the internet information source, the Linux background server system, the Web client program system and the client terminal are sequentially connected, the Linux background server system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, and the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory. The invention can adopt the heterogeneous documents of the financial products in real time and extract the data which is interested by the user from the documents, thereby ensuring the effectiveness of the provided financial data and solving the problem of inconvenient collection of the heterogeneous information in the traditional financial field.

Description

Information acquisition system based on financial heterogeneous big data and control method thereof

Technical Field

The invention relates to the technical field of information acquisition systems, in particular to an information acquisition system based on financial heterogeneous big data.

Background

With the development of information technology, there are more and more financial activities on the internet. The financial field has a large amount of information published through the internet at every moment, and because the network has huge information beams, unfixed information sources and obvious textualized expression characteristics, the financial information on the internet is still mainly in a semi-structured form when being published at present. Compared with structured data, the heterogeneous information is easy to distribute and collect, but has high noise, large information redundancy and inconvenient reading and understanding, so that effective information extraction becomes crucial.

Disclosure of Invention

The invention aims to solve the problems of high information acquisition noise, large information redundancy and inconvenience in reading and understanding in the prior financial field, and provides an information acquisition system based on financial heterogeneous big data and a control method thereof.

In order to achieve the purpose, the invention adopts the following technical scheme:

an information acquisition system based on financial heterogeneous big data comprises an internet information source, a Linux background service system, a Web client program system and a client terminal, and is characterized in that the internet information source, the Linux background service system, the Web client program system and the client terminal are sequentially connected, the Linux background service system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory, the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser and the data memory are sequentially connected, the extraction rule generation module comprises a rule classification unit and a rule synthesis unit, the rule classification unit is connected with the rule synthesis unit, the rule synthesis unit comprises a matcher, a comparator and a generalization device, the matcher, a judgment device and the generalization device are sequentially connected, the information extraction evaluation module comprises a first database, a second database and a first data comparator, and the first database and the second database are connected with the first data comparator.

Preferably, the crawler URL parser comprises a controller module, a parsing module and a resource library module, wherein the parsing module comprises a webpage grabbing unit, a webpage information feature extraction unit, a webpage information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the webpage grabbing unit, the webpage information feature extraction unit and the webpage information classification modeling unit are sequentially connected, the webpage information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.

Preferably, the generalizer adopts a rule generalizing method based on a heuristic function, and adopts Laplacian error estimation as a heuristic function.

Preferably, the first database comprises three parameters of accuracy, recall and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, the recall and the F-measure.

Preferably, the operation is carried out as follows:

the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and the heterogeneous information acquisition and preprocessing module is designed with the resolver of the PDF document and the Web information and is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data.

The second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis.

The third step: and finally, the system extracts information on unknown data by applying a rule base through an information extraction evaluation module, the system is in an iterative operation state, the heterogeneous information collection and preprocessing module continuously provides text information for a subsequent module, and when an extraction task at a certain time cannot meet preset requirements, a document is recorded and is prepared to enter the next heterogeneous information processing process.

Compared with the prior art, the invention provides an information acquisition system based on financial heterogeneous big data and a control method thereof, and the information acquisition system has the following beneficial effects:

1. according to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, the Linux background server system is responsible for collecting the heterogeneous information of financial products from an internet information source and extracting structured data from the data, the structured data is used for being provided for the Web client program system, and the Web client program system can analyze and research the data and provide the data for the client terminal.

2. In the heterogeneous information collection and preprocessing module, a crawler URL parser searches newly issued financial announcement information from an internet information source, and parses the newly issued financial announcement information into a PDF document form, and further parses the newly issued financial announcement information into processable pure text data through the PDF parser; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents, extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data.

3. The information acquisition system based on the financial heterogeneous big data and the management control method thereof are characterized in that in an extraction rule generation module, rules aiming at the same target entity in different documents are classified through a rule classification unit so as to obtain a rule subset of the same target, a heuristic learning method is adopted on the subset, and the rules belonging to separate documents are synthesized into a rule normal form through a rule synthesis unit so as to smoothly extract information on documents with unknown structures and expressions in the future; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule expression method on the labeled corpus, and improves the traditional method which needs domain experts to formulate extraction rules.

4. According to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, in the information extraction evaluation module, the first data comparator compares three parameters of accuracy, recall rate and F-measure in the first database with three reference values preset in the second database so as to evaluate the information extraction effect.

Drawings

FIG. 1 is a system diagram of an information acquisition system based on financial heterogeneous big data according to the present invention;

FIG. 2 is a system diagram of a Linux background server system of an information acquisition system based on financial heterogeneous big data, which is provided by the invention;

FIG. 3 is a system diagram of a heterogeneous information collecting and preprocessing module of an information collecting system based on financial heterogeneous big data according to the present invention;

FIG. 4 is a system diagram of an extraction rule generating module of an information collecting system based on financial heterogeneous big data according to the present invention;

FIG. 5 is a system diagram of an information extraction and evaluation module of an information collection system based on financial heterogeneous big data according to the present invention;

FIG. 6 is a system diagram of a crawler URL parser of an information collection system based on financial heterogeneous big data according to the present invention;

FIG. 7 is a system diagram of an analysis module of an information collection system based on financial heterogeneous big data according to the present invention;

FIG. 8 is a system diagram of a computer analysis unit of an information collection system based on financial heterogeneous big data according to the present invention;

FIG. 9 is a system diagram of a rule synthesis unit of an information collection system based on financial heterogeneous big data according to the present invention;

FIG. 10 is a system diagram of a heterogeneous information processing procedure of an information collection system based on financial heterogeneous big data according to the present invention;

fig. 11 is a system diagram of an information acquisition system based on financial heterogeneous big data and a rule generation algorithm thereof according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Referring to fig. 1-11, an information collection system based on financial heterogeneous big data and a management control method thereof comprise an internet information source, a Linux background server system, a Web client program system and a client terminal, and are characterized in that the internet information source, the Linux background server system, the Web client program system and the client terminal are sequentially connected, the Linux background server system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data storage, and the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser, the Web client program system and the client terminal are sequentially connected, HTML resolver, data memory link to each other in proper order, extraction rule generation module includes that the rule classifies unit, rule synthesis unit, the rule classifies unit and rule synthesis unit and links to each other, rule synthesis unit includes matcher, comparator, generalizing ware, matcher, judger, generalizing ware link to each other in proper order, information extraction evaluation module includes first database, second database, first data comparator, first database and second database all link to each other with first data comparator.

The crawler URL parser comprises a controller module, a parsing module and a resource library module, wherein the parsing module comprises a webpage grabbing unit, a webpage information feature extraction unit, a webpage information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the webpage grabbing unit, the webpage information feature extraction unit and the webpage information classification modeling unit are sequentially connected, the webpage information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.

The generalization device adopts a rule generalization method based on a heuristic function, and adopts Laplacian error estimation as a heuristic function.

The first database comprises three parameters of accuracy, recall rate and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, the recall rate and the F-measure.

The method comprises the following steps:

When the system is used, the Linux background server-side system is responsible for collecting heterogeneous information of financial products from an internet information source and extracting structured data from the data, and specifically, a crawler URL parser searches newly issued financial announcement information from the internet information source and parses the information into a PDF document form, and then the PDF parser parses the information into processable pure text data; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents and extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule representation method on a labeled corpus, improves the traditional method that a domain expert needs to make an extraction rule, and in an information extraction evaluation module, a first data comparator compares three parameters of accuracy, recall rate and F-measure in a first database with three reference values preset in a second database so as to evaluate the information extraction effect.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. The information acquisition system based on financial heterogeneous big data comprises an internet information source, a Linux background service system, a Web client program system and a client terminal, and is characterized in that the internet information source, the Linux background service system, the Web client program system and the client terminal are sequentially connected, the Linux background service system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory, the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser and the data memory are sequentially connected, the extraction rule generation module comprises a rule classification unit and a rule synthesis unit, the rule classification unit is connected with the rule synthesis unit, the rule synthesis unit comprises a matcher, a comparator and a generalization device, the matcher, a judger and the generalization device are sequentially connected, the information extraction evaluation module comprises a first database, a second database and a first data comparator, the first database and the second database are both connected with the first data comparator,

the crawler URL parser is used for searching newly issued financial bulletin information from an internet information source and parsing the newly issued financial bulletin information into a PDF document form or processing web page data through a search engine retriever;

the PDF parser is used for processing the PDF document into plain text data in a processable form;

the HTML parser is used for parsing the webpage data into plain text data;

the rule classification unit is used for classifying rules aiming at the same target entity in different documents so as to obtain a rule subset of the same target;

the matcher is used for matching a rule subset to a training sample;

the judger is used for judging whether a training set exists or not, the system can finish the rule generation of the rule subset when the training set does not exist and finally form a rule base, and the system can repeat the matching of the rule subset on the training sample when the training set exists;

the generalizer is used for generalizing the existing rules.

2. The information acquisition system based on the financial heterogeneous big data as claimed in claim 1, wherein the crawler URL parser comprises a controller module, a parsing module and a resource library module, the parsing module comprises a web page capturing unit, a web page information feature extraction unit, a web page information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the web page capturing unit, the web page information feature extraction unit and the web page information classification modeling unit are connected in sequence, the web page information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.

3. The financial heterogeneous big data based information acquisition system according to claim 1, wherein the generalizer adopts a rule generalization method based on heuristic functions and adopts Laplacian error estimation as a heuristic function.

4. The information acquisition system based on financial heterogeneous big data according to claim 1, wherein the first database comprises three parameters of accuracy, recall and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, recall and F-measure.

5. The control method of the information collection system based on the financial heterogeneous big data according to claim 1, characterized by comprising the following steps:

the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and a heterogeneous information acquisition and preprocessing module is provided with the resolver of the PDF document and the Web information, is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data;

the second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis;