CN108416034B - Information acquisition system based on financial heterogeneous big data and control method thereof - Google Patents
Information acquisition system based on financial heterogeneous big data and control method thereof Download PDFInfo
- Publication number
- CN108416034B CN108416034B CN201810201458.1A CN201810201458A CN108416034B CN 108416034 B CN108416034 B CN 108416034B CN 201810201458 A CN201810201458 A CN 201810201458A CN 108416034 B CN108416034 B CN 108416034B
- Authority
- CN
- China
- Prior art keywords
- information
- rule
- heterogeneous
- data
- financial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an information acquisition system based on financial heterogeneous big data and a control method thereof, wherein the information acquisition system comprises an internet information source, a Linux background server system, a Web client program system and a client terminal, the internet information source, the Linux background server system, the Web client program system and the client terminal are sequentially connected, the Linux background server system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, and the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory. The invention can adopt the heterogeneous documents of the financial products in real time and extract the data which is interested by the user from the documents, thereby ensuring the effectiveness of the provided financial data and solving the problem of inconvenient collection of the heterogeneous information in the traditional financial field.
Description
Technical Field
The invention relates to the technical field of information acquisition systems, in particular to an information acquisition system based on financial heterogeneous big data.
Background
With the development of information technology, there are more and more financial activities on the internet. The financial field has a large amount of information published through the internet at every moment, and because the network has huge information beams, unfixed information sources and obvious textualized expression characteristics, the financial information on the internet is still mainly in a semi-structured form when being published at present. Compared with structured data, the heterogeneous information is easy to distribute and collect, but has high noise, large information redundancy and inconvenient reading and understanding, so that effective information extraction becomes crucial.
Disclosure of Invention
The invention aims to solve the problems of high information acquisition noise, large information redundancy and inconvenience in reading and understanding in the prior financial field, and provides an information acquisition system based on financial heterogeneous big data and a control method thereof.
In order to achieve the purpose, the invention adopts the following technical scheme:
an information acquisition system based on financial heterogeneous big data comprises an internet information source, a Linux background service system, a Web client program system and a client terminal, and is characterized in that the internet information source, the Linux background service system, the Web client program system and the client terminal are sequentially connected, the Linux background service system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory, the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser and the data memory are sequentially connected, the extraction rule generation module comprises a rule classification unit and a rule synthesis unit, the rule classification unit is connected with the rule synthesis unit, the rule synthesis unit comprises a matcher, a comparator and a generalization device, the matcher, a judgment device and the generalization device are sequentially connected, the information extraction evaluation module comprises a first database, a second database and a first data comparator, and the first database and the second database are connected with the first data comparator.
Preferably, the crawler URL parser comprises a controller module, a parsing module and a resource library module, wherein the parsing module comprises a webpage grabbing unit, a webpage information feature extraction unit, a webpage information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the webpage grabbing unit, the webpage information feature extraction unit and the webpage information classification modeling unit are sequentially connected, the webpage information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.
Preferably, the generalizer adopts a rule generalizing method based on a heuristic function, and adopts Laplacian error estimation as a heuristic function.
Preferably, the first database comprises three parameters of accuracy, recall and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, the recall and the F-measure.
Preferably, the operation is carried out as follows:
the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and the heterogeneous information acquisition and preprocessing module is designed with the resolver of the PDF document and the Web information and is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data.
The second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis.
The third step: and finally, the system extracts information on unknown data by applying a rule base through an information extraction evaluation module, the system is in an iterative operation state, the heterogeneous information collection and preprocessing module continuously provides text information for a subsequent module, and when an extraction task at a certain time cannot meet preset requirements, a document is recorded and is prepared to enter the next heterogeneous information processing process.
Compared with the prior art, the invention provides an information acquisition system based on financial heterogeneous big data and a control method thereof, and the information acquisition system has the following beneficial effects:
1. according to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, the Linux background server system is responsible for collecting the heterogeneous information of financial products from an internet information source and extracting structured data from the data, the structured data is used for being provided for the Web client program system, and the Web client program system can analyze and research the data and provide the data for the client terminal.
2. In the heterogeneous information collection and preprocessing module, a crawler URL parser searches newly issued financial announcement information from an internet information source, and parses the newly issued financial announcement information into a PDF document form, and further parses the newly issued financial announcement information into processable pure text data through the PDF parser; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents, extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data.
3. The information acquisition system based on the financial heterogeneous big data and the management control method thereof are characterized in that in an extraction rule generation module, rules aiming at the same target entity in different documents are classified through a rule classification unit so as to obtain a rule subset of the same target, a heuristic learning method is adopted on the subset, and the rules belonging to separate documents are synthesized into a rule normal form through a rule synthesis unit so as to smoothly extract information on documents with unknown structures and expressions in the future; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule expression method on the labeled corpus, and improves the traditional method which needs domain experts to formulate extraction rules.
4. According to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, in the information extraction evaluation module, the first data comparator compares three parameters of accuracy, recall rate and F-measure in the first database with three reference values preset in the second database so as to evaluate the information extraction effect.
Drawings
FIG. 1 is a system diagram of an information acquisition system based on financial heterogeneous big data according to the present invention;
FIG. 2 is a system diagram of a Linux background server system of an information acquisition system based on financial heterogeneous big data, which is provided by the invention;
FIG. 3 is a system diagram of a heterogeneous information collecting and preprocessing module of an information collecting system based on financial heterogeneous big data according to the present invention;
FIG. 4 is a system diagram of an extraction rule generating module of an information collecting system based on financial heterogeneous big data according to the present invention;
FIG. 5 is a system diagram of an information extraction and evaluation module of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 6 is a system diagram of a crawler URL parser of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 7 is a system diagram of an analysis module of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 8 is a system diagram of a computer analysis unit of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 9 is a system diagram of a rule synthesis unit of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 10 is a system diagram of a heterogeneous information processing procedure of an information collection system based on financial heterogeneous big data according to the present invention;
fig. 11 is a system diagram of an information acquisition system based on financial heterogeneous big data and a rule generation algorithm thereof according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Referring to fig. 1-11, an information collection system based on financial heterogeneous big data and a management control method thereof comprise an internet information source, a Linux background server system, a Web client program system and a client terminal, and are characterized in that the internet information source, the Linux background server system, the Web client program system and the client terminal are sequentially connected, the Linux background server system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data storage, and the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser, the Web client program system and the client terminal are sequentially connected, HTML resolver, data memory link to each other in proper order, extraction rule generation module includes that the rule classifies unit, rule synthesis unit, the rule classifies unit and rule synthesis unit and links to each other, rule synthesis unit includes matcher, comparator, generalizing ware, matcher, judger, generalizing ware link to each other in proper order, information extraction evaluation module includes first database, second database, first data comparator, first database and second database all link to each other with first data comparator.
The crawler URL parser comprises a controller module, a parsing module and a resource library module, wherein the parsing module comprises a webpage grabbing unit, a webpage information feature extraction unit, a webpage information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the webpage grabbing unit, the webpage information feature extraction unit and the webpage information classification modeling unit are sequentially connected, the webpage information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.
The generalization device adopts a rule generalization method based on a heuristic function, and adopts Laplacian error estimation as a heuristic function.
The first database comprises three parameters of accuracy, recall rate and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, the recall rate and the F-measure.
The method comprises the following steps:
the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and the heterogeneous information acquisition and preprocessing module is designed with the resolver of the PDF document and the Web information and is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data.
The second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis.
The third step: and finally, the system extracts information on unknown data by applying a rule base through an information extraction evaluation module, the system is in an iterative operation state, the heterogeneous information collection and preprocessing module continuously provides text information for a subsequent module, and when an extraction task at a certain time cannot meet preset requirements, a document is recorded and is prepared to enter the next heterogeneous information processing process.
1. According to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, the Linux background server system is responsible for collecting the heterogeneous information of financial products from an internet information source and extracting structured data from the data, the structured data is used for being provided for the Web client program system, and the Web client program system can analyze and research the data and provide the data for the client terminal.
2. In the heterogeneous information collection and preprocessing module, a crawler URL parser searches newly issued financial announcement information from an internet information source, and parses the newly issued financial announcement information into a PDF document form, and further parses the newly issued financial announcement information into processable pure text data through the PDF parser; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents, extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data.
3. The information acquisition system based on the financial heterogeneous big data and the management control method thereof are characterized in that in an extraction rule generation module, rules aiming at the same target entity in different documents are classified through a rule classification unit so as to obtain a rule subset of the same target, a heuristic learning method is adopted on the subset, and the rules belonging to separate documents are synthesized into a rule normal form through a rule synthesis unit so as to smoothly extract information on documents with unknown structures and expressions in the future; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule expression method on the labeled corpus, and improves the traditional method which needs domain experts to formulate extraction rules.
4. According to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, in the information extraction evaluation module, the first data comparator compares three parameters of accuracy, recall rate and F-measure in the first database with three reference values preset in the second database so as to evaluate the information extraction effect.
When the system is used, the Linux background server-side system is responsible for collecting heterogeneous information of financial products from an internet information source and extracting structured data from the data, and specifically, a crawler URL parser searches newly issued financial announcement information from the internet information source and parses the information into a PDF document form, and then the PDF parser parses the information into processable pure text data; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents and extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule representation method on a labeled corpus, improves the traditional method that a domain expert needs to make an extraction rule, and in an information extraction evaluation module, a first data comparator compares three parameters of accuracy, recall rate and F-measure in a first database with three reference values preset in a second database so as to evaluate the information extraction effect.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (5)
1. The information acquisition system based on financial heterogeneous big data comprises an internet information source, a Linux background service system, a Web client program system and a client terminal, and is characterized in that the internet information source, the Linux background service system, the Web client program system and the client terminal are sequentially connected, the Linux background service system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory, the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser and the data memory are sequentially connected, the extraction rule generation module comprises a rule classification unit and a rule synthesis unit, the rule classification unit is connected with the rule synthesis unit, the rule synthesis unit comprises a matcher, a comparator and a generalization device, the matcher, a judger and the generalization device are sequentially connected, the information extraction evaluation module comprises a first database, a second database and a first data comparator, the first database and the second database are both connected with the first data comparator,
the crawler URL parser is used for searching newly issued financial bulletin information from an internet information source and parsing the newly issued financial bulletin information into a PDF document form or processing web page data through a search engine retriever;
the PDF parser is used for processing the PDF document into plain text data in a processable form;
the HTML parser is used for parsing the webpage data into plain text data;
the rule classification unit is used for classifying rules aiming at the same target entity in different documents so as to obtain a rule subset of the same target;
the matcher is used for matching a rule subset to a training sample;
the judger is used for judging whether a training set exists or not, the system can finish the rule generation of the rule subset when the training set does not exist and finally form a rule base, and the system can repeat the matching of the rule subset on the training sample when the training set exists;
the generalizer is used for generalizing the existing rules.
2. The information acquisition system based on the financial heterogeneous big data as claimed in claim 1, wherein the crawler URL parser comprises a controller module, a parsing module and a resource library module, the parsing module comprises a web page capturing unit, a web page information feature extraction unit, a web page information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the web page capturing unit, the web page information feature extraction unit and the web page information classification modeling unit are connected in sequence, the web page information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.
3. The financial heterogeneous big data based information acquisition system according to claim 1, wherein the generalizer adopts a rule generalization method based on heuristic functions and adopts Laplacian error estimation as a heuristic function.
4. The information acquisition system based on financial heterogeneous big data according to claim 1, wherein the first database comprises three parameters of accuracy, recall and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, recall and F-measure.
5. The control method of the information collection system based on the financial heterogeneous big data according to claim 1, characterized by comprising the following steps:
the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and a heterogeneous information acquisition and preprocessing module is provided with the resolver of the PDF document and the Web information, is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data;
the second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis;
the third step: and finally, the system extracts information on unknown data by applying a rule base through an information extraction evaluation module, the system is in an iterative operation state, the heterogeneous information collection and preprocessing module continuously provides text information for a subsequent module, and when an extraction task at a certain time cannot meet preset requirements, a document is recorded and is prepared to enter the next heterogeneous information processing process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810201458.1A CN108416034B (en) | 2018-03-12 | 2018-03-12 | Information acquisition system based on financial heterogeneous big data and control method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810201458.1A CN108416034B (en) | 2018-03-12 | 2018-03-12 | Information acquisition system based on financial heterogeneous big data and control method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108416034A CN108416034A (en) | 2018-08-17 |
CN108416034B true CN108416034B (en) | 2021-11-16 |
Family
ID=63131071
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810201458.1A Active CN108416034B (en) | 2018-03-12 | 2018-03-12 | Information acquisition system based on financial heterogeneous big data and control method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108416034B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635252A (en) * | 2018-10-25 | 2019-04-16 | 北京中关村科金技术有限公司 | A kind of insurance products key message analytic method, apparatus and system based on PDF format |
CN110889632B (en) * | 2019-11-27 | 2023-10-13 | 国网能源研究院有限公司 | Data monitoring and analyzing system of company image lifting system |
CN111209322B (en) * | 2019-12-26 | 2023-12-15 | 上海大智慧财汇数据科技有限公司 | Financial information acquisition processing system and method |
CN112035837B (en) * | 2020-07-31 | 2023-06-20 | 中国人民解放军战略支援部队信息工程大学 | Malicious PDF document detection system and method based on mimicry defense |
CN113253659A (en) * | 2021-06-04 | 2021-08-13 | 厦门致上信息科技有限公司 | Financial big data automatic acquisition and intelligent analysis system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609512A (en) * | 2012-02-07 | 2012-07-25 | 北京中机科海科技发展有限公司 | System and method for heterogeneous information mining and visual analysis |
CN104881488A (en) * | 2015-06-05 | 2015-09-02 | 焦点科技股份有限公司 | Relational table-based extraction method of configurable information |
CN104933095A (en) * | 2015-05-22 | 2015-09-23 | 中国电子科技集团公司第十研究所 | Heterogeneous information universality correlation analysis system and analysis method thereof |
CN106294885A (en) * | 2016-10-09 | 2017-01-04 | 华东师范大学 | A kind of data collection towards isomery webpage and mask method |
CN106354843A (en) * | 2016-08-31 | 2017-01-25 | 虎扑(上海)文化传播股份有限公司 | Web crawler system and method |
CN106649260A (en) * | 2016-10-19 | 2017-05-10 | 中国计量大学 | Product feature structure tree construction method based on comment text mining |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7370381B2 (en) * | 2004-11-22 | 2008-05-13 | Truveo, Inc. | Method and apparatus for a ranking engine |
CN101582075B (en) * | 2009-06-24 | 2011-05-11 | 大连海事大学 | Web information extraction system |
CN102750316B (en) * | 2012-04-25 | 2015-10-28 | 北京航空航天大学 | Based on the conceptual relation label abstracting method of semantic co-occurrence patterns |
CN102708096B (en) * | 2012-05-29 | 2014-10-15 | 代松 | Network intelligence public sentiment monitoring system based on semantics and work method thereof |
CN103049575B (en) * | 2013-01-05 | 2015-08-19 | 华中科技大学 | A kind of academic conference search system of topic adaptation |
CN104794211A (en) * | 2015-04-24 | 2015-07-22 | 清华大学 | Method and system for extracting sentiment inducements and analyzing inducement elements based on microblog text |
CN106202044A (en) * | 2016-07-07 | 2016-12-07 | 武汉理工大学 | A kind of entity relation extraction method based on deep neural network |
-
2018
- 2018-03-12 CN CN201810201458.1A patent/CN108416034B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609512A (en) * | 2012-02-07 | 2012-07-25 | 北京中机科海科技发展有限公司 | System and method for heterogeneous information mining and visual analysis |
CN104933095A (en) * | 2015-05-22 | 2015-09-23 | 中国电子科技集团公司第十研究所 | Heterogeneous information universality correlation analysis system and analysis method thereof |
CN104881488A (en) * | 2015-06-05 | 2015-09-02 | 焦点科技股份有限公司 | Relational table-based extraction method of configurable information |
CN106354843A (en) * | 2016-08-31 | 2017-01-25 | 虎扑(上海)文化传播股份有限公司 | Web crawler system and method |
CN106294885A (en) * | 2016-10-09 | 2017-01-04 | 华东师范大学 | A kind of data collection towards isomery webpage and mask method |
CN106649260A (en) * | 2016-10-19 | 2017-05-10 | 中国计量大学 | Product feature structure tree construction method based on comment text mining |
Also Published As
Publication number | Publication date |
---|---|
CN108416034A (en) | 2018-08-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108416034B (en) | Information acquisition system based on financial heterogeneous big data and control method thereof | |
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN102073726B (en) | Structured data import method and device for search engine system | |
CN102073725B (en) | Method for searching structured data and search engine system for implementing same | |
US8185530B2 (en) | Method and system for web document clustering | |
CN104951539A (en) | Internet data center harmful information monitoring system | |
CN106776567B (en) | Internet big data analysis and extraction method and system | |
CN105279277A (en) | Knowledge data processing method and device | |
CN104899324A (en) | Sample training system based on IDC (internet data center) harmful information monitoring system | |
CN103530429A (en) | Webpage content extracting method | |
CN110956021A (en) | Original article generation method, device, system and server | |
CN105069112A (en) | Industry vertical search engine system | |
CN109948154A (en) | A kind of personage's acquisition and relationship recommender system and method based on name | |
CN113918794B (en) | Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium | |
US11334592B2 (en) | Self-orchestrated system for extraction, analysis, and presentation of entity data | |
CN112035723A (en) | Resource library determination method and device, storage medium and electronic device | |
Wang et al. | Multi-modal transformer using two-level visual features for fake news detection | |
CN104778232B (en) | Searching result optimizing method and device based on long query | |
CN113111645A (en) | Media text similarity detection method | |
CN115801455B (en) | Method and device for detecting counterfeit website based on website fingerprint | |
CN114238735B (en) | Intelligent internet data acquisition method | |
KR101880474B1 (en) | Keyword-based service provide method for high value added content information service and method and recording medium storing program for executing the same and recording medium storing program for executing the same | |
CN100357942C (en) | Mobile internet intelligent information retrieval engine based on key-word retrieval | |
Zhang et al. | Research on keyword extraction and sentiment orientation analysis of educational texts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |