CN108416034B - Information acquisition system based on financial heterogeneous big data and control method thereof - Google Patents

Information acquisition system based on financial heterogeneous big data and control method thereof Download PDF

Info

Publication number
CN108416034B
CN108416034B CN201810201458.1A CN201810201458A CN108416034B CN 108416034 B CN108416034 B CN 108416034B CN 201810201458 A CN201810201458 A CN 201810201458A CN 108416034 B CN108416034 B CN 108416034B
Authority
CN
China
Prior art keywords
information
rule
heterogeneous
data
financial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810201458.1A
Other languages
Chinese (zh)
Other versions
CN108416034A (en
Inventor
孙善辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201810201458.1A priority Critical patent/CN108416034B/en
Publication of CN108416034A publication Critical patent/CN108416034A/en
Application granted granted Critical
Publication of CN108416034B publication Critical patent/CN108416034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information acquisition system based on financial heterogeneous big data and a control method thereof, wherein the information acquisition system comprises an internet information source, a Linux background server system, a Web client program system and a client terminal, the internet information source, the Linux background server system, the Web client program system and the client terminal are sequentially connected, the Linux background server system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, and the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory. The invention can adopt the heterogeneous documents of the financial products in real time and extract the data which is interested by the user from the documents, thereby ensuring the effectiveness of the provided financial data and solving the problem of inconvenient collection of the heterogeneous information in the traditional financial field.

Description

Information acquisition system based on financial heterogeneous big data and control method thereof
Technical Field
The invention relates to the technical field of information acquisition systems, in particular to an information acquisition system based on financial heterogeneous big data.
Background
With the development of information technology, there are more and more financial activities on the internet. The financial field has a large amount of information published through the internet at every moment, and because the network has huge information beams, unfixed information sources and obvious textualized expression characteristics, the financial information on the internet is still mainly in a semi-structured form when being published at present. Compared with structured data, the heterogeneous information is easy to distribute and collect, but has high noise, large information redundancy and inconvenient reading and understanding, so that effective information extraction becomes crucial.
Disclosure of Invention
The invention aims to solve the problems of high information acquisition noise, large information redundancy and inconvenience in reading and understanding in the prior financial field, and provides an information acquisition system based on financial heterogeneous big data and a control method thereof.
In order to achieve the purpose, the invention adopts the following technical scheme:
an information acquisition system based on financial heterogeneous big data comprises an internet information source, a Linux background service system, a Web client program system and a client terminal, and is characterized in that the internet information source, the Linux background service system, the Web client program system and the client terminal are sequentially connected, the Linux background service system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory, the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser and the data memory are sequentially connected, the extraction rule generation module comprises a rule classification unit and a rule synthesis unit, the rule classification unit is connected with the rule synthesis unit, the rule synthesis unit comprises a matcher, a comparator and a generalization device, the matcher, a judgment device and the generalization device are sequentially connected, the information extraction evaluation module comprises a first database, a second database and a first data comparator, and the first database and the second database are connected with the first data comparator.
Preferably, the crawler URL parser comprises a controller module, a parsing module and a resource library module, wherein the parsing module comprises a webpage grabbing unit, a webpage information feature extraction unit, a webpage information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the webpage grabbing unit, the webpage information feature extraction unit and the webpage information classification modeling unit are sequentially connected, the webpage information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.
Preferably, the generalizer adopts a rule generalizing method based on a heuristic function, and adopts Laplacian error estimation as a heuristic function.
Preferably, the first database comprises three parameters of accuracy, recall and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, the recall and the F-measure.
Preferably, the operation is carried out as follows:
the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and the heterogeneous information acquisition and preprocessing module is designed with the resolver of the PDF document and the Web information and is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data.
The second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis.
The third step: and finally, the system extracts information on unknown data by applying a rule base through an information extraction evaluation module, the system is in an iterative operation state, the heterogeneous information collection and preprocessing module continuously provides text information for a subsequent module, and when an extraction task at a certain time cannot meet preset requirements, a document is recorded and is prepared to enter the next heterogeneous information processing process.
Compared with the prior art, the invention provides an information acquisition system based on financial heterogeneous big data and a control method thereof, and the information acquisition system has the following beneficial effects:
1. according to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, the Linux background server system is responsible for collecting the heterogeneous information of financial products from an internet information source and extracting structured data from the data, the structured data is used for being provided for the Web client program system, and the Web client program system can analyze and research the data and provide the data for the client terminal.
2. In the heterogeneous information collection and preprocessing module, a crawler URL parser searches newly issued financial announcement information from an internet information source, and parses the newly issued financial announcement information into a PDF document form, and further parses the newly issued financial announcement information into processable pure text data through the PDF parser; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents, extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data.
3. The information acquisition system based on the financial heterogeneous big data and the management control method thereof are characterized in that in an extraction rule generation module, rules aiming at the same target entity in different documents are classified through a rule classification unit so as to obtain a rule subset of the same target, a heuristic learning method is adopted on the subset, and the rules belonging to separate documents are synthesized into a rule normal form through a rule synthesis unit so as to smoothly extract information on documents with unknown structures and expressions in the future; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule expression method on the labeled corpus, and improves the traditional method which needs domain experts to formulate extraction rules.
4. According to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, in the information extraction evaluation module, the first data comparator compares three parameters of accuracy, recall rate and F-measure in the first database with three reference values preset in the second database so as to evaluate the information extraction effect.
Drawings
FIG. 1 is a system diagram of an information acquisition system based on financial heterogeneous big data according to the present invention;
FIG. 2 is a system diagram of a Linux background server system of an information acquisition system based on financial heterogeneous big data, which is provided by the invention;
FIG. 3 is a system diagram of a heterogeneous information collecting and preprocessing module of an information collecting system based on financial heterogeneous big data according to the present invention;
FIG. 4 is a system diagram of an extraction rule generating module of an information collecting system based on financial heterogeneous big data according to the present invention;
FIG. 5 is a system diagram of an information extraction and evaluation module of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 6 is a system diagram of a crawler URL parser of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 7 is a system diagram of an analysis module of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 8 is a system diagram of a computer analysis unit of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 9 is a system diagram of a rule synthesis unit of an information collection system based on financial heterogeneous big data according to the present invention;
FIG. 10 is a system diagram of a heterogeneous information processing procedure of an information collection system based on financial heterogeneous big data according to the present invention;
fig. 11 is a system diagram of an information acquisition system based on financial heterogeneous big data and a rule generation algorithm thereof according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Referring to fig. 1-11, an information collection system based on financial heterogeneous big data and a management control method thereof comprise an internet information source, a Linux background server system, a Web client program system and a client terminal, and are characterized in that the internet information source, the Linux background server system, the Web client program system and the client terminal are sequentially connected, the Linux background server system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data storage, and the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser, the Web client program system and the client terminal are sequentially connected, HTML resolver, data memory link to each other in proper order, extraction rule generation module includes that the rule classifies unit, rule synthesis unit, the rule classifies unit and rule synthesis unit and links to each other, rule synthesis unit includes matcher, comparator, generalizing ware, matcher, judger, generalizing ware link to each other in proper order, information extraction evaluation module includes first database, second database, first data comparator, first database and second database all link to each other with first data comparator.
The crawler URL parser comprises a controller module, a parsing module and a resource library module, wherein the parsing module comprises a webpage grabbing unit, a webpage information feature extraction unit, a webpage information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the webpage grabbing unit, the webpage information feature extraction unit and the webpage information classification modeling unit are sequentially connected, the webpage information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.
The generalization device adopts a rule generalization method based on a heuristic function, and adopts Laplacian error estimation as a heuristic function.
The first database comprises three parameters of accuracy, recall rate and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, the recall rate and the F-measure.
The method comprises the following steps:
the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and the heterogeneous information acquisition and preprocessing module is designed with the resolver of the PDF document and the Web information and is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data.
The second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis.
The third step: and finally, the system extracts information on unknown data by applying a rule base through an information extraction evaluation module, the system is in an iterative operation state, the heterogeneous information collection and preprocessing module continuously provides text information for a subsequent module, and when an extraction task at a certain time cannot meet preset requirements, a document is recorded and is prepared to enter the next heterogeneous information processing process.
1. According to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, the Linux background server system is responsible for collecting the heterogeneous information of financial products from an internet information source and extracting structured data from the data, the structured data is used for being provided for the Web client program system, and the Web client program system can analyze and research the data and provide the data for the client terminal.
2. In the heterogeneous information collection and preprocessing module, a crawler URL parser searches newly issued financial announcement information from an internet information source, and parses the newly issued financial announcement information into a PDF document form, and further parses the newly issued financial announcement information into processable pure text data through the PDF parser; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents, extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data.
3. The information acquisition system based on the financial heterogeneous big data and the management control method thereof are characterized in that in an extraction rule generation module, rules aiming at the same target entity in different documents are classified through a rule classification unit so as to obtain a rule subset of the same target, a heuristic learning method is adopted on the subset, and the rules belonging to separate documents are synthesized into a rule normal form through a rule synthesis unit so as to smoothly extract information on documents with unknown structures and expressions in the future; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule expression method on the labeled corpus, and improves the traditional method which needs domain experts to formulate extraction rules.
4. According to the information acquisition system based on the financial heterogeneous big data and the management control method thereof, in the information extraction evaluation module, the first data comparator compares three parameters of accuracy, recall rate and F-measure in the first database with three reference values preset in the second database so as to evaluate the information extraction effect.
When the system is used, the Linux background server-side system is responsible for collecting heterogeneous information of financial products from an internet information source and extracting structured data from the data, and specifically, a crawler URL parser searches newly issued financial announcement information from the internet information source and parses the information into a PDF document form, and then the PDF parser parses the information into processable pure text data; when a document which cannot be processed is encountered, the crawler URL parser is processed into web page data through a search engine retriever and is parsed into pure text data through an HTML parser. The heterogeneous information collection and preprocessing module is provided with a PDF document and Web information analyzer, which is beneficial to analyzing various heterogeneous documents and extracting structured text information from the documents and storing the structured text information in a data memory so as to facilitate the processing of subsequent data; specifically, a matcher is applied to a labeled corpus, a rule subset is matched on a training sample, a rule subsystem tries to cover an entity of the labeled sample by using the existing generalized rule, when a target can be covered, a judger judges whether a training set exists or not, the system can complete rule generation of the rule subset without the training set, a rule base is finally formed, and the system can repeat matching of the rule subset on the training sample when the training set exists; when the generalized rule cannot cover the entity of the marked sample, the rule generating the entity of the marked sample is added into the rule subset, and the generalization device pair generalizes the existing rule with the rule. The method obtains a generalized rule representation method on a labeled corpus, improves the traditional method that a domain expert needs to make an extraction rule, and in an information extraction evaluation module, a first data comparator compares three parameters of accuracy, recall rate and F-measure in a first database with three reference values preset in a second database so as to evaluate the information extraction effect.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (5)

1. The information acquisition system based on financial heterogeneous big data comprises an internet information source, a Linux background service system, a Web client program system and a client terminal, and is characterized in that the internet information source, the Linux background service system, the Web client program system and the client terminal are sequentially connected, the Linux background service system comprises a heterogeneous information collection and preprocessing module, an extraction rule generation module and an information extraction evaluation module, the heterogeneous information collection and preprocessing module, the extraction rule generation module and the information extraction evaluation module are sequentially connected, the heterogeneous information collection and preprocessing module comprises a crawler URL parser, a PDF parser, a search engine retriever, an HTML parser and a data memory, the crawler URL parser, the PDF parser, the search engine retriever, the HTML parser and the data memory are sequentially connected, the extraction rule generation module comprises a rule classification unit and a rule synthesis unit, the rule classification unit is connected with the rule synthesis unit, the rule synthesis unit comprises a matcher, a comparator and a generalization device, the matcher, a judger and the generalization device are sequentially connected, the information extraction evaluation module comprises a first database, a second database and a first data comparator, the first database and the second database are both connected with the first data comparator,
the crawler URL parser is used for searching newly issued financial bulletin information from an internet information source and parsing the newly issued financial bulletin information into a PDF document form or processing web page data through a search engine retriever;
the PDF parser is used for processing the PDF document into plain text data in a processable form;
the HTML parser is used for parsing the webpage data into plain text data;
the rule classification unit is used for classifying rules aiming at the same target entity in different documents so as to obtain a rule subset of the same target;
the matcher is used for matching a rule subset to a training sample;
the judger is used for judging whether a training set exists or not, the system can finish the rule generation of the rule subset when the training set does not exist and finally form a rule base, and the system can repeat the matching of the rule subset on the training sample when the training set exists;
the generalizer is used for generalizing the existing rules.
2. The information acquisition system based on the financial heterogeneous big data as claimed in claim 1, wherein the crawler URL parser comprises a controller module, a parsing module and a resource library module, the parsing module comprises a web page capturing unit, a web page information feature extraction unit, a web page information classification modeling unit, a data storage unit, a computer analysis unit and a computer display unit, the web page capturing unit, the web page information feature extraction unit and the web page information classification modeling unit are connected in sequence, the web page information classification modeling unit and the data storage unit are both connected with the computer analysis unit, and the computer analysis unit is connected with the computer display unit; the computer analysis unit includes a data extractor, a data receiver, and a second data comparator.
3. The financial heterogeneous big data based information acquisition system according to claim 1, wherein the generalizer adopts a rule generalization method based on heuristic functions and adopts Laplacian error estimation as a heuristic function.
4. The information acquisition system based on financial heterogeneous big data according to claim 1, wherein the first database comprises three parameters of accuracy, recall and F-measure, and the second database stores three preset reference values respectively corresponding to the accuracy, recall and F-measure.
5. The control method of the information collection system based on the financial heterogeneous big data according to claim 1, characterized by comprising the following steps:
the first step is as follows: firstly, a system searches a newly released financial product on an internet information source by using a crawler URL resolver, when a PDF document which cannot be processed is encountered, the crawler URL resolver searches a Web page for substitution through a search engine retriever, and a heterogeneous information acquisition and preprocessing module is provided with the resolver of the PDF document and the Web information, is responsible for resolving the heterogeneous document, extracting text information from the heterogeneous document and storing the text information into subsequent processing data;
the second step is that: secondly, in an extraction rule generating module, a system generates a rule set from the marked training samples, and the rule set imports the result into a final rule base through clustering and synthesis;
the third step: and finally, the system extracts information on unknown data by applying a rule base through an information extraction evaluation module, the system is in an iterative operation state, the heterogeneous information collection and preprocessing module continuously provides text information for a subsequent module, and when an extraction task at a certain time cannot meet preset requirements, a document is recorded and is prepared to enter the next heterogeneous information processing process.
CN201810201458.1A 2018-03-12 2018-03-12 Information acquisition system based on financial heterogeneous big data and control method thereof Active CN108416034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810201458.1A CN108416034B (en) 2018-03-12 2018-03-12 Information acquisition system based on financial heterogeneous big data and control method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810201458.1A CN108416034B (en) 2018-03-12 2018-03-12 Information acquisition system based on financial heterogeneous big data and control method thereof

Publications (2)

Publication Number Publication Date
CN108416034A CN108416034A (en) 2018-08-17
CN108416034B true CN108416034B (en) 2021-11-16

Family

ID=63131071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810201458.1A Active CN108416034B (en) 2018-03-12 2018-03-12 Information acquisition system based on financial heterogeneous big data and control method thereof

Country Status (1)

Country Link
CN (1) CN108416034B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635252A (en) * 2018-10-25 2019-04-16 北京中关村科金技术有限公司 A kind of insurance products key message analytic method, apparatus and system based on PDF format
CN110889632B (en) * 2019-11-27 2023-10-13 国网能源研究院有限公司 Data monitoring and analyzing system of company image lifting system
CN111209322B (en) * 2019-12-26 2023-12-15 上海大智慧财汇数据科技有限公司 Financial information acquisition processing system and method
CN112035837B (en) * 2020-07-31 2023-06-20 中国人民解放军战略支援部队信息工程大学 Malicious PDF document detection system and method based on mimicry defense
CN113253659A (en) * 2021-06-04 2021-08-13 厦门致上信息科技有限公司 Financial big data automatic acquisition and intelligent analysis system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609512A (en) * 2012-02-07 2012-07-25 北京中机科海科技发展有限公司 System and method for heterogeneous information mining and visual analysis
CN104881488A (en) * 2015-06-05 2015-09-02 焦点科技股份有限公司 Relational table-based extraction method of configurable information
CN104933095A (en) * 2015-05-22 2015-09-23 中国电子科技集团公司第十研究所 Heterogeneous information universality correlation analysis system and analysis method thereof
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
CN106354843A (en) * 2016-08-31 2017-01-25 虎扑(上海)文化传播股份有限公司 Web crawler system and method
CN106649260A (en) * 2016-10-19 2017-05-10 中国计量大学 Product feature structure tree construction method based on comment text mining

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7370381B2 (en) * 2004-11-22 2008-05-13 Truveo, Inc. Method and apparatus for a ranking engine
CN101582075B (en) * 2009-06-24 2011-05-11 大连海事大学 Web information extraction system
CN102750316B (en) * 2012-04-25 2015-10-28 北京航空航天大学 Based on the conceptual relation label abstracting method of semantic co-occurrence patterns
CN102708096B (en) * 2012-05-29 2014-10-15 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN103049575B (en) * 2013-01-05 2015-08-19 华中科技大学 A kind of academic conference search system of topic adaptation
CN104794211A (en) * 2015-04-24 2015-07-22 清华大学 Method and system for extracting sentiment inducements and analyzing inducement elements based on microblog text
CN106202044A (en) * 2016-07-07 2016-12-07 武汉理工大学 A kind of entity relation extraction method based on deep neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609512A (en) * 2012-02-07 2012-07-25 北京中机科海科技发展有限公司 System and method for heterogeneous information mining and visual analysis
CN104933095A (en) * 2015-05-22 2015-09-23 中国电子科技集团公司第十研究所 Heterogeneous information universality correlation analysis system and analysis method thereof
CN104881488A (en) * 2015-06-05 2015-09-02 焦点科技股份有限公司 Relational table-based extraction method of configurable information
CN106354843A (en) * 2016-08-31 2017-01-25 虎扑(上海)文化传播股份有限公司 Web crawler system and method
CN106294885A (en) * 2016-10-09 2017-01-04 华东师范大学 A kind of data collection towards isomery webpage and mask method
CN106649260A (en) * 2016-10-19 2017-05-10 中国计量大学 Product feature structure tree construction method based on comment text mining

Also Published As

Publication number Publication date
CN108416034A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN102073726B (en) Structured data import method and device for search engine system
CN102073725B (en) Method for searching structured data and search engine system for implementing same
US8185530B2 (en) Method and system for web document clustering
CN104951539A (en) Internet data center harmful information monitoring system
CN106776567B (en) Internet big data analysis and extraction method and system
CN105279277A (en) Knowledge data processing method and device
CN104899324A (en) Sample training system based on IDC (internet data center) harmful information monitoring system
CN103530429A (en) Webpage content extracting method
CN110956021A (en) Original article generation method, device, system and server
CN105069112A (en) Industry vertical search engine system
CN109948154A (en) A kind of personage's acquisition and relationship recommender system and method based on name
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
US11334592B2 (en) Self-orchestrated system for extraction, analysis, and presentation of entity data
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Wang et al. Multi-modal transformer using two-level visual features for fake news detection
CN104778232B (en) Searching result optimizing method and device based on long query
CN113111645A (en) Media text similarity detection method
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
CN114238735B (en) Intelligent internet data acquisition method
KR101880474B1 (en) Keyword-based service provide method for high value added content information service and method and recording medium storing program for executing the same and recording medium storing program for executing the same
CN100357942C (en) Mobile internet intelligent information retrieval engine based on key-word retrieval
Zhang et al. Research on keyword extraction and sentiment orientation analysis of educational texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant