CN104598536A - Structured processing method of distributed network information - Google Patents

Structured processing method of distributed network information Download PDF

Info

Publication number
CN104598536A
CN104598536A CN201410840847.0A CN201410840847A CN104598536A CN 104598536 A CN104598536 A CN 104598536A CN 201410840847 A CN201410840847 A CN 201410840847A CN 104598536 A CN104598536 A CN 104598536A
Authority
CN
China
Prior art keywords
webpage
file
network information
structuring
doc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410840847.0A
Other languages
Chinese (zh)
Other versions
CN104598536B (en
Inventor
常鹏飞
伍赛
陈珂
寿黎但
陈刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201410840847.0A priority Critical patent/CN104598536B/en
Publication of CN104598536A publication Critical patent/CN104598536A/en
Application granted granted Critical
Publication of CN104598536B publication Critical patent/CN104598536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a structured processing method of distributed network information. The method comprises the following steps: configuring a network information acqusition task, and saving interesting webpages of a user in category to serve as target webpages; acquiring the network information, cooperatively acquiring the webpages through multiple map/reduce processes, performing structured processing and saving in an HDFS (Hadoop Distributed File System) file system; performing structured clustering on the webpages after the structured processing by using a tree edit distance mode; performing structured extraction on the clustered webpage information, and saving in a database. A distributed architecture is adopted, a huge data volume of network data can be processed by using the calculation and storage capacity of a cheap computer cluster; the webpages are effectively classified; the network information is extracted and saved by using the structured mode, and further analytical processing of the network information is facilitated.

Description

A kind of distributed network information structuring disposal route
Technical field
The present invention relates to a kind of web information processing method in network information gathering field, particularly relate to a kind of distributed network information structuring acquiring and processing method.
Background technology
Distributed system is by effectively being organized by the computing cluster of cheapness, performs the system of large-scale data computing and storage.
Distributed system is different from one-of-a-kind system, utilize computer cluster carry out data operation and store the cost that will balance between single node computing power and internodal communication, also will consider the problems such as the restorability of the system effectiveness that cluster interior joint fault causes and data simultaneously.Hadoop distributed treatment and HDFS distributed file system are increase income distributed arithmetic and the storage systems that the Map/Reduce computation model proposed based on Google is designed and developed.Due to the succinct versatility that it effectively solves problem in distributed system and its framework, be obtained in a lot of field and apply widely.
Structuring clustering method is the one in clustering method, and different from the method for carrying out cluster by content, structuring cluster is it is emphasised that carry out cluster according to structure, and this just needs different method for measuring similarity.Tree edit distance method is a kind of method weighing tree structure similarity, one tree is converted to another one tree and means the insertion carrying out a series of node between two trees, deletes and replaces, and operation each time expends certain cost.If the textural difference of two trees is large, mean that running cost is high, running cost is low, shows that the textural difference set is little.What the editing distance therefore set represented is two minimum costs setting required for conversion.
Network information gathering is also web crawlers usually, is widely used in internet search engine or other similar websites, to obtain or to upgrade content and the retrieval mode of these websites.It can gather all content of pages that it can have access to automatically, does further process for search engine etc.
The information that existing web crawlers grabs is stored in storage system with the form of original web page.There is following shortcoming in such storage mode, one is store with the form of original web page to need larger storage space; Two is have a large amount of irrelevant informations, as advertisement etc. in the information stored; Three is that to preserve information be in the form of a web page a kind of semi-structured mode, and relative to structurized storage mode, semi-structured storage mode cause certain obstacle can to the use of further information.
Summary of the invention
The object of the invention is to the deficiency for existing network information acquiring technology, provide a kind of distributed network information structuring acquiring and processing method.
The technical solution used in the present invention comprises the following steps:
1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web;
2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system;
3) webpage after structuring process is adopted the mode of tree edit distance, carry out structuring cluster;
4) structuring extraction is carried out to the info web after cluster, be saved in database.
Described step 2) specifically comprise:
2.1) obtain URL seed file, the set of URL seed file is saved to HDFS file system wait capture in file, wait that capturing file deposits the URL that will capture, and to arrange initial sequence number be 1;
2.2) judge that whether wait to capture in file is empty, if so, then jumps to step 2.7); Otherwise, carry out next step 2.3);
2.3) gathered by the webpage that map/reduce process is corresponding to each URL seed file in HDFS file system, and be kept at web storage file in HDFS file system and deposit, crude webpage deposited by web storage file;
2.4) webpage captured in web storage file is therefrom extracted to the URL resolving and make new advances by map/reduce process again, and be kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL parsed;
2.5) by map/reduce process optimization temporary folder, filter URL wherein, the URL of repetition is removed, then by result waiting to capture in file and upgrade in HDFS file system;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is greater than capture depth value D epththen enter step 2.7), otherwise jump to step 2.2);
2.7) a web storage file merged into by multiple web storage files above-mentioned steps obtained by the process of map/reduce, and removes the webpage wherein repeated.
Described step 3) carry out cluster by map/reduce process, concrete steps are as follows:
3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively xwith target web C described in each itag tree TREE cibetween tree edit distance DIS ci, obtain tree edit distance set { DIS c1, DIS c2, DIS c3..., DIS cn, and generate key-value pair <C i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set cmin, by minimum tree editing distance DIS cmincorresponding key-value pair <C min, WEB> passes to the reduce stage;
3.2) in the reduce stage, according to above-mentioned key-value pair <C i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, each file DOC cisave the webpage with same web page structure, obtain structuring website construction result { DOC c1, DOC c2, DOC c3..., DOC cn, complete the structuring cluster of webpage.
Described step 4) according to step 3) in the structuring website construction result { DOC that obtains c1, DOC c2, DOC c3..., DOC cn, each class webpage is extracted, the information in webpage is extracted and is saved in database.
The field corresponding in database by the Node extraction of tag tree corresponding in described webpage.
Described inhomogeneous webpage adopts different extracting modes.
Step 1) to the configuration of network information gathering task, be mutual interface.Different from the reptile of search engine, fundamental purpose of the present invention is to monitor for the specific information source of network.The interested webpage of user as information source can be divided into text message, pictorial information and video information etc. by content type, also can be divided into news information and advertising message etc. by contents attribute.Meanwhile, the renewal frequency of different aforementioned sources is also not quite similar.Configured by network information gathering task, the information of the information source gathered can be determined, thus realize taking different acquisition methods to different types of information source.
Step 2) based on Hadoop distributed treatment and HDFS distributed file system, to step 1) in definition network collection task carry out distributed crawl.
Step 3) carry out Web page structural cluster, step 2) webpage that captures still is stored in HDFS with semi-structured document form, can not extracting directly in database.Step 3) adopt the mode of tree edit distance, by its structure, cluster is carried out to webpage.Webpage is all adopt HTML to write, and single html file can be abstracted into the form of a tag tree, has the webpage of identical information, and its tag tree also has similar even identical structure.In order to weigh the similarity of tag tree, present invention employs the method for tree edit distance, by step 2) webpage that grabs classifies.
Step 4) info web structuring is extracted, according to step 3) in the result of structuring cluster, and step 1) in extracting mode to each class webpage, the information in webpage is extracted and is saved in database.
The beneficial effect that the present invention has is:
Present invention employs distributed framework, utilize the calculating of cheap computer cluster and storage capacity to process the huge network data of data volume; Have employed the Web page structural cluster mode of tree edit distance, effectively webpage is classified; Have employed structurized mode extract the network information and preserve, facilitate the further analyzing and processing to the network information.
Accompanying drawing explanation
Fig. 1 is the invention process flow chart of steps.
Fig. 2 is step 3.1 of the present invention) in web page tag tree.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described.
As shown in Figure 1, the present invention includes following steps:
1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web; Subsequent step can be obtained and be used for the target web set { C of cluster 1, C 2, C 3..., C n; When target web classification of the present invention is preserved, can classify to the different information type in same website, info class may be there is in such as same website and can be divided into news category, product data class and picture category etc.
2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system.
Step 2) based on Hadoop distributed treatment and HDFS distributed file system, distributed crawl is carried out to the network collection task of definition in step 1.
2.1) obtain URL seed file, the set of URL seed file is saved to HDFS file system wait capture in file, wait that capturing file deposits the URL that will capture, and to arrange initial sequence number be 1;
2.2) judge that whether wait to capture in file is empty, if so, then jumps to step 2.7); Otherwise, carry out next step 2.3);
2.3) gathered by the webpage that map/reduce process is corresponding to each URL seed file in HDFS file system, and be kept at web storage file in HDFS file system and deposit, crude webpage deposited by web storage file;
2.4) therefrom being extracted the webpage captured in web storage file by map/reduce process again and resolve the URL made new advances, and being kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL resolving and obtain;
2.5) by map/reduce process optimization temporary folder, filter URL wherein, the URL of repetition is removed, then by result waiting to capture in file and upgrade in HDFS file system;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is greater than capture depth value D epththen enter step 2.7), otherwise jump to step 2.2), then repeat above-mentioned steps 2) ~ 6) until current sequence number is greater than capture depth value D epth;
2.7) multiple web storage files above-mentioned steps obtained by the process of map/reduce merge into a web storage file according to webpage cryptographic hash, and remove the webpage wherein repeated; Web storage file after merging can be kept in the web page files folder of HDFS file system, the webpage grabbed preserved by this file.
3) webpage after structuring process is adopted the mode of tree edit distance, carry out structuring cluster;
Step 3) carry out cluster by map/reduce process, concrete steps are as follows:
3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, such as step 1) the middle interested webpage of user, a webpage in such as a certain Website News class webpage, target web and the webpage grabbed are extracted into the form of tag tree by the map stage, as shown in Figure 2.Utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively xwith target web C described in each itag tree TREE cibetween tree edit distance DIS ci, obtain tree edit distance set { DIS c1, DIS c2, DIS c3..., DIS cn, and generate key-value pair <C i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set cmin, by minimum tree editing distance DIS cmincorresponding key-value pair <C min, WEB> passes to the reduce stage;
3.2) in the reduce stage, according to above-mentioned key-value pair <C i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, destination file folder has file DOC c1, file DOC c2, file DOC c3..., file DOC cn, each file DOC cisave the webpage with same web page structure, obtain structuring website construction result { DOC c1, DOC c2, DOC c3..., DOC cn, complete the structuring cluster of webpage.
4) structuring extraction is carried out to the info web after cluster, be saved in database.
Step 4) according to step 3) in the structuring website construction result { DOC that obtains c1, DOC c2, DOC c3..., DOC cn, each class webpage is extracted, the information in webpage is extracted and is saved in database, extract the mode that can adopt the Node extraction of tag tree corresponding in webpage field corresponding in database.
Inhomogeneous webpage adopts different extracting modes, the extracting mode { R of each class webpage of definable 1, R 2, R 3..., R n, the information in webpage is extracted and is saved in database.
Embodiment of the present invention is as follows:
For certain electric automobile website, user will obtain news category webpage wherein and automobile model parameter class webpage, and for this two classes webpage, each class obtains a typical webpage as target web, thus forms target web set { C 1, C 2.
Step 2) in distributed crawl is carried out to this electric automobile website, obtain its web data, this process is consuming time relevant with waiting the cluster scale capturing website scale and perform crawl task, and for the cluster of ten nodes, fully loaded captures speed can reach 100,000/hour.
Step 3) webpage that will get from this electric automobile website, be divided into news category and automobile model parameter class two class webpage by cluster, the accuracy rate of this process cluster can accomplish more than 95%.
Step 4) structurized extraction is carried out for this two classes webpage, for news category webpage, extract title, text, date issued, the information such as source, are saved in database; For automobile model parameter class webpage, by different model parameter extraction out, be saved in database.
Wherein, a kind of R of extracting mode iform: wherein PubTitle, Content, PublicationDate, DCSource are the fields in database, is a webpage xpath path after each field, defines the position of this field in webpage.To each class web page files DOC ci, utilize the extracting mode R of its correspondence i, data corresponding in webpage are extracted field corresponding in database.
Can see, in whole crawl process, user is only needed to provide exemplary web page in interested two class webpages, the method just can by the information of corresponding types in targeted website, structurized extraction is also saved in database, whole process, on the basis keeping comparatively high-accuracy (embodiment obtain more than 95%), provides very fast information handling rate.
Thus, the present invention is based on HDFS file system, for the feature that the network information is semi-structured, propose and utilize tree edit distance to carry out cluster to webpage according to structure of web page, on the basis of cluster result, to the webpage of each class according to mode information extraction, and be saved in database, thus realize the structuring collection of info web, there is significant technique effect.

Claims (6)

1. a distributed network information structuring disposal route, is characterized in that the step of the method is as follows:
1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web;
2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system;
3) webpage after structuring process is adopted the mode of tree edit distance, carry out structuring cluster;
4) structuring extraction is carried out to the info web after cluster, be saved in database.
2. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that: described step 2) specifically comprise:
2.1) obtain URL seed file, the set of URL seed file is saved to HDFS file system wait capture in file, wait that capturing file deposits the URL that will capture, and to arrange initial sequence number be 1;
2.2) judge that whether wait to capture in file is empty, if so, then jumps to step 2.7); Otherwise, carry out next step 2.3);
2.3) gathered by the webpage that map/reduce process is corresponding to each URL seed file in HDFS file system, and be kept at web storage file in HDFS file system and deposit, crude webpage deposited by web storage file;
2.4) webpage captured in web storage file is therefrom extracted to the URL resolving and make new advances by map/reduce process again, and be kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL parsed;
2.5) by map/reduce process optimization temporary folder, filter URL wherein, the URL of repetition is removed, then by result waiting to capture in file and upgrade in HDFS file system;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is greater than capture depth value D epththen enter step 2.7), otherwise jump to step 2.2);
2.7) a web storage file merged into by multiple web storage files above-mentioned steps obtained by the process of map/reduce, and removes the webpage wherein repeated.
3. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that:
Described step 3) carry out cluster by map/reduce process, concrete steps are as follows:
3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively xwith target web C described in each itag tree TREE cibetween tree edit distance DIS ci, obtain tree edit distance set { DIS c1, DIS c2, DIS c3..., DIS cn, and generate key-value pair <C i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set cmin, by minimum tree editing distance DIS cmincorresponding key-value pair <C min, WEB> passes to the reduce stage;
3.2) in the reduce stage, according to above-mentioned key-value pair <C i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, each file DOC cisave the webpage with same web page structure, obtain structuring website construction result { DOC c1, DOC c2, DOC c3..., DOC cn, complete the structuring cluster of webpage.
4. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that: described step 4) according to step 3) in the structuring website construction result { DOC that obtains c1, DOC c2, DOC c3..., DOC cn, each class webpage is extracted, the information in webpage is extracted and is saved in database.
5. a kind of distributed network information structuring disposal route according to claim 4, is characterized in that: the field corresponding in database by the Node extraction of tag tree corresponding in described webpage.
6. a kind of distributed network information structuring disposal route according to claim 4, is characterized in that: described inhomogeneous webpage adopts different extracting modes.
CN201410840847.0A 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method Active CN104598536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410840847.0A CN104598536B (en) 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410840847.0A CN104598536B (en) 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method

Publications (2)

Publication Number Publication Date
CN104598536A true CN104598536A (en) 2015-05-06
CN104598536B CN104598536B (en) 2017-10-20

Family

ID=53124321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410840847.0A Active CN104598536B (en) 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method

Country Status (1)

Country Link
CN (1) CN104598536B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106293465A (en) * 2016-08-09 2017-01-04 Tcl移动通信科技(宁波)有限公司 The Web page management method of a kind of mobile terminal and system
CN106815196A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Soft text represents number of times statistical method and device
CN107451224A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 A kind of clustering method and system based on big data parallel computation
CN109829094A (en) * 2019-01-23 2019-05-31 钟祥博谦信息科技有限公司 Distributed reptile system
CN111177301A (en) * 2019-11-26 2020-05-19 云南电网有限责任公司昆明供电局 Key information identification and extraction method and system
CN112115164A (en) * 2019-06-19 2020-12-22 北京金山云网络技术有限公司 Data processing method and device, data query method and device, and network equipment
CN113220943A (en) * 2021-06-04 2021-08-06 上海天旦网络科技发展有限公司 Target information positioning method and system in semi-structured flow data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
US20110313973A1 (en) * 2010-06-19 2011-12-22 Srivas Mandayam C Map-Reduce Ready Distributed File System
US20120150836A1 (en) * 2010-12-08 2012-06-14 Microsoft Corporation Training parsers to approximately optimize ndcg
US20130091414A1 (en) * 2011-10-11 2013-04-11 Omer BARKOL Mining Web Applications
CN103257975A (en) * 2012-02-21 2013-08-21 腾讯科技(深圳)有限公司 Search method, search device and search system
CN103534700A (en) * 2011-05-20 2014-01-22 惠普发展公司,有限责任合伙企业 System and method for configuration policy extraction
CN104217020A (en) * 2014-09-25 2014-12-17 浪潮(北京)电子信息产业有限公司 Webpage clustering method and system based on MapReduce framework

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
US20110313973A1 (en) * 2010-06-19 2011-12-22 Srivas Mandayam C Map-Reduce Ready Distributed File System
US20120150836A1 (en) * 2010-12-08 2012-06-14 Microsoft Corporation Training parsers to approximately optimize ndcg
CN103534700A (en) * 2011-05-20 2014-01-22 惠普发展公司,有限责任合伙企业 System and method for configuration policy extraction
US20130091414A1 (en) * 2011-10-11 2013-04-11 Omer BARKOL Mining Web Applications
CN103257975A (en) * 2012-02-21 2013-08-21 腾讯科技(深圳)有限公司 Search method, search device and search system
CN104217020A (en) * 2014-09-25 2014-12-17 浪潮(北京)电子信息产业有限公司 Webpage clustering method and system based on MapReduce framework

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
宫丽娜 等: "基于树编辑距离的聚类算法数据记录抽取", 《赤峰学院学报(自然科学版)》 *
聂卉 等: "树编辑距离在Web信息抽取中的应用与实现", 《现代图书情报技术》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815196A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Soft text represents number of times statistical method and device
CN106815196B (en) * 2015-11-27 2020-07-31 北京国双科技有限公司 Soft text display frequency statistical method and device
CN106293465A (en) * 2016-08-09 2017-01-04 Tcl移动通信科技(宁波)有限公司 The Web page management method of a kind of mobile terminal and system
CN107451224A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 A kind of clustering method and system based on big data parallel computation
CN109829094A (en) * 2019-01-23 2019-05-31 钟祥博谦信息科技有限公司 Distributed reptile system
CN112115164A (en) * 2019-06-19 2020-12-22 北京金山云网络技术有限公司 Data processing method and device, data query method and device, and network equipment
CN112115164B (en) * 2019-06-19 2024-09-03 北京金山云网络技术有限公司 Data processing method and device, data query method and device and network equipment
CN111177301A (en) * 2019-11-26 2020-05-19 云南电网有限责任公司昆明供电局 Key information identification and extraction method and system
CN111177301B (en) * 2019-11-26 2023-05-26 云南电网有限责任公司昆明供电局 Method and system for identifying and extracting key information
CN113220943A (en) * 2021-06-04 2021-08-06 上海天旦网络科技发展有限公司 Target information positioning method and system in semi-structured flow data

Also Published As

Publication number Publication date
CN104598536B (en) 2017-10-20

Similar Documents

Publication Publication Date Title
CN104598536B (en) A kind of distributed network information structuring processing method
Mehmood et al. Implementing big data lake for heterogeneous data sources
CN104951539B (en) Internet data center&#39;s harmful information monitoring system
CN104298771B (en) A kind of magnanimity web daily record datas inquiry and analysis method
CN102193917B (en) Method and device for processing and querying data
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN105468744B (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN103399887A (en) Query and statistical analysis system for mass logs
CN103942335A (en) Construction method of uninterrupted crawler system oriented to web page structure change
CN111899089A (en) Enterprise risk early warning method and system based on knowledge graph
CN104077402A (en) Data processing method and data processing system
KR101801257B1 (en) Text-Mining Application Technique for Productive Construction Document Management
CN107644050A (en) A kind of querying method and device of the Hbase based on solr
CN103902667A (en) Simple network information collector achieving method based on meta-search
CN106844588A (en) A kind of analysis method and system of the user behavior data based on web crawlers
Rehman et al. Building socially-enabled event-enriched maps
CN107704620A (en) A kind of method, apparatus of file administration, equipment and storage medium
AL-Msie'deen et al. Detecting commonality and variability in use-case diagram variants
CN104133913A (en) System and method for automatically establishing city shop information library based on video analysis, searching and aggregation
Stefanov Analysis of cloud based etl in the era of iot and big data
CN112214615A (en) Policy document processing method and device based on knowledge graph and storage medium
CN104063456A (en) We media transmission atlas analysis method and device based on vector query
Xie et al. Design and implementation of the topic-focused crawler based on Scrapy
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant