CN104598536A - Structured processing method of distributed network information - Google Patents
Structured processing method of distributed network information Download PDFInfo
- Publication number
- CN104598536A CN104598536A CN201410840847.0A CN201410840847A CN104598536A CN 104598536 A CN104598536 A CN 104598536A CN 201410840847 A CN201410840847 A CN 201410840847A CN 104598536 A CN104598536 A CN 104598536A
- Authority
- CN
- China
- Prior art keywords
- webpage
- file
- network information
- structuring
- doc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a structured processing method of distributed network information. The method comprises the following steps: configuring a network information acqusition task, and saving interesting webpages of a user in category to serve as target webpages; acquiring the network information, cooperatively acquiring the webpages through multiple map/reduce processes, performing structured processing and saving in an HDFS (Hadoop Distributed File System) file system; performing structured clustering on the webpages after the structured processing by using a tree edit distance mode; performing structured extraction on the clustered webpage information, and saving in a database. A distributed architecture is adopted, a huge data volume of network data can be processed by using the calculation and storage capacity of a cheap computer cluster; the webpages are effectively classified; the network information is extracted and saved by using the structured mode, and further analytical processing of the network information is facilitated.
Description
Technical field
The present invention relates to a kind of web information processing method in network information gathering field, particularly relate to a kind of distributed network information structuring acquiring and processing method.
Background technology
Distributed system is by effectively being organized by the computing cluster of cheapness, performs the system of large-scale data computing and storage.
Distributed system is different from one-of-a-kind system, utilize computer cluster carry out data operation and store the cost that will balance between single node computing power and internodal communication, also will consider the problems such as the restorability of the system effectiveness that cluster interior joint fault causes and data simultaneously.Hadoop distributed treatment and HDFS distributed file system are increase income distributed arithmetic and the storage systems that the Map/Reduce computation model proposed based on Google is designed and developed.Due to the succinct versatility that it effectively solves problem in distributed system and its framework, be obtained in a lot of field and apply widely.
Structuring clustering method is the one in clustering method, and different from the method for carrying out cluster by content, structuring cluster is it is emphasised that carry out cluster according to structure, and this just needs different method for measuring similarity.Tree edit distance method is a kind of method weighing tree structure similarity, one tree is converted to another one tree and means the insertion carrying out a series of node between two trees, deletes and replaces, and operation each time expends certain cost.If the textural difference of two trees is large, mean that running cost is high, running cost is low, shows that the textural difference set is little.What the editing distance therefore set represented is two minimum costs setting required for conversion.
Network information gathering is also web crawlers usually, is widely used in internet search engine or other similar websites, to obtain or to upgrade content and the retrieval mode of these websites.It can gather all content of pages that it can have access to automatically, does further process for search engine etc.
The information that existing web crawlers grabs is stored in storage system with the form of original web page.There is following shortcoming in such storage mode, one is store with the form of original web page to need larger storage space; Two is have a large amount of irrelevant informations, as advertisement etc. in the information stored; Three is that to preserve information be in the form of a web page a kind of semi-structured mode, and relative to structurized storage mode, semi-structured storage mode cause certain obstacle can to the use of further information.
Summary of the invention
The object of the invention is to the deficiency for existing network information acquiring technology, provide a kind of distributed network information structuring acquiring and processing method.
The technical solution used in the present invention comprises the following steps:
1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web;
2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system;
3) webpage after structuring process is adopted the mode of tree edit distance, carry out structuring cluster;
4) structuring extraction is carried out to the info web after cluster, be saved in database.
Described step 2) specifically comprise:
2.1) obtain URL seed file, the set of URL seed file is saved to HDFS file system wait capture in file, wait that capturing file deposits the URL that will capture, and to arrange initial sequence number be 1;
2.2) judge that whether wait to capture in file is empty, if so, then jumps to step 2.7); Otherwise, carry out next step 2.3);
2.3) gathered by the webpage that map/reduce process is corresponding to each URL seed file in HDFS file system, and be kept at web storage file in HDFS file system and deposit, crude webpage deposited by web storage file;
2.4) webpage captured in web storage file is therefrom extracted to the URL resolving and make new advances by map/reduce process again, and be kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL parsed;
2.5) by map/reduce process optimization temporary folder, filter URL wherein, the URL of repetition is removed, then by result waiting to capture in file and upgrade in HDFS file system;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is greater than capture depth value D
epththen enter step 2.7), otherwise jump to step 2.2);
2.7) a web storage file merged into by multiple web storage files above-mentioned steps obtained by the process of map/reduce, and removes the webpage wherein repeated.
Described step 3) carry out cluster by map/reduce process, concrete steps are as follows:
3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively
xwith target web C described in each
itag tree TREE
cibetween tree edit distance DIS
ci, obtain tree edit distance set { DIS
c1, DIS
c2, DIS
c3..., DIS
cn, and generate key-value pair <C
i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set
cmin, by minimum tree editing distance DIS
cmincorresponding key-value pair <C
min, WEB> passes to the reduce stage;
3.2) in the reduce stage, according to above-mentioned key-value pair <C
i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments
cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, each file DOC
cisave the webpage with same web page structure, obtain structuring website construction result { DOC
c1, DOC
c2, DOC
c3..., DOC
cn, complete the structuring cluster of webpage.
Described step 4) according to step 3) in the structuring website construction result { DOC that obtains
c1, DOC
c2, DOC
c3..., DOC
cn, each class webpage is extracted, the information in webpage is extracted and is saved in database.
The field corresponding in database by the Node extraction of tag tree corresponding in described webpage.
Described inhomogeneous webpage adopts different extracting modes.
Step 1) to the configuration of network information gathering task, be mutual interface.Different from the reptile of search engine, fundamental purpose of the present invention is to monitor for the specific information source of network.The interested webpage of user as information source can be divided into text message, pictorial information and video information etc. by content type, also can be divided into news information and advertising message etc. by contents attribute.Meanwhile, the renewal frequency of different aforementioned sources is also not quite similar.Configured by network information gathering task, the information of the information source gathered can be determined, thus realize taking different acquisition methods to different types of information source.
Step 2) based on Hadoop distributed treatment and HDFS distributed file system, to step 1) in definition network collection task carry out distributed crawl.
Step 3) carry out Web page structural cluster, step 2) webpage that captures still is stored in HDFS with semi-structured document form, can not extracting directly in database.Step 3) adopt the mode of tree edit distance, by its structure, cluster is carried out to webpage.Webpage is all adopt HTML to write, and single html file can be abstracted into the form of a tag tree, has the webpage of identical information, and its tag tree also has similar even identical structure.In order to weigh the similarity of tag tree, present invention employs the method for tree edit distance, by step 2) webpage that grabs classifies.
Step 4) info web structuring is extracted, according to step 3) in the result of structuring cluster, and step 1) in extracting mode to each class webpage, the information in webpage is extracted and is saved in database.
The beneficial effect that the present invention has is:
Present invention employs distributed framework, utilize the calculating of cheap computer cluster and storage capacity to process the huge network data of data volume; Have employed the Web page structural cluster mode of tree edit distance, effectively webpage is classified; Have employed structurized mode extract the network information and preserve, facilitate the further analyzing and processing to the network information.
Accompanying drawing explanation
Fig. 1 is the invention process flow chart of steps.
Fig. 2 is step 3.1 of the present invention) in web page tag tree.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described.
As shown in Figure 1, the present invention includes following steps:
1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web; Subsequent step can be obtained and be used for the target web set { C of cluster
1, C
2, C
3..., C
n; When target web classification of the present invention is preserved, can classify to the different information type in same website, info class may be there is in such as same website and can be divided into news category, product data class and picture category etc.
2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system.
Step 2) based on Hadoop distributed treatment and HDFS distributed file system, distributed crawl is carried out to the network collection task of definition in step 1.
2.1) obtain URL seed file, the set of URL seed file is saved to HDFS file system wait capture in file, wait that capturing file deposits the URL that will capture, and to arrange initial sequence number be 1;
2.2) judge that whether wait to capture in file is empty, if so, then jumps to step 2.7); Otherwise, carry out next step 2.3);
2.3) gathered by the webpage that map/reduce process is corresponding to each URL seed file in HDFS file system, and be kept at web storage file in HDFS file system and deposit, crude webpage deposited by web storage file;
2.4) therefrom being extracted the webpage captured in web storage file by map/reduce process again and resolve the URL made new advances, and being kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL resolving and obtain;
2.5) by map/reduce process optimization temporary folder, filter URL wherein, the URL of repetition is removed, then by result waiting to capture in file and upgrade in HDFS file system;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is greater than capture depth value D
epththen enter step 2.7), otherwise jump to step 2.2), then repeat above-mentioned steps 2) ~ 6) until current sequence number is greater than capture depth value D
epth;
2.7) multiple web storage files above-mentioned steps obtained by the process of map/reduce merge into a web storage file according to webpage cryptographic hash, and remove the webpage wherein repeated; Web storage file after merging can be kept in the web page files folder of HDFS file system, the webpage grabbed preserved by this file.
3) webpage after structuring process is adopted the mode of tree edit distance, carry out structuring cluster;
Step 3) carry out cluster by map/reduce process, concrete steps are as follows:
3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, such as step 1) the middle interested webpage of user, a webpage in such as a certain Website News class webpage, target web and the webpage grabbed are extracted into the form of tag tree by the map stage, as shown in Figure 2.Utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively
xwith target web C described in each
itag tree TREE
cibetween tree edit distance DIS
ci, obtain tree edit distance set { DIS
c1, DIS
c2, DIS
c3..., DIS
cn, and generate key-value pair <C
i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set
cmin, by minimum tree editing distance DIS
cmincorresponding key-value pair <C
min, WEB> passes to the reduce stage;
3.2) in the reduce stage, according to above-mentioned key-value pair <C
i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments
cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, destination file folder has file DOC
c1, file DOC
c2, file DOC
c3..., file DOC
cn, each file DOC
cisave the webpage with same web page structure, obtain structuring website construction result { DOC
c1, DOC
c2, DOC
c3..., DOC
cn, complete the structuring cluster of webpage.
4) structuring extraction is carried out to the info web after cluster, be saved in database.
Step 4) according to step 3) in the structuring website construction result { DOC that obtains
c1, DOC
c2, DOC
c3..., DOC
cn, each class webpage is extracted, the information in webpage is extracted and is saved in database, extract the mode that can adopt the Node extraction of tag tree corresponding in webpage field corresponding in database.
Inhomogeneous webpage adopts different extracting modes, the extracting mode { R of each class webpage of definable
1, R
2, R
3..., R
n, the information in webpage is extracted and is saved in database.
Embodiment of the present invention is as follows:
For certain electric automobile website, user will obtain news category webpage wherein and automobile model parameter class webpage, and for this two classes webpage, each class obtains a typical webpage as target web, thus forms target web set { C
1, C
2.
Step 2) in distributed crawl is carried out to this electric automobile website, obtain its web data, this process is consuming time relevant with waiting the cluster scale capturing website scale and perform crawl task, and for the cluster of ten nodes, fully loaded captures speed can reach 100,000/hour.
Step 3) webpage that will get from this electric automobile website, be divided into news category and automobile model parameter class two class webpage by cluster, the accuracy rate of this process cluster can accomplish more than 95%.
Step 4) structurized extraction is carried out for this two classes webpage, for news category webpage, extract title, text, date issued, the information such as source, are saved in database; For automobile model parameter class webpage, by different model parameter extraction out, be saved in database.
Wherein, a kind of R of extracting mode
iform: wherein PubTitle, Content, PublicationDate, DCSource are the fields in database, is a webpage xpath path after each field, defines the position of this field in webpage.To each class web page files DOC
ci, utilize the extracting mode R of its correspondence
i, data corresponding in webpage are extracted field corresponding in database.
Can see, in whole crawl process, user is only needed to provide exemplary web page in interested two class webpages, the method just can by the information of corresponding types in targeted website, structurized extraction is also saved in database, whole process, on the basis keeping comparatively high-accuracy (embodiment obtain more than 95%), provides very fast information handling rate.
Thus, the present invention is based on HDFS file system, for the feature that the network information is semi-structured, propose and utilize tree edit distance to carry out cluster to webpage according to structure of web page, on the basis of cluster result, to the webpage of each class according to mode information extraction, and be saved in database, thus realize the structuring collection of info web, there is significant technique effect.
Claims (6)
1. a distributed network information structuring disposal route, is characterized in that the step of the method is as follows:
1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web;
2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system;
3) webpage after structuring process is adopted the mode of tree edit distance, carry out structuring cluster;
4) structuring extraction is carried out to the info web after cluster, be saved in database.
2. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that: described step 2) specifically comprise:
2.1) obtain URL seed file, the set of URL seed file is saved to HDFS file system wait capture in file, wait that capturing file deposits the URL that will capture, and to arrange initial sequence number be 1;
2.2) judge that whether wait to capture in file is empty, if so, then jumps to step 2.7); Otherwise, carry out next step 2.3);
2.3) gathered by the webpage that map/reduce process is corresponding to each URL seed file in HDFS file system, and be kept at web storage file in HDFS file system and deposit, crude webpage deposited by web storage file;
2.4) webpage captured in web storage file is therefrom extracted to the URL resolving and make new advances by map/reduce process again, and be kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL parsed;
2.5) by map/reduce process optimization temporary folder, filter URL wherein, the URL of repetition is removed, then by result waiting to capture in file and upgrade in HDFS file system;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is greater than capture depth value D
epththen enter step 2.7), otherwise jump to step 2.2);
2.7) a web storage file merged into by multiple web storage files above-mentioned steps obtained by the process of map/reduce, and removes the webpage wherein repeated.
3. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that:
Described step 3) carry out cluster by map/reduce process, concrete steps are as follows:
3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively
xwith target web C described in each
itag tree TREE
cibetween tree edit distance DIS
ci, obtain tree edit distance set { DIS
c1, DIS
c2, DIS
c3..., DIS
cn, and generate key-value pair <C
i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set
cmin, by minimum tree editing distance DIS
cmincorresponding key-value pair <C
min, WEB> passes to the reduce stage;
3.2) in the reduce stage, according to above-mentioned key-value pair <C
i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments
cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, each file DOC
cisave the webpage with same web page structure, obtain structuring website construction result { DOC
c1, DOC
c2, DOC
c3..., DOC
cn, complete the structuring cluster of webpage.
4. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that: described step 4) according to step 3) in the structuring website construction result { DOC that obtains
c1, DOC
c2, DOC
c3..., DOC
cn, each class webpage is extracted, the information in webpage is extracted and is saved in database.
5. a kind of distributed network information structuring disposal route according to claim 4, is characterized in that: the field corresponding in database by the Node extraction of tag tree corresponding in described webpage.
6. a kind of distributed network information structuring disposal route according to claim 4, is characterized in that: described inhomogeneous webpage adopts different extracting modes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410840847.0A CN104598536B (en) | 2014-12-29 | 2014-12-29 | A kind of distributed network information structuring processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410840847.0A CN104598536B (en) | 2014-12-29 | 2014-12-29 | A kind of distributed network information structuring processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104598536A true CN104598536A (en) | 2015-05-06 |
CN104598536B CN104598536B (en) | 2017-10-20 |
Family
ID=53124321
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410840847.0A Active CN104598536B (en) | 2014-12-29 | 2014-12-29 | A kind of distributed network information structuring processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104598536B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106293465A (en) * | 2016-08-09 | 2017-01-04 | Tcl移动通信科技(宁波)有限公司 | The Web page management method of a kind of mobile terminal and system |
CN106815196A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Soft text represents number of times statistical method and device |
CN107451224A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | A kind of clustering method and system based on big data parallel computation |
CN109829094A (en) * | 2019-01-23 | 2019-05-31 | 钟祥博谦信息科技有限公司 | Distributed reptile system |
CN111177301A (en) * | 2019-11-26 | 2020-05-19 | 云南电网有限责任公司昆明供电局 | Key information identification and extraction method and system |
CN112115164A (en) * | 2019-06-19 | 2020-12-22 | 北京金山云网络技术有限公司 | Data processing method and device, data query method and device, and network equipment |
CN113220943A (en) * | 2021-06-04 | 2021-08-06 | 上海天旦网络科技发展有限公司 | Target information positioning method and system in semi-structured flow data |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667201A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Integration method of Deep Web query interface based on tree merging |
US20110313973A1 (en) * | 2010-06-19 | 2011-12-22 | Srivas Mandayam C | Map-Reduce Ready Distributed File System |
US20120150836A1 (en) * | 2010-12-08 | 2012-06-14 | Microsoft Corporation | Training parsers to approximately optimize ndcg |
US20130091414A1 (en) * | 2011-10-11 | 2013-04-11 | Omer BARKOL | Mining Web Applications |
CN103257975A (en) * | 2012-02-21 | 2013-08-21 | 腾讯科技(深圳)有限公司 | Search method, search device and search system |
CN103534700A (en) * | 2011-05-20 | 2014-01-22 | 惠普发展公司,有限责任合伙企业 | System and method for configuration policy extraction |
CN104217020A (en) * | 2014-09-25 | 2014-12-17 | 浪潮(北京)电子信息产业有限公司 | Webpage clustering method and system based on MapReduce framework |
-
2014
- 2014-12-29 CN CN201410840847.0A patent/CN104598536B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667201A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Integration method of Deep Web query interface based on tree merging |
US20110313973A1 (en) * | 2010-06-19 | 2011-12-22 | Srivas Mandayam C | Map-Reduce Ready Distributed File System |
US20120150836A1 (en) * | 2010-12-08 | 2012-06-14 | Microsoft Corporation | Training parsers to approximately optimize ndcg |
CN103534700A (en) * | 2011-05-20 | 2014-01-22 | 惠普发展公司,有限责任合伙企业 | System and method for configuration policy extraction |
US20130091414A1 (en) * | 2011-10-11 | 2013-04-11 | Omer BARKOL | Mining Web Applications |
CN103257975A (en) * | 2012-02-21 | 2013-08-21 | 腾讯科技(深圳)有限公司 | Search method, search device and search system |
CN104217020A (en) * | 2014-09-25 | 2014-12-17 | 浪潮(北京)电子信息产业有限公司 | Webpage clustering method and system based on MapReduce framework |
Non-Patent Citations (2)
Title |
---|
宫丽娜 等: "基于树编辑距离的聚类算法数据记录抽取", 《赤峰学院学报(自然科学版)》 * |
聂卉 等: "树编辑距离在Web信息抽取中的应用与实现", 《现代图书情报技术》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106815196A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Soft text represents number of times statistical method and device |
CN106815196B (en) * | 2015-11-27 | 2020-07-31 | 北京国双科技有限公司 | Soft text display frequency statistical method and device |
CN106293465A (en) * | 2016-08-09 | 2017-01-04 | Tcl移动通信科技(宁波)有限公司 | The Web page management method of a kind of mobile terminal and system |
CN107451224A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | A kind of clustering method and system based on big data parallel computation |
CN109829094A (en) * | 2019-01-23 | 2019-05-31 | 钟祥博谦信息科技有限公司 | Distributed reptile system |
CN112115164A (en) * | 2019-06-19 | 2020-12-22 | 北京金山云网络技术有限公司 | Data processing method and device, data query method and device, and network equipment |
CN112115164B (en) * | 2019-06-19 | 2024-09-03 | 北京金山云网络技术有限公司 | Data processing method and device, data query method and device and network equipment |
CN111177301A (en) * | 2019-11-26 | 2020-05-19 | 云南电网有限责任公司昆明供电局 | Key information identification and extraction method and system |
CN111177301B (en) * | 2019-11-26 | 2023-05-26 | 云南电网有限责任公司昆明供电局 | Method and system for identifying and extracting key information |
CN113220943A (en) * | 2021-06-04 | 2021-08-06 | 上海天旦网络科技发展有限公司 | Target information positioning method and system in semi-structured flow data |
Also Published As
Publication number | Publication date |
---|---|
CN104598536B (en) | 2017-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104598536B (en) | A kind of distributed network information structuring processing method | |
Mehmood et al. | Implementing big data lake for heterogeneous data sources | |
CN104951539B (en) | Internet data center's harmful information monitoring system | |
CN104298771B (en) | A kind of magnanimity web daily record datas inquiry and analysis method | |
CN102193917B (en) | Method and device for processing and querying data | |
CN103023714B (en) | The liveness of topic Network Based and cluster topology analytical system and method | |
CN105468744B (en) | Big data platform for realizing tax public opinion analysis and full text retrieval | |
CN104516982A (en) | Method and system for extracting Web information based on Nutch | |
CN103399887A (en) | Query and statistical analysis system for mass logs | |
CN103942335A (en) | Construction method of uninterrupted crawler system oriented to web page structure change | |
CN111899089A (en) | Enterprise risk early warning method and system based on knowledge graph | |
CN104077402A (en) | Data processing method and data processing system | |
KR101801257B1 (en) | Text-Mining Application Technique for Productive Construction Document Management | |
CN107644050A (en) | A kind of querying method and device of the Hbase based on solr | |
CN103902667A (en) | Simple network information collector achieving method based on meta-search | |
CN106844588A (en) | A kind of analysis method and system of the user behavior data based on web crawlers | |
Rehman et al. | Building socially-enabled event-enriched maps | |
CN107704620A (en) | A kind of method, apparatus of file administration, equipment and storage medium | |
AL-Msie'deen et al. | Detecting commonality and variability in use-case diagram variants | |
CN104133913A (en) | System and method for automatically establishing city shop information library based on video analysis, searching and aggregation | |
Stefanov | Analysis of cloud based etl in the era of iot and big data | |
CN112214615A (en) | Policy document processing method and device based on knowledge graph and storage medium | |
CN104063456A (en) | We media transmission atlas analysis method and device based on vector query | |
Xie et al. | Design and implementation of the topic-focused crawler based on Scrapy | |
CN111026940A (en) | Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |