CN103377260A - Analysis method and device of URLs (Uniform Resource Locator) of weblog - Google Patents

Analysis method and device of URLs (Uniform Resource Locator) of weblog Download PDF

Info

Publication number
CN103377260A
CN103377260A CN2012101331708A CN201210133170A CN103377260A CN 103377260 A CN103377260 A CN 103377260A CN 2012101331708 A CN2012101331708 A CN 2012101331708A CN 201210133170 A CN201210133170 A CN 201210133170A CN 103377260 A CN103377260 A CN 103377260A
Authority
CN
China
Prior art keywords
url
heavy
regular expression
numbering
going
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101331708A
Other languages
Chinese (zh)
Other versions
CN103377260B (en
Inventor
张清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taobao China Software Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201210133170.8A priority Critical patent/CN103377260B/en
Publication of CN103377260A publication Critical patent/CN103377260A/en
Application granted granted Critical
Publication of CN103377260B publication Critical patent/CN103377260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides an analysis method and device of URLs (Uniform Resource Locator) of a weblog. The analysis method comprises extracting the URLs in the weblog; performing duplicate removal processing on the URLs; performing regex match on the duplicate removed URLs through a plurality of preset regular expressions in turn and extracting serial numbers of regular expressions which are matched with the duplicate removed URLs; duplicating serial numbers of regular expressions of duplicate removed URLs which are the same with the URLS before duplicate removal to be utilized as corresponding serial numbers of the regular expressions; performing statistics on different regular expression serial numbers corresponding to the URLS before the duplicate removal. The analysis method and device of the URLs of the weblog can reduce the calculated amount of the regex match and reduce the calculation cost.

Description

The analytical approach of a kind of network log URL and device
Technical field
The application relates to the technical field that data are processed, and particularly relates to analytical approach and the device of a kind of network log URL.
Background technology
Often can carry out various analysis minings to these magnanimity Web log (network log) in business analysis processes; wherein; the important information that is comprising guest access among the URL of Web log; usually need to use regular expression and URL to mate, classification under the regular expression on the coupling is carried out business analysis.
In the prior art, the URL processing procedure of whole Web log divided for three steps:
1. collect the Web log of magnanimity and store raw data;
2. URL is carried out the coupling of regular expression, each bar URL matches regularity may have many (being generally in this scope of 1-10 bar);
3. the business category corresponding according to regularity, the follow-up data index analysis of output business category.
Suppose that original web log has the n bar, the coupling regular expression has the m bar, and the Data Matching that so real matching process produces just has n * m bar.
Above problems of the prior art are, URL canonical matching process is comparatively complicated, and the record number of Large-Scale Interconnected net Web log is magnanimity, and many canonical matched rules carry out the canonical coupling one by one to the URL of magnanimity successively, calculated amount is very large, and it is higher to assess the cost.
Therefore, the application's technical matters to be solved is, the analysis mechanisms of a kind of network log URL is provided, and to reduce the calculated amount of canonical coupling, reduction assesses the cost.
Summary of the invention
The application's technical matters to be solved provides the analytical approach of a kind of network log URL, and to reduce the calculated amount of canonical coupling, reduction assesses the cost.
The application also provides the analytical equipment of a kind of network log URL, in order to guarantee said method application and realization in practice.
In order to address the above problem, the application discloses the analytical approach of a kind of network log URL, comprising:
Extract the URL in the Webpage log;
Described URL is gone heavily to process;
Adopt successively a plurality of regular expressions preset, to go heavy after URL carry out the canonical coupling, extract with go heavy after the numbering of regular expression of URL coupling;
For removing heavy front URL, copy the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence;
Add up going different regular expression numberings corresponding to heavy front each URL.
Preferably, go heavy before and go the URL after heavy to be stored in the first form and the second form with the form of row respectively; Described regular expression numbering corresponding to URL of going after heavy, corresponding stored is in the second form.
Preferably, described all URL for going before heavy in the URL that goes after heavy, find regular expression corresponding to the URL identical with it, comprise as the step of the regular expression of correspondence:
The data of the second form are gone turn row;
By to the first form be connected in the form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression numbering.
Preferably, described regular expression numbering corresponding to heavy front URL of going, correspondence is added in the first form.
Preferably, described regular expression numbering corresponding to heavy front URL of going replaced URL corresponding in the first form.
Preferably, described to go heavy before different regular expression numberings corresponding to each URL step of adding up be to calculate respectively each different regular expression numbering at the number of times that goes to occur among all URL before heavy.
Preferably, the numbering that is numbered its affiliated business category of described regular expression.
The application also provides the analytical equipment of a kind of network log URL, comprising:
The URL extraction module is for the URL that extracts Webpage log;
URL removes the molality piece, is used for described URL is gone heavily to process;
The canonical matching module is used for a plurality of regular expressions that successively employing is preset, and carries out the canonical coupling to removing heavy rear URL, extracts the numbering with the regular expression that goes heavy rear URL coupling;
The matching result replication module is used for copying the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence for removing heavy front URL;
Statistical module is used for going different regular expression numberings corresponding to heavy front each URL to add up.
Preferably, go heavy before and go the URL after heavy to be stored in the first form and the second form with the form of row respectively; Described regular expression numbering corresponding to URL of going after heavy, corresponding stored is in the second form.
Preferably, described matching result replication module comprises:
Row turns the row submodule, turns row for the data of the second form are gone;
Equivalent connexon module, be used for by to the first form be connected form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression to number.
Compared with prior art, the application has the following advantages:
According to the application, for the URL among the Web log of magnanimity, remove first the URL that wherein repeats, URL after going is heavily being carried out the canonical coupling, because the log of magnanimity the inside, the number of times of the repeated accesses of URL is very high, after going to weigh, carry out canonical matching technique cost for once for identical URL, the matching result by removing heavy rear URL can obtain with it identical regular expression corresponding to all URL.Therefore, can very effectively be reduced to assessing the cost of URL canonical coupling minimum.
The application can will go the URL of heavy front and back to be stored in the form, connect by going heavy front and back URL column to carry out equivalence, can find the corresponding relation of heavy front all URL and its regular expression, the non-equivalence connection than the canonical coupling can reduce assessing the cost.And, when carrying out equivalent the connection, can select the regular expression numbering is replaced URL corresponding in the table, show that the result just only has the numbering of canonical coupling expression formula, than the situation that has URL, greatly reduced the col width of form, it is less to take resource.
Description of drawings
Fig. 1 is the process flow diagram of analytical approach embodiment of a kind of network log URL of the application;
Fig. 2 is the structured flowchart of analytical equipment embodiment of a kind of network log URL of the application.
Embodiment
For above-mentioned purpose, the feature and advantage that make the application can become apparent more, below in conjunction with the drawings and specific embodiments the application is described in further detail.
With reference to figure 1, show the process flow diagram of analytical approach embodiment of a kind of network log URL of the application, specifically can may further comprise the steps:
URL in step 101, the extraction Webpage log.
Webpage log is the file with the .log ending of the various raw informations such as record web server reception ﹠ disposal request and run time error, and definite says, should be server log.The web page address URL that has comprised the guest request access in the Webpage log.
URL is comprised of agreement, domain name, request address three parts, the unique resource of determining a request of URL intactly, and this resource can be the page, content module, file or multimedia resource etc.URL is for the website, and the use of URL is the unique location to resource, so mode can have a lot, with unique description (resource name or abbreviation etc.) of resource, the unique identifier of resource (ID, figure notation etc.) also can be dynamic parameter.Therefore, by extracting information among the URL which web page contents of can having learnt guest access, by the analysis to URL in the massive logs, can learn the situation that various web page resources are accessed, such as number of times, the information such as frequency.
Step 102, described URL is gone heavily to process.
URL can repeatedly be accessed in one day, therefore, can have the URL of a large amount of repetitions in the network log of magnanimity.Described going heavily is treated to the network address of repeating in the described Webpage log of removal, and the URL that retains is all not identical.When going heavily to process, can extract unduplicated URL among all URL, or URL is put into table successively, before storage, judge whether there is the identical network address in the table,, then add in the table if do not exist, if exist, then do not add.
In a preferred embodiment of the present application, go heavy before and go URL after heavy can be respectively be stored in the first form and the second form with the form of row.In the following example.
The first form is:
A
http://men.taobao.com/123456
http://men.taobao.com/123456
http://men.taobao.com/123456
http://women.taobao.com/123456
http://women.taobao.com/123456
http://women.taobao.com/123456
Wherein, this URL of http://men.taobao.com/123456 has repeated 3 times, and this URL of http://women.taobao.com/123456 has also repeated 3 times, and therefore, the second form that goes to obtain after heavy is:
D
http://men.taobao.com/123456
http://women.taobao.com/123456
Step 103, adopt a plurality of regular expressions preset successively, to go heavy after URL carry out the canonical coupling, extract with go heavy after the numbering of regular expression of URL coupling.
Be well known that regular expression is be used to the instrument that carries out text matches, usually formed by some common characters and some metacharacters (meta characters).Common character comprises the letter and number of capital and small letter, and metacharacter then has special implication.The coupling of regular expression can be understood as, and in given character string, seeks the part that is complementary with given regular expression.Might have a more than part to satisfy given regular expression in the character string, at this moment each such part be called as a coupling.That URL is mated with the default regular expression that comprises key word herein, on the coupling key word that comprises among the URL in the regular expression is described, unmatch, illustrate not comprise.Can learn the information that comprises among the URL or the classification of information by the coupling of URL being carried out a plurality of regular expressions.
In a preferred embodiment of the present application, described regular expression numbering corresponding to URL of going after heavy, can corresponding stored in the second form.Concrete, the numbering of described regular expression can be the numbering of business category under it.
As above routine, after mating, the result is as shown in the table:
D E
http://men.taobao.com/123456 men
http://women.taobao.com/123456 men
http://women.taobao.com/123456 women
Use many regular expressions that preset to carry out the canonical coupling to http://men.taobao.com/123456, draw and the matching regular expressions that is numbered men.The key word that had both comprised men among the http://women.taobao.com/123456 also comprises the key word of women, can with the matching regular expressions that is numbered men and women.
Step 104, for go heavy before URL, copy identical with it go heavy after the regular expression numbering of URL, as the regular expression numbering of correspondence.
It is identical with it to go URL before heavy all can find in the URL that goes after heavy, therefore, for go heavy before URL, identical with it URL can be numbered as the regular expression of own correspondence at corresponding regular expression numbering.Because the application carries out the canonical coupling for the URL after going heavily, with respect to mating one by one for every URL in the prior art, can greatly reduce workload.As above example need to be mated one by one for 6 URL in the prior art, and after going to weigh, only needs mate 2 URL, then with matching result and 6 corresponding getting final product of URL.
In concrete realization, the result who goes URL before and after heavy and canonical coupling put into form after, described step 104 can comprise:
Substep S11, the data of the second form are gone turn row.
Substep S12, by to the first form be connected in the form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression numbering.
After carrying out canonical coupling, the regular expression numbering that each bar URL is corresponding is that the form that is listed as is stored, and the numbering of regular expression that can url is corresponding is sequential storage to row the inside by size, and is as follows:
D E F
http://men.taobao.com/123456 men
http://women.taobao.com/123456 men women
Then, A row be connecteds with D carry out equivalence and connect, so just E can be listed as, F is listed as and the regular expression of G in being listed as numbered and the URL of A in being listed as associates.
In a preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going can correspondence be added in the first form, for example, the regular expression numbering added in the row on URL right side in the first form, and carry out corresponding with URL.
In another preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going, can replace URL corresponding in the first form, namely for each URL in the first form, numbering with the regular expression that identical URL is corresponding with it in the second form, add in the first form, and replace former URL.
The application is for the URL of magnanimity, only unduplicated URL wherein carried out the canonical coupling, and therefore can very effective reduction url canonical coupling assess the cost is minimum.
Step 105, to go heavy before different regular expression numberings corresponding to each URL add up.
In concrete realization, described step 105 can for, calculate respectively the number of times that each different regular expression numbering occurs among all URL before going heavily, according to the corresponding keyword of different regular expressions or classification, can add up the various information of guest access website.
In concrete realization, can in the Data Warehouse Platforms such as Hadoop or Hive, implement the application.
In sum, according to the application, for the URL among the Web log of magnanimity, remove first the URL that wherein repeats, the URL after going is heavily being carried out the canonical coupling, because the Web log of magnanimity the inside, the number of times of the repeated accesses of URL is very high, after going to weigh, carries out canonical matching technique cost for once for identical URL, by go heavy after the matching result of URL, can obtain all identical with it URL corresponding regular expression.Therefore, can very effectively be reduced to assessing the cost of URL canonical coupling minimum.
The application can will go the URL of heavy front and back to be stored in the form, connect by going heavy front and back URL column to carry out equivalence, can find the corresponding relation of heavy front all URL and its regular expression, the non-equivalence connection than the canonical coupling can reduce assessing the cost.And, when carrying out equivalent the connection, can select the regular expression numbering is replaced URL corresponding in the table, show that the result just only has the numbering of canonical coupling expression formula, than the situation that has URL, greatly reduced the col width of form, it is less to take resource.
For embodiment of the method, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the application is not subjected to the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the application is necessary.
With reference to figure 2, the structured flowchart of analytical equipment embodiment that it shows a kind of network log URL of the application specifically can comprise with lower module:
URL extraction module 201 is for the URL that extracts Webpage log;
URL removes molality piece 202, is used for described URL is gone heavily to process;
Canonical matching module 203 is used for a plurality of regular expressions that successively employing is preset, and carries out the canonical coupling to removing heavy rear URL, extracts the numbering with the regular expression that goes heavy rear URL coupling;
Matching result replication module 204 is used for copying the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence for removing heavy front URL;
Statistical module 205 is used for going different regular expression numberings corresponding to heavy front each URL to add up.
In a preferred embodiment of the present application, go heavy before and go URL after heavy can be respectively be stored in the first form and the second form with the form of row; Described regular expression numbering corresponding to URL of going after heavy, can corresponding stored in the second form.
In a preferred embodiment of the present application, described matching result replication module can comprise:
Row turns the row submodule, turns row for the data of the second form are gone;
Equivalent connexon module, be used for by to the first form be connected form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression to number.
In a preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going can correspondence be added in the first form.
In a preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going can be replaced URL corresponding in the first form.
In a preferred embodiment of the present application, described statistical module can for,
Computing module is used for calculating respectively each different regular expression numbering number of times that all URL occur before going heavily.
In a preferred embodiment of the present application, the numbering of described regular expression can be the numbering of business category under it.
Because described device embodiment is substantially corresponding to aforementioned embodiment of the method illustrated in figures 1 and 2, so not detailed part in the description of present embodiment can referring to the related description in the previous embodiment, just not given unnecessary details at this.
The application can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.
The application can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the application, in these distributed computing environment, be executed the task by the teleprocessing equipment that is connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
In this article, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby not only comprise those key elements so that comprise process, method, article or the equipment of a series of key elements, but also comprise other key elements of clearly not listing, or also be included as the intrinsic key element of this process, method, article or equipment.Do not having in the situation of more restrictions, the key element that is limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
More than the analytical approach of a kind of network log URL that the application is provided, and, the analytical equipment of a kind of network log URL is described in detail, used specific case herein the application's principle and embodiment are set forth, the explanation of above embodiment just is used for helping to understand the application's method and core concept thereof; Simultaneously, for one of ordinary skill in the art, the thought according to the application all will change in specific embodiments and applications, and in sum, this description should not be construed as the restriction to the application.

Claims (10)

1. the analytical approach of a network log URL is characterized in that, comprising:
Extract the URL in the Webpage log;
Described URL is gone heavily to process;
Adopt successively a plurality of regular expressions preset, to go heavy after URL carry out the canonical coupling, extract with go heavy after the numbering of regular expression of URL coupling;
For removing heavy front URL, copy the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence;
Add up going different regular expression numberings corresponding to heavy front each URL.
2. the method for claim 1 is characterized in that, go heavy before and go the URL after heavy to be stored in the first form and the second form with the form of row respectively; Described regular expression numbering corresponding to URL of going after heavy, corresponding stored is in the second form.
3. method as claimed in claim 2 is characterized in that, described all URL for going before heavy in the URL that goes after heavy, find regular expression corresponding to the URL identical with it, comprises as the step of the regular expression of correspondence:
The data of the second form are gone turn row;
By to the first form be connected in the form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression numbering.
4. method as claimed in claim 2 is characterized in that, described regular expression numbering corresponding to heavy front URL of going, and correspondence is added in the first form.
5. method as claimed in claim 2 is characterized in that, described regular expression numbering corresponding to heavy front URL of going replaced URL corresponding in the first form.
6. the method for claim 1 is characterized in that, described to go heavy before different regular expression numberings corresponding to each URL step of adding up be to calculate respectively each different regular expression numbering at the number of times that goes to occur among all URL before heavy.
7. such as each described method of claim 1-6, it is characterized in that the numbering that is numbered its affiliated business category of described regular expression.
8. the analytical equipment of a network log URL is characterized in that, comprising:
The URL extraction module is for the URL that extracts Webpage log;
URL removes the molality piece, is used for described URL is gone heavily to process;
The canonical matching module is used for a plurality of regular expressions that successively employing is preset, and carries out the canonical coupling to removing heavy rear URL, extracts the numbering with the regular expression that goes heavy rear URL coupling;
The matching result replication module is used for copying the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence for removing heavy front URL;
Statistical module is used for going different regular expression numberings corresponding to heavy front each URL to add up.
9. device as claimed in claim 8 is characterized in that, go heavy before and go the URL after heavy to be stored in the first form and the second form with the form of row respectively; Described regular expression numbering corresponding to URL of going after heavy, corresponding stored is in the second form.
10. device as claimed in claim 9 is characterized in that, described matching result replication module comprises:
Row turns the row submodule, turns row for the data of the second form are gone;
Equivalent connexon module, be used for by to the first form be connected form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression to number.
CN201210133170.8A 2012-04-28 2012-04-28 The analysis method and device of a kind of network log URL Active CN103377260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210133170.8A CN103377260B (en) 2012-04-28 2012-04-28 The analysis method and device of a kind of network log URL

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210133170.8A CN103377260B (en) 2012-04-28 2012-04-28 The analysis method and device of a kind of network log URL

Publications (2)

Publication Number Publication Date
CN103377260A true CN103377260A (en) 2013-10-30
CN103377260B CN103377260B (en) 2017-05-31

Family

ID=49462386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210133170.8A Active CN103377260B (en) 2012-04-28 2012-04-28 The analysis method and device of a kind of network log URL

Country Status (1)

Country Link
CN (1) CN103377260B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617198A (en) * 2013-11-14 2014-03-05 北京国双科技有限公司 Page merging method and page merging device
CN103986606A (en) * 2014-05-27 2014-08-13 重庆邮电大学 Method for parallel recognition and statistics of webpage URLs based on MapReduce algorithm
CN104252532A (en) * 2014-09-11 2014-12-31 北京优特捷信息技术有限公司 Website information statistic method and device
CN104933056A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Uniform resource locator (URL) de-duplication method and device
CN105005600A (en) * 2015-07-02 2015-10-28 焦点科技股份有限公司 Preprocessing method of URL (Uniform Resource Locator) in access log
CN105516114A (en) * 2015-12-01 2016-04-20 珠海市君天电子科技有限公司 Method and device for scanning vulnerability based on webpage hash value and electronic equipment
CN105591836A (en) * 2015-09-09 2016-05-18 杭州华三通信技术有限公司 Data flow detection method and device
CN105790967A (en) * 2014-12-18 2016-07-20 华为技术有限公司 Weblog processing method and device
CN107145542A (en) * 2017-04-25 2017-09-08 上海斐讯数据通信技术有限公司 The high efficiency extraction subscription client ID method and system from URL
CN109791741A (en) * 2016-09-27 2019-05-21 日本电信电话株式会社 Secret equivalence connection system, secret equivalent attachment device, secret equivalent connection method, program
CN109995784A (en) * 2019-04-03 2019-07-09 杭州汉领信息科技有限公司 A kind of data extraction accelerated method based on UDP
CN110012010A (en) * 2019-04-03 2019-07-12 杭州汉领信息科技有限公司 A kind of WAF defence method based on targeted sites self study modeling

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133540A1 (en) * 2006-12-01 2008-06-05 Websense, Inc. System and method of analyzing web addresses
US20080172390A1 (en) * 2007-01-16 2008-07-17 Microsoft Corporation Associating security trimmers with documents in an enterprise search system
CN101727447A (en) * 2008-10-10 2010-06-09 浙江搜富网络技术有限公司 Generation method and device of regular expression based on URL
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133540A1 (en) * 2006-12-01 2008-06-05 Websense, Inc. System and method of analyzing web addresses
US20080172390A1 (en) * 2007-01-16 2008-07-17 Microsoft Corporation Associating security trimmers with documents in an enterprise search system
CN101727447A (en) * 2008-10-10 2010-06-09 浙江搜富网络技术有限公司 Generation method and device of regular expression based on URL
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617198B (en) * 2013-11-14 2017-10-27 北京国双科技有限公司 Page merging method and device
CN103617198A (en) * 2013-11-14 2014-03-05 北京国双科技有限公司 Page merging method and page merging device
CN104933056A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Uniform resource locator (URL) de-duplication method and device
CN104933056B (en) * 2014-03-18 2019-08-13 腾讯科技(深圳)有限公司 Uniform resource locator De-weight method and device
CN103986606B (en) * 2014-05-27 2017-03-29 重庆邮电大学 It is a kind of based on the parallelism recognition of MapReduce algorithms, the method for statistical web page URL
CN103986606A (en) * 2014-05-27 2014-08-13 重庆邮电大学 Method for parallel recognition and statistics of webpage URLs based on MapReduce algorithm
CN104252532A (en) * 2014-09-11 2014-12-31 北京优特捷信息技术有限公司 Website information statistic method and device
CN105790967A (en) * 2014-12-18 2016-07-20 华为技术有限公司 Weblog processing method and device
CN105005600A (en) * 2015-07-02 2015-10-28 焦点科技股份有限公司 Preprocessing method of URL (Uniform Resource Locator) in access log
CN105005600B (en) * 2015-07-02 2017-05-24 焦点科技股份有限公司 Preprocessing method of URL (Uniform Resource Locator) in access log
CN105591836A (en) * 2015-09-09 2016-05-18 杭州华三通信技术有限公司 Data flow detection method and device
CN105591836B (en) * 2015-09-09 2019-03-15 新华三技术有限公司 Data-flow detection method and apparatus
CN105516114A (en) * 2015-12-01 2016-04-20 珠海市君天电子科技有限公司 Method and device for scanning vulnerability based on webpage hash value and electronic equipment
CN105516114B (en) * 2015-12-01 2018-12-14 珠海市君天电子科技有限公司 Method and device for scanning vulnerability based on webpage hash value and electronic equipment
CN109791741A (en) * 2016-09-27 2019-05-21 日本电信电话株式会社 Secret equivalence connection system, secret equivalent attachment device, secret equivalent connection method, program
CN109791741B (en) * 2016-09-27 2022-01-18 日本电信电话株式会社 Secret equivalence connection system, connection device, connection method, and recording medium
CN107145542A (en) * 2017-04-25 2017-09-08 上海斐讯数据通信技术有限公司 The high efficiency extraction subscription client ID method and system from URL
CN110012010A (en) * 2019-04-03 2019-07-12 杭州汉领信息科技有限公司 A kind of WAF defence method based on targeted sites self study modeling
CN110012010B (en) * 2019-04-03 2021-09-17 杭州汉领信息科技有限公司 Target site self-learning modeling-based WAF defense method
CN109995784A (en) * 2019-04-03 2019-07-09 杭州汉领信息科技有限公司 A kind of data extraction accelerated method based on UDP
CN109995784B (en) * 2019-04-03 2022-02-11 杭州汉领信息科技有限公司 UDP-based data extraction acceleration method

Also Published As

Publication number Publication date
CN103377260B (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN103377260A (en) Analysis method and device of URLs (Uniform Resource Locator) of weblog
CN102171689B (en) Method and system for providing search results
US9264505B2 (en) Building a semantics graph for an enterprise communication network
US8832102B2 (en) Methods and apparatuses for clustering electronic documents based on structural features and static content features
US20170177729A1 (en) Search engine and link-based ranking algorithm for the semantic web
US20100211609A1 (en) Method and system to process unstructured data
RU2012144649A (en) PRODUCT SYNTHESIS FROM MULTIPLE SOURCES
US8515986B2 (en) Query pattern generation for answers coverage expansion
US20150169734A1 (en) Building features and indexing for knowledge-based matching
CN101963965A (en) Document indexing method, data query method and server based on search engine
US9514113B1 (en) Methods for automatic footnote generation
CN102710795A (en) Hotspot collecting method and device
CN105975459A (en) Lexical item weight labeling method and device
CN102591965A (en) Method and device for detecting black chain
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
CN106168968A (en) A kind of Website classification method and device
CN106874368B (en) RTB bidding advertisement position value analysis method and system
US20210109945A1 (en) Self-orchestrated system for extraction, analysis, and presentation of entity data
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
CN106033444B (en) Text content clustering method and device
WO2014071100A1 (en) Search service including indexing text containing numbers in part using one or more number index structures
CN103744970A (en) Method and device for determining subject term of picture
Prakash et al. Content extraction issues in online web education
CN109213972B (en) Method, device, equipment and computer storage medium for determining document similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1186804

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1186804

Country of ref document: HK

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211103

Address after: Room 554, floor 5, building 3, No. 969, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee after: TAOBAO (CHINA) SOFTWARE CO.,LTD.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: ALIBABA GROUP HOLDING Ltd.