CN108073607B - URL processing method and device - Google Patents

URL processing method and device Download PDF

Info

Publication number
CN108073607B
CN108073607B CN201610996918.5A CN201610996918A CN108073607B CN 108073607 B CN108073607 B CN 108073607B CN 201610996918 A CN201610996918 A CN 201610996918A CN 108073607 B CN108073607 B CN 108073607B
Authority
CN
China
Prior art keywords
url
key
source page
characters
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610996918.5A
Other languages
Chinese (zh)
Other versions
CN108073607A (en
Inventor
包佳杰
施维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610996918.5A priority Critical patent/CN108073607B/en
Publication of CN108073607A publication Critical patent/CN108073607A/en
Application granted granted Critical
Publication of CN108073607B publication Critical patent/CN108073607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • G06F8/22Procedural

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a URL processing method and a URL processing device, which improve the success rate of collecting website data accessed by a user. The method comprises the following steps: acquiring a URL of a source page of an accessed page; if the number of the characters of the URL of the source page is larger than a first threshold value, extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page; if the number of characters of the processed URL of the source page is smaller than or equal to the first threshold, generating a request URL by using the processed URL of the source page, wherein the number of characters of the request URL meets the maximum length limiting condition of a browser for the request URL.

Description

URL processing method and device
Technical Field
The invention relates to the field of big data analysis, in particular to a URL processing method and device.
Background
In the field of big data analysis, in order to collect access behavior data of a page visitor, a mainstream page analysis tool usually records a URL (Uniform Resource Locator ) of a source page of a page currently visited by a user when the user visits the browser by using a JavaScript Tracker (installed on a user client), then fills the URL of the source page into a request URL (request URL) of a Get request of the browser, and sends the Get request to a data collection server, so as to analyze an access path of the page by the user. The URL is a compact representation of the location and access method of a resource available from the internet, and is the address of a standard resource on the internet.
However, the current browser has a certain limit to the length of the request URL in the Get request, for example, an ie (internet explorer) browser has a maximum limit to the length of the request URL of 2083 characters, a Google Chrome (Google browser) has a maximum limit to the length of the request URL of 8182 characters, and once the maximum length limit is exceeded, the browser directly discards the request URL exceeding the maximum length limit, so that the page access data of the user cannot be acquired.
Disclosure of Invention
In view of the above problems, the present invention is provided to provide a URL processing method and apparatus that overcomes or at least partially solves the above problems, and improves a collection success rate of data of a website accessed by a user.
The invention provides a URL processing method, which comprises the following steps:
acquiring a URL of a source page of an accessed page;
if the number of the characters of the URL of the source page is larger than a first threshold value, extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page;
if the number of characters of the processed URL of the source page is smaller than or equal to the first threshold, generating a request URL by using the processed URL of the source page, wherein the number of characters of the request URL meets the maximum length limiting condition of a browser for the request URL.
Preferably, the URL of the source page includes a key-value pair, where the key-value pair includes a key and a value;
extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page comprises the following steps:
extracting characters with preset number from the values of the key-value pairs to obtain the processed URL of the source page, wherein the preset number is smaller than the maximum total number of the characters of the values of the key-value pairs; and/or the presence of a gas in the gas,
if the URL of the source page comprises a plurality of key value pairs, extracting partial key value pairs to obtain the processed URL of the source page.
Preferably, the key-value pairs comprise a first key-value pair and a second key-value pair;
extracting a preset number of characters from the values of the key-value pairs to obtain a processed URL of a source page comprises:
extracting a first preset number of characters from the value of the first key-value pair to obtain a first processed URL of the source page;
if the number of the characters of the first processed URL of the source page is larger than the first threshold, extracting a second preset number of characters from the value of the second key value pair to obtain a second processed URL of the source page, wherein the number of the characters of the second processed URL of the source page is smaller than or equal to the first threshold.
Preferably, the number of characters of the value of the first key-value pair and the number of values of the second key-value pair are both greater than or equal to a second threshold value.
Preferably, if there are a plurality of key-value pairs included in the URL of the source page, extracting a part of the key-value pairs includes:
deleting the key value pairs in sequence according to a principle of from back to front;
judging whether the number of the characters of the URL of the rest source pages is larger than the first threshold value or not every time one key value pair is deleted, and if not, stopping deleting; if yes, the next key-value pair is continuously deleted.
Preferably, if there are a plurality of key-value pairs included in the URL of the source page, extracting a part of the key-value pairs includes:
and arranging the key value pairs according to the sequence of the number of characters of the values from small to large, and extracting the front M key value pairs, wherein M is less than the total number of the key value pairs.
The invention also provides a URL processing device, which comprises: the device comprises an acquisition unit, an extraction unit and a generation unit;
the acquisition unit is used for acquiring the URL of the source page of the accessed page;
the extraction unit is used for extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page if the number of the characters of the URL of the source page is greater than a first threshold;
and the generating unit is used for generating a request URL by using the processed URL of the source page if the number of the characters of the processed URL of the source page is less than or equal to the first threshold, wherein the number of the characters of the request URL meets the maximum length limiting condition of a browser for the request URL.
Preferably, the URL of the source page includes a key-value pair, where the key-value pair includes a key and a value;
the extraction unit is specifically configured to:
extracting characters with preset number from the values of the key-value pairs to obtain the processed URL of the source page, wherein the preset number is smaller than the maximum total number of the characters of the values of the key-value pairs; and/or the presence of a gas in the gas,
if the URL of the source page comprises a plurality of key value pairs, extracting partial key value pairs to obtain the processed URL of the source page.
Preferably, the key-value pairs comprise a first key-value pair and a second key-value pair;
the extraction unit is specifically configured to:
extracting a first preset number of characters from the value of the first key-value pair to obtain a first processed URL of the source page;
if the number of the characters of the first processed URL of the source page is larger than the first threshold, extracting a second preset number of characters from the value of the second key value pair to obtain a second processed URL of the source page, wherein the number of the characters of the second processed URL of the source page is smaller than or equal to the first threshold.
Preferably, if there are a plurality of key-value pairs included in the URL of the source page, the extracting unit is specifically configured to:
deleting the key value pairs in sequence according to a principle of from back to front;
judging whether the number of the characters of the URL of the rest source pages is larger than the first threshold value or not every time one key value pair is deleted, and if not, stopping deleting; if yes, the next key-value pair is continuously deleted.
By means of the technical scheme, when the number of the characters of the URL of the source page is larger than a first threshold value, partial character strings are extracted from the URL of the source page according to a preset rule, namely, the URL of the source page is compressed, so that a request URL in a Get request generated according to the processed URL of the source page meets the requirement of a browser for limiting the maximum length of the request URL, the generated Get request is sent to a data acquisition server, collection of website data accessed by a user is achieved, and the success rate of collecting the website data accessed by the user is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a URL processing method according to an embodiment of the present invention
FIG. 2 is a diagram illustrating a source page according to an embodiment of the invention;
FIG. 3 is a diagram illustrating a page accessed in one embodiment of the invention;
FIG. 4 is a flowchart illustrating a URL processing method according to a second embodiment of the present invention;
fig. 5 is a block diagram illustrating a URL processing apparatus according to a third embodiment of the present invention.
Detailed Description
In the prior art, once the number of characters of a request URL in a Get request exceeds a certain threshold, the Get request is discarded by a browser and cannot be transmitted to a data acquisition server, which results in missing of page data accessed by a user.
In the process of overcoming the technical problems, the inventor finds that the length of the source page URL in the request URL is too long, which is usually the most main reason that the character number of the request URL does not meet the conditions, so that the character number of the processed URL of the source page meets the requirement of a browser on the character number of the request URL in the Get request, and the success rate of the data acquisition server for acquiring the data of the page accessed by the user is improved.
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example one
Referring to fig. 1, a flowchart of a URL processing method according to an embodiment of the present invention is shown.
The URL processing method provided in this embodiment includes the following steps:
step S101: the URL of the source page of the accessed page is obtained.
In this embodiment, the page currently visited by the user is referred to as a visited page. And the user jumps to the accessed page through a link of a certain page, the page is called a source page of the accessed page, and the URL of the source page is used for identifying the source page from which the accessed page jumps.
For example, referring to fig. 2 and 3, assume that a user opens an hundreds degree page and searches for "guosh" in an hundreds degree search box, and a page of search results regarding "guosh" appears.
Next, the user clicks on a second search result, i.e., "national double-tech limited company (Gridsum) -focus data, and creates a link corresponding to the value" to enter the home page of the national double-official website.
Then the homepage of the country duplex web in fig. 3 is the accessed page, the page of the search result about "country duplex" in fig. 2 is the source page of the page in fig. 3, and the URL of the source page in fig. 2 is: https:// www.baidu.com/s? ie-utf-8 & F-8& rsv _ bp-0 & rsv _ idx-1 & tn-baidu & wd-E5% 9B% BD% E5% 8F-8C & rsv _ pq-B2404 d1B000078d9& rsv _ t-7094 pw6Tf5gINh8 nfjgbnkeckyueh 3l0 jimidtilguk 3 oiucnwk 9l3fnU & rlang-cn & rsv-enter-1 & rsv _ sun 3-1 & rsv _ sun 1-1 & rsv _ sun 7 & 100& rsv _ sun _ 3664 & t-3527 & 2458 _ sun-3.
According to the background art, a JavaScript Tracker records a URL of a source page of a currently accessed page of a browser accessed by a user and stores the URL in a client, so that the URL of the source page of the accessed page can be obtained in this way.
Step S102: if the number of the characters of the URL of the source page is larger than a first threshold value, extracting partial character strings from the URL of the source page according to a preset rule to obtain the processed URL of the source page.
Because the length of the request URL in the Get request needs to satisfy the condition that the maximum limit length is not exceeded, what most easily causes the characters of the request URL to not satisfy the condition is the length of the URL of the source page. Therefore, the number of characters of the URL of the source page needs to be controlled within a first threshold range, and the first threshold may be a fixed value for the same browser, for example, if the maximum limit length of the request URL is 2083 characters, the first threshold may be fixed to 700 characters if other fields of the request URL generally do not exceed 1383 characters.
The first threshold may also vary as a function of the length of other fields of the request URL. If the length of the other fields in the request URL is 1300 characters, then the first threshold may be 783 characters; if the other fields in the request URL are 1400 characters in length, then the first threshold may be 683 characters.
In addition, the maximum length limit for the request URL may be different for different browsers, but the first threshold may be the same or different. If the first threshold values are the same, the corresponding first threshold value with the minimum maximum length limit of the request URL is selected as much as possible so as to be suitable for the requirements of all browsers. For example, if the maximum limit length of the IE browser to the request URL is the smallest among all browsers, the first threshold value adapted to the IE browser is selected as the fixed threshold value. Alternatively, when the accessed page is loaded by an IE browser, the first threshold may be 700 characters; when the page being accessed is a Google Chrome load, then the first threshold may be 2800 characters.
In this embodiment, when the number of characters of the URL of the source page is greater than the first threshold, a part of character strings are extracted from the URL of the source page according to a preset rule to obtain a processed URL of the source page, that is, the URL of the source page is compressed, so that the processed URL of the source page can be smaller than the first threshold.
The preset rule is not limited in the present invention, and those skilled in the art can design the rule according to the specific situation. In the present embodiment, the preset rule is described by way of example.
Typically, the URL of the source page includes at least one key-value pair (key-value) including a key and a value in the form of a "key-value". Each key-value pair is separated by the "&" symbol. With the URL "https:// www.baidu.com/s? ie-utf-8 & F-8& rsv _ bp-0 & rsv _ idx-1 & tn-baidu & wd-E5% 9B% BD% E5% 8F-8C & rsv _ pq-B2404 d1B000078d9& rsv _ t-C7094 pw6Tf5gINh8 nfjgbnkeckyueh 3l0 jimidtilguk 3 oiucnwk 9l3fnU & rlang-cn & rsv _ enter-1 & rsv _ sun 3-1 & rsv _ sun 1-1 & rsv _ sun 7 & 100& rsv _ sun-3664, URL-3564-3 j-usu & rlag-3 j-cn & rsv _ enter-1 & rsv _ sun-3-1 & 2464-3 _ sun-3 _ 24 _ r-3 j-r-q-r-1 & r-38-1 & 2458-g-1 & 2458-3-g-38-3 j-g-3 j-r-q-r-3 j-q-r-3-q-r-3-q-2-: ie-utf-8, F-8, rsv _ bp-0, rsv _ idx-1, tn-baidu, wd-E5% 9B%, BD% E5% 8F% 8C, rsv _ pq-B2404 d1B000078d9, rsv _ t-c 7094pw6Tf5gINh8 nfjgbekekucseyieh 3l0 jimidtilguk 3 oiucnwk 9l3fnU, rqlang-cn, rsv _ enter-1, rsv _ sug 3-1, rsv _ sug 1-1, rsv _ sug 7-100, rsv _ sug-2-t-6364, and 68664 _ su-64. Taking a key-value pair "ie ═ utf-8" as an example, "ie" is the key of the key-value pair, and "utf-8" is the value of the key-value pair.
The preset rule may be to extract a preset number of characters from the values of the key-value pair to obtain a processed URL of the source page.
The preset number may be a fixed value, for example, 10, and if the number of characters of the value of the key-value pair is less than or equal to the value, all the character strings of the value are reserved; if the number of characters of the value of the key-value pair is greater than the value, the first N characters may be extracted, where N is equal to the value. The reason why the first N characters are extracted instead of the last N characters or random N characters is that the information carried by the former characters is more important than the information carried by the latter characters in general, and the random N characters can thoroughly destroy the information carried by the original value.
Taking the key-value pair "tn ═ baidu" as an example, the value of the key-value pair has 5 characters, and then the key-value pair after extracting the first 10 characters is still "tn ═ baidu". Taking the key value pair "rsv _ pq ═ b2404d1b000078d 9" as an example, if the value in the key value pair has 16 characters, the key value pair after 10 characters are extracted is "rsv _ pq ═ b2404d1b 00".
The predetermined number may also vary with the number of characters of the value of the key-value pair. For example, the preset number is two thirds, one half, one third, and the like of the number of characters of the value of the key value pair, and the first two thirds, one half, and one third, and the like of the value are extracted during extraction. Taking the key value pair "rsv _ pq ═ b2404d1b000078d 9" as an example, after extracting the first half of the characters of the value, the key value pair "rsv _ pq ═ b2404d1 b". In addition, for the same source page URL, the preset number corresponding to each key-value pair may be the same or different. For example, when the number of characters of the value of the key-value pair is within a first interval range (e.g., [20,50]), extracting a character string of the first two thirds of the value; when the number of characters of the value of the key-value pair is within a second interval range (e.g., [100,200]), a character string of the first half of the value is extracted.
Regardless of whether the preset number is a fixed value or a dynamically changing value, the preset number must be less than the maximum total number of characters of the value of the key-value pair.
In addition, if the first threshold value changes according to different browsers, the preset number also correspondingly changes in a negative correlation manner. That is, when the first threshold is small, the preset number should be large; when the first threshold is larger, the preset number should be smaller, so that the original information of the URL of the source page can be retained as much as possible.
If the values of each key value pair are respectively subjected to the processing of extracting the characters with the preset number once, and the characters of the obtained source page processed URL are still larger than the first threshold value, the source page processed URL is subjected to the processing once again according to the preset rule until the characters of the processed URL are smaller than or equal to the first threshold value.
In addition to the way of compressing the URL of the source page, which is to extract a preset number of characters from the values of the key-value pairs, when the URL of the source page includes a plurality of key-value pairs, the preset rule may be: partial key-value pairs are extracted. Specifically, part of the key-value pairs may be extracted from the URL of the source page, or may be extracted according to a certain rule, for example, the key-value pairs are arranged in order of the number of characters of the value from small to large, and the first M key-value pairs are extracted, where M is smaller than the total number of the key-value pairs.
The number of the extracted partial key-value pairs can be a fixed value, can also be in positive correlation with the number of characters of the URL of the source page, and can also be in negative correlation with the size of the first threshold. That is, under the condition that the first threshold is fixed, when the number of characters of the URL of the source page is large, the number of the extracted partial key-value pairs may be relatively large; when the number of characters of the URL of the source page is small, the number of the extracted partial key-value pairs may also be relatively small. Under the condition that the number of characters of the URL of the source page is certain, when the first threshold value is larger, the number of the extracted partial key-value pairs is less; when the first threshold is smaller, the number of the extracted partial key-value pairs should be larger. Thereby, the original information of the URL of the source page can be kept as much as possible.
In addition, the preset rule may also be a combination of the above two ways, see the following embodiments.
Step S103: if the number of characters of the processed URL of the source page is smaller than or equal to the first threshold, generating a request URL by using the processed URL of the source page, wherein the number of characters of the request URL meets the maximum length limiting condition of a browser for the request URL.
Compressing the URL of the source page through the step S102, and if the number of characters of the obtained processed URL is less than or equal to the first threshold, generating a request URL in the Get request using the processed URL of the source page. The specific request URL generation method is basically the same as the method for generating the request URL according to the pre-processing URL of the source page in the prior art, and details are not repeated here.
In this embodiment, when the number of the characters of the URL of the source page is greater than the first threshold, a part of the character strings is extracted from the URL of the source page according to a preset rule, that is, the URL of the source page is compressed, so that a request URL in a Get request generated according to the processed URL of the source page meets a requirement of a browser for maximum length limitation, and thus, the generated Get request is sent to a data acquisition server, thereby implementing collection of data of websites accessed by a user.
Example two
In this embodiment, two ways of extracting a preset number of characters from the values of the key-value pairs and extracting partial key-value pairs are combined together to process the URL of the source page. Specifically, when the number of characters of the URL of the source page is greater than the first threshold, a preset number of characters are extracted from values in key value pairs of the URL of the source page, and if the number of characters of the URL of the source page after extraction is still greater than the first threshold and the URL of the source page includes a plurality of key value pairs, a part of key value pairs are extracted at this time to further compress the URL of the source page.
Referring to fig. 4, it is a flowchart of a URL processing method according to a second embodiment of the present invention.
The URL processing method provided in this embodiment includes the following steps:
step S201: the URL of the source page of the accessed page is obtained.
This step is the same as step S101 in the first embodiment, and will not be described in detail here.
Step S202: determining whether the number of the characters of the URL of the source page is greater than a first threshold, if so, performing step S203.
The URL of the source page comprises key-value pairs, and the key-value pairs comprise keys and values.
Step S203: and extracting characters with preset number from the values of the key-value pairs, wherein the preset number is smaller than the maximum total number of the characters of the values of the key-value pairs.
In this embodiment, the key-value pairs of the URL of the source page are divided into important key-value pairs and non-important key-value pairs, and if the key-value pairs are important key-value pairs, compression processing is not performed to retain all information; if the key value pair is not important, compressing the key value pair so as to shorten the characters of the URL of the source page to be smaller than the first threshold value.
Specifically, a preset list may be established. Since the same browser may use the same key to express the same meaning information when generating URLs of different source pages, the preset list includes the key of the important key-value pair. If the key of the key value pair of the URL of the source page exists in the preset list, no compression processing is carried out; if the information does not exist in the preset list, the following compression processing steps are executed. It is understood that the preset list may include not only the key of the important key-value pair corresponding to one browser, but also keys of important key-value pairs corresponding to a plurality of browsers.
In addition, in practical applications, each non-significant key-value pair of the URL of the source page may be executed to extract a preset number of characters from the values of the key-value pair, but in order to keep the original information of the URL of the source page as much as possible on the premise that the condition that the characters of the processed URL of the source page are smaller than the first threshold is satisfied, the embodiment may also sequentially execute the extraction steps on the non-significant key-value pairs, that is, each pair of non-significant key-value pairs executes the extraction step once, and then determines whether the characters of the URL of the source page are larger than the first threshold, and if not, does not execute the extraction step on the next non-significant key-value pair; if yes, the extraction step is executed for the next non-important key value pair.
For example, assuming that the non-significant key-value pair includes a first key-value pair and a second key-value pair, first extracting a first preset number of characters from values of the first key-value pair to obtain a first processed URL of the source page. If the number of the characters of the first processed URL of the source page is larger than the first threshold value, extracting a second preset number of characters from the value of the second key value pair to obtain a second processed URL of the source page, wherein the number of the characters of the second processed URL of the source page is smaller than or equal to the first threshold value. The first preset number may be the same as or different from the second preset number, and specific reasons may be referred to the description of the preset number in the first embodiment.
The first key-value pair and the second key-value pair may be considered as being representative of each of the non-significant key-value pairs in the chain of non-significant key-value pairs. The order of each of the non-significant key-value pairs in the "chain" of non-significant key-value pairs may be random or preset.
If random, to improve the compression efficiency of the URL of the source page, the non-significant key-value pairs with a value greater than or equal to a second threshold (e.g., 100 characters) may be selected for sorting. That is, the number of characters of the value of the first key-value pair and the number of characters of the value of the second key-value pair are both greater than or equal to a second threshold value.
If the compression efficiency of the URL of the source page is preset, in order to improve the compression efficiency of the URL of the source page, the key value pair with a large number of characters of the value can be arranged in front, and the key value pair with a small number of characters of the value can be arranged behind. Specifically, the values may be sorted according to the number of characters of the values from large to small, and then each non-important key value pair may be compressed in sequence according to the method provided above.
It should be noted that the above-mentioned preferred way of sequentially ordering key-value pairs and ordering them according to the character number of the values from large to small does not depend on the existence of the non-important key-value pairs, and the above-mentioned preferred way can be adopted if the key-value pairs are not divided into important key-value pairs and non-important key-value pairs.
Step S204: determining whether the number of the extracted characters of the URL of the source page is greater than the first threshold, if so, performing step S205.
Step S205: if the URL of the source page comprises a plurality of key value pairs, extracting partial key value pairs to obtain the processed URL of the source page.
In this embodiment, a part of key-value pairs may be randomly extracted from the URL of the source page, or extracted according to a preset rule. For example, extraction may be performed from the unimportant key-value pairs to retain information of the important key-value pairs. Further, since key value pairs ranked more behind in the URL of the source page are less important, the non-important key value pairs may be deleted in sequence according to the principle that the positions are from back to front, and each time a non-important key value pair is deleted, it is determined whether the number of characters of the URLs of the remaining source pages is greater than the first threshold, and if not, the deletion is stopped; if yes, the next non-important key value pair is continuously deleted. By such a way of deleting the non-important key-value pairs in sequence, the information of the non-important key-value pairs in the URL of the source page can be retained as much as possible.
Of course, it can be understood that the way of sequentially deleting key-value pairs does not depend on non-important key-value pairs, and if the key-value pairs are not divided into important and non-important key-value pairs, the way can be adopted to retain the information of the key-value pairs in the URL of the source page.
Step S206: if the number of characters of the processed URL of the source page is smaller than or equal to the first threshold, generating a request URL by using the processed URL of the source page, wherein the number of characters of the request URL meets the maximum length limiting condition of a browser for the request URL.
In this embodiment, a preset number of characters are extracted from the values of the key value pairs, and under the condition that the number of the characters of the URLs of the extracted source pages is still greater than the first threshold, a part of the key values are extracted, so as to combine the compression modes of the URLs of the two source pages. Of course, in practical applications, the compression operation of extracting a part of key-value pairs may be performed first, and then the operation of extracting a preset number of characters from the values of the key-value pairs may be performed. Specifically, a certain number of key value pairs may be extracted first, and when the number of characters of the URL of the source page after extraction is found to be still greater than the first threshold, a preset number of characters may be extracted from the values of the key value pairs, so as to achieve the purpose of compressing the characters of the URL of the source page to be within the first threshold.
EXAMPLE III
Referring to fig. 5, a block diagram of a URL processing apparatus according to a third embodiment of the present invention is shown.
The URL processing apparatus provided in this embodiment includes: an acquisition unit 101, an extraction unit 102, and a generation unit 103;
the acquiring unit 101 is configured to acquire a URL of a source page of an accessed page;
the extracting unit 102 is configured to, if the number of characters of the URL of the source page is greater than a first threshold, extract a partial character string from the URL of the source page according to a preset rule to obtain a processed URL of the source page;
the generating unit 103 is configured to generate a request URL by using the processed URL of the source page if the number of characters of the processed URL of the source page is less than or equal to the first threshold, where the number of characters of the request URL satisfies a maximum length restriction condition of the request URL by a browser.
The URL processing apparatus includes a processor and a memory, the acquiring unit 101, the extracting unit 102, the generating unit 103, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the success rate of collecting the data of the website accessed by the user is improved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
According to the URL processing device provided by the embodiment, when the number of the characters of the URL of the source page is larger than the first threshold, a part of character strings are extracted from the URL of the source page according to a preset rule, namely, the URL of the source page is compressed, so that the request URL in the Get request generated according to the processed URL of the source page meets the requirement of the browser for limiting the maximum length of the browser, and thus, the generated Get request is sent to the data acquisition server, the collection of website access data of a user is realized, and the collection success rate of the website access data of the user is improved.
Optionally, the URL of the source page includes a key-value pair, where the key-value pair includes a key and a value;
the extraction unit is specifically configured to:
extracting characters with preset number from the values of the key-value pairs to obtain the processed URL of the source page, wherein the preset number is smaller than the maximum total number of the characters of the values of the key-value pairs; and/or the presence of a gas in the gas,
if the URL of the source page comprises a plurality of key value pairs, extracting partial key value pairs to obtain the processed URL of the source page.
Optionally, the key-value pair includes a first key-value pair and a second key-value pair;
the extraction unit is specifically configured to:
extracting a first preset number of characters from the value of the first key-value pair to obtain a first processed URL of the source page;
if the number of the characters of the first processed URL of the source page is larger than the first threshold, extracting a second preset number of characters from the value of the second key value pair to obtain a second processed URL of the source page, wherein the number of the characters of the second processed URL of the source page is smaller than or equal to the first threshold.
Optionally, if there are a plurality of key-value pairs included in the URL of the source page, the extracting unit is specifically configured to:
deleting the key value pairs in sequence according to a principle of from back to front;
judging whether the number of the characters of the URL of the rest source pages is larger than the first threshold value or not every time one key value pair is deleted, and if not, stopping deleting; if yes, the next key-value pair is continuously deleted.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device:
acquiring a URL of a source page of an accessed page;
if the number of the characters of the URL of the source page is larger than a first threshold value, extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page;
if the number of characters of the processed URL of the source page is smaller than or equal to the first threshold, generating a request URL by using the processed URL of the source page, wherein the number of characters of the request URL meets the maximum length limiting condition of a browser for the request URL.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for URL processing, the method comprising:
acquiring a URL of a source page of an accessed page;
if the number of the characters of the URL of the source page is larger than a first threshold value, extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page;
if the number of characters of the processed URL of the source page is smaller than or equal to the first threshold, generating a request URL by using the processed URL of the source page, wherein the number of characters of the request URL meets the maximum length limiting condition of a browser for the request URL.
2. The method of claim 1, wherein the URL of the source page includes key-value pairs, the key-value pairs including a key and a value;
extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page comprises the following steps:
extracting characters with preset number from the values of the key-value pairs to obtain the processed URL of the source page, wherein the preset number is smaller than the maximum total number of the characters of the values of the key-value pairs; and/or the presence of a gas in the gas,
if the URL of the source page comprises a plurality of key value pairs, extracting partial key value pairs to obtain the processed URL of the source page.
3. The method of claim 2, wherein the key-value pair comprises a first key-value pair and a second key-value pair;
extracting a preset number of characters from the values of the key-value pairs to obtain a processed URL of a source page comprises:
extracting a first preset number of characters from the value of the first key-value pair to obtain a first processed URL of the source page;
if the number of the characters of the first processed URL of the source page is larger than the first threshold, extracting a second preset number of characters from the value of the second key value pair to obtain a second processed URL of the source page, wherein the number of the characters of the second processed URL of the source page is smaller than or equal to the first threshold.
4. The method of claim 3, wherein the number of characters of the value of the first key-value pair and the number of characters of the value of the second key-value pair are each greater than or equal to a second threshold value.
5. The method of claim 2, wherein if there are a plurality of key-value pairs included in the URL of the source page, the extracting the partial key-value pair comprises:
deleting the key value pairs in sequence according to a principle of from back to front;
judging whether the number of the characters of the URL of the rest source pages is larger than the first threshold value or not every time one key value pair is deleted, and if not, stopping deleting; if yes, the next key-value pair is continuously deleted.
6. The method of claim 2, wherein if there are a plurality of key-value pairs included in the URL of the source page, the extracting the partial key-value pair comprises:
and arranging the key value pairs according to the sequence of the number of characters of the values from small to large, and extracting the front M key value pairs, wherein M is less than the total number of the key value pairs.
7. An apparatus for URL processing, the apparatus comprising: the device comprises an acquisition unit, an extraction unit and a generation unit;
the acquisition unit is used for acquiring the URL of the source page of the accessed page;
the extraction unit is used for extracting partial character strings from the URL of the source page according to a preset rule to obtain a processed URL of the source page if the number of the characters of the URL of the source page is greater than a first threshold;
and the generating unit is used for generating a request URL by using the processed URL of the source page if the number of the characters of the processed URL of the source page is less than or equal to the first threshold, wherein the number of the characters of the request URL meets the maximum length limiting condition of a browser for the request URL.
8. The apparatus of claim 7, wherein the URL of the source page comprises key-value pairs, wherein the key-value pairs comprise a key and a value;
the extraction unit is specifically configured to:
extracting characters with preset number from the values of the key-value pairs to obtain the processed URL of the source page, wherein the preset number is smaller than the maximum total number of the characters of the values of the key-value pairs; and/or the presence of a gas in the gas,
if the URL of the source page comprises a plurality of key value pairs, extracting partial key value pairs to obtain the processed URL of the source page.
9. The apparatus of claim 8, wherein the key-value pair comprises a first key-value pair and a second key-value pair;
the extraction unit is specifically configured to:
extracting a first preset number of characters from the value of the first key-value pair to obtain a first processed URL of the source page;
if the number of the characters of the first processed URL of the source page is larger than the first threshold, extracting a second preset number of characters from the value of the second key value pair to obtain a second processed URL of the source page, wherein the number of the characters of the second processed URL of the source page is smaller than or equal to the first threshold.
10. The apparatus according to claim 8, wherein if there are a plurality of key-value pairs included in the URL of the source page, the extracting unit is specifically configured to:
deleting the key value pairs in sequence according to a principle of from back to front;
judging whether the number of the characters of the URL of the rest source pages is larger than the first threshold value or not every time one key value pair is deleted, and if not, stopping deleting; if yes, the next key-value pair is continuously deleted.
CN201610996918.5A 2016-11-08 2016-11-08 URL processing method and device Active CN108073607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610996918.5A CN108073607B (en) 2016-11-08 2016-11-08 URL processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610996918.5A CN108073607B (en) 2016-11-08 2016-11-08 URL processing method and device

Publications (2)

Publication Number Publication Date
CN108073607A CN108073607A (en) 2018-05-25
CN108073607B true CN108073607B (en) 2020-03-06

Family

ID=62153739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610996918.5A Active CN108073607B (en) 2016-11-08 2016-11-08 URL processing method and device

Country Status (1)

Country Link
CN (1) CN108073607B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114285834A (en) * 2021-12-24 2022-04-05 山石网科通信技术股份有限公司 Message transmission method and device and terminal equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102843271A (en) * 2011-11-14 2012-12-26 哈尔滨安天科技股份有限公司 Formalization detection method and system for malicious URL (uniform resource locator)
CN105099829A (en) * 2015-08-30 2015-11-25 大连理工大学 Electronic resource service availability automatic monitoring method based on HTTP (Hyper Text Transfer Protocol) protocol

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258016A1 (en) * 2010-04-14 2011-10-20 Optify, Inc. Systems and methods for generating lead intelligence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102843271A (en) * 2011-11-14 2012-12-26 哈尔滨安天科技股份有限公司 Formalization detection method and system for malicious URL (uniform resource locator)
CN105099829A (en) * 2015-08-30 2015-11-25 大连理工大学 Electronic resource service availability automatic monitoring method based on HTTP (Hyper Text Transfer Protocol) protocol

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网络信息构造与用户行为结合分析研究;高城;《中国优秀硕士学位论文全文数据库》;20150715(第07期);第二章第2.1-2.2节 *

Also Published As

Publication number Publication date
CN108073607A (en) 2018-05-25

Similar Documents

Publication Publication Date Title
CN107832406B (en) Method, device, equipment and storage medium for removing duplicate entries of mass log data
CN108304410B (en) Method and device for detecting abnormal access page and data analysis method
US20130185429A1 (en) Processing Store Visiting Data
KR102018445B1 (en) Compression of cascading style sheet files
US20190197071A1 (en) System and method for evaluating nodes of funnel model
CN110008419B (en) Webpage deduplication method, device and equipment
CN106844640A (en) A kind of web data analysis and processing method
CN109359263B (en) User behavior feature extraction method and system
CN112347501A (en) Data processing method, device, equipment and storage medium
US20120166412A1 (en) Super-clustering for efficient information extraction
CN105790967B (en) Network log processing method and device
CN106557483B (en) Data processing method, data query method, data processing equipment and data query equipment
CN108073607B (en) URL processing method and device
CN112579623A (en) Method, device, storage medium and equipment for storing data
CN110969469B (en) Data acquisition method and device
CN109587198B (en) Image-text information pushing method and device
CN106776654B (en) Data searching method and device
CN105512145A (en) Method and device for information classification
CN110188301B (en) Information aggregation method and device for website
CN110826007B (en) Column updating date determining method, device and equipment and readable storage medium
CN111125588B (en) Method and device for drawing and evaluating propagation effect graph, storage medium and processor
JP5464082B2 (en) Document processing apparatus, document processing method, document processing program, and computer-readable recording medium recording the document processing program
CN110750739B (en) Page type determination method and device
KR101921123B1 (en) Field-Indexing Method for Message
CN109299417B (en) Method and device for inquiring access path

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant