WO2017107708A1 - Procédé et dispositif d'extraction de préfixes de localisateur uniforme de ressource pour auto-adaptation de mandataire d'utilisateur - Google Patents

Procédé et dispositif d'extraction de préfixes de localisateur uniforme de ressource pour auto-adaptation de mandataire d'utilisateur Download PDF

Info

Publication number
WO2017107708A1
WO2017107708A1 PCT/CN2016/106250 CN2016106250W WO2017107708A1 WO 2017107708 A1 WO2017107708 A1 WO 2017107708A1 CN 2016106250 W CN2016106250 W CN 2016106250W WO 2017107708 A1 WO2017107708 A1 WO 2017107708A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
adaptive
version
website
prefix
Prior art date
Application number
PCT/CN2016/106250
Other languages
English (en)
Chinese (zh)
Inventor
孙键
李毅
Original Assignee
北京搜狗科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京搜狗科技发展有限公司 filed Critical 北京搜狗科技发展有限公司
Publication of WO2017107708A1 publication Critical patent/WO2017107708A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Definitions

  • the present invention relates to the field of electronic technologies, and in particular, to a method and apparatus for mining a uniform resource locator URL prefix of an adaptive user agent UA.
  • the screen size of a personal computer personal computer, for example, a notebook computer, a desktop computer, etc.
  • a mobile terminal such as a mobile phone
  • the webpage is a mobile version, so there is a big difference in display style and content compared to the mobile version of a webpage.
  • more and more websites have both a computer version of the webpage and a mobile version of the webpage.
  • the website In the computer browser or the mobile browser, if the access request comes from the browser on the computer, the website returns the web version of the web page. If the access request comes from the mobile browser, the website returns the mobile version of the web page, and the website is unified.
  • the Uniform Resource Locator (URL) is often referred to as the URL of the adaptive UA.
  • the URL of the part of the website is not the URL of the adaptive UA.
  • the mobile terminal needs to be provided with a mobile browser.
  • the URL of the displayed web page which can usually be achieved by mining the conversion rules between the two. For example, for the webpage A that is convenient for the browser of the computer to display, the conversion rule between the webpages B corresponding to the webpage A and the webpage B displayed by the mobile browser is extracted, and when the mobile browser receives the request for accessing the webpage A, based on The conversion rule is Go to page B and display it to the user on the mobile side, so that the web page can respond to the access request of the computer and the mobile terminal normally.
  • the webpage of the adaptive UA can correctly respond to the access request of the browser on the computer side and the access request of the mobile browser. If the mining of the conversion rule is performed, the workload will be increased, and the work may be excavated. The resulting conversion rules are incorrect and cause the mobile terminal to display an error, such as displaying garbled characters, dead links, and the like.
  • the embodiment of the present invention provides an adaptive UA URL prefix mining method and device, which can mine the URL prefix of the adaptive UA. Based on this, the adaptive UA webpage can be effectively removed in the conversion rule mining work, and the conversion is reduced. Rules mining the workload and improving the accuracy of the mining.
  • the first aspect of the embodiments of the present invention provides a method for mining a uniform resource locator URL prefix of an adaptive user agent UA, including:
  • the webpage of the adaptive UA includes a webpage capable of correctly responding to an access request of a browser of a computer end and an access request of a mobile browser;
  • the homepage of the website is an adaptive UA webpage
  • the first level prefix that satisfies the condition of the adaptive UA is determined to be the URL prefix of the adaptive UA.
  • the determining whether the homepage of the website is an adaptive UA webpage includes:
  • the calculating the similarity between the first version webpage and the second version webpage includes:
  • Meta-information similarity and/or page structure similarity of the first version webpage and the second version webpage are calculated.
  • the calculating meta-information similarity of the first version webpage and the second version webpage includes:
  • Parsing the first version webpage and the second version webpage extracting first meta information of the first version webpage and second meta information of the second version webpage;
  • the meta information includes a ⁇ meta> of the webpage Information under the label;
  • the calculating the page structure similarity of the first version webpage and the second version webpage includes:
  • Parsing the first version webpage and the second version webpage extracting first page structure information of the first version webpage and second page structure information of the second version webpage;
  • the calculating the similarity between the first version webpage and the second version webpage includes:
  • the similarity of the first version webpage and the second version webpage is calculated by editing distance or cosine similarity.
  • the determining whether the first-level prefix corresponding to the website address included in the website meets the conditions of the adaptive UA includes:
  • the first-level prefix is determined to be adaptive. UA conditions.
  • the method further includes:
  • the webpage corresponding to the URL prefix of the adaptive UA is stored in a preset webpage library.
  • the second aspect of the embodiments of the present invention further provides an adaptive UA URL prefix mining apparatus, including:
  • a first determining unit configured to determine whether the homepage of the website is an adaptive UA webpage, where the webpage of the adaptive UA includes a webpage capable of correctly responding to an access request of a browser of a computer and an access request of a browser of a mobile browser ;
  • a second determining unit configured to determine, when the determination result of the first determining unit is yes, whether each first level prefix corresponding to the website address included in the website satisfies the condition of the adaptive UA;
  • a determining unit configured to determine, according to the determination result of the second determining unit, that the first level prefix that satisfies the condition of the adaptive UA is a URL prefix of the adaptive UA.
  • a third aspect of the embodiments of the present invention further provides an apparatus for adaptive resource locator URL prefix mining of an adaptive user agent UA, including a memory and one or more programs, one or more of which are included.
  • the program is stored in the memory and configured to be executed by one or more processors.
  • the one or more programs include instructions for:
  • the webpage of the adaptive UA includes a webpage capable of correctly responding to an access request of a browser of a computer end and an access request of a mobile browser;
  • the homepage of the website is an adaptive UA webpage
  • the first level prefix that satisfies the condition of the adaptive UA is determined to be the URL prefix of the adaptive UA.
  • first determining whether the homepage of the website is an adaptive UA webpage and determining, for the homepage that is an adaptive UA website, which URL first-level prefixes included in the website meet the conditions of the adaptive UA, and The first-level prefix included in the website that satisfies the adaptive UA condition is determined as the URL prefix of the adaptive UA.
  • the embodiment of the present application can mine the URL prefix that meets the adaptive UA condition included in each common website. Based on this, it is possible to effectively remove the adaptive UA webpage in the conversion rule mining work, and avoid the defect of increasing the workload or obtaining the wrong conversion rule caused by the conversion rule of the adaptive UA URL in the prior art.
  • FIG. 1 is a flowchart of a method for mining a URL prefix of an adaptive UA according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for providing a webpage URL according to an embodiment of the present invention
  • FIG. 3 is a block diagram of an adaptive UA URL prefix mining apparatus according to an embodiment of the present invention.
  • FIG. 4 is a block diagram of an apparatus 800 for adaptive URL prefix mining of a UA, according to an exemplary embodiment
  • FIG. 5 is a schematic structural diagram of a server 1900 for adaptive URL prefix mining of a UA according to an exemplary embodiment.
  • the embodiment of the invention provides an adaptive UA URL prefix mining method and device, which can mine the URL prefix of the adaptive UA. Based on this, the adaptive UA webpage can be effectively removed in the conversion rule mining work, and the conversion rule is reduced. Excavate the workload and improve the accuracy of the excavation.
  • FIG. 1 is a flowchart of a mining method according to an embodiment of the present invention. include:
  • S1 collecting the homepage of the web crawler crawled by the computer-side search engine
  • S2 determining whether the homepage of the website is an adaptive UA webpage, wherein the website homepage is adaptive UA when the webpage corresponding to the homepage of the website can correctly respond to the access request of the browser of the computer and the access request of the mobile browser. Page.
  • the webpage corresponding to the homepage of the website can correctly respond to the access request of the browser on the computer side, which can refer to the webpage of the desktop version returned to the browser of the computer according to the access request of the browser of the computer, and the webpage can be accessed normally;
  • the web page corresponding to the home page can correctly respond to the mobile browser's access request, which can refer to the mobile browser's access request to return the mobile version of the web page to the mobile browser and the web page can be accessed normally.
  • the homepage of the website is an adaptive UA webpage
  • the website whose homepage is an adaptive UA it is determined which URL first-level prefixes included in the website satisfy the condition of the adaptive UA. And determine the first-level prefix that meets the adaptive UA condition included in each website as the URL prefix of the adaptive UA.
  • the embodiment of the present application can mine the URL prefix that meets the adaptive UA condition included in each common website. Based on this, it is possible to effectively remove the adaptive UA webpage in the conversion rule mining work, and avoid the defect of increasing the workload or obtaining the wrong conversion rule caused by the conversion rule of the adaptive UA URL in the prior art.
  • the mobile terminal can obtain the mobile version of the webpage by sending the hyperlink corresponding to the search result to the mobile terminal, without conversion according to the conversion.
  • the rules will facilitate the conversion of the webpage displayed by the browser on the computer to the webpage displayed by the mobile browser, saving the workload of finding and transmitting the mobile version of the webpage to the mobile terminal, and also facilitating the movement of the web crawler of the mobile search engine.
  • the web page displayed by the browser provides the source.
  • the web crawler of the computer-side search engine is also called web spider and web robot. It is a program for automatically extracting web pages. It is a search engine for downloading web pages from the World Wide Web. It is an important component of search engines.
  • the traditional crawler starts from the URL of one or several initial webpages and obtains the URL on the initial webpage. During the process of crawling the webpage, the new URL is continuously extracted from the current page into the queue until the system stops the certain conditions.
  • All webpages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for later query and retrieval, and the analysis results obtained by this process may also be used for future crawling processes. Give feedback and guidance.
  • webpage crawled by the web crawler of the computer-side search engine refers to the webpage crawled or collected by the web crawler for the convenience of the browser of the computer, instead of the webpage displayed by the mobile browser.
  • the homepage of the website crawled by the web crawler is WWW.XXXX.COM.
  • the mining method provided by the embodiment of the present invention enters S2, that is, whether the homepage of the website is an adaptive UA webpage.
  • the browser when the browser sends an access request to the website, the browser sends its own UA to the website. After receiving the UA, the website sends the data corresponding to the UA to the browser for the browser. display.
  • the website corresponding to the homepage of the webpage can return the desktop version of the webpage to the computer browser according to the access request of the browser of the computer, and can normally access, and can return to the mobile browser according to the access request of the mobile browser.
  • the homepage of the website is an adaptive UA webpage.
  • the step S2 of determining whether the home page of the website is an adaptive UA web page may include the following steps S21 to S22:
  • step S21 the first page of the website is captured by the computer-side browser UA, and the first version of the website corresponding to the homepage of the website is captured by the computer-side browser UA, and The mobile browser UA captures the second version of the webpage corresponding to the homepage of the website.
  • the UA of the computer browser can use the UA of the Firefox browser as an example, specifically: Mozilla/5.0 (Windows NT 6.1; WOW64; rv: 21.0) Gecko/20100101 Firefox/21.0;
  • the browser UA can use the UA of the android browser as an example, specifically: Mozilla/5.0 (Linux; Android 4.4; Nexus 5 Build/BuildID) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0. 0.0Mobile Safari/537.36.
  • other computer-side browser UA or mobile browser UA may also be used.
  • Step S22 the first version of the webpage corresponding to the homepage of the website is captured by the computer-side browser UA, and the second version of the webpage corresponding to the homepage of the website is captured by the mobile browser UA, and the first version of the webpage and the second webpage are calculated.
  • calculating the similarity between the first version of the webpage and the second version of the webpage may include: calculating a meta-information similarity and/or a similar page structure between the first version of the webpage and the second version of the webpage degree.
  • step S22 if the home page of the website UA and/or the mobile browser UA fails to fetch the website homepage, for example, returning "dead chain”, "invalid page”, and the like when capturing, it indicates This page cannot be adaptive to UA, and will not be described here; of course, if you use multiple computer-side browsers UA and multiple mobile browsers UA for crawling, you can use the capture success rate. Judging, if it is lower than a certain value, it means that the page cannot be adaptive to UA, etc. Of course, other methods may be used. For example, if the number of UAs that failed to crawl exceeds the preset number, the page cannot be adaptive to UA, etc. Wait, there is no limit here.
  • the similarity between the first version of the webpage and the second version of the webpage is calculated in step S22, and the calculation may be performed by the following steps:
  • Step S22 parsing the first version webpage and the second version webpage, extracting first meta information and first page structure information of the first version webpage, and extracting second meta information and second page structure information of the second version webpage ;
  • parsing the first version webpage and the second version webpage can be completed simultaneously. There may be no order, and no restrictions are imposed here.
  • html HyperText Markup Language
  • html5 edited page home page is analyzed as an example: it can be parsed by xpath or dom4j, where xpath is the Extensible Markup Language (xml) path language, which is used to determine XML. (a subset of the standard universal markup language) the language of a part of the document, and html is an important subset of XML, and dom4j is a Java (an object-oriented programming language for writing cross-platform applications)
  • xml application programming interface API
  • API application programming interface
  • the meta information of each version webpage is extracted.
  • the meta information may be information under the ⁇ meta> tag of the webpage.
  • the information under the ⁇ meta> tag is used to record meta information of an html web page document, such as author, date and time, web page description, keywords, and the like.
  • extracting the page structure information may include: removing the attribute and content of the page tag from the html source code of each version of the webpage, and retaining only the tag name, thereby leaving the page structure information of the version of the webpage.
  • Step S22b calculating meta-information similarity of the first meta-information and the second meta-information, and calculating page structure similarity of the first page structure information and the second page structure information.
  • the meta information similarity and the page structure similarity may be calculated by editing the distance or the cosine similarity.
  • EditDistance() is the unnormalized edit distance
  • len() is the length
  • max() is the maximum
  • p1 and p2 are the two strings.
  • p1 may be a meta information string extracted from the first version webpage
  • p2 may be a meta information string extracted from the second version webpage.
  • EditDistance(p1, p2) represents the minimum number of editing operations required to convert from p1 to p2
  • max(len(p1), len(p1)) represents the string length of p1 and the maximum length of the string length of p2. It should be noted that EditDistance(p1, p2) can also be obtained by calculating the minimum number of editing operations required to convert from p1 to p2.
  • p1 may be a page structure information string extracted from the first version of the webpage
  • p2 may be a page structure information string extracted from the second version of the webpage.
  • EditDistance(p1, p2) represents the minimum number of editing operations required to convert from p1 to p2 or p2 to p1
  • max(len(p1), len(p1)) represents the string length of p1 and the string length of p2. Maximum value.
  • the page structure information may refer to the attribute and content of the page tag removed from the Html source code, and only the information obtained after the tag name is retained.
  • a web page removes the attribute and content of the page tag from the Html source code, and only the tag name is retained to obtain the text.
  • Tags such as paragraph mark ⁇ p>, title tag ⁇ h> or formatted text tag ⁇ b>, hyperlink tag ⁇ a>, and so on.
  • the page structure information of the first version of the webpage may be taken as the first set, and the page structure information of the second version of the webpage is used as the second set, and the frequency of each of the two sets is calculated separately, and then according to the two The frequency of each tag in the set generates two vectors corresponding to the set respectively.
  • the cosine similarity of the two vectors can be calculated, that is, the cosine similarity between the first version of the webpage and the second version of the webpage is calculated. .
  • the page structure information of the first version of the webpage and the page structure information of the webpage of the second version may be separately stored by a corresponding storage structure such as "primary key-value".
  • the primary key is The label name
  • the value corresponding to the primary key is the frequency corresponding to the label
  • the structural information of the two versions of the webpage is calculated and imported into the storage table, that is, the vector corresponding to the two versions of the webpage is obtained
  • two The cosine similarity between the vectors that is, the cosine similarity between the first version of the webpage and the second version of the webpage can be obtained, and will not be described here. The larger the value, the more similar.
  • the similarity of the webpage can also be calculated by means of text similarity and the like, and will not be described again here.
  • step S22c when the similarity of the meta information calculated in step S22b is less than the first threshold, and/or the similarity of the page structure is less than the second threshold, the webpage of the website home page may be determined to be an adaptive UA webpage.
  • calculating the similarity between the first version of the webpage and the second version of the webpage may be performed by simultaneously calculating the meta-information similarity and the page structure similarity as a reference, or calculating the meta-information similarity and the page structure similarity.
  • the calculation of the meta-information similarity and the page structure similarity are used as a reference, and the first threshold is set to 0.9 and the second threshold is 0.9, that is, if The meta-information similarity between the first version webpage and the second version webpage is less than 0.9, and the page structure similarity between the first version webpage and the second version webpage is also less than 0.9, indicating that the similarity of the two versions of the webpage is small.
  • the similarity between the webpage of the computer-side browser corresponding to the homepage of the website and the corresponding webpage of the mobile browser is small, and it can be determined that the homepage of the website can be adaptive UA, otherwise, it can be determined that the homepage of the website cannot be Adaptive UA.
  • the calculation of the meta-information similarity and the page structure similarity as an example is taken as an example, and the first threshold is also set to 0.9, and the second threshold is 0.9, that is, as long as the first If the meta-information similarity between the version webpage and the second version webpage is less than 0.9, or the page structure similarity between the first version webpage and the second version webpage is less than 0.9, it can be determined between the first version webpage and the second version webpage.
  • the similarity of the webpage is small, so that the homepage of the website can be adaptive UA.
  • the mining method provided by the embodiment of the present invention proceeds to S3, that is, whether the first-level prefix corresponding to the web address included in the website satisfies the condition of the adaptive UA.
  • the URL of a web page included in the website may be WWW.XXXX.COM/yyyy/zzz.html, where WWW.XXXX.COM/yyyy is included in the website.
  • WWW.XXXX.COM/yyyy/zzz.html is an adaptive UA
  • all URLs under the first-level prefix WWW.XXXX.COM/yyyy to which the URL belongs may be considered to be adaptive UA. That is, WWW.XXXX.COM/yyyy/aaa.asp is adaptive UA; WWW.XXXX.COM/yyyy/bbb.asp is also adaptive UA, and so on.
  • determining whether each level 1 prefix corresponding to the website address included in the website satisfies the condition of the adaptive UA may include: randomly extracting a plurality of URLs under a certain level prefix corresponding to the website address included in the website; It is determined whether the number of URLs meets the condition of the adaptive UA; if the proportion of the URL that satisfies the condition of the adaptive UA under the primary prefix exceeds a third preset threshold, the primary prefix may be considered to satisfy the condition of the adaptive UA.
  • a plurality of webpage URLs under a certain level of prefix corresponding to the website may be randomly selected to determine whether the plurality of webpages are adaptive UA webpages.
  • the specific method for determining whether the web page is an adaptive UA is the same as the method for determining whether the home page of the website is an adaptive UA web page in step S2.
  • the specific process of determining whether a webpage is a webpage of an adaptive UA has been described in detail in step S2, and will not be described again here.
  • the ratio of the number of URLs of the adaptive UA in the URL under the first-level prefix to the total number of extracted URLs is greater than a third threshold (for example, may be 0.9),
  • the first level prefix is the URL prefix of the adaptive UA.
  • the first-level prefix may be determined to be an adaptive UA.
  • URL prefix or, if the number of URLs in the URL under the first-level prefix that cannot be adaptive UA exceeds the second preset number, the first-level prefix may be determined to be a non-adaptive UA URL prefix. This will not go into details.
  • the mining method provided by the embodiment of the present invention further includes: storing the webpage corresponding to the first-level prefix of the adaptive UA into the preset webpage library.
  • the preset webpage library is used to store a webpage capable of adapting to the UA.
  • the webpage corresponding to the URL prefix can be stored in the preset webpage library, so that the web crawler of the mobile search engine is conveniently displayed by the mobile browser.
  • the web page provides the source and also provides a crawlable seed for the web crawler of the mobile search engine.
  • FIG. 2 is a flowchart of a method for providing a webpage URL according to an embodiment of the present invention. As shown in FIG. 2, the method includes:
  • S100 providing a corresponding search result by using a mobile search engine based on a search request of the mobile terminal for the webpage;
  • search result includes a webpage from a computer version provided by the computer-side search engine, determine whether the webpage of the computer version belongs to the preset webpage, and the webpage in the preset webpage is obtained by using the mining method provided by the embodiment of the present invention;
  • a mobile terminal such as a smart phone, a tablet computer, or the like
  • the browser can display a corresponding search result provided by the mobile search engine.
  • the search result generally includes a webpage from a computer version provided by a computer-side search engine, and at this time, it can be judged whether the webpage of the computer version belongs to In the preset webpage, the webpage in the preset webpage is obtained by the foregoing mining method according to the embodiment of the present invention.
  • the mobile terminal directly provides the webpage corresponding to the computer version.
  • the URL so that when the user accesses the web version of the webpage through the mobile terminal, the user can directly obtain the mobile version of the webpage.
  • the webpage of the mobile version corresponds to the webpage of the mobile version, and then returns the webpage of the mobile version to the user, and the method provided by the embodiment of the present invention is used, if the webpage corresponding to the search results is an adaptive UA webpage and is included in the pre-preview
  • the hyperlink of the webpage corresponding to the search results can be provided, and the user can automatically obtain the mobile version of the webpage when the user accesses through the mobile terminal, thereby saving the search and sending the mobile version of the webpage specifically to the mobile terminal. The amount of work.
  • the homepage of the website is an adaptive UA webpage
  • the homepage that is an adaptive UA website it is determined which URLs of the website include a first-level prefix that satisfies the adaptive UA.
  • the condition of the first-level prefix included in each website that satisfies the adaptive UA condition is determined as the URL prefix of the adaptive UA.
  • the embodiment of the present application can mine the URL prefix that meets the adaptive UA condition included in each common website. Based on this, it is possible to effectively remove the adaptive UA webpage in the conversion rule mining work, and avoid the defect of increasing the workload or obtaining the wrong conversion rule caused by the conversion rule of the adaptive UA URL in the prior art.
  • FIG. 3 is a block diagram of a mining device according to an embodiment of the present invention.
  • the collecting unit 201 is configured to collect a webpage crawled by a web crawler of a computer-side search engine
  • the first determining unit 202 is configured to determine whether the homepage of the website is an adaptive UA webpage, where the webpage of the adaptive UA includes a webpage capable of correctly responding to an access request of the browser of the computer and an access request of the mobile browser;
  • the second determining unit 203 is configured to determine, when the determination result of the first determining unit 202 is YES, whether the first level prefix corresponding to the website address included in the website satisfies the condition of the adaptive UA;
  • the determining unit 204 is configured to determine, according to the determination result of the second determining unit 203, that the first level prefix that satisfies the condition of the adaptive UA is the URL prefix of the adaptive UA.
  • the first determining unit 202 may include: a first crawling subunit, a second crawling subunit, a first calculating subunit, and a first determining subunit.
  • the first crawling sub-unit is configured to obtain the first version webpage corresponding to the homepage of the website by using the computer-side browser UA;
  • the second crawling sub-unit captures the second version webpage corresponding to the homepage of the website through the mobile browser UA;
  • a first calculating subunit configured to calculate a similarity between the first version webpage and the second version webpage
  • the first determining subunit is configured to determine that the website homepage is an adaptive UA webpage when the similarity is less than a preset first threshold.
  • the first computing subunit is configured to calculate meta information similarity and/or page structure similarity of the first version webpage and the second version webpage.
  • the first computing subunit includes: a first webpage parsing subunit and a meta information similarity calculating subunit;
  • a first webpage parsing subunit configured to parse the first version webpage and the second version webpage, and extract first meta information of the first version webpage and second meta information of the second version webpage;
  • the meta information includes a ⁇ meta> tag of the webpage Information under
  • the meta information similarity calculation subunit is configured to calculate the similarity between the first meta information and the second meta information as the meta information similarity of the first version webpage and the second version webpage.
  • the first computing subunit includes: a second webpage parsing subunit and a page similarity calculating subunit;
  • a second webpage parsing subunit configured to parse the first version webpage and the second version webpage, and extract the first page structure information of the first version webpage and the second page structure information of the second version webpage;
  • the page similarity calculation subunit calculates the similarity between the first page structure information and the second page structure information as the page structure similarity between the first version webpage and the second version webpage.
  • the first calculation subunit is used to edit the distance or cosine similarity The way to calculate the similarity between the first version of the web page and the second version of the web page.
  • the second determining unit 203 may include: an extracting subunit, a URL determining subunit, and a second determining subunit;
  • a URL determining subunit configured to respectively determine whether at least one URL satisfies an adaptive UA condition
  • a second determining subunit configured to: if the result of the judgment of the URL determining subunit is a URL under the extracted first level prefix, the number of URLs satisfying the condition of the adaptive UA is greater than the total number of extracted URLs.
  • the preset threshold determines that the primary prefix satisfies the condition of the adaptive UA.
  • the device may further include a storage unit 205, where the storage unit 205 is specifically configured to store the webpage corresponding to the URL prefix of the adaptive UA into the preset webpage library.
  • the URL prefix mining device of the adaptive UA provided by the embodiment of the present invention and the URL prefix mining method of the adaptive UA introduced in the foregoing section are based on two aspects under the same inventive concept, and the adaptive UA has been described in detail in the foregoing section.
  • the specific process of the URL prefix mining method is not described in detail for the simplicity of the specification, and the URL prefix mining device of the adaptive UA is not described in detail.
  • a third aspect of the embodiments of the present invention further provides an apparatus for adaptive resource locator URL prefix mining of an adaptive user agent UA, including a memory, and one or more programs, wherein one or more programs are stored in the memory. And configured to execute, by one or more processors, one or more programs include instructions for:
  • the webpage of the adaptive UA includes a webpage capable of correctly responding to the access request of the browser of the computer and the access request of the mobile browser;
  • each URL included in the website corresponds to one Whether the level prefix satisfies the condition of adaptive UA;
  • the first level prefix that satisfies the condition of the adaptive UA is determined to be the URL prefix of the adaptive UA.
  • FIG. 4 is a block diagram of an apparatus 800 for adaptive URL prefix mining of a UA, according to an exemplary embodiment.
  • device 800 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • apparatus 800 can include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, And a communication component 816.
  • Processing component 802 typically controls the overall operation of device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • Processing component 802 can include one or more processors 820 to execute instructions to perform all or part of the steps of the above described methods.
  • processing component 802 can include one or more modules to facilitate interaction between component 802 and other components.
  • processing component 802 can include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
  • Memory 804 is configured to store various types of data to support operation at device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 804 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Electrically erasable programmable read only memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • Power component 806 provides power to various components of device 800.
  • Power component 806 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 800.
  • the multimedia component 808 includes an output interface between the device 800 and the user Screen.
  • the screen can include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor may sense not only the boundary of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 810 is configured to output and/or input an audio signal.
  • the audio component 810 includes a microphone (MIC) that is configured to receive an external audio signal when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in memory 804 or transmitted via communication component 816.
  • the audio component 810 also includes a speaker for outputting an audio signal.
  • the I/O interface 812 provides an interface between the processing component 802 and the peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
  • Sensor assembly 814 includes one or more sensors for providing device 800 with a status assessment of various aspects.
  • sensor assembly 814 can detect an open/closed state of device 800, a relative positioning of components, such as the display and keypad of device 800, and sensor component 814 can also detect a change in position of one component of device 800 or device 800. The presence or absence of user contact with device 800, device 800 orientation or acceleration/deceleration, and temperature variation of device 800.
  • Sensor assembly 814 can include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor assembly 814 can also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 816 is configured to facilitate wired or wireless communication between device 800 and other devices.
  • the device 800 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, Or a combination of them.
  • the communication component 816 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • the communication component 816 also includes a near field communication (NFC) module to facilitate short range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor, or other electronic component implementation for performing the above methods.
  • non-transitory computer readable storage medium comprising instructions, such as a memory 804 comprising instructions executable by processor 820 of apparatus 800 to perform the above method.
  • the non-transitory computer readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
  • a non-transitory computer readable storage medium when executed by a processor of a mobile terminal, enables a mobile terminal to perform a uniform resource locator URL prefix mining method of an adaptive user agent UA,
  • the method includes:
  • the webpage of the adaptive UA includes a webpage capable of correctly responding to an access request of a browser of a computer end and an access request of a mobile browser;
  • the homepage of the website is an adaptive UA webpage
  • the first level prefix that satisfies the condition of the adaptive UA is determined to be the URL prefix of the adaptive UA.
  • FIG. 5 is a schematic structural diagram of a server 1900 for adaptive URL prefix mining of a UA according to an exemplary embodiment.
  • the server 1900 can vary considerably depending on configuration or performance, and can include one or more central processing units (CPUs) 1922 (eg, one or more processors) and memory 1932, one Or more than one storage medium 1930 storing data 1942 or data 1944 (eg, one or one storage device in Shanghai).
  • the memory 1932 and the storage medium 1930 may be short-term storage or persistent storage.
  • the program stored on storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations in the server.
  • central processor 1922 can be configured to communicate with storage medium 1930, which performs a series of instruction operations in storage medium 1930.
  • Server 1900 may also include one or more power sources 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941.
  • power sources 1926 For example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un dispositif d'extraction de préfixes de localisateur uniforme de ressource pour auto-adaptation de mandataire d'utilisateur. Le procédé comporte les étapes consistant à: recueillir une page d'accueil de site web capturée par un collecteur web d'un moteur de recherche du côté d'un ordinateur; juger si la page d'accueil du site web est une page web avec auto-adaptation d'UA; juger si chaque préfixe du premier ordre correspondant à une adresse web comprise dans un site web satisfait la condition d'auto-adaptation d'UA lorsque la page d'accueil du site web est une page web avec auto-adaptation d'UA; et déterminer que le préfixe du premier ordre satisfaisant la condition d'auto-adaptation d'UA est un préfixe d'URL pour auto-adaptation d'UA. Au moyen des modes de réalisation de la présente invention, des préfixes d'URL satisfaisant la condition d'auto-adaptation d'UA compris dans divers sites web courants peuvent être extraits. Sur cette base, des pages web avec auto-adaptation d'UA peuvent être éliminées efficacement pendant l'opération d'extraction de règles de transformation, et le défaut que constitue l'accroissement de la charge de travail ou l'obtention d'une règle de transformation erronée pendant l'extraction de règles de transformation d'URL d'auto-adaptation d'UA dans l'état antérieur de la technique est évité.
PCT/CN2016/106250 2015-12-25 2016-11-17 Procédé et dispositif d'extraction de préfixes de localisateur uniforme de ressource pour auto-adaptation de mandataire d'utilisateur WO2017107708A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510996559.9A CN105630987B (zh) 2015-12-25 2015-12-25 自适应用户代理的统一资源定位符前缀挖掘方法和装置
CN201510996559.9 2015-12-25

Publications (1)

Publication Number Publication Date
WO2017107708A1 true WO2017107708A1 (fr) 2017-06-29

Family

ID=56045920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/106250 WO2017107708A1 (fr) 2015-12-25 2016-11-17 Procédé et dispositif d'extraction de préfixes de localisateur uniforme de ressource pour auto-adaptation de mandataire d'utilisateur

Country Status (2)

Country Link
CN (1) CN105630987B (fr)
WO (1) WO2017107708A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804532A (zh) * 2018-05-03 2018-11-13 腾讯科技(深圳)有限公司 一种查询意图的挖掘和查询意图的识别方法、装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630987B (zh) * 2015-12-25 2019-06-21 北京搜狗科技发展有限公司 自适应用户代理的统一资源定位符前缀挖掘方法和装置
CN107357766B (zh) * 2017-07-19 2018-06-26 掌阅科技股份有限公司 基于电子书的页面编辑方法、电子设备及计算机存储介质
CN110968770B (zh) * 2018-09-29 2023-09-05 北京国双科技有限公司 一种终止爬虫工具爬取的方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008097201A (ja) * 2006-10-10 2008-04-24 Nec Corp ブラウザデータ共有システム、サーバ、方法およびプログラム
CN103577447A (zh) * 2012-07-30 2014-02-12 百度在线网络技术(北京)有限公司 一种用于确定目标页面的页面类型信息的方法和设备
CN104392009A (zh) * 2014-12-19 2015-03-04 北京奇虎科技有限公司 获取移动站点链接地址的方法和装置
CN104504100A (zh) * 2014-12-29 2015-04-08 北京奇虎科技有限公司 一种确定pc网页与移动网页自适应关系的系统及方法
CN104881453A (zh) * 2015-05-18 2015-09-02 百度在线网络技术(北京)有限公司 一种识别网页类型的方法和装置
CN105630987A (zh) * 2015-12-25 2016-06-01 北京搜狗科技发展有限公司 自适应用户代理的统一资源定位符前缀挖掘方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572931B (zh) * 2014-12-29 2016-06-22 北京奇虎科技有限公司 一种确定pc网页与移动网页自适应关系的系统及方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008097201A (ja) * 2006-10-10 2008-04-24 Nec Corp ブラウザデータ共有システム、サーバ、方法およびプログラム
CN103577447A (zh) * 2012-07-30 2014-02-12 百度在线网络技术(北京)有限公司 一种用于确定目标页面的页面类型信息的方法和设备
CN104392009A (zh) * 2014-12-19 2015-03-04 北京奇虎科技有限公司 获取移动站点链接地址的方法和装置
CN104504100A (zh) * 2014-12-29 2015-04-08 北京奇虎科技有限公司 一种确定pc网页与移动网页自适应关系的系统及方法
CN104881453A (zh) * 2015-05-18 2015-09-02 百度在线网络技术(北京)有限公司 一种识别网页类型的方法和装置
CN105630987A (zh) * 2015-12-25 2016-06-01 北京搜狗科技发展有限公司 自适应用户代理的统一资源定位符前缀挖掘方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804532A (zh) * 2018-05-03 2018-11-13 腾讯科技(深圳)有限公司 一种查询意图的挖掘和查询意图的识别方法、装置
CN108804532B (zh) * 2018-05-03 2020-06-26 腾讯科技(深圳)有限公司 一种查询意图的挖掘和查询意图的识别方法、装置

Also Published As

Publication number Publication date
CN105630987A (zh) 2016-06-01
CN105630987B (zh) 2019-06-21

Similar Documents

Publication Publication Date Title
JP6051338B2 (ja) ページロールバック制御方法、ページロールバック制御装置、端末、プログラム及び記録媒体
US8856100B2 (en) Displaying browse sequence with search results
WO2021022689A1 (fr) Procédé et appareil de collecte d'informations
US10878044B2 (en) System and method for providing content recommendation service
WO2017107708A1 (fr) Procédé et dispositif d'extraction de préfixes de localisateur uniforme de ressource pour auto-adaptation de mandataire d'utilisateur
US10845950B2 (en) Web browser extension
US20130024873A1 (en) Context-aware applications and methods
US20130219277A1 (en) Gesture and Voice Controlled Browser
US20140047359A1 (en) Mechanism for adding new search modes to user agent
US9934206B2 (en) Method and apparatus for extracting web page content
WO2017097075A1 (fr) Procédé et appareil de mise en correspondance de mot-clé flou
US10241994B2 (en) Electronic device and method for providing content on electronic device
RU2610245C2 (ru) Способ и устройство для идентификации кодирования веб-страницы
RU2595524C2 (ru) Устройство и способ обработки содержимого веб-ресурса в браузере
JP2011123740A (ja) 閲覧システム、サーバ、テキスト抽出方法及びプログラム
WO2016058425A1 (fr) Procédé, appareil et dispositif de recherche vocale, et support de stockage informatique
WO2017161994A1 (fr) Procédé et dispositif destinés à l'affichage d'informations, et support de stockage informatique
WO2017028407A1 (fr) Procédé et dispositif pour extraire un résumé de texte
WO2020024403A1 (fr) Dispositif et procédé d'exploration de données de corpus cible, et support d'informations
CN107491453B (zh) 一种识别作弊网页的方法及装置
CN107784037B (zh) 信息处理方法和装置、用于信息处理的装置
WO2020073493A1 (fr) Procédé, appareil et dispositif de détection de vulnérabilité par injection sql, et support d'informations lisible
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system
CN113127653B (zh) 信息显示方法、装置
CN109766501B (zh) 爬虫协议管理方法及装置、爬虫系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16877519

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16877519

Country of ref document: EP

Kind code of ref document: A1