WO2023282848A1 - 网页分类方法、装置、存储介质及电子设备 - Google Patents

网页分类方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2023282848A1
WO2023282848A1 PCT/SG2022/050381 SG2022050381W WO2023282848A1 WO 2023282848 A1 WO2023282848 A1 WO 2023282848A1 SG 2022050381 W SG2022050381 W SG 2022050381W WO 2023282848 A1 WO2023282848 A1 WO 2023282848A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
information
classified
web page
category
Prior art date
Application number
PCT/SG2022/050381
Other languages
English (en)
French (fr)
Inventor
汪罕
熊泓宇
冯一琦
刘臻
张皓程
刘宾
Original Assignee
脸萌有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 脸萌有限公司 filed Critical 脸萌有限公司
Publication of WO2023282848A1 publication Critical patent/WO2023282848A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0269Targeted advertisements based on user profile or attribute
    • G06Q30/0271Personalized advertisement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure provides a method for classifying webpages, including: acquiring characteristic information of webpages to be classified, where the characteristic information includes search engine optimization information, webpage sharing information shared from the webpages to be classified to third-party websites, and At least two of the webpage advertisement information related to the webpage to be classified and the webpage rendering information extracted from the rendered image result of rendering the webpage to be classified that are placed on the platform by the website corresponding to the webpage to be classified; respectively predicting candidate webpage categories of the webpages to be classified according to the feature information; and determining the target webpage categories to which the webpages to be classified belong from all the candidate webpage categories.
  • the characteristic information includes search engine optimization information, webpage sharing information shared from the webpages to be classified to third-party websites, and At least two of the webpage advertisement information related to the webpage to be classified and the webpage rendering information extracted from the rendered image result of rendering the webpage to be classified that are placed on the platform by the website corresponding to the webpage to be classified; respectively predicting candidate webpage categories of the webpages to be classified according to the feature information; and determining the target webpage categories to
  • the present disclosure provides a device for classifying webpages, including: a first acquisition module, configured to acquire characteristic information of webpages to be classified, where the characteristic information includes search engine optimization information, and is shared with a third party from the webpages to be classified
  • the webpage sharing information of the website, the webpage advertisement information related to the webpage to be classified that is placed on the platform by the website corresponding to the webpage to be classified, and the rendering image of the webpage to be classified At least two of the webpage rendering information extracted from the image results; a prediction module, used to predict the candidate webpage categories of the webpages to be classified according to each of the feature information; a determination module, used to select from all the candidate webpage categories Determine the category of the target webpage to which the webpage to be classified belongs.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method for classifying webpages described in any one of the above-mentioned first aspects are implemented.
  • the present disclosure provides an electronic device, including: a storage device, on which a computer program is stored; a processing device, configured to execute the computer program in the storage device, so as to realize any one of the above first aspects The steps of the web page classification method described in the item.
  • various feature information of the webpages to be classified are used to predict the candidate webpage categories of the webpages to be classified, and further determine the target webpage categories of the webpages to be classified from the candidate webpage categories to improve the accuracy of webpage classification; and it is used for webpages
  • the characteristic information of the classification is selected from search engine optimization information, webpage sharing information, webpage advertising information and webpage rendering information, and the classification is carried out by using characteristic information related to webpages in different dimensions, which can fundamentally improve the accuracy of webpage classification.
  • FIG. 1 is a flow chart of a method for classifying webpages according to an exemplary embodiment of the present disclosure.
  • Fig. 2 is a block diagram of an apparatus for classifying webpages according to an exemplary embodiment of the present disclosure.
  • Fig. 3 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure. DETAILED DESCRIPTION
  • embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings.
  • Fig. 1 is a flow chart showing a method for classifying webpages according to an exemplary embodiment of the present disclosure. Referring to FIG.
  • the method for classifying webpages includes: Step 101, obtaining characteristic information of webpages to be classified.
  • the characteristic information includes search engine optimization information, webpage sharing information shared from webpages to be classified to third-party websites, and websites corresponding to webpages to be classified At least two of the webpage advertisement information related to the webpage to be classified and the webpage rendering information extracted from the rendered image result of rendering the webpage to be classified are delivered on the platform.
  • Step 102 respectively predicting candidate webpage categories of webpages to be classified according to each feature information.
  • Step 103 determine the target webpage category to which the webpage to be classified belongs from all candidate webpage categories.
  • various feature information of the webpage to be classified is used to predict the candidate webpage category of the webpage to be classified, and the target webpage category of the webpage to be classified is further determined from the candidate webpage category, so as to improve the accuracy of webpage classification; and it is used for webpage classification
  • the characteristic information of the webpage is selected from search engine optimization information, webpage sharing information, webpage advertising information and webpage rendering information, and the classification is carried out by using the characteristic information related to webpages in different dimensions, which can fundamentally improve the accuracy of webpage classification.
  • the above steps are described in detail below with examples. Firstly, it needs to be explained that the categories of web pages may be sports, novels, shopping, news, etc.
  • search engine optimization information may refer to internal and external adjustment and optimization of the website on the basis of understanding the natural ranking mechanism of the search engine, improving the natural ranking of keywords in the search engine, and obtaining more traffic.
  • the search engine optimization information may be information in header elements, keyword tags, and description tags in the hypertext markup language that constructs the webpage to be classified.
  • the webpage sharing information may be information such as URLs of webpages to be classified, webpage titles of webpages to be classified, and webpage contents of webpages to be classified.
  • Web page advertisement information is an advertisement delivered to users. Webpage advertisement information may be placed on platforms such as webpages, applications, or other digital environments. Exemplarily, the webpage advertisement information may be information delivered in formats such as video, picture, and audio.
  • Web page rendering refers to the whole process in which a browser converts the hypertext markup language of a web page into an image that users can see intuitively.
  • the web page rendering information may refer to information extracted from a rendered image result of rendering a web page to be classified. It can be understood that the extracted information may be text information or picture information, which is not limited in this embodiment.
  • an OCR Optical Character Recognition, optical character recognition
  • the prominent area refers to the middle central position displayed on the display of the webpage to be classified. It can be understood that the information closely related to the webpage will be in the central position of the display interface. Therefore, it can be displayed on the central position of the display. Extract web page rendering information.
  • the center position can be determined by the resolution of the display.
  • the NLP Natural Language Processing, natural language processing
  • the step of determining the target webpage category of the webpage to be classified from all candidate webpage categories in FIG. 1 may include: determining the confidence of each feature information; normalizing all confidences; When the largest confidence degree among the confidence degrees of the normalization processing is greater than or equal to the first preset threshold, the candidate webpage category corresponding to the feature information corresponding to the maximum confidence degree is determined as the target webpage to which the webpage to be classified belongs category.
  • the confidence of information represents the credibility of the information, and correspondingly, the higher the credibility, the higher the accuracy of the category.
  • the normalization process is to map different data into the range of 0 ⁇ 1, so as to facilitate the comparison between different data.
  • the first preset threshold may be set according to actual conditions, which is not limited in this embodiment.
  • the credibility of the feature information of the webpage will be affected to varying degrees.
  • many website developers adopt cheating means to adjust webpages and add content irrelevant to the page to improve the ranking in search. Therefore, through the above method, a confidence level is calculated for each feature information, and the feature information with the highest confidence level is The corresponding candidate webpage category is determined as the target webpage category to which the webpage to be classified belongs. Since the selected feature information has high reliability, the accuracy of webpage classification can be further improved.
  • the preset category is determined as the target webpage category to which the webpage to be classified belongs, wherein, the preset Categories include low-quality webpage categories.
  • the higher the ranking of a webpage in the search engine the higher its confidence.
  • the search engine optimization information When the ranking is less than a certain threshold (that is, the ranking is in the top 10), we consider the search engine optimization information to be credible, and then the confidence of the search engine optimization information can be set to a preset degree of confidence.
  • the preset reliability can be set according to the actual situation.
  • the corresponding preset reliability can be further set according to the specific ranking situation that the first ranking value is within the previous preset amount, for example, setting a variety of ranking values and each ranking value
  • the association table of the preset reliability corresponding to the value when the first ranking value is within the previous preset number, by querying the association table, it is determined that the confidence of the search engine optimization information is the preset reliability.
  • the preset number may be 5 or 10, which is not limited in this embodiment.
  • the auxiliary webpage is a webpage of the same category as the webpage category corresponding to the search engine optimization information, and the auxiliary webpage is used to assist in calculating the confidence level of the search engine optimization information. It should be noted that the first search engine and the second search engine are different search engines.
  • determine the confidence level of information shared by a web page by: Obtaining information shared from a third-party website The first number of users who visit the webpage to be classified and the second number of users who visit the webpage to be classified; determine the confidence level of the information shared by the webpage according to the first number of users and the second number of users.
  • the first number of users represents the number of users who share the webpage to be classified (shared to a third-party website)
  • the second number of users represents the number of users who visit the webpage to be classified.
  • the first number of users and the second number of users may be obtained through web crawling technology. For example, the ratio of the first number of users to the second number of users can be determined as the confidence level of the information shared by the webpage.
  • the ratio of the first number of users to the second number of users can be represented as the sharing rate of users.
  • the data representing user behavior (sharing rate) is used to feed back the classification results of the webpage. Since the user behavior data can reflect the authenticity of the data to a certain extent, when there is an error in the classification result of the webpage, the data representing user behavior To further characterize the confidence of feature information, it can provide the system with the ability to self-correct classification errors, so as to improve the accuracy of web page classification.
  • the click-through rate of the advertisement represents the click-through rate of the advertisement, that is, the ratio of the actual number of clicks on the advertisement to the display volume of the advertisement.
  • the bounce rate of an advertisement represents the percentage of the number of visits that leave after visiting the page entrance and the total visits generated, and is also equivalent to the number of visits to the website after visiting a page and the total visits to the website (including multiple web pages under the website) frequency.
  • the webpage to be classified is a webpage under the website.
  • the exit rate of the advertisement represents the percentage of the number of page visits that the user exits from the webpage to be classified to the number of page visits that enter the webpage to be classified.
  • the number of page visits exiting from the webpage to be classified includes the number of times that the user jumps out when browsing a single page (webpage to be classified) during one visit, and also includes the number of times that the user jumps out from the webpage to be classified after browsing multiple pages.
  • the preset website parameters are related to the scale and size of the website, and can be set through manual designation or supervised learning, which is not limited in this embodiment.
  • the confidence level of the web page rendering information is determined by: extracting a preset number of rendering local information at different positions in the rendered image result; according to each rendering local information, determining whether each rendering local information It is related to the candidate webpage category corresponding to the webpage rendering information; and determining the confidence level of the webpage rendering information according to the quantity and preset quantity of the rendered partial information related to the candidate webpage category corresponding to the webpage rendering information.
  • the different positions may be different text positions and different picture positions in the rendered image result.
  • determining whether each rendering partial information is related to the candidate webpage category corresponding to the webpage rendering information may be: determining the keyword information of each rendering partial information, whether the keyword information of the candidate webpage category corresponding to the webpage rendering information corresponds, If it corresponds, it is determined that the rendering partial information is related to the candidate webpage category corresponding to the webpage rendering information, and if not, it is determined that the rendering partial information is not related to the candidate webpage category corresponding to the webpage rendering information.
  • the steps of determining whether each piece of rendered partial information is related to the candidate webpage category corresponding to the webpage rendering information are further explained below by taking the webpage category as a sports category as an example.
  • the rendering partial information corresponds to the keyword of the candidate webpage category corresponding to the webpage rendering information, which means that the rendering partial information is related to the candidate webpage category corresponding to the webpage rendering information. It can be understood that when the extracted rendering partial information is related to the candidate webpage category corresponding to the webpage rendering information, the greater the proportion of the rendering partial information in all the extracted rendering partial information, the greater the credibility of the webpage rendering information higher.
  • the ratio of the quantity of rendered partial information related to the candidate webpage category corresponding to the webpage rendering information to the preset quantity may be determined as the confidence degree of the webpage rendering information.
  • the step of determining the confidence of each feature information may include: for every two candidate webpage categories in all candidate webpage categories, determining the similarity between the two candidate webpage categories; among all the similarities.
  • the confidence degree of each piece of feature information is determined.
  • a similarity calculation method in the related art may be used to calculate the similarity between every two candidate web page categories, which will not be described in detail here in this embodiment.
  • the second preset threshold may be set according to actual conditions, which is not limited in this embodiment.
  • the similarities of the candidate webpage categories predicted by the feature information of the webpage category to be classified are relatively large, it is not necessary to determine the feature information with the highest reliability through the calculation of the confidence degree and correspond to the feature information
  • the candidate category of the webpage is determined as the category of the target webpage. Therefore, through the above method, if at least one of the similarities is smaller than the second preset threshold, the step of determining the confidence of each feature information is performed, reducing the The amount of calculation is reduced, and the classification efficiency of web page classification is improved.
  • any one candidate webpage category among all candidate webpage categories is determined as the target webpage category to which the webpage to be classified belongs.
  • Fig. 2 is a block diagram of an apparatus for classifying webpages according to an exemplary embodiment of the present disclosure. Referring to FIG.
  • the apparatus 200 for classifying webpages includes: a first obtaining module 201, configured to obtain feature information of webpages to be classified, the feature information including search engine optimization information, webpages shared from the webpages to be classified to third-party websites At least two of the shared information, the webpage advertisement information related to the webpage to be classified that is placed on the platform by the website corresponding to the webpage to be classified, and the webpage rendering information extracted from the rendered image result of rendering the webpage to be classified
  • the prediction module 202 is used to respectively predict the candidate webpage categories of the webpages to be classified according to the characteristic information; the determination module 203 is used to determine the target webpage categories to which the webpages to be classified belong from all the candidate webpage categories .
  • the determination module 203 includes: a confidence degree determination submodule, configured to determine the confidence degree of each feature information; a normalization submodule, configured to perform normalization processing on all the confidence degrees ; a first determining submodule, configured to, in the case that the largest confidence degree among all the normalized confidence degrees is greater than or equal to the first preset threshold, set the feature information corresponding to the largest confidence degree to The candidate webpage category is determined as the target webpage category to which the to-be-classified webpage belongs.
  • the determining module 203 further includes: a second determining submodule, configured to set The preset category is determined as the target webpage category to which the webpage to be classified belongs, where the preset category includes a low-quality webpage category.
  • the apparatus 200 further includes: a first ranking determination module, configured to determine the first ranking value of the webpage to be classified in the first search engine according to the search engine optimization information; preset determination A module, configured to determine that the confidence level of the search engine optimization information is a preset level of confidence when the first ranking value is within the previous preset number; a web page determination module, configured to determine that the first ranking value is within When the number is beyond the preset number, determine the auxiliary webpage of the webpage to be classified, wherein the auxiliary webpage is a webpage of the same category as the webpage category corresponding to the search engine optimization information; the second ranking determination module is used for determining the second ranking value of the webpage to be classified and the auxiliary webpage in the second search engine; An average ranking determination module, configured to determine the average ranking value of the webpage to be classified and the auxiliary webpage according to the second ranking value of the webpage to be classified and the auxiliary webpage in the second search engine; the first calculation A module for calculating the confidence level of the search engine optimization information by using the following formula:
  • the apparatus 200 further includes: a second acquiring module, configured to acquire the number of first users who share the webpage to be classified from the third-party website and the second users who visit the webpage to be classified Quantity; a second calculation module, configured to determine the confidence level of the information shared by the webpage according to the first number of users and the second number of users.
  • the apparatus 200 further includes: a third acquisition module, configured to acquire the click-through rate, bounce rate, and exit rate of the advertisement corresponding to the webpage advertisement information; a third calculation module, configured to adopt the following formula Calculate the confidence level of the advertisement information of the webpage:
  • the apparatus 200 further includes: an extraction module, configured to extract a preset number of rendering local information at different positions in the rendered image result; a judging module, configured to , determining whether each of the rendering partial information is related to the candidate webpage category corresponding to the webpage rendering information; a fourth calculation module, configured to correspond to the number of rendering partial information related to the candidate webpage category corresponding to the webpage rendering information and the The preset number is used to determine the confidence level of the web page rendering information.
  • the confidence determination submodule is specifically configured to determine the similarity between the two candidate webpage categories for every two candidate webpage categories in all the candidate webpage categories; If at least one of the similarities is smaller than a second preset threshold, the confidence of each piece of feature information is determined. In a possible manner, the confidence determination submodule is further configured to select any candidate webpage category in all the candidate webpage categories when all the similarities are greater than or equal to the second preset threshold It is determined that the webpage to be classified belongs to the category of the target webpage.
  • FIG. 3 shows an electronic device suitable for implementing embodiments of the present disclosure (for example, the terminal in FIG. 1 device or server) 300 is a schematic structural diagram.
  • the terminal devices in the embodiments of the present disclosure may include but not limited to mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (eg mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 3 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure. As shown in FIG.
  • an electronic device 300 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 301, which may be randomly accessed according to a program stored in a read-only memory (ROM) 302 or loaded from a storage device 308 Various appropriate actions and processes are executed by programs in the memory (RAM) 303 . In the RAM 303, various programs and data necessary for the operation of the electronic device 300 are also stored.
  • the processing device 301 , ROM 302 and RAM 303 are connected to each other through a bus 304 .
  • An input/output (I/O) interface 305 is also connected to bus 304 .
  • the following devices can be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 307 such as a device; a storage device 308 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 309.
  • the communication means 309 may allow the electronic device 300 to perform wireless or wired communication with other devices to exchange data. While FIG. 3 shows electronic device 300 having various means, it should be understood that implementing or possessing all of the illustrated means is not a requirement. More or fewer means may alternatively be implemented or provided.
  • the processes described above with reference to the flowcharts can be implemented as computer software programs.
  • the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 309 , or from storage means 308 , or from ROM 302 .
  • the processing device 301 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer-readable storage medium may be, for example but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: electrical connections with one or more conductors, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, device, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which computer-readable program codes are carried. This propagated data signal can employ a variety of forms, including but not limited to electromagnetic signals, optical signals or any suitable combination of the above.
  • the computer-readable signal medium can also be any computer-readable medium other than the computer-readable storage medium, and the computer-readable signal medium can send, propagate or transmit the program for use by the instruction execution system, device or device or in combination with it .
  • the program code contained on the computer readable medium can be transmitted by any appropriate medium, including but not limited to: electric wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future-developed network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium Communication (eg, communication network) interconnections.
  • Examples of communication networks include local area networks ("LANs”), wide area networks (“WANs”), Internets (e.g., the Internet) and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or it may exist independently without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires characteristic information of webpages to be classified, and the characteristic information includes search engine optimization information,
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external connect).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code that contains one or more logic functions for implementing the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive boxes represent may in fact be executed substantially in parallel, or they may sometimes be executed in reverse order, depending on the functions involved.
  • each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented by a dedicated hardware-based system that performs specified functions or operations. , or may be implemented by a combination of special purpose hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation on the module itself under certain circumstances, for example, the first obtaining module may also be described as "a module for obtaining at least two Internet Protocol addresses".
  • the functions described herein above may be performed at least in part by one or more hardware logic components.
  • exemplary types of hardware logic components include: field programmable gate array (FPGA), application specific integrated circuit (ASIC), application specific standard product (ASSP), system on chip (SOC), complex programmable Logical device (CPLD) and so on.
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • ASSP application specific standard product
  • SOC system on chip
  • CPLD complex programmable Logical device
  • a machine-readable medium may be a tangible medium, which may contain or store a program for use by or in combination with an instruction execution system, device, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM or flash memory), optical fiber, compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM Random Access Memory
  • ROM Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • Example 1 provides a method for classifying webpages, including: acquiring characteristic information of webpages to be classified, where the characteristic information includes search engine optimization information, sharing from the webpages to be classified to the first The webpage sharing information of the three-party website, the webpage advertisement information related to the webpage to be classified launched by the website corresponding to the webpage to be classified, and the webpage rendering information extracted from the rendered image result of rendering the webpage to be classified at least two of them; respectively predicting candidate webpage categories of the webpage to be classified according to each feature information; and determining the target webpage category to which the webpage to be classified belongs from all the candidate webpage categories.
  • the characteristic information includes search engine optimization information
  • Example 2 provides the method of Example 1, the determining from all the candidate webpage categories the target webpage category to which the webpage to be classified belongs includes: determining each of the feature information Confidence levels; performing normalization processing on all the confidence levels; in the case that the largest confidence level among all the normalized confidence levels is greater than or equal to the first preset threshold value, the maximum confidence level will be compared with the maximum confidence level
  • the candidate webpage category corresponding to the corresponding feature information is determined as the target webpage category to which the webpage to be classified belongs.
  • Example 3 provides the method of Example 2, and the method further includes: among all normalized confidence levels, the largest confidence level is less than the first preset threshold case, the default class is determined as the category of the target webpage to which the webpage to be classified belongs, wherein the preset category includes a category of low-quality webpages.
  • Example 4 provides the method of Example 2, the characteristic information includes the search engine optimization information, and the confidence level of the search engine optimization information is determined in the following manner: according to the search engine optimization information, determining the first ranking value of the webpage to be classified in the first search engine; when the first ranking value is within the previous preset number, determining that the confidence level of the search engine optimization information is preset Setting reliability; when the first ranking value is outside the previous preset number, determining an auxiliary webpage of the webpage to be classified, wherein the auxiliary webpage is the category to which the webpage category corresponding to the search engine optimization information belongs The same webpage; determining the second ranking value of the webpage to be classified and the auxiliary webpage in the second search engine; according to the second ranking value of the webpage to be classified and the auxiliary webpage in the second search engine, Determining the average ranking value of the webpage to be classified and the auxiliary webpage; calculating the confidence level of the search engine optimization information using the following formula:
  • Conl sigmoid((M+T)/R+(KR)/M);
  • the Coni is the confidence degree of the search engine optimization information
  • the M is the difference between the webpage to be classified and the auxiliary webpage
  • the T is the preset number
  • the K is the average ranking value
  • the R is the first ranking value of the webpage to be classified.
  • Example 5 provides the method of Example 2, wherein the feature information includes the webpage sharing information, and the confidence level of the webpage sharing information is determined by: obtaining from the third party The website shares the first number of users of the webpage to be classified and the second number of users who visit the to-be-classified webpage; according to the first number of users and the second number of users, determine the confidence level of the information shared by the webpage .
  • Example 7 provides the method of Example 2, the feature information includes the web page rendering information, and the confidence level of the web page rendering information is determined in the following manner: in the rendered image result Extracting a preset number of rendering partial information from different positions in ; according to each of the rendering partial information, determining whether each of the rendering partial information is related to the candidate webpage category corresponding to the webpage rendering information; according to the webpage rendering information Confidence of the webpage rendering information is determined corresponding to the number of rendered partial information related to the candidate webpage category and the preset number.
  • Example 8 provides the method of Example 2-7, the determining the confidence of each feature information includes: For every two candidates in all the candidate webpage categories webpage category, determining the similarity between the two candidate webpage categories; among all the similarities, at least one similarity is smaller than the second In the case of a preset threshold, the confidence level of each feature information is determined.
  • Example 9 provides the method of Example 8, and the method further includes: when all the similarities are greater than or equal to the second preset threshold, all the Any one of the candidate webpage categories is determined as the target webpage category to which the webpage to be classified belongs.
  • Example 10 provides an apparatus for classifying webpages, including: a first obtaining module, configured to obtain characteristic information of webpages to be classified, where the characteristic information includes search engine optimization information, obtained from the The webpage sharing information shared by the webpage to be classified to the third-party website, the webpage advertisement information related to the webpage to be classified that is placed on the platform by the website corresponding to the webpage to be classified, and the rendered image result from rendering the webpage to be classified At least two of the webpage rendering information extracted from the webpage; a prediction module, used to predict the candidate webpage categories of the webpages to be classified according to each of the feature information; a determination module, used to determine the selected webpage categories from all the candidate webpage categories Describes the landing page category to which the page to be categorized belongs.
  • a first obtaining module configured to obtain characteristic information of webpages to be classified, where the characteristic information includes search engine optimization information, obtained from the The webpage sharing information shared by the webpage to be classified to the third-party website, the webpage advertisement information related to the webpage to be classified that is placed on the platform by the website corresponding to the
  • Example 11 provides the device of Example 10, the determination module includes: a confidence degree determination submodule, configured to determine the confidence degree of each feature information; a normalization submodule, It is used to perform normalization processing on all the confidence degrees; the first determining submodule is used to set The candidate webpage category corresponding to the feature information corresponding to the maximum confidence is determined as the target webpage category to which the webpage to be classified belongs.
  • a confidence degree determination submodule configured to determine the confidence degree of each feature information
  • a normalization submodule It is used to perform normalization processing on all the confidence degrees
  • the first determining submodule is used to set The candidate webpage category corresponding to the feature information corresponding to the maximum confidence is determined as the target webpage category to which the webpage to be classified belongs.
  • Example 12 provides the apparatus of Example 11, the determination module further includes: a second determination submodule, configured to obtain the largest confidence degree among all normalized confidence degrees If it is less than the first preset threshold, the preset category is determined as the target webpage category to which the webpage to be classified belongs, where the preset category includes a low-quality webpage category.
  • a second determination submodule configured to obtain the largest confidence degree among all normalized confidence degrees If it is less than the first preset threshold, the preset category is determined as the target webpage category to which the webpage to be classified belongs, where the preset category includes a low-quality webpage category.
  • Example 13 provides the device of Example 11, and the device further includes: a first ranking determination module, configured to determine that the webpage to be classified ranks first according to the search engine optimization information A first ranking value in a search engine; a preset determination module, configured to determine that the confidence of the search engine optimization information is a preset reliability when the first ranking value is within the previous preset number; A determining module, configured to determine an auxiliary webpage of the webpage to be classified when the first ranking value is outside the previous preset number, wherein the auxiliary webpage belongs to the category of the webpage corresponding to the search engine optimization information webpages of the same category; a second ranking determination module, configured to determine a second ranking value of the webpage to be classified and the auxiliary webpage in a second search engine; an average ranking determination module, to determine the webpage according to the webpage to be classified and the The second ranking value of the auxiliary webpage in the second search engine, determining the average ranking value of the webpage to be classified and the auxiliary webpage; a first calculation module, used to calculate the search engine
  • Example 14 provides the device of Example 11, and the device further includes: a second acquiring module, configured to acquire the first webpage shared from the third-party website to the webpage to be classified The number of users and the second number of users accessing the webpage to be classified; a second calculation module, configured to determine the confidence level of the webpage sharing information according to the first number of users and the second number of users.
  • a second acquiring module configured to acquire the first webpage shared from the third-party website to the webpage to be classified The number of users and the second number of users accessing the webpage to be classified
  • a second calculation module configured to determine the confidence level of the webpage sharing information according to the first number of users and the second number of users.
  • Example 16 provides the apparatus of Example 11, the apparatus further comprising: an extraction module, configured to extract a preset number of rendered local information at different positions in the rendered image result a judging module, configured to determine whether each rendering partial information is related to a candidate webpage category corresponding to the webpage rendering information according to each rendering partial information; The number of rendered partial information related to the candidate webpage category and the preset number determine the confidence level of the webpage rendering information.
  • Example 17 provides the apparatus of Examples 11-16, the confidence determination submodule is specifically configured to, for every two candidate webpage categories in all the candidate webpage categories, determine The similarity between the two candidate web page categories; if at least one of all the similarities is smaller than a second preset threshold, determine the confidence of each feature information.
  • Example 18 provides the apparatus of Example 17, and the confidence degree determining submodule is further configured to be used in the case that all the similarities are greater than or equal to the second preset threshold , determining any candidate webpage category among all the candidate webpage categories as the target webpage category to which the to-be-classified webpage belongs.
  • Example 19 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1 to 9 are implemented.
  • Example 20 provides an electronic device, including: a storage device, on which a computer program is stored; a processing device, configured to execute the computer program in the storage device, to Implement the steps of any one of the methods described in Examples 1 to 9.
  • a storage device on which a computer program is stored
  • a processing device configured to execute the computer program in the storage device, to Implement the steps of any one of the methods described in Examples 1 to 9.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本公开涉及一种网页分类方法、装置、存储介质及电子设备,所述方法包括:获取待分类网页的特征信息,所述特征信息包括搜索引擎优化信息、从所述待分类网页分享到第三方网站的网页分享信息、与所述待分类网页对应的网站在平台投放的与所述待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲染图像结果中提取的网页渲染信息中的至少5两种;根据各所述特征信息分别预测所述待分类网页的候选网页类别;从所有所述候选网页类别中确定所述待分类网页所属的目标网页类别。利用待分类网页的多种特征信息预测待分类网页的候选网页类别,进一步再从候选网页类别中确定待分类网页的目标网页类别,提高网页分类的准确率。

Description

网页分 类方法 、 装置、 存储介质及 电子 设备 相关申请的交叉 引用 本 申请要求于 2021年 07月 07日提交的, 申请号为 202110768852.5、 发明名称为“网 页分类方法、 装置、 存储介质及电子设备” 的中国专利申请的优先权, 该申请的全部内容 通过引用结合在本申请中。 技术领域 本公开涉及分类技术领域, 具体地, 涉及一种网页分类方法、 装置、 存储介质及电子 设备。 背景技术 分类 问题是人类所面临的一个非常重要且具有普遍意义的问题, 将事物正确的分类, 有助于人们认识世界, 使杂乱无章的现实世界变得有条理。 在相关技术, 网页的分类信息广泛的应用于搜索、 广告等互联网领域, 如何准确地对 网页进行分类, 是计算机领域长期以来一直在研究和探索的问题。 发明内容 提供该发明 内容部分以便以简要的形式介绍构思, 这些构思将在后面的具体实施方式 部分被详细描述。 该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特 征, 也不旨在用于限制所要求的保护的技术方案的范围。 第一方面 , 本公开提供一种网页分类方法, 包括: 获取待分类网页的特征信息, 所述特征信息包括搜索引擎优化信息、 从所述待分类网 页分享到第三方网站的网页分享信息、 与所述待分类网页对应的网站在平台投放的与所述 待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲染图像结果中提取的网页 渲染信息中的至少两种; 根据各所述特征信息分别预测所述待分类网页的候选网页类别; 从所有所述候选网页类别中确定所述待分类网页所属的目标网页类别。 第二方面 , 本公开提供一种网页分类装置, 包括: 第一获取模块, 用于获取待分类网页的特征信息, 所述特征信息包括搜索引擎优化信 息、 从所述待分类网页分享到第三方网站的网页分享信息、 与所述待分类网页对应的网站 在平台投放的与所述待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲染图 像结果中提取的网页渲染信息中的至少两种; 预测模块, 用于根据各所述特征信息分别预测所述待分类网页的候选网页类别; 确定模块,用于从所有所述候选网页类别中确定所述待分类网页所属的目标网页类别。 第三方面 , 本公开提供一种计算机可读介质, 其上存储有计算机程序, 该程序被处理 装置执行时实现上述第一方面中任一项所述网页分类方法的步骤。 第 四方面, 本公开提供一种电子设备, 包括: 存储装置, 其上存储有计算机程序; 处理装置 , 用于执行所述存储装置中的所述计算机程序, 以实现上述第一方面中任一 项所述网页分类方法的步骤。 通过上述技术方案,利用待分类网页的多种特征信息预测待分类网页的候选网页类别, 进一步再从候选网页类别中确定待分类网页的目标网页类别, 提高网页分类的准确率; 且 用于网页分类的特征信息从搜索引擎优化信息、 网页分享信息、 网页广告信息以及网页渲 染信息中选择, 利用在不同维度与网页有关的特征信息进行分类, 能从根本上提高网页分 类的准确率。 本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。 附图说明 结合 附图并参考以下具体实施方式, 本公开各实施例的上述和其他特征、 优点及方面 将变得更加明显。 贯穿附图中, 相同或相似的附图标记表示相同或相似的元素。 应当理解 附图是示意性的, 原件和元素不一定按照比例绘制。 在附图中: 图 1是根据本公开一示例性实施例示出的一种网页分类方法的流程图。 图 2是根据本公开一示例性实施例示出的一种网页分类装置的框图。 图 3是根据本公开一示例性实施例示出的一种电子设备的框图。 具体实施方式 下面将参照 附图更详细地描述本公开的实施例。 虽然附图中显示了本公开的某些实施 例, 然而应当理解的是, 本公开可以通过各种形式来实现, 而且不应该被解释为限于这里 阐述的实施例, 相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是, 本公开的附图及实施例仅用于示例性作用, 并非用于限制本公开的保护范围。 应当理解, 本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行, 和 /或 并行执行。 此外, 方法实施方式可以包括附加的步骤和 /或省略执行示出的步骤。 本公开的 范围在此方面不受限制。 本文使用 的术语 “包括 ”及其变形是开放性包括, 即 “包括但不限于”。 术语“基于”是“至 少部分地基于”。 术语“一个实施例”表示“至少一个实施例”; 术语“另一实施例”表示“至少一 个另外的实施例”; 术语“一些实施例”表示“至少一些实施例”。 其他术语的相关定义将在下 文描述中给出。 需要注 意, 本公开中提及的“第一”、 “第二”等概念仅用于对不同的装置、 模块或单元 进行区分, 并非用于限定这些装置、 模块或单元所执行的功能的顺序或者相互依存关系。 需要注意 , 本公开中提及的“一个”、 “多个” 的修饰是示意性而非限制性的, 本领域技 术人员应当理解, 除非在上下文另有明确指出, 否则应该理解为 “一个或多个”。 本 公开实施方式中的多个装置之间所交互的消息或 者信息的名称仅用于说明性的目 的, 而并不是用于对这些消息或信息的范围进行限制。 正如背景技术所 言, 如何准确地对网页进行分类, 是计算机领域长期以来一直在研究 和探索的问题。 在相关技术中, 基于单一的网页文本对网页进行分类, 很难准确地识别不 同类别的网页。 综上 , 本公开提供一种网页分类方法, 用于利用在不同维度与网页有关的特征信息进 行分类, 能从根本上提高网页分类的准确率。 图 1是根据本公开一示例性实施例示出的一种网页分类方法的流程图。 参照 图 1, 该网页分类方法包括: 步骤 101, 获取待分类网页的特征信息, 特征信息包括搜索引擎优化信息、 从待分类 网页分享到第三方网站的网页分享信息、 与待分类网页对应的网站在平台投放的与待分类 网页相关的网页广告信息以及从渲染待分类网页的渲染图像结果中提取的网页渲染信息中 的至少两种。 步骤 102, 根据各特征信息分别预测待分类网页的候选网页类别。 步骤 103 , 从所有候选网页类别中确定待分类网页所属的目标网页类别。 通过上述方式 , 利用待分类网页的多种特征信息预测待分类网页的候选网页类别, 进 一步再从候选网页类别中确定待分类网页的目标网页类别, 提高网页分类的准确率; 且用 于网页分类的特征信息从搜索引擎优化信息、 网页分享信息、 网页广告信息以及网页渲染 信息中选择, 利用在不同维度与网页有关的特征信息进行分类, 能从根本上提高网页分类 的准确率。 为 了使得本领域技术人员更加理解本公开提供的网页分类方法, 下面对上述各步骤进 行详细举例说明。 首先 需要说明的是, 网页的类别可以是运动、 小说、 购物、 新闻等。 本公开在此不作 任何限定。 在本公开 中, 搜索引擎优化信息可以是指在了解搜索引擎自然排名机制的基础上, 对 网站进行内部及外部的调整优化, 改进网站在搜索引擎中的关键词自然排名, 获得更多流 量, 从而达成网站销售及品牌建设的目标。 示例地, 搜索引擎优化信息可以是构建待分类 网页的超文本标记语言中的头部元素、 关键词标签和描述标签中的信息。 在一些 网页中, 通常可以建立与第三方网站的连接, 以便于在用户快捷地将该网页分 享至第三方网站, 以使用户无需登录第三方网站就可实现实时实地的网页分享。 示例地, 网页分享信息可以是待分类网页的网址、 待分类网页的网页标题以及待分类网页的网页内 容等信息。 网页广告信息是向用户投放的广告。 网页广告信息可以向网页、 应用或其他数字环境 等平台中投放。 示例地, 网页广告信息可以是以视频、 图片、 音频等格式进行投放的信息。 网页渲染是指浏览器将网页的超文本标记语言变成用户能直观看到的图像的全过程。 示例地, 网页渲染信息可以是指从渲染待分类网页的渲染图像结果中提取的信息。 可以理 解的是, 提取的信息可以是文字信息, 也可以是图片信息, 本实施例在此不作限定。 在可能 的方式中, 可以采用 OCR(Optical Character Recognition, 光学字符识别)技术 从渲染图像结果中显著的区域提取网页渲染信息。 其中, 显著的区域是指待分类网页在显 示器上显示的中间中心位置, 可以理解的是, 与网页紧密关联的信息会处于显示界面的中 心位置, 因此, 可以在显示器上显示的中间中心位置处提取网页渲染信息。 另外, 中心位 置可以通过显示器的分辨率确定。 在可 能的方式中, 可以采用相关技术中的 NLP (Natural Language Processing, 自然语 言处理) 技术来预测每种特征信息所对应的候选网页类别, 本实施例在此不作赘述。 在可能 的方式中, 图 1中从所有候选网页类别中确定待分类网页所属的目标网页类别 的步骤可以包括: 确定各特征信息的置信度; 对所有置信度进行归一化处理; 在所有经过 归一化处理的置信度中最大的置信度大于或等于第一预设阈值的情况下, 将与该最大的置 信度对应的特征信息所对应的候选网页类别确定为待分类网页所属的目标网页类别。 需要说 明的是, 信息的置信度表征该种信息的可信度, 相对应地, 可信度越高, 其类 别的准确度越高。 在本公开 中, 归一化处理是将不同数据映射到 0〜 1范围之内, 以便于不同数据之间的 比较。 在本公开 中, 第一预设阈值可以通过实际情况进行设定, 本实施例在此不作限定。 考虑到 网站结构的日趋复杂化, 网页的特征信息的可信度将受到不同程度的影响。 例 如, 不少网站开发者采取作弊的手段调整网页, 添加和页面无关内容来提高在搜索中的排 名。 因此, 通过上述方式, 为每种特征信息计算一个置信度, 将置信度最高的特征信息所 对应的候选网页类别确定为待分类网页所属的目标网页类别, 由于选取的特征信息的可信 度高, 则可以进一步提高网页分类的准确率。 在可能 的方式中, 在所有经过归一化处理的置信度中最大的置信度小于第一预设阈值 的情况下, 将预设类别确定为待分类网页所属的目标网页类别, 其中, 预设类别包括低质 量网页类别。 通过上述方式 , 考虑到在每种特征信息的置信度均较低的情况下, 该待分类网页的质 量不是很高, 因此, 在此种情况下, 可以将待分类网页确定为低质量网页类别。 在网页分 类的其他应用领域中, 例如, 网页推荐领域, 可以避免将低质量网页推荐给用户, 为此保 证了网页推荐的质量。 以下将对各个特征信息的置信度的计算过程进行进一步解释说明。 在可能 的方式中, 通过以下方式确定搜索引擎优化信息的置信度: 根据搜索引擎优化 信息, 确定待分类网页在第一搜索引擎中的第一排名值; 在第一排名值位于前预设数量之 内时, 确定搜索引擎优化信息的置信度为预设置信度; 在第一排名值位于前预设数量之外 时, 确定待分类网页的辅助网页, 确定待分类网页和辅助网页在第二搜索引擎的第二排名 值; 根据待分类网页和辅助网页在所述第二搜索引擎的第二排名值, 确定待分类网页和辅 助 网页的 平均 排名 值; 采用以下公 式计 算搜 索引 擎优化 信息 的置 信度 : Conl=sigmoid((M+T)/R+(K-R)/M); 其中, Coni为搜索引擎优化信息的置信度, M为待分 类网页和辅助网页在第二搜索引擎中的最低排名值 , T为预设数量, K为平均排名值, R 为待分类网页的第一排名值。 值得说明的是, 一个网页根据其搜索引擎优化信息, 在搜索引擎中的排名越靠前, 其 置信度越高。 当排名小于某个阈值时 (即, 排名位于前 10), 我们认为这个搜索引擎优化 信息是可信的, 则可以将搜索引擎优化信息的置信度为预设置信度。 在本公开中, 预设置 信度可以根据实际情况进行设定。 在另一种可能 的方式中, 还可以进一步根据第一排名值在前预设数量之内的具体排名 情况, 设置对应的预设置信度, 例如, 设置一个多种排名值和与每种排名值对应的预设置 信度的关联表, 在第一排名值位于前预设数量之内时, 通过查询该关联表, 确定搜索引擎 优化信息的置信度为预设置信度。 示例地 , 预设数量可以是 5, 也可以是 10, 本实施例在此不作限定。 在本公开 中, 辅助网页为与搜索引擎优化信息对应的网页类别所属类别相同的网页, 且辅助网页用于辅助计算搜索引擎优化信息的置信度。 需要说 明的是, 第一搜索引擎和第二搜索引擎为不同的搜索引擎。 在可能 的方式中, 通过以下方式确定网页分享信息的置信度: 获取从第三方网站分享 到待分类网页的第一用户数量和访问所述待分类网页的第二用户数量; 根据第一用户数量 和第二用户数量, 确定网页分享信息的置信度。 需要说 明的是, 第一用户数量表征分享待分类网页(分享到第三方网站) 的用户数量, 第二用户数量表征访问待分类网页的用户数量。 其中, 第一用户数量和第二用户数量可以 通过网络爬取技术获取到。 示例地 , 可以将第一用户数量与第二用户数量的比值确定为网页分享信息的置信度, 可以理解的是, 第一用户数量与第二用户数量的比值可以表征为用户的分享率。 通过上述方式 , 采用表征用户行为的数据 (分享率) 来反馈网页的分类结果, 由于用 户行为数据能从一定程度上反应数据的真实性, 在网页分类结果出现错误时, 从表征用户 行为的数据来进一步表征特征信息的置信度, 能给系统提供了自我修正分类错误的能力, 以提高网页分类的准确率。 在可能 的方式中, 通过以下方式确定网页广告信息的置信度: 获取网页广告信息对应 的广告的 点击通过率、 跳出率和退出率; 采用以下公式计算网页广告信息的置信度 : Coni = CTR / {bounce rate + A * exiterate); 其中,
Figure imgf000008_0001
为网页广告信息的置信度, C77?为点击 通过率, 为跳出率,
Figure imgf000008_0002
为退出率, d为预设网站参数。 需要说 明的是, 广告的点击通过率表征广告的点击到达率, 即广告的实际点击次数与 广告的展现量的比值。 广告 的跳出率表征访问了页面入口就离开的访问量与所产生的总访问量的百分比, 也 等同于访问一个页面后离开网站的次数与访问网站 (该网站下包括多个网页) 的总访问次 数。 可以理解的是, 待分类网页是该网站下的一个网页。 广告 的退出率表征用户从待分类网页退出的页面访问数与进入待分类网页的页面访问 数的百分比。 其中, 从待分类网页退出的页面访问数包括在一次访问过程中用户浏览单页 (待分类网页) 即跳出的次数, 也包括浏览多页后从待分类网页跳出的次数。 进入待分类 网页的页面访问数包括用户重复浏览待分类网页的次数。 示例地 , 10个访问来到 a页面后, 5个访问直接从 a页面离开, 3个访问去 b页面, 2 个访问去 c页面然后直接离开, 其中, 去了 b页面的 3个用户有 2个访问返还 a页面最终 从 a页面离开。 计算 a页面的跳出率 =(5/10) *100%, a页面的退出率 =((5+2) /(10+2)) *100%。 预设网站参数与网站 的规模和大小相关, 可以通过人工指定或者监督学习的方法进行 设置, 本实施例在此不作限定。 在可能 的方式中, 通过以下方式确定网页渲染信息的置信度: 在渲染图像结果中的不 同位置处提取预设数量的渲染局部信息; 根据各渲染局部信息, 确定各渲染局部信息是否 与网页渲染信息对应的候选网页类别相关; 根据与网页渲染信息对应的候选网页类别相关 的渲染局部信息的数量与预设数量, 确定网页渲染信息的置信度。 示例地 , 不同位置处可以是渲染图像结果中的不同文字处以及不同图片处。 在本公开 中, 确定各渲染局部信息是否与网页渲染信息对应的候选网页类别相关可以 是: 确定各渲染局部信息的关键词信息, 与网页渲染信息对应的候选网页类别的关键词信 息是否对应, 若对应, 则确定渲染局部信息与网页渲染信息对应的候选网页类别相关, 若 不对应, 则确定渲染局部信息与网页渲染信息对应的候选网页类别不相关。 以下以网页类 别是运动类别为例对确定各渲染局部信息是否与网页渲染信息对应的候选网页类别相关的 步骤进行进一步解释说明。 示例地 , 在提取到的渲染局部信息的是运动鞋, 网页渲染信息对应的候选网页类别是 运动的情况下, 显然地运动的关键词可以是运动鞋、 运动服、 运动器材等。 因此, 该渲染 局部信息与网页渲染信息对应的候选网页类别的关键词是相对应的, 也就表明该渲染局部 信息与网页渲染信息对应的候选网页类别是相关的。 可 以理解的是, 在提取到渲染局部信息与网页渲染信息对应的候选网页类别是相关的 渲染局部信息数量占所有提取到的渲染局部信息的比重越大时, 表征该网页渲染信息的可 信度越高。 因此, 可以将网页渲染信息对应的候选网页类别相关的渲染局部信息的数量与 预设数量的比值确定为网页渲染信息的置信度。 在可能 的实施方式中, 确定各特征信息的置信度的步骤可以包括: 针对所有候选网页 类别中每两个候选网页类别, 确定该两个候选网页类别之间的相似度; 在所有相似度中至 少存在一个相似度小于第二预设阈值的情况下, 确定各特征信息的置信度。 示例地 , 可以采用相关技术中的相似度计算方法计算每两个候选网页类别之间的相似 度, 本实施例在此不作赘述。 需要说 明的是, 第二预设阈值可以根据实际情况进行设定, 本实施例在此不作限定。 考虑到确定待分类 网页的网页类别的特征信息所预测的候选网页类别的相似度均较大 的情况下, 无需再通过置信度的计算来确定可信度最高的特征信息并将该特征信息对应的 网页候选类别确定为目标网页类别, 因此, 通过上述方式, 在所有相似度中至少存在一个 相似度小于第二预设阈值的情况下, 再执行确定各特征信息的置信度的步骤, 减少了计算 量, 提高了网页分类的分类效率。 在可能 的实施方式中, 在所有相似度均大于或等于第二预设阈值的情况下, 将所有候 选网页类别中任意一个候选网页类别确定为待分类网页所属目标网页类别。 考虑到确定待分类 网页的网页类别的特征信息所预测的候选网页类别的相似度均较大 的情况下, 所有候选网页类别均可作为待分类网页所属目标网页类别, 因此, 在此种情况 下, 直接将所有候选网页类别中任意一个候选网页类别确定为待分类网页所属目标网页类 别, 以提高网页分类的分类效率。 本公开实施例还提供一种网页分类装置, 该网页分类装置可以通过软件、 硬件或者两 者结合的方式成为电子设备的部分或全部。 图 2是根据本公开一示例性实施例示出的一种 网页分类装置的框图。 参照图 2, 该网页分类装置 200包括: 第一获取模块 201, 用于获取待分类网页的特征信息, 所述特征信息包括搜索引擎优 化信息、 从所述待分类网页分享到第三方网站的网页分享信息、 与所述待分类网页对应的 网站在平台投放的与所述待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲 染图像结果中提取的网页渲染信息中的至少两种; 预测模块 202, 用于根据各所述特征信息分别预测所述待分类网页的候选网页类别; 确定模块 203 , 用于从所有所述候选网页类别中确定所述待分类网页所属的目标网页 类别。 在可能 的方式中, 所述确定模块 203包括: 置信度确定子模块, 用于确定各所述特征信息的置信度; 归一化子模块, 用于对所有所述置信度进行归一化处理; 第一确定子模块 , 用于在所有经过归一化处理的置信度中最大的置信度大于或等于第 一预设阈值的情况下, 将与该最大的置信度对应的特征信息所对应的候选网页类别确定为 所述待分类网页所属的目标网页类别。 在可能 的方式中, 所述确定模块 203还包括: 第二确定子模块 , 用于在所有经过归一化处理的置信度中最大的置信度小于所述第一 预设阈值的情况下, 将预设类别确定为所述待分类网页所属的目标网页类别, 其中, 所述 预设类别包括低质量网页类别。 在可能 的方式中, 所述装置 200还包括: 第一排名确定模块 , 用于根据所述搜索引擎优化信息, 确定所述待分类网页在第一搜 索引擎中的第一排名值; 预设确定模块, 用于在所述第一排名值位于前预设数量之内时, 确定所述搜索引擎优 化信息的置信度为预设置信度; 网页确定模块, 用于在所述第一排名值位于前预设数量之外时, 确定所述待分类网页 的辅助网页, 其中, 所述辅助网页为与所述搜索引擎优化信息对应的网页类别所属类别相 同的网页; 第二排名确定模块 , 用于确定所述待分类网页和所述辅助网页在第二搜索引擎的第二 排名值; 平均排名确定模块 , 用于根据所述待分类网页和所述辅助网页在所述第二搜索引擎的 第二排名值, 确定所述待分类网页和所述辅助网页的平均排名值; 第一计算模块, 用于采用以下公式计算所述搜索引擎优化信息的置信度:
Conl=sigmoid((M+T)/R+(K-R)/M); 其 中, 所述 Coni为所述搜索引擎优化信息的置信度, 所述 M为所述待分类网页和所 述辅助网页在所述第二搜索引擎中的最低排名值, 所述 T为所述预设数量, 所述 K为所述 平均排名值, 所述 R为所述待分类网页的第一排名值。 在可能 的方式中, 所述装置 200还包括: 第二获取模块, 用于获取从所述第三方网站分享到所述待分类网页的第一用户数量和 访问所述待分类网页的第二用户数量; 第二计算模块, 用于根据所述第一用户数量和所述第二用户数量, 确定所述网页分享 信息的置信度。 在可能 的方式中, 所述装置 200还包括: 第三获取模块, 用于获取所述网页广告信息对应的广告的点击通过率、 跳出率和退出 率; 第三计算模块, 用于采用以下公式计算所述网页广告信息的置信度:
Coni = CTR/ (bounce rate + A* exiterate ); 信息的置信度, 所述 C77?为所述点击通过率, 所述
Figure imgf000011_0001
所述退出率, 所述 ^为预设网站参数。 在可能 的方式中, 所述装置 200还包括: 提取模块, 用于在所述渲染图像结果中的不同位置处提取预设数量的渲染局部信息; 判断模块 , 用于根据各所述渲染局部信息, 确定各所述渲染局部信息是否与所述网页 渲染信息对应的候选网页类别相关; 第 四计算模块, 用于根据与所述网页渲染信息对应的候选网页类别相关的渲染局部信 息的数量与所述预设数量, 确定所述网页渲染信息的置信度。 在可能 的方式中, 所述置信度确定子模块具体用于针对所有所述候选网页类别中每两 个所述候选网页类别, 确定该两个所述候选网页类别之间的相似度; 在所有所述相似度中 至少存在一个相似度小于第二预设阈值的情况下, 确定各所述特征信息的置信度。 在可能 的方式中, 所述置信度确定子模块还用于在所有所述相似度均大于或等于所述 第二预设阈值的情况下, 将所有所述候选网页类别中任意一个候选网页类别确定为所述待 分类网页所属目标网页类别。 下面参考 图 3, 其示出了适于用来实现本公开实施例的电子设备 (例如图 1中的终端 设备或服务器) 300的结构示意图。 本公开实施例中的终端设备可以包括但不限于诸如移 动电话、 笔记本电脑、 数字广播接收器、 PDA (个人数字助理)、 PAD (平板电脑)、 PMP (便携式多媒体播放器)、 车载终端 (例如车载导航终端) 等等的移动终端以及诸如数字 TV、 台式计算机等等的固定终端。 图 3示出的电子设备仅仅是一个示例, 不应对本公开实 施例的功能和使用范围带来任何限制。 如 图 3所示, 电子设备 300可以包括处理装置(例如中央处理器、 图形处理器等)301, 其可以根据存储在只读存储器 (ROM) 302中的程序或者从存储装置 308加载到随机访问 存储器 (RAM) 303中的程序而执行各种适当的动作和处理。 在 RAM 303中, 还存储有电 子设备 300操作所需的各种程序和数据。处理装置 301、 ROM 302以及 RAM 303通过总线 304彼此相连。 输入 /输出 (I/O) 接口 305也连接至总线 304。 通常 , 以下装置可以连接至 I/O接口 305: 包括例如触摸屏、 触摸板、 键盘、 鼠标、 摄 像头、 麦克风、 加速度计、 陀螺仪等的输入装置 306; 包括例如液晶显示器 (LCD)、 扬声 器、振动器等的输出装置 307; 包括例如磁带、硬盘等的存储装置 308; 以及通信装置 309。 通信装置 309可以允许电子设备 300与其他设备进行无线或有线通信以交换数据。 虽然图 3示 出了具有各种装置的电子设备 300, 但是应理解的是, 并不要求实施或具备所有示出的 装置。 可以替代地实施或具备更多或更少的装置。 特 别地, 根据本公开的实施例, 上文参考流程图描述的过程可以被实现为计算机软件 程序。 例如, 本公开的实施例包括一种计算机程序产品, 其包括承载在非暂态计算机可读 介质上的计算机程序, 该计算机程序包含用于执行流程图所示的方法的程序代码。 在这样 的实施例中, 该计算机程序可以通过通信装置 309从网络上被下载和安装, 或者从存储装 置 308被安装, 或者从 ROM 302被安装。 在该计算机程序被处理装置 301执行时, 执行本 公开实施例的方法中限定的上述功能。 需要说明的是, 本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机 可读存储介质或者是上述两者的任意组合。 计算机可读存储介质例如可以是一 但不限于 — 电、 磁、 光、 电磁、 红外线、 或半导体的系统、 装置或器件, 或者任意以上的组合。 计算机可读存储介质的更具体的例子可以包括但不限于: 具有一个或多个导线的电连接、 便携式计算机磁盘、 硬盘、 随机访问存储器 (RAM)、 只读存储器 (ROM)、 可擦式可编程 只读存储器 (EPROM或闪存)、 光纤、 便携式紧凑磁盘只读存储器(CD-ROM)、 光存储器 件、 磁存储器件、 或者上述的任意合适的组合。 在本公开中, 计算机可读存储介质可以是 任何包含或存储程序的有形介质, 该程序可以被指令执行系统、 装置或者器件使用或者与 其结合使用。 而在本公开中, 计算机可读信号介质可以包括在基带中或者作为载波一部分 传播的数据信号, 其中承载了计算机可读的程序代码。 这种传播的数据信号可以采用多种 形式, 包括但不限于电磁信号、 光信号或上述的任意合适的组合。 计算机可读信号介质还 可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、 传播或者传输用于由指令执行系统、 装置或者器件使用或者与其结合使用的程序。 计算机 可读介质上包含的程序代码可以用任何适当的介质传输, 包括但不限于: 电线、 光缆、 RF (射频) 等等, 或者上述的任意合适的组合。 在一些实施方式 中, 客户端、服务器可以利用诸如 HTTP(HyperText Transfer Protocol, 超文本传输协议) 之类的任何当前已知或未来研发的网络协议进行通信, 并且可以与任意 形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”), 广域网 (“WAN”), 网际网 (例如, 互联网) 以及端对端网络 (例如, ad hoc端对端网络), 以及任何当前已知或未来研发的网络。 上述计算机可读介质可 以是上述电子设备中所包含的; 也可以是单独存在, 而未装配 入该电子设备中。 上述计算机可读介质承载有一个或者多个程序 , 当上述一个或者多个程序被该电子设 备执行时, 使得该电子设备: 获取待分类网页的特征信息, 所述特征信息包括搜索引擎优 化信息、 从所述待分类网页分享到第三方网站的网页分享信息、 与所述待分类网页对应的 网站在平台投放的与所述待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲 染图像结果中提取的网页渲染信息中的至少两种; 根据各所述特征信息分别预测所述待分 类网页的候选网页类别; 从所有所述候选网页类别中确定所述待分类网页所属的目标网页 类别。 可 以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序 代码, 上述程序设计语言包括但不限于面向对象的程序设计语言一诸如 Java、 Smalltalk、 C++, 还包括常规的过程式程序设计语言 —诸如 “C”语言或类似的程序设计语言。 程序代 码可以完全地在用户计算机上执行、 部分地在用户计算机上执行、 作为一个独立的软件包 执行、 部分在用户计算机上部分在远程计算机上执行、 或者完全在远程计算机或服务器上 执行。 在涉及远程计算机的情形中, 远程计算机可以通过任意种类的网络一包括局域网 (LAN) 或广域网 (WAN) —连接到用户计算机, 或者, 可以连接到外部计算机 (例如 利用因特网服务提供商来通过因特网连接)。 附图中的流程图和框图, 图示了按照本公开各种实施例的系统、 方法和计算机程序产 品的可能实现的体系架构、 功能和操作。 在这点上, 流程图或框图中的每个方框可以代表 一个模块、 程序段、 或代码的一部分, 该模块、 程序段、 或代码的一部分包含一个或多个 用于实现规定的逻辑功能的可执行指令。 也应当注意, 在有些作为替换的实现中, 方框中 所标注的功能也可以以不同于附图中所标注的顺序发生。 例如, 两个接连地表示的方框实 际上可以基本并行地执行, 它们有时也可以按相反的顺序执行, 这依所涉及的功能而定。 也要注意的是, 框图和 /或流程图中的每个方框、 以及框图和 /或流程图中的方框的组合, 可 以用执行规定的功能或操作的专用的基于硬件的系统来实现, 或者可以用专用硬件与计算 机指令的组合来实现。 描述于本公开实施例 中所涉及到的模块可以通过软件的方式实现, 也可以通过硬件的 方式来实现。 其中, 模块的名称在某种情况下并不构成对该模块本身的限定, 例如, 第一 获取模块还可以被描述为“获取至少两个网际协议地址的模块”。 本文 中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。 例如, 非 限制性地, 可以使用的示范类型的硬件逻辑部件包括: 现场可编程门阵列 (FPGA)、 专用 集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD) 等等。 在本公开 的上下文中, 机器可读介质可以是有形的介质, 其可以包含或存储以供指令 执行系统、 装置或设备使用或与指令执行系统、 装置或设备结合地使用的程序。 机器可读 介质可以是机器可读信号介质或机器可读储存介质。 机器可读介质可以包括但不限于电子 的、 磁性的、 光学的、 电磁的、 红外的、 或半导体系统、 装置或设备, 或者上述内容的任 何合适组合。 机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、 便携 式计算机盘、 硬盘、 随机存取存储器(RAM)、 只读存储器 (ROM)、 可擦除可编程只读存 储器 (EPROM或快闪存储器)、 光纤、 便捷式紧凑盘只读存储器(CD-ROM)、 光学储存设 备、 磁储存设备、 或上述内容的任何合适组合。 根据本公开 的一个或多个实施例, 示例 1提供了一种网页分类方法, 包括: 获取待分 类网页的特征信息, 所述特征信息包括搜索引擎优化信息、 从所述待分类网页分享到第三 方网站的网页分享信息、 与所述待分类网页对应的网站在平台投放的与所述待分类网页相 关的网页广告信息以及从渲染所述待分类网页的渲染图像结果中提取的网页渲染信息中的 至少两种; 根据各所述特征信息分别预测所述待分类网页的候选网页类别; 从所有所述候 选网页类别中确定所述待分类网页所属的目标网页类别。 根据本公开 的一个或多个实施例, 示例 2提供了示例 1的方法, 所述从所有所述候选 网页类别中确定所述待分类网页所属的目标网页类别, 包括: 确定各所述特征信 息的置信度; 对所有所述置信度进行归一化处理; 在所有经过归一 化处理的置信度中最大的置信度大于或等于第一预设阈值的情况下, 将与该最大的置信度 对应的特征信息所对应的候选网页类别确定为所述待分类网页所属的目标网页类别。 根据本公开 的一个或多个实施例, 示例 3提供了示例 2的方法, 所述方法还包括: 在 所有经过归一化处理的置信度中最大的置信度小于所述第一预设阈值的情况下, 将预设类 别确定为所述待分类网页所属的目标网页类别, 其中, 所述预设类别包括低质量网页类别。 根据本公开 的一个或多个实施例, 示例 4提供了示例 2的方法, 所述特征信息包括所 述搜索引擎优化信息, 通过以下方式确定所述搜索引擎优化信息的置信度: 根据所述搜索 引擎优化信息, 确定所述待分类网页在第一搜索引擎中的第一排名值; 在所述第一排名值 位于前预设数量之内时, 确定所述搜索引擎优化信息的置信度为预设置信度; 在所述第一 排名值位于前预设数量之外时, 确定所述待分类网页的辅助网页, 其中, 所述辅助网页为 与所述搜索引擎优化信息对应的网页类别所属类别相同的网页; 确定所述待分类网页和所 述辅助网页在第二搜索引擎的第二排名值; 根据所述待分类网页和所述辅助网页在所述第 二搜索引擎的第二排名值, 确定所述待分类网页和所述辅助网页的平均排名值; 采用以下 公式计算所述搜索引擎优化信息的置信度:
Conl=sigmoid((M+T)/R+(K-R)/M); 其中, 所述 Coni为所述搜索引擎优化信息的置信 度, 所述 M为所述待分类网页和所述辅助网页在所述第二搜索引擎中的最低排名值, 所述 T 为所述预设数量, 所述 K为所述平均排名值, 所述 R为所述待分类网页的第一排名值。 根据本公开 的一个或多个实施例, 示例 5提供了示例 2的方法, 所述特征信息包括所 述网页分享信息, 通过以下方式确定所述网页分享信息的置信度: 获取从所述第三方网站 分享到所述待分类网页的第一用户数量和访问所述待分类网页的第二用户数量; 根据所述 第一用户数量和所述第二用户数量, 确定所述网页分享信息的置信度。 根据本公开 的一个或多个实施例, 示例 6提供了示例 2的方法, 所述特征信息包括所 述网页广告信息, 通过以下方式确定所述网页广告信息的置信度: 获取所述网页广告信息 对应的广告的点击通过率、 跳出率和退出率; 采用以下公式计算所述网页广告信息的置信 度: Coni = CTR / {bounce rate + A* exiterate ); 其中,所述 Co«2为所述网页广告信息的置信度, 所述 CZR为所述点击通过率, 所述 ^ 为所述跳出率, 所述
Figure imgf000015_0001
为所述退出率, 所 述 d为预设网站参数。 根据本公开 的一个或多个实施例, 示例 7提供了示例 2的方法, 所述特征信息包括所 述网页渲染信息, 通过以下方式确定所述网页渲染信息的置信度: 在所述渲染图像结果中 的不同位置处提取预设数量的渲染局部信息; 根据各所述渲染局部信息, 确定各所述渲染 局部信息是否与所述网页渲染信息对应的候选网页类别相关; 根据与所述网页渲染信息对 应的候选网页类别相关的渲染局部信息的数量与所述预设数量, 确定所述网页渲染信息的 置信度。 根据本 公开的一个或多个实施例, 示例 8提供了示例 2-7的方法, 所述确定各所述特 征信息的置信度, 包括: 针对所有所述候选网页类别中每两个所述候选网页类别, 确定该 两个所述候选网页类别之间的相似度; 在所有所述相似度中至少存在一个相似度小于第二 预设阈值的情况下, 确定各所述特征信息的置信度。 根据本公开 的一个或多个实施例, 示例 9提供了示例 8的方法, 所述方法还包括: 在 所有所述相似度均大于或等于所述第二预设阈值的情况下, 将所有所述候选网页类别中任 意一个候选网页类别确定为所述待分类网页所属目标网页类别。 根据本公开的一个或多个实施例, 示例 10提供了一种网页分类装置, 包括: 第一获取 模块, 用于获取待分类网页的特征信息, 所述特征信息包括搜索引擎优化信息、 从所述待 分类网页分享到第三方网站的网页分享信息、 与所述待分类网页对应的网站在平台投放的 与所述待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲染图像结果中提取 的网页渲染信息中的至少两种; 预测模块, 用于根据各所述特征信息分别预测所述待分类 网页的候选网页类别; 确定模块, 用于从所有所述候选网页类别中确定所述待分类网页所 属的目标网页类别。 根据本公开的一个或多个实施例, 示例 11提供了示例 10的装置, 所述确定模块包括: 置信度确定子模块, 用于确定各所述特征信息的置信度; 归一化子模块, 用于对所有所述 置信度进行归一化处理; 第一确定子模块, 用于在所有经过归一化处理的置信度中最大的 置信度大于或等于第一预设阈值的情况下, 将与该最大的置信度对应的特征信息所对应的 候选网页类别确定为所述待分类网页所属的目标网页类别。 根据本公开 的一个或多个实施例, 示例 12提供了示例 11的装置, 所述确定模块还包 括: 第二确定子模块, 用于在所有经过归一化处理的置信度中最大的置信度小于所述第一 预设阈值的情况下, 将预设类别确定为所述待分类网页所属的目标网页类别, 其中, 所述 预设类别包括低质量网页类别。 根据本公开 的一个或多个实施例, 示例 13提供了示例 11的装置, 所述装置还包括: 第一排名确定模块, 用于根据所述搜索引擎优化信息, 确定所述待分类网页在第一搜索引 擎中的第一排名值; 预设确定模块, 用于在所述第一排名值位于前预设数量之内时, 确定 所述搜索引擎优化信息的置信度为预设置信度; 网页确定模块, 用于在所述第一排名值位 于前预设数量之外时, 确定所述待分类网页的辅助网页, 其中, 所述辅助网页为与所述搜 索引擎优化信息对应的网页类别所属类别相同的网页; 第二排名确定模块, 用于确定所述 待分类网页和所述辅助网页在第二搜索引擎的第二排名值; 平均排名确定模块, 用于根据 所述待分类网页和所述辅助网页在所述第二搜索引擎的第二排名值, 确定所述待分类网页 和所述辅助网页的平均排名值; 第一计算模块, 用于采用以下公式计算所述搜索引擎优化 信息的置信度: Conl=sigmoid((M+T)/R+(K-R)/M); 其中, 所述 Coni为所述搜索引擎优化 信息的置信度, 所述 M为所述待分类网页和所述辅助网页在所述第二搜索引擎中的最低排 名值, 所述 T为所述预设数量, 所述 K为所述平均排名值, 所述 R为所述待分类网页的第 一排名值。 根据本公开 的一个或多个实施例, 示例 14提供了示例 11的装置, 所述装置还包括: 第二获取模块, 用于获取从所述第三方网站分享到所述待分类网页的第一用户数量和访问 所述待分类网页的第二用户数量; 第二计算模块, 用于根据所述第一用户数量和所述第二 用户数量, 确定所述网页分享信息的置信度。 根据本公开 的一个或多个实施例, 示例 15提供了示例 11的装置, 所述装置还包括: 第三获取模块, 用于获取所述网页广告信息对应的广告的点击通过率、 跳出率和退出率; 第 三计 算模 块, 用于采用 以下公式 计算 所 述网 页广 告信 息的 置信 度: Coni = CTR/ (bounce rate +A * exiterate ); 其中, 所述 Co«2为所述网页广告信息的置信度, 所
Figure imgf000017_0001
所述退出率, 所述 d 为预设网站参数。 根据本公开 的一个或多个实施例, 示例 16提供了示例 11的装置, 所述装置还包括: 提取模块, 用于在所述渲染图像结果中的不同位置处提取预设数量的渲染局部信息; 判断 模块, 用于根据各所述渲染局部信息, 确定各所述渲染局部信息是否与所述网页渲染信息 对应的候选网页类别相关; 第四计算模块, 用于根据与所述网页渲染信息对应的候选网页 类别相关的渲染局部信息的数量与所述预设数量, 确定所述网页渲染信息的置信度。 根据本公开的一个或多个实施例, 示例 17提供了示例 11-16的装置, 所述置信度确定 子模块具体用于针对所有所述候选网页类别中每两个所述候选网页类别, 确定该两个所述 候选网页类别之间的相似度; 在所有所述相似度中至少存在一个相似度小于第二预设阈值 的情况下, 确定各所述特征信息的置信度。 根据本公开 的一个或多个实施例, 示例 18提供了示例 17的装置, 所述置信度确定子 模块还用于在所有所述相似度均大于或等于所述第二预设阈值的情况下, 将所有所述候选 网页类别中任意一个候选网页类别确定为所述待分类网页所属目标网页类别。 根据本公开的一个或多个实施例, 示例 19提供了一种计算机可读介质, 其上存储有计 算机程序, 该程序被处理装置执行时实现示例 1至 9任一项所述方法的步骤。 根据本公开的一个或多个实施例, 示例 20提供了一种电子设备, 包括: 存储装置, 其上存储有计算机程序; 处理装置 , 用于执行所述存储装置中的所述计算机程序, 以实现示例 1至 9任一项所 述方法的步骤。 以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。 本领域技术人员应 当理解, 本公开中所涉及的公开范围, 并不限于上述技术特征的特定组合而成的技术方案, 同时也应涵盖在不脱离上述公开构思的情况下, 由上述技术特征或其等同特征进行任意组 合而形成的其它技术方案。 例如上述特征与本公开中公开的 (但不限于) 具有类似功能的 技术特征进行互相替换而形成的技术方案。 此外 , 虽然采用特定次序描绘了各操作, 但是这不应当理解为要求这些操作以所示出 的特定次序或以顺序次序执行来执行。 在一定环境下, 多任务和并行处理可能是有利的。 同样地, 虽然在上面论述中包含了若干具体实现细节, 但是这些不应当被解释为对本公开 的范围的限制。 在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施 例中。 相反地, 在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子 组合的方式实现在多个实施例中。 尽管 已经采用特定于结构特征和 /或方法逻辑动作的语言描述了本主题, 但是应当理解 所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。 相反, 上面所描 述的特定特征和动作仅仅是实现权利要求书的示例形式。 关于上述实施例中的装置, 其中 各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述, 此处将不做 详细阐述说明。

Claims

权利要求书
1、 一种网页分类方法, 其特征在于, 包括: 获取待分类网页的特征信息, 所述特征信息包括搜索引擎优化信息、 从所述待分类网 页分享到第三方网站的网页分享信息、 与所述待分类网页对应的网站在平台投放的与所述 待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲染图像结果中提取的网页 植染信息中的至少两种; 根据各所述特征信息分别预测所述待分类网页的候选网页类别; 从所有所述候选网页类别中确定所述待分类网页所属的目标网页类别。
2、 根据权利要求 1所述的方法, 其特征在于, 所述从所有所述候选网页类别中确定所 述待分类网页所属的目标网页类别, 包括: 确定各所述特征信息 的置信度; 对所有所述置信度进行归一化处理; 在所有经过归一化处理的置信度中最大的置信度大于或等于第一预设阈值的情况下, 将与该最大的置信度对应的特征信息所对应的候选网页类别确定为所述待分类网页所属的 目标网页类别。
3、 根据权利要求 2所述的方法, 其特征在于, 所述方法还包括: 在所有经过归一化处理的置信度中最大的置信度小于所述第一预设阈值的情况下, 将 预设类别确定为所述待分类网页所属的目标网页类别, 其中, 所述预设类别包括低质量网 页类别。
4、 根据权利要求 2所述的方法, 其特征在于, 所述特征信息包括所述搜索引擎优化信 息, 通过以下方式确定所述搜索引擎优化信息的置信度: 根据所述搜索 引擎优化信息, 确定所述待分类网页在第一搜索引擎中的第一排名值; 在所述第一排名值位于前预设数量之 内时, 确定所述搜索引擎优化信息的置信度为预 设置信度; 在所述第一排名值位于前预设数量之外时, 确定所述待分类网页的辅助网页, 其中, 所述辅助网页为与所述搜索引擎优化信息对应的网页类别所属类别相同的网页; 确定所述待分类网页和所述辅助网页在第二搜索引擎的第二排名值; 根据所述待分类 网页和所述辅助网页在所述第二搜索引擎的第二排名值, 确定所述待 分类网页和所述辅助网页的平均排名值; 采用 以下公式计算所述搜索引擎优化信息的置信度:
Conl=sigmoid((M+T)/R+(K-R)/M); 其 中, 所述 Coni为所述搜索引擎优化信息的置信度, 所述 M为所述待分类网页和所 述辅助网页在所述第二搜索引擎中的最低排名值, 所述 T为所述预设数量, 所述 K为所述 平均排名值, 所述 R为所述待分类网页的第一排名值。
5、 根据权利要求 2所述的方法, 其特征在于, 所述特征信息包括所述网页分享信息, 通过以下方式确定所述网页分享信息的置信度: 获取从所述第三方 网站分享到所述待分类网页的第一用户数量和访问所述待分类网页 的第二用户数量; 根据所述第一用户数量和所述第二用户数量, 确定所述网页分享信息的置信度。
6、 根据权利要求 2所述的方法, 其特征在于, 所述特征信息包括所述网页广告信息, 通过以下方式确定所述网页广告信息的置信度: 获取所述网页广告信息对应的广告的点击通过率、 跳出率和退出率; 采用 以下公式计算所述网页广告信息的置信度:
Con2 = CTR/ (bounce rate + A* exiterate ); 其 中,
Figure imgf000020_0001
为所述 网页广告信息的置信度, 所述 C77?为所述点击通过率, 所述 ^^ 为所述跳出率,
Figure imgf000020_0002
所述退出率, 所述 d为预设网站参数。
7、 根据权利要求 2所述的方法, 其特征在于, 所述特征信息包括所述网页渲染信息, 通过以下方式确定所述网页渲染信息的置信度: 在所述渲染 图像结果中的不同位置处提取预设数量的渲染局部信息; 根据各所述渲染局 部信息, 确定各所述渲染局部信息是否与所述网页渲染信息对应的 候选网页类别相关; 根据与所述 网页渲染信息对应的候选网页类别相关的渲染局部信息的数量与所述预设 数量, 确定所述网页渲染信息的置信度。
8、 根据权利要求 2-7任一所述的方法, 其特征在于, 所述确定各所述特征信息的置信 度, 包括: 针对所有所述候选网页类别中每两个所述候选网页类别, 确定该两个所述候选网页类 别之间的相似度; 在所有所述相似度 中至少存在一个相似度小于第二预设阈值的情况下, 确定各所述特 征信息的置信度。
9、 根据权利要求 8所述的方法, 其特征在于, 所述方法还包括: 在所有所述相似度 均大于或等于所述第二预设阈值的情况下, 将所有所述候选网页类 别中任意一个候选网页类别确定为所述待分类网页所属目标网页类别。
10、 一种网页分类装置, 其特征在于, 包括: 第一获取模块, 用于获取待分类网页的特征信息, 所述特征信息包括搜索引擎优化信 息、 从所述待分类网页分享到第三方网站的网页分享信息、 与所述待分类网页对应的网站 在平台投放的与所述待分类网页相关的网页广告信息以及从渲染所述待分类网页的渲染图 像结果中提取的网页渲染信息中的至少两种; 预测模块, 用于根据各所述特征信息分别预测所述待分类网页的候选网页类别; 确定模块,用于从所有所述候选网页类别中确定所述待分类网页所属的目标网页类别。
11、 一种计算机可读介质, 其上存储有计算机程序, 其特征在于, 该程序被处理装置 执行时实现权利要求 1-9中任一项所述方法的步骤。
12、 一种电子设备, 其特征在于, 包括: 存储装置, 其上存储有计算机程序; 处理装置 , 用于执行所述存储装置中的所述计算机程序, 以实现权利要求 1-9中任一 项所述方法的步骤。
PCT/SG2022/050381 2021-07-07 2022-06-02 网页分类方法、装置、存储介质及电子设备 WO2023282848A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110768852.5A CN113360734B (zh) 2021-07-07 2021-07-07 网页分类方法、装置、存储介质及电子设备
CN202110768852.5 2021-07-07

Publications (1)

Publication Number Publication Date
WO2023282848A1 true WO2023282848A1 (zh) 2023-01-12

Family

ID=77538791

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2022/050381 WO2023282848A1 (zh) 2021-07-07 2022-06-02 网页分类方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN113360734B (zh)
WO (1) WO2023282848A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599155A (zh) * 2016-12-07 2017-04-26 北京亚鸿世纪科技发展有限公司 一种网页分类方法及系统
US20180189614A1 (en) * 2015-06-30 2018-07-05 Beijing Qihoo Techology Company Limited Method and device for classifying webpages
CN110516710A (zh) * 2019-07-25 2019-11-29 湖南星汉数智科技有限公司 网页分类方法、装置、计算机装置及计算机可读存储介质
CN112749360A (zh) * 2019-10-30 2021-05-04 北京国双科技有限公司 网页分类方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819597B (zh) * 2012-08-13 2015-04-22 北京星网锐捷网络技术有限公司 网页分类方法及设备
CN102902794B (zh) * 2012-09-29 2016-08-03 北京奇虎科技有限公司 网页分类系统及方法
CN110781925B (zh) * 2019-09-29 2023-03-10 支付宝(杭州)信息技术有限公司 软件页面的分类方法、装置、电子设备及存储介质
CN112925987A (zh) * 2019-11-20 2021-06-08 浙江大搜车软件技术有限公司 页面分享方法、装置、计算机设备和存储介质
CN112100530B (zh) * 2020-08-03 2023-12-22 百度在线网络技术(北京)有限公司 网页分类方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189614A1 (en) * 2015-06-30 2018-07-05 Beijing Qihoo Techology Company Limited Method and device for classifying webpages
CN106599155A (zh) * 2016-12-07 2017-04-26 北京亚鸿世纪科技发展有限公司 一种网页分类方法及系统
CN110516710A (zh) * 2019-07-25 2019-11-29 湖南星汉数智科技有限公司 网页分类方法、装置、计算机装置及计算机可读存储介质
CN112749360A (zh) * 2019-10-30 2021-05-04 北京国双科技有限公司 网页分类方法及装置

Also Published As

Publication number Publication date
CN113360734A (zh) 2021-09-07
CN113360734B (zh) 2023-05-02

Similar Documents

Publication Publication Date Title
CN107679211B (zh) 用于推送信息的方法和装置
JP6161679B2 (ja) 検索エンジン及びその実現方法
WO2020156389A1 (zh) 信息推送方法和装置
US10579675B2 (en) Content-based video recommendation
WO2017121076A1 (zh) 信息推送方法和装置
US20200045122A1 (en) Method and apparatus for pushing information
CN111368185A (zh) 数据展示方法、装置、存储介质及电子设备
WO2022247562A1 (zh) 多模态数据检索方法、装置、介质及电子设备
CN113806588B (zh) 搜索视频的方法和装置
CN112287206A (zh) 信息处理方法、装置和电子设备
WO2021190129A1 (zh) 页面处理方法、装置、电子设备及计算机可读存储介质
CN113204691B (zh) 一种信息展示方法、装置、设备及介质
US7917520B2 (en) Pre-cognitive delivery of in-context related information
US11294964B2 (en) Method and system for searching new media information
WO2023018379A2 (zh) 知识图谱构建方法、装置、存储介质及电子设备
WO2022228390A1 (zh) 媒体内容处理方法、装置、设备和存储介质
CN112765424B (zh) 数据查询方法、装置、设备及计算机可读介质
WO2023000782A1 (zh) 获取视频热点的方法、装置、可读介质和电子设备
WO2023282848A1 (zh) 网页分类方法、装置、存储介质及电子设备
CN112084441A (zh) 信息检索方法、装置和电子设备
WO2022245280A1 (zh) 特征构建方法、内容显示方法及相关装置
KR20210084641A (ko) 정보를 송신하는 방법 및 장치
CN112256719A (zh) 实体查询方法、装置、可读介质与电子设备
CN117171433A (zh) 物流信息的获取方法和装置
CN112214695A (zh) 信息处理方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22838139

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18572097

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE