WO2020000717A1 - 网页分类方法、装置及计算机可读存储介质 - Google Patents

网页分类方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2020000717A1
WO2020000717A1 PCT/CN2018/107490 CN2018107490W WO2020000717A1 WO 2020000717 A1 WO2020000717 A1 WO 2020000717A1 CN 2018107490 W CN2018107490 W CN 2018107490W WO 2020000717 A1 WO2020000717 A1 WO 2020000717A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
classified
web page
seed
word segmentation
Prior art date
Application number
PCT/CN2018/107490
Other languages
English (en)
French (fr)
Inventor
吴壮伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020000717A1 publication Critical patent/WO2020000717A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a method, a device, and a computer-readable storage medium for classifying a webpage.
  • this application provides a webpage classification method, device, and computer-readable storage medium. Its main purpose is to automatically classify webpages by combining crawler technology and neural network models.
  • the present application provides a webpage classification method, which includes:
  • Obtaining step obtaining a webpage link from a seed webpage, and obtaining a webpage source code from a webpage to be classified pointed to by the webpage link;
  • Preprocessing step performing noise filtering on the source code of the webpage, obtaining the filtered text of each webpage to be classified, word segmentation and de-stopping of the filtered text, and obtaining a set of available words for each webpage to be classified;
  • Extraction step extracting core keywords from the set of available words to obtain the core keyword set of each webpage to be classified;
  • Calculation step Calculate the average value of the core keyword word vector of each webpage to be classified, and input the average value into a webpage classification model trained in advance to obtain the classification result of each webpage to be classified;
  • Loop step The web page to be classified with the classification result is used as a new seed web page, and the process returns to the acquisition step.
  • the present application further provides an electronic device including a memory and a processor, where the memory includes a web page classification program, and the web page classification program is implemented by the processor to implement the following steps:
  • Obtaining step obtaining a webpage link from a seed webpage, and obtaining a webpage source code from a webpage to be classified pointed to by the webpage link;
  • Preprocessing step performing noise filtering on the source code of the webpage, obtaining the filtered text of each webpage to be classified, word segmentation and de-stopping of the filtered text, and obtaining a set of available words for each webpage to be classified;
  • Extraction step extracting core keywords from the set of available words to obtain the core keyword set of each webpage to be classified;
  • Calculation step Calculate the average value of the core keyword word vector of each webpage to be classified, and input the average value into a webpage classification model trained in advance to obtain the classification result of each webpage to be classified;
  • Loop step Use the web page to be classified with the classification result as a new seed web page, and return to the acquisition step.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium includes a webpage classification program, and when the webpage classification program is executed by the processor, implements the webpage classification as described above. Any step in the method.
  • the webpage classification method, device, and computer-readable storage medium proposed in the present application obtain webpage source code from a seed webpage, webpage source code from a webpage to be classified to which the webpage link points, and then perform noise filtering on the webpage source code to obtain a title tag. , Keyword tags and description tags in the text portion of the filtered text, the filtered text is segmented and de-stopped word processing to obtain the set of available words, using the TF-IDF algorithm to extract the core keywords from the set of available words, to get each Categorize the core keyword set of the webpage, then calculate the average of the core keyword word vectors of each webpage to be classified, and enter it into the webpage classification model to obtain the classification result of the webpage to be classified. Because the webpage to be classified that has obtained the classification result can be used as a new seed webpage, and its webpage link and corresponding webpage source code can be obtained again, the automatic classification of a large number of webpages can be realized by using this application.
  • FIG. 1 is a schematic diagram of a preferred embodiment of an electronic device of the present application
  • FIG. 2 is a program module diagram of the webpage classification program in FIG. 1;
  • FIG. 3 is a flowchart of a preferred embodiment of a webpage classification method of the present application.
  • the present application provides an electronic device.
  • FIG. 1 it is a schematic diagram of a preferred embodiment of the electronic device 1 of the present application.
  • the electronic device 1 crawls webpage links and webpage source code using crawler technology, preprocesses the webpage source code to obtain available words, and then obtains the core keyword set of each webpage to be classified, and then uses each webpage to be classified.
  • the average value of the core keyword word vector and the webpage classification model obtained in advance obtain the classification results of each webpage to be classified.
  • the electronic device 1 may be a terminal device having storage and computing functions, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
  • the server when the electronic device 1 is a server, the server may be one or more of a rack server, a blade server, a tower server, or a rack server.
  • the electronic device 1 includes a memory 11, a processor 12, a network interface 13 and a communication bus 14.
  • the memory 11 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory, and the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1.
  • the readable storage medium may also be the external memory 11 of the electronic device 1, such as a plug-in hard disk and a Smart Memory Card (SMC) provided on the electronic device 1. , Secure Digital (SD) card, Flash card (Flash card), etc.
  • SD Secure Digital
  • Flash card Flash card
  • the readable storage medium of the memory 11 is generally used to store an operating system, a web page classification program 10, a web page classification model, a seed web page with a web page type annotation, and a web page link corresponding to a web page to be classified to obtain a classification result. Wait.
  • the memory 11 may also be used to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), a microprocessor or other data processing chip in some embodiments, and is configured to run program code or process data stored in the memory 11, for example, to execute a web page classification program. 10 etc.
  • CPU central processing unit
  • microprocessor or other data processing chip in some embodiments, and is configured to run program code or process data stored in the memory 11, for example, to execute a web page classification program. 10 etc.
  • the network interface 13 may include a standard wired interface, a wireless interface (such as a WI-FI interface). It is generally used to establish a communication connection between the electronic device 1 and other electronic devices or systems.
  • the communication bus 14 is used to implement connection and communication between the aforementioned components.
  • FIG. 1 only shows the electronic device 1 having the components 11-14 and the web page classification program 10, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the electronic device 1 may further include a display, which may also be referred to as a display screen or a display unit.
  • a display may also be referred to as a display screen or a display unit.
  • it may be an LED display, a liquid crystal display, a touch-type liquid crystal display, an organic light-emitting diode (OLED) display, or the like.
  • the display is used to display information processed in the electronic device 1 and to display a visualized user interface.
  • the electronic device 1 further includes a touch sensor.
  • An area provided by the touch sensor for a user to perform a touch operation is referred to as a touch area.
  • the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like.
  • the touch sensor includes not only a touch sensor of a contact type, but also a touch sensor of a proximity type and the like.
  • the touch sensor may be a single sensor, or may be a plurality of sensors arranged in an array, for example. The user can start the webpage classification program 10 by touching the touch area.
  • the area of the display of the electronic device 1 may be the same as that of the touch sensor, or may be different.
  • a display and the touch sensor are stacked to form a touch display screen. The device detects a touch operation triggered by a user based on a touch display screen.
  • the electronic device 1 may further include a radio frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described herein again.
  • RF radio frequency
  • Obtaining step obtaining a webpage link from a seed webpage, and obtaining a webpage source code from a webpage to be classified pointed to by the webpage link;
  • Preprocessing step performing noise filtering on the source code of the webpage, obtaining the filtered text of each webpage to be classified, word segmentation and de-stopping of the filtered text, and obtaining a set of available words for each webpage to be classified
  • Extraction step extracting core keywords from the set of available words to obtain the core keyword set of each webpage to be classified;
  • Calculation step Calculate the average value of the core keyword word vector of each webpage to be classified, and input the average value into a webpage classification model trained in advance to obtain the classification result of each webpage to be classified;
  • Loop step Use the web page to be classified with the classification result as a new seed web page, and return to the acquisition step.
  • the webpage classification program 10 may be divided into multiple modules, and the multiple modules are stored in the memory 12 and executed by the processor 13 to complete the present application.
  • the module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions.
  • the webpage classification program 10 may be divided into: an acquisition module 110, a preprocessing module 120, an extraction module 130, a calculation module 140, a model training module 150, and a model application module 160.
  • the obtaining module 110 is configured to obtain a webpage link and a webpage source code.
  • the obtaining module 110 uses a general web crawler to obtain a web page link from a seed web page, and obtains a web page source code from a web page to be classified pointed to by the web page link.
  • the preprocessing module 120 is configured to preprocess the source code of the webpage to obtain a set of available words for each webpage to be classified.
  • the pre-processing module 120 first performs noise filtering on the source code of the webpage using regular expressions, and obtains the text portion of the title tag, keyword tag, and description tag in the webpage source code, that is, ⁇ title>, ⁇ keywords>, ⁇
  • the text part in description> is used as the filtered text for each web page to be classified, and then the filtered text is segmented and de-stopped to obtain the set of available words for each web page to be classified.
  • the regular expression is also called a regular expression, and is usually used to retrieve and replace text that meets a certain pattern and rule.
  • Each regular expression can filter out the corresponding webpage noise, including advertisements, navigation bar, Javascript script code, CSS style code, HTML tags, punctuation marks, special symbols, etc.
  • Word segmentation is the basis of text processing. Word segmentation can be based on one or more of the word segmentation method based on string matching, the word segmentation method based on understanding, and the word segmentation method based on statistics. Among them, the word segmentation method based on string matching is also called a dictionary-based word segmentation algorithm. In this embodiment, a stutterer can be used to perform word segmentation processing on the filtered text.
  • Stop words mainly include function words, which can be conjunctions, prepositions, auxiliary words, mood words, etc., and sometimes pronouns, several times, and so on. These function words usually do not have a clear meaning by themselves, and only have a role in a complete sentence, such as "then”, “so”, “in”, “of”, “ah”, “this” , “That", and so on.
  • de-stop word processing may be performed on the filtered text against a preset stop word list to obtain a set of available words for each webpage to be classified.
  • An extraction module 130 is configured to extract core keywords from a set of available words to obtain a core keyword set of each webpage to be classified.
  • a term-frequency-inverse document frequency (TF-IDF) algorithm and a preset corpus are used as the available words whose TF-IDF value is greater than a preset threshold.
  • Core keywords get the core keyword set of each webpage to be classified.
  • the TF-IDF algorithm is a statistical method used to evaluate the importance of a word to a file set or a file in a corpus. Specifically, in this embodiment, the TF-IDF algorithm is used to evaluate the importance of the available words of the webpage to be classified to the webpage to be classified, and the available words with a value of TF * IDF greater than a preset threshold are used as the core of the webpage to be classified Key words.
  • term frequency (TF) represents the frequency of available words appearing on a webpage, that is, the quotient of the number of times an available word appears on a webpage to be classified and the number of times all available words on the webpage to be classified appear.
  • IDF Inverse document frequency
  • the calculation module 140 is configured to map core keywords into word vectors, and calculate an average value of the core keyword word vectors of each webpage.
  • a core keyword word vector of a webpage to be classified is distributedly represented.
  • a distributed word vector is a low-dimensional real number vector.
  • the core keywords are associated with points in a low-dimensional space. The representation of this vector is not unique, only to achieve certain discrimination.
  • the distance between distributed word vectors can be measured using the traditional Euclidean distance or the cosine distance. For vectors expressed in this way, the distance between "Mike” and “Microphone” will be much smaller than the distance between "Mike” and "Sunlight”.
  • the model application module 160 implements the classification of the webpages by using the foregoing properties.
  • a model training module 150 is configured to train a neural network model by using an average value of a core keyword word vector of each pre-selected seed web page and a corresponding web page type label to obtain a web page classification model.
  • the webpage type labeling may be manual labeling or automatic labeling. For example, when the number of pre-selected seed web pages is large, the seed web page selected from the financial website may be automatically labeled as the financial type, and the seed web page selected from the sports website may be automatically labeled as the sports type. More precisely, the pre-selected seed web pages can be labeled in multiple levels by manual tagging.
  • webpage type annotation can also be implemented by combining manual annotation and automatic annotation.
  • the neural network model may be a deep learning model based on a neural network, including but not limited to a convolutional neural network, a deep neural network, a recurrent neural network, and the like.
  • the model training module 150 uses the average value of these core keyword word vectors and the corresponding web page type labels as sample data, and passes training and verification. , Adjust the model parameters to get the trained webpage classification model.
  • the model application module 160 is configured to obtain a classification result of the webpage to be classified by using an average value of core keyword word vectors of the webpage to be classified and a webpage classification model.
  • the average value of the core keyword word vector of the webpage to be classified is calculated as the feature vector of the webpage to be classified, and the webpage classification model is used to calculate the average of the core keyword word vector of the webpage to be classified.
  • the web page type corresponding to the seed web page with the smallest cosine distance or less than the threshold is marked as the web page type of the web page to be classified.
  • the webpage classification model includes multiple webpage type access models, which can find the K seed pages closest to the average of the core keyword vector of the webpage to be classified, and count the corresponding webpages.
  • Category and probability according to the probability from high to low, the average value of the core keyword word vector of the webpage to be classified is sequentially input into the admission model of various categories, and the multi-classification problem of webpage classification is converted into multiple binary classification problems .
  • the webpage classification model is trained by other programs, that is, the webpage classification program 10 may not include the model training module 150.
  • this application also provides a webpage classification method.
  • a flowchart of a preferred embodiment of a webpage classification method of the present application is shown.
  • the processor 12 of the electronic device 1 executes the web page classification program 10 stored in the memory, the following steps to implement the web page classification method are as follows:
  • Step S300 The obtaining module 110 obtains a webpage link from a seed webpage, and obtains a webpage source code from a webpage to be classified pointed to by the webpage link.
  • the obtaining module 110 uses a universal web crawler to obtain all webpage links from a pre-selected preset number of seed webpages, and obtains webpage source code from a webpage to be classified pointed to by the webpage link.
  • the pre-processing module 120 performs noise filtering on the source code of the webpage, obtains the filtered text of each webpage to be classified, and performs word segmentation and stopword processing on the filtered text to obtain a set of available words for each webpage to be classified.
  • the filtered text includes a text portion in a title tag, a keyword tag, and a description tag in a webpage source code.
  • the word segmentation method used in the word segmentation processing includes a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics. One or more of them.
  • the process of obtaining the filtered text from the source code of the webpage and the process of word segmentation and de-stopping of the filtered text, please refer to the detailed description of the preprocessing module 120 described above, which will not be repeated here.
  • the extraction module 130 extracts core keywords from a set of available words to obtain a core keyword set of each webpage to be classified.
  • the extraction module 130 uses the TF-IDF algorithm in combination with the Chinese Wikipedia corpus to extract the available words with a value of TF * IDF greater than a preset threshold, as the core keywords of the webpage to be classified.
  • Step S303 the calculation module 140 calculates an average value of the core keyword word vectors of each web page to be classified, and the model application module 160 inputs the average value to a web page classification model trained by the model training module 150, and outputs Classification results.
  • step S304 the web page to be classified obtained with the classification result is used as a new seed web page, and the above steps S300-S303 are repeatedly performed.
  • steps S303 and S304 further include:
  • step S304 Set the number of executions of step S304. When the setting requirements are met, step S304 is no longer performed and the webpage classification operation ends.
  • the seed page For the convenience of expression, here we divide the seed page into the first-generation seed page, the second-generation seed page, and the third-generation seed page.
  • the web pages to be classified can be divided into first-generation web pages to be classified, second-generation web pages to be classified, and the like.
  • the seed webpage used for model training belongs to the first-generation seed webpage, and the first-generation to-be-categorized webpage refers to a webpage pointed to by all webpage links in the first-generation seed webpage, which can be used as the second-generation seed webpage. , And so on, will not repeat them.
  • step S304 After obtaining the classification result of each first-generation web page to be classified, step S304 is executed for the first time, and the first-generation web page to be classified is used as the second-generation seed web page, and the process is repeated.
  • step S304 After performing steps S300-S303, the classification result of each second-generation to-be-categorized web page is obtained, and then step S304 is performed a second time, until the classification result of each third-generation to-be-categorized web page is obtained, step S304 is not performed, and the web page is ended Classification operation.
  • a seed webpage with a webpage type annotation and a webpage link corresponding to a webpage to be classified that has obtained classification results may also be stored in a database, and when the acquired webpage link already exists in the database, termination of the webpage for the webpage is terminated.
  • the link For example, after the obtaining module 110 obtains a web page link from the seed web page, the web page link is queried in the database. If the query is successful, the web page corresponding to the web link already has a classification result, and there is no need to repeat the operation. If the query fails, then Perform the next steps normally.
  • the webpage classification method proposed in this embodiment obtains a webpage source code from a seed webpage, a webpage source code from a webpage to be classified to which the webpage link points, and performs noise filtering on the webpage source code to obtain a title tag, a keyword tag, and a description tag in Chinese.
  • the filtered text is segmented and de-stopped to obtain the set of available words.
  • the core keywords are extracted from the set of available words using the TF-IDF algorithm to obtain the set of core keywords for each web page to be classified.
  • an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), and an erasable and programmable memory. Any one or any combination of read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, etc.
  • the computer-readable storage medium may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), and an erasable and programmable memory. Any one or any combination of read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请提供一种网页分类方法、装置及存储介质,该方法从种子网页中获取网页链接,从该网页链接所指向的待分类网页中获取网页源码后,对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合。之后,该方法从所述可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合,计算每个待分类网页的核心关键词词向量的平均值,并将该平均值输入训练得到的网页分类模型,得到每个待分类网页的分类结果。利用本申请,可以对种子网页的网页链接所指向的待分类网页实现自动分类。

Description

网页分类方法、装置及计算机可读存储介质
优先权申明
本申请要求于2018年6月29日提交中国专利局、申请号为201810694720.0,发明名称为“网页分类方法、装置及计算机可读存储介质”的中国专利申请的优先权,其内容全部通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种网页分类方法、装置及计算机可读存储介质。
背景技术
随着互联网技术和Web技术的高速发展,互联网上网页的数量在不断增加,数据资源在不断丰富,为各种数据密集型的应用提供了潜在的数据来源。然而,过大的信息量给人们处理数据信息带来了很多困难,传统的靠人工的信息处理方式显然已经无法满足大量数据处理的要求。在这一背景下,如何自动获取海量网页的有效文本内容,并对海量网页进行自动分类,是组织和管理网络资源的关键。
发明内容
鉴于以上原因,本申请提供一种网页分类方法、装置及计算机可读存储介质,其主要目的在于结合爬虫技术和神经网络模型,对网页进行自动分类。
为实现上述目的,本申请提供一种网页分类方法,该方法包括:
获取步骤:从种子网页中获取网页链接,从所述网页链接指向的待分类网页中获取网页源码;
预处理步骤:对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合;
提取步骤:从所述可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合;
计算步骤:计算每个待分类网页的核心关键词词向量的平均值,将该平均值输入预先训练得到的网页分类模型,得到每个待分类网页的分类结果;及
循环步骤:将获得分类结果的待分类网页作为新的种子网页,返回获取步骤。
本申请还提供一种电子装置,该电子装置包括存储器和处理器,所述存储器中包括网页分类程序,该网页分类程序被所述处理器执行时实现如下步骤:
获取步骤:从种子网页中获取网页链接,从所述网页链接指向的待分类网页中获取网页源码;
预处理步骤:对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合;
提取步骤:从所述可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合;
计算步骤:计算每个待分类网页的核心关键词词向量的平均值,将该平均值输入预先训练得到的网页分类模型,得到每个待分类网页的分类结果;及
循环步骤:将获得分类结果的待分类网页作为新的种子网页,返回获取步骤。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中包括网页分类程序,该网页分类程序被所述处理器执行时实现如上所述的网页分类方法中的任意步骤。
本申请提出的网页分类方法、装置及计算机可读存储介质,通过从种子网页中获取网页链接,从网页链接指向的待分类网页中获取网页源码,然后对网页源码进行噪音过滤,得到包括标题标签、关键词标签和描述标签中文本部分的筛选文本,对筛选文本进行分词和去停用词处理,得到可用词集合,利用TF-IDF算法从可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合,然后计算每个待分类网页的核心关键词词向量的平均值,将其输入网页分类模型,得到待分类网页的分类结果。因为获得分类结果的待 分类网页可以作为新的种子网页,重新获取其网页链接及对应的网页源码,所以利用本申请可以实现对大量网页的自动分类。
附图说明
图1为本申请电子装置较佳实施例的示意图;
图2为图1中网页分类程序的程序模块图;
图3为本申请网页分类方法较佳实施例的流程图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚明白,下面将结合若干附图及实施例,对本申请进行进一步详细说明。应当理解的是,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供一种电子装置。参照图1所示,为本申请电子装置1较佳实施例的示意图。在该实施例中,电子装置1利用爬虫技术爬取网页链接和网页源码,对网页源码进行预处理得到可用词,进而得到每个待分类网页的核心关键词集合,然后利用每个待分类网页核心关键词词向量的平均值和预先训练得到的网页分类模型得到每个待分类网页的分类结果。
所述电子装置1可以是服务器、智能手机、平板电脑、便携计算机、桌上型计算机等具有存储和运算功能的终端设备。在一个实施例中,当电子装置1为服务器时,该服务器可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等的一种或几种。
所述电子装置1包括存储器11、处理器12、网络接口13及通信总线14。
其中,存储器11包括至少一种类型的可读存储介质。所述至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,所述可读存储介质可以是所述电子装置1的内部 存储单元,例如该电子装置1的硬盘。在另一些实施例中,所述可读存储介质也可以是所述电子装置1的外部存储器11,例如所述电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
在本实施例中,所述存储器11的可读存储介质通常用于存储操作系统、网页分类程序10、网页分类模型以及具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接等。所述存储器11还可以用于暂时地存储已经输出或者将要输出的数据。
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行网页分类程序10等。
网络接口13可以包括标准的有线接口、无线接口(如WI-FI接口)。通常用于在该电子装置1与其他电子设备或系统之间建立通信连接。
通信总线14用于实现上述组件之间的连接通信。
图1仅示出了具有组件11-14以及网页分类程序10的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
可选地,该电子装置1还可以包括显示器,也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)显示器等。显示器用于显示在电子装置1中处理的信息以及用于显示可视化的用户界面。
可选地,该电子装置1还包括触摸传感器。所述触摸传感器所提供的供用户进行触摸操作的区域称为触控区域。此外,这里所述的触摸传感器可以为电阻式触摸传感器、电容式触摸传感器等。而且,所述触摸传感器不仅包括接触式的触摸传感器,也可包括接近式的触摸传感器等。此外,所述触摸传感器可以为单个传感器,也可以为例如阵列布置的多个传感器。用户可以通过触摸所述触控区域启动网页分类程序10。
此外,该电子装置1的显示器的面积可以与所述触摸传感器的面积相同,也可以不同。可选地,将显示器与所述触摸传感器层叠设置,以形成触摸显示屏。该装置基于触摸显示屏侦测用户触发的触控操作。
该电子装置1还可以包括射频(Radio Frequency,RF)电路、传感器和音频电路等等,在此不再赘述。
在上述实施例中,处理器12执行存储器11中存储的网页分类程序10时实现如下步骤:
获取步骤:从种子网页中获取网页链接,从所述网页链接指向的待分类网页中获取网页源码;
预处理步骤:对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合;
提取步骤:从所述可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合;
计算步骤:计算每个待分类网页的核心关键词词向量的平均值,将该平均值输入预先训练得到的网页分类模型,得到每个待分类网页的分类结果;及
循环步骤:将获得分类结果的待分类网页作为新的种子网页,返回获取步骤。
关于上述步骤的详细介绍,请参照下述图2关于网页分类程序10较佳实施例的程序模块图以及图3关于网页分类方法较佳实施例的流程图的说明。
在其他实施例中,网页分类程序10可以被分割为多个模块,该多个模块被存储于存储器12中,并由处理器13执行,以完成本申请。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。
参照图2所示,为图1中网页分类程序10较佳实施例的程序模块图。在本实施例中,所述网页分类程序10可以被分割为:获取模块110、预处理模块120、提取模块130、计算模块140、模型训练模块150以及模型应用模块160。
获取模块110,用于获取网页链接和网页源码。例如,获取模块110利用通用网络爬虫从种子网页中获取网页链接,从该网页链接指向的待分类网页中获取网页源码。
预处理模块120,用于对网页源码进行预处理,得到每个待分类网页的可用词集合。在本实施例中,预处理模块120先使用正则表达式对网页源码进行噪音过滤,获取网页源码中标题标签、关键词标签和描述标签中的文本部分,即<title>、<keywords>、<description>中的文本部分,以此作为每个待分类网页的筛选文本,然后对筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合。
其中,所述正则表达式又称规则表达式,通常被用来检索、替换那些符合某个模式、规则的文本。每一个正则表达式都可以过滤掉与之对应的网页噪音,包括广告、导航栏、Javascript脚本代码、CSS样式代码、HTML标签、标点符号、特殊符号等。
分词是文本处理的基础,分词可以采用基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。其中,基于字符串匹配的分词方法也称为基于词典的分词算法。在本实施例中,可以使用结巴分词器对所述筛选文本进行分词处理。
停用词主要包括功能词,可以是连词、介词、助词、语气词等,有时也可以是代词、数次等。这些功能词通常自身并无明确的意义,只有将其放入一个完整的句子中才有一定作用,例如“那么”、“所以”、“在”、“的”、“啊”、“这”、“那”等等。在本实施例中,可以对照预设停用词表对所述筛选文本进行去停用词处理,得到每个待分类网页的可用词集合。
提取模块130,用于从可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合。在本实施例中,利用词频-逆文件频率(Term Frequency-Inverse Document Frequency,TF-IDF)算法和预设语料库(例如中文维基百科语料库),将TF-IDF值大于预设阈值的可用词作为核心关键词,得到每个待分类网页的核心关键词集合。
TF-IDF算法是一种统计方法,用来评估某词对于一个文件集或一个语料库中的其中一份文件的重要程度。具体的,在本实施例中,TF-IDF算法用来评估待分类网页的可用词对于待分类网页的重要程度,将TF*IDF的值大于预设阈值的可用词作为该待分类网页的核心关键词。其中,词频(Term Frequency,TF)表示可用词在网页中出现的频率,即某可用词在某待分类网页中出现的次数与该待分类网页中所有可用词出现的次数之商。逆文件频率(Inverse  document frequency,IDF)可以看作某可用词对某待分类网页重要程度的权重,某可用词在某类网页中的词频越大,在所有网页中的词频越小,则IDF的值越大,该可用词对该待分类网页的重要程度越大。
计算模块140,用于将核心关键词映射为词向量,并计算每个网页的核心关键词词向量的平均值。在本实施例中,待分类网页的核心关键词词向量采用分布式表示。分布式词向量是一种低维实数向量,将所述核心关键词与低维空间中的点形成对应关系,这种向量的表示并不是唯一的,只为实现一定的区分性。分布式词向量之间的距离可以用传统的欧氏距离来衡量,也可以用余弦距离来衡量。用这种方式表示的向量,“麦克”和“话筒”的距离会远远小于“麦克”和“阳光”的距离。模型应用模块160正是利用上述性质实现对网页的分类。
模型训练模块150,用于利用每个预先选取的种子网页的核心关键词词向量的平均值和对应的网页类型标注对神经网络模型进行训练,得到网页分类模型。所述网页类型标注可以是人工标注,也可以是自动标注。例如,当预先选取的种子网页数量较大时,可以将从财经网站选取的种子网页自动标注为财经类型,将从体育网站选取的种子网页自动标注为体育类型。更精确的,可以通过人工标注的方式对预先选取的种子网页进行多层次标注,例如,将某种子网页标注为:体育-篮球-NBA,以便后续能更合理地利用网页资源,例如实现网页类型细分等。可以理解的是,网页类型标注还可以通过结合人工标注和自动标注的方式实现。
所述神经网络模型可以是基于神经网络的深度学习模型,包括但不限于卷积神经网络、深度神经网络和循环神经网络等。计算模块140得到每个预先选取的种子网页的核心关键词词向量的平均值后,模型训练模块150以这些核心关键词词向量的平均值及对应的网页类型标注作为样本数据,通过训练和验证,调整模型参数,得到训练好的网页分类模型。
模型应用模块160,用于利用待分类网页的核心关键词词向量的平均值以及网页分类模型,得到待分类网页的分类结果。在本实施例中,将计算得到的待分类网页的核心关键词词向量的平均值作为待分类网页的特征向量,利用所述网页分类模型,通过计算待分类网页的核心关键词词向量的平均值与种子网页的核心关键词词向量的平均值之间的余弦距离,将余弦距离最小或 小于阈值的种子网页对应的网页类型标注作为该待分类网页的网页类型。
在一个实施例中,所述网页分类模型包括多个网页类型的准入模型,可求出与待分类网页的核心关键词词向量的平均值距离最近的K个种子网页,统计出对应的网页类别和概率,依据概率从高到低,将该待分类网页的核心关键词词向量的平均值依次输入各种类别的准入模型,将网页分类这个多分类问题转化为多个二值分类问题。
在另一个实施例中,所述网页分类模型由其他程序训练得到,也就是说,所述网页分类程序10可以不包括所述模型训练模块150。
此外,本申请还提供一种网页分类方法。参照图3所示,为本申请网页分类方法的较佳实施例的流程图。电子装置1的处理器12执行存储器中存储的网页分类程序10时实现网页分类方法的如下步骤:
步骤S300,获取模块110从种子网页中获取网页链接,从该网页链接所指向的待分类网页中获取网页源码。例如,获取模块110利用通用网络爬虫从预先选取的预设数量的种子网页中获取所有网页链接,从网页链接指向的待分类网页中获取网页源码。
步骤S301,预处理模块120对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合。所述筛选文本包括网页源码中标题标签、关键词标签和描述标签中的文本部分,所述分词处理采用的分词方法包括基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。关于从网页源码中得到筛选文本的过程以及对筛选文本进行分词和去停用词处理的过程,可参照上述关于预处理模块120的详细介绍,在此不再赘述。
步骤S302,提取模块130从可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合。例如,提取模块130利用TF-IDF算法,结合中文维基百科语料库,将TF*IDF的值大于预设阈值的可用词提取出来,作为待分类网页的核心关键词。
步骤S303,计算模块140计算每个待分类网页的核心关键词词向量的平均值,模型应用模块160将该平均值输入由模型训练模块150训练得到的网 页分类模型,输出每个待分类网页的分类结果。
步骤S304,将获得分类结果的待分类网页作为新的种子网页,重复执行上述步骤S300-S303。
在其他实施例中,步骤S303和步骤S304之间还包括:
设置步骤S304的执行次数,当满足设置要求时,不再执行步骤S304,结束网页分类操作。
为了便于表述,此处我们将种子网页分为第一代种子网页、第二代种子网页和第三代种子网页等。类似地,可将待分类网页分为第一代待分类网页、第二代待分类网页等。其中,用于进行模型训练的种子网页属于第一代种子网页,所述第一代待分类网页指所述第一代种子网页中所有网页链接所指向的网页,其可作为第二代种子网页,以此类推,不再赘述。
例如,假设设置步骤S304的执行次数为2,则当得到每个第一代待分类网页的分类结果后,第一次执行步骤S304,将第一代待分类网页作为第二代种子网页,重复执行步骤S300-S303后,得到每个第二代待分类网页的分类结果,然后第二次执行步骤S304,直至得到每个第三代待分类网页的分类结果,不再执行步骤S304,结束网页分类操作。
在其他实施例中,还可以将具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接存储至数据库,当获取的网页链接已在所述数据库中存在时,终止针对该网页链接的后续操作。例如,当获取模块110从种子网页中获取网页链接后,在所述数据库中查询该网页链接,若查询成功,则该网页链接对应的网页已存在分类结果,无须重复操作,若查询失败,则正常执行后续步骤。
本实施例提出的网页分类方法,通过从种子网页中获取网页链接,从网页链接指向的待分类网页中获取网页源码,对网页源码进行噪音过滤,得到包括标题标签、关键词标签和描述标签中文本部分的筛选文本,对筛选文本进行分词和去停用词处理,得到可用词集合,利用TF-IDF算法从可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合,然后计算每个待分类网页的核心关键词词向量的平均值,将其输入网页分类模型,得到待分类网页的分类结果,再从所述待分类网页中获取网页链接,重复上述步骤。 利用网络爬虫,可实现对网页源码和网页链接的深层爬取,获取大量网页数据,通过训练深度学习模型,可实现网页自动分类,因此,利用本申请,可以实现对大量网页的自动分类。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质可以是硬盘、多媒体卡、SD卡、闪存卡、SMC、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器等等中的任意一种或者几种的任意组合。
本申请之计算机可读存储介质的具体实施方式与上述网页分类方法和电子装置1的具体实施方式大致相同,请参相关介绍,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质中,包括若干指令用以使得服务器执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种网页分类方法,应用于电子装置,其特征在于,该方法包括:
    获取步骤:从种子网页中获取网页链接,从所述网页链接指向的待分类网页中获取网页源码;
    预处理步骤:对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合;
    提取步骤:从所述可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合;
    计算步骤:计算每个待分类网页的核心关键词词向量的平均值,将该平均值输入预先训练得到的网页分类模型,得到每个待分类网页的分类结果;及
    循环步骤:将获得分类结果的待分类网页作为新的种子网页,返回获取步骤。
  2. 如权利要求1所述的网页分类方法,其特征在于,所述网页分类模型的训练步骤包括:
    为预先选取的预设数量的种子网页标注网页类型;
    对所述种子网页的网页源码进行预处理,得到每个种子网页的可用词集合;
    从所述可用词集合中提取核心关键词,得到每个种子网页的核心关键词集合;
    计算每个种子网页的核心关键词词向量的平均值;及
    利用每个种子网页的核心关键词词向量的平均值和对应的网页类型标注对神经网络模型进行训练,得到网页分类模型。
  3. 如权利要求1所述的网页分类方法,其特征在于,所述筛选文本包括网页源码中标题标签、关键词标签和描述标签中的文本部分,所述分词处理采用的分词方法包括基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。
  4. 如权利要求2所述的网页分类方法,其特征在于,所述筛选文本包括网页源码中标题标签、关键词标签和描述标签中的文本部分,所述分词处理采用的分词方法包括基于字符串匹配的分词方法、基于理解的分词方法和基 于统计的分词方法中的一种或几种。
  5. 如权利要求2所述的网页分类方法,其特征在于,该方法还包括:
    设置所述循环步骤的执行次数,当满足设置要求时,终止所述循环步骤。
  6. 如权利要求1所述的网页分类方法,其特征在于,该方法还包括:
    将具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接存储至数据库;
    当获取的网页链接已在所述数据库中存在时,终止针对该网页链接的后续操作。
  7. 如权利要求2所述的网页分类方法,其特征在于,该方法还包括:
    将具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接存储至数据库;
    当获取的网页链接已在所述数据库中存在时,终止针对该网页链接的后续操作。
  8. 一种电子装置,包括存储器和处理器,其特征在于,所述存储器中包括网页分类程序,该网页分类程序被所述处理器执行时实现如下步骤:
    获取步骤:从种子网页中获取网页链接,从所述网页链接指向的待分类网页中获取网页源码;
    预处理步骤:对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合;
    提取步骤:从所述可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合;
    计算步骤:计算每个待分类网页的核心关键词词向量的平均值,将该平均值输入预先训练得到的网页分类模型,得到每个待分类网页的分类结果;及
    循环步骤:将获得分类结果的待分类网页作为新的种子网页,返回获取步骤。
  9. 如权利要求8所述的电子装置,其特征在于,所述网页分类模型的训练步骤包括:
    为预先选取的预设数量的种子网页标注网页类型;
    对所述种子网页的网页源码进行预处理,得到每个种子网页的可用词集合;
    从所述可用词集合中提取核心关键词,得到每个种子网页的核心关键词集合;
    计算每个种子网页的核心关键词词向量的平均值;及
    利用每个种子网页的核心关键词词向量的平均值和对应的网页类型标注对神经网络模型进行训练,得到网页分类模型。
  10. 如权利要求8所述的电子装置,其特征在于,所述筛选文本包括网页源码中标题标签、关键词标签和描述标签中的文本部分,所述分词处理采用的分词方法包括基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。
  11. 如权利要求9所述的电子装置,其特征在于,所述筛选文本包括网页源码中标题标签、关键词标签和描述标签中的文本部分,所述分词处理采用的分词方法包括基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。
  12. 如权利要求8所述的电子装置,其特征在于,所述网页分类程序被所述处理器执行时实现如下步骤:
    设置所述循环步骤的执行次数,当满足设置要求时,终止所述循环步骤。
  13. 如权利要求8所述的电子装置,其特征在于,所述网页分类程序被所述处理器执行时还实现如下步骤:
    将具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接存储至数据库;
    当获取的网页链接已在所述数据库中存在时,终止针对该网页链接的后续操作。
  14. 如权利要求9所述的电子装置,其特征在于,所述网页分类程序被所述处理器执行时还实现如下步骤:
    将具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接存储至数据库;
    当获取的网页链接已在所述数据库中存在时,终止针对该网页链接的后续操作。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括网页分类程序,所述网页分类程序被处理器执行时实现如下步骤:
    获取步骤:从种子网页中获取网页链接,从所述网页链接指向的待分类网页中获取网页源码;
    预处理步骤:对所述网页源码进行噪音过滤,获取每个待分类网页的筛选文本,对该筛选文本进行分词和去停用词处理,得到每个待分类网页的可用词集合;
    提取步骤:从所述可用词集合中提取核心关键词,得到每个待分类网页的核心关键词集合;
    计算步骤:计算每个待分类网页的核心关键词词向量的平均值,将该平均值输入预先训练得到的网页分类模型,得到每个待分类网页的分类结果;及
    循环步骤:将获得分类结果的待分类网页作为新的种子网页,返回获取步骤。
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述网页分类模型的训练步骤包括:
    为预先选取的预设数量的种子网页标注网页类型;
    对所述种子网页的网页源码进行预处理,得到每个种子网页的可用词集合;
    从所述可用词集合中提取核心关键词,得到每个种子网页的核心关键词集合;
    计算每个种子网页的核心关键词词向量的平均值;及
    利用每个种子网页的核心关键词词向量的平均值和对应的网页类型标注对神经网络模型进行训练,得到网页分类模型。
  17. 如权利要求15所述的计算机可读存储介质,其特征在于,所述筛选文本包括网页源码中标题标签、关键词标签和描述标签中的文本部分,所述分词处理采用的分词方法包括基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法中的一种或几种。
  18. 如权利要求15所述的计算机可读存储介质,其特征在于,所述网页分类程序被所述处理器执行时实现如下步骤:
    设置所述循环步骤的执行次数,当满足设置要求时,终止所述循环步骤。
  19. 如权利要求15所述的计算机可读存储介质,其特征在于,所述网页分类程序被所述处理器执行时还实现如下步骤:
    将具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接存储至数据库;
    当获取的网页链接已在所述数据库中存在时,终止针对该网页链接的后续操作。
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,所述网页分类程序被所述处理器执行时还实现如下步骤:
    将具有网页类型标注的种子网页和获得分类结果的待分类网页对应的网页链接存储至数据库;
    当获取的网页链接已在所述数据库中存在时,终止针对该网页链接的后续操作。
PCT/CN2018/107490 2018-06-29 2018-09-26 网页分类方法、装置及计算机可读存储介质 WO2020000717A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810694720.0 2018-06-29
CN201810694720.0A CN109062972A (zh) 2018-06-29 2018-06-29 网页分类方法、装置及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2020000717A1 true WO2020000717A1 (zh) 2020-01-02

Family

ID=64817979

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/107490 WO2020000717A1 (zh) 2018-06-29 2018-09-26 网页分类方法、装置及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN109062972A (zh)
WO (1) WO2020000717A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783175B (zh) * 2019-01-16 2023-03-31 平安普惠企业管理有限公司 应用程序图标管理方法、装置、可读存储介质及终端设备
CN111797299A (zh) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 模型训练方法、网页分类方法、装置、存储介质及设备
CN110191096B (zh) * 2019-04-30 2023-05-09 安徽工业大学 一种基于语义分析的词向量网页入侵检测方法
CN110545355B (zh) * 2019-07-31 2021-04-02 努比亚技术有限公司 一种智能提醒方法、终端及计算机可读存储介质
CN110427628A (zh) * 2019-08-02 2019-11-08 杭州安恒信息技术股份有限公司 基于神经网络算法的web资产分类检测方法及装置
CN110750493B (zh) * 2019-09-03 2022-08-09 平安科技(深圳)有限公司 一种法律文本归档方法、装置、可读存储介质及终端设备
CN110705290B (zh) * 2019-09-29 2023-06-23 新华三信息安全技术有限公司 一种网页分类方法及装置
CN111382385B (zh) * 2020-02-21 2024-04-12 奇安信科技集团股份有限公司 网页所属行业分类方法及装置
CN111931040B (zh) * 2020-06-30 2024-01-12 深圳市世强元件网络有限公司 一种网络平台内部服务实体服务入口的推荐方法
CN112256987A (zh) * 2020-10-19 2021-01-22 中国互联网金融协会 监测境外股票交易网站的方法及装置、设备及存储介质
CN112860726A (zh) * 2021-02-07 2021-05-28 天云融创数据科技(北京)有限公司 结构化查询语句分类模型训练方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794311A (zh) * 2010-03-05 2010-08-04 南京邮电大学 基于模糊数据挖掘的中文网页自动分类方法
CN103226578A (zh) * 2013-04-02 2013-07-31 浙江大学 面向医学领域的网站识别和网页细分类的方法
CN104035968A (zh) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 基于社交网络的训练语料集的构建方法和装置
CN106126512A (zh) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 一种集成学习的网页分类方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794311A (zh) * 2010-03-05 2010-08-04 南京邮电大学 基于模糊数据挖掘的中文网页自动分类方法
CN103226578A (zh) * 2013-04-02 2013-07-31 浙江大学 面向医学领域的网站识别和网页细分类的方法
CN104035968A (zh) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 基于社交网络的训练语料集的构建方法和装置
CN106126512A (zh) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 一种集成学习的网页分类方法及装置

Also Published As

Publication number Publication date
CN109062972A (zh) 2018-12-21

Similar Documents

Publication Publication Date Title
WO2020000717A1 (zh) 网页分类方法、装置及计算机可读存储介质
WO2019218514A1 (zh) 网页目标信息的提取方法、装置及存储介质
CN107679144B (zh) 基于语义相似度的新闻语句聚类方法、装置及存储介质
WO2022022045A1 (zh) 基于知识图谱的文本比对方法、装置、设备及存储介质
WO2021068339A1 (zh) 文本分类方法、装置及计算机可读存储介质
AU2017408801B2 (en) User keyword extraction device and method, and computer-readable storage medium
WO2020237856A1 (zh) 基于知识图谱的智能问答方法、装置及计算机存储介质
WO2019200806A1 (zh) 文本分类模型的生成装置、方法及计算机可读存储介质
US10380197B2 (en) Network searching method and network searching system
US9767144B2 (en) Search system with query refinement
WO2022095374A1 (zh) 关键词抽取方法、装置、终端设备及存储介质
WO2021135469A1 (zh) 基于机器学习的信息抽取方法、装置、计算机设备及介质
Urvoy et al. Tracking web spam with html style similarities
WO2015149533A1 (zh) 一种基于网页内容分类进行分词处理的方法和装置
CN110390044B (zh) 一种相似网络页面的搜索方法及设备
CN111177532A (zh) 一种垂直搜索方法、装置、计算机系统及可读存储介质
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
WO2020253043A1 (zh) 智能文本分类方法、装置及计算机可读存储介质
WO2020258481A1 (zh) 个性化文本智能推荐方法、装置及计算机可读存储介质
CN113051356A (zh) 开放关系抽取方法、装置、电子设备及存储介质
WO2020056977A1 (zh) 知识点推送方法、装置及计算机可读存储介质
CN110851598A (zh) 文本分类方法、装置、终端设备及存储介质
WO2021068681A1 (zh) 标签分析方法、装置及计算机可读存储介质
WO2019214142A1 (zh) 电子装置、基于研报数据的预测方法、程序和计算机存储介质
WO2018171295A1 (zh) 一种给文章标注标签的方法、装置、终端及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18924108

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.04.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18924108

Country of ref document: EP

Kind code of ref document: A1