CN110633446B - Webpage column recognition model training method, using method, device and storage medium - Google Patents

Webpage column recognition model training method, using method, device and storage medium Download PDF

Info

Publication number
CN110633446B
CN110633446B CN201911161584.XA CN201911161584A CN110633446B CN 110633446 B CN110633446 B CN 110633446B CN 201911161584 A CN201911161584 A CN 201911161584A CN 110633446 B CN110633446 B CN 110633446B
Authority
CN
China
Prior art keywords
url
web page
training
column
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911161584.XA
Other languages
Chinese (zh)
Other versions
CN110633446A (en
Inventor
耿雪芹
朱露
王晓斌
黄三伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Ant Software Ltd By Share Ltd
Original Assignee
Hunan Ant Software Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Ant Software Ltd By Share Ltd filed Critical Hunan Ant Software Ltd By Share Ltd
Priority to CN201911161584.XA priority Critical patent/CN110633446B/en
Publication of CN110633446A publication Critical patent/CN110633446A/en
Application granted granted Critical
Publication of CN110633446B publication Critical patent/CN110633446B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for training a webpage column recognition model, which comprises the following steps: selecting a first number of web page columns url, a second number of web page contents url and a title length corresponding to a web page; processing a webpage column url to obtain a corresponding first character string characteristic; processing the webpage content url to obtain a corresponding second character string characteristic; normalizing the title length to obtain title characteristics; dividing the training set into a training set and a verification set; inputting the training set into the selected machine learning model for training to obtain a trained machine learning model; testing the trained machine learning model by adopting a verification set to obtain evaluation parameters; and when the evaluation parameters reach the set threshold value, determining the model as a qualified model. The invention also discloses a device for training the webpage column recognition model and a storage medium, and solves the problem that the webpage column cannot be effectively recognized in the prior art.

Description

Webpage column recognition model training method, using method, device and storage medium
Technical Field
The invention relates to the field of updating of hot topics, in particular to a method and a device for training a webpage column recognition model and a storage medium.
Background
The structure of a website comprises a plurality of navigation pages such as first-level columns, second-level columns, third-level columns and the like besides home pages and webpage contents. According to the hierarchical division, the web pages of a general website can be divided into three types: home page, column page, content page. The website home page is an entry webpage of a website, guides an internet user to browse contents of other parts of the website and is contents with a directory property; the column page is an aggregation page established according to the category of the published information, and guides the user to quickly find the interested information; the specific content page may be a detailed introduction page of an article, a commodity, etc., and the content can be displayed on one page.
At present, in a data acquisition system, the identification of web page columns is mostly carried out through manual extraction. Limited by human labor, not all hurdles url will generally be extracted. The existing solution is to manually select the column url of the website or to identify the column url by performing rule matching by dividing url path features of a webpage. However, the path division of url by each website is not so clear and uniform, and url path division and web page column do not have complete corresponding relationship
Therefore, the problem in the prior art is that if the column url is not in the set rule, the column url can not be identified; and the content url is in the set matching rule and can be mistakenly taken as the column url. Therefore, this method cannot effectively identify the web page columns.
Disclosure of Invention
In view of the above, the present invention mainly aims to provide a method and an apparatus for training a web page column recognition model, and a storage medium, and aims to solve the problem in the prior art that a web page column cannot be effectively recognized.
In order to achieve the purpose, the technical scheme of the invention is realized as follows: the invention provides a method for training a webpage column recognition model, which comprises the following steps:
selecting a first number of web page columns url, a second number of web page contents url and a title length corresponding to each url, wherein the ratio of the first number to the second number is within a preset range;
processing a webpage column url to obtain a corresponding first character string characteristic;
processing the webpage content url to obtain a corresponding second character string characteristic;
carrying out normalization processing on the title length to obtain a title length characteristic;
dividing the first character string feature, the second character string feature and the title length feature into a training set and a verification set;
inputting the training set into the selected machine learning model for training to obtain a trained machine learning model;
testing the trained machine learning model by adopting a verification set to obtain evaluation parameters, wherein the evaluation parameters include but are not limited to: accuracy and recall;
and determining the evaluation parameter as a qualified model when the evaluation parameter reaches a set threshold value.
In the above scheme, the step of selecting the first number of web page columns url, the second number of web page contents url, and the title length corresponding to each url includes:
capturing a webpage column url, a webpage content url and a title length corresponding to each url;
carrying out character string recognition on data of each webpage column url and each webpage content url to obtain a plurality of groups of data, wherein each group of data is url containing the same character string format;
from the plurality of sets of data, a first number of web page columns url and a second number of web page content url are selected.
In the above scheme, after the step of performing character string recognition on the data of each website to obtain a plurality of sets of data, the method further includes:
obtaining the url number of each group of data;
judging whether the url number is smaller than a preset threshold value or not;
if so, the set of data is deleted.
In the foregoing solution, the step of processing the web page column url to obtain the corresponding first character string feature includes:
decoding a webpage column url;
performing word segmentation processing on the decoded web page column url to obtain a word list;
and vectorizing the words in the word list to obtain a first character string characteristic.
In the above scheme, the step of capturing the web page column url, the web page content url, and the title length corresponding to each url includes:
acquiring all website home pages url from a url library of a data acquisition system;
and acquiring a web page column url and a web page content url in each website according to the website home page url.
In the above scheme, the step of performing character string recognition on the data of each website to obtain multiple sets of data includes:
and clustering or regularly matching url paths, and dividing url with the same character string format into one group to obtain multiple groups of data.
In the foregoing solution, the step of inputting the training set into the selected machine learning model for training to obtain a trained machine learning model includes:
and inputting the training set into a support vector machine or logistic regression or naive Bayes for training to obtain a trained machine learning model.
In addition, the invention also discloses a use method of the webpage column recognition model obtained based on the webpage column recognition model training method, and the use method comprises the following steps:
acquiring urls to be detected, performing word segmentation processing, and acquiring vectorized character string characteristics and normalized title length characteristics, wherein the urls to be detected are web page column characteristics or web page content urls, and the title length characteristics are title lengths corresponding to the urls to be detected;
inputting the vectorized character string features and the normalized title length features into a pre-trained machine learning model;
determining the classification type of the url to be detected according to the output result of the machine learning model, wherein the classification type is as follows: a web column url and a web content url.
In order to achieve the above object, the present invention further provides a web page column recognition model training device, which includes a processor and a memory connected to the processor through a communication bus; wherein the content of the first and second substances,
the memory is used for storing a webpage column recognition model training program;
the processor is used for executing the webpage column recognition model training program,
selecting a first number of web page columns url, a second number of web page contents url and a title length corresponding to each url, wherein the ratio of the first number to the second number is within a preset range;
processing a webpage column url to obtain a corresponding first character string characteristic;
processing the webpage content url to obtain a corresponding second character string characteristic;
carrying out normalization processing on the title length to obtain a title length characteristic;
dividing the first character string feature, the second character string feature and the title length feature into a training set and a verification set;
inputting the training set into the selected machine learning model for training to obtain a trained machine learning model;
testing the trained machine learning model by adopting a verification set to obtain evaluation parameters, wherein the evaluation parameters include but are not limited to: accuracy and recall;
when the evaluation parameters reach a set threshold value, determining the evaluation parameters as qualified models;
and any one of the steps of the web page column identification model training method.
To achieve the above object, the present invention further provides a computer readable storage medium, specifically a computer readable storage medium, which stores one or more programs, where the one or more programs are executable by one or more processors, so as to cause the one or more processors to execute the steps of the web page column recognition model training method according to any one of the above aspects.
The invention provides a method for training a webpage column recognition model, which comprises the steps of selecting a first number of webpage column urls, a second number of webpage content urls and a title length corresponding to each url; processing a webpage column url to obtain a corresponding first character string characteristic; processing the webpage content url to obtain a corresponding second character string characteristic; carrying out normalization processing on the title length to obtain the title length characteristic; dividing character string features (first character string features or second character string features) and the title length features into a training set and a verification set; inputting the training set into the selected machine learning model for training to obtain a trained machine learning model; testing the trained machine learning model by adopting a verification set to obtain evaluation parameters; and when the evaluation parameters reach the set threshold value, determining the model as a qualified model. The information of the character string is learned through machine learning, compared with the matching path field only in rule matching, the learned rule is more and more comprehensive, and the problem that the rule set manually is limited is well solved. Therefore, the embodiment of the invention can solve the problem that the webpage column cannot be effectively identified in the prior art.
Drawings
FIG. 1 is a schematic flow chart illustrating a method for training a web page column recognition model according to an alternative embodiment of the present invention;
FIG. 2 is a schematic diagram of a web page structure according to an alternative embodiment of the present invention;
FIG. 3 is a schematic diagram of an application of an alternative embodiment of the present invention;
FIG. 4 is a schematic diagram of another application of an alternative embodiment of the present invention;
FIG. 5 is a flowchart illustrating a specific implementation of a web page column recognition model training method according to an alternative embodiment of the present invention;
FIG. 6 is a flowchart illustrating another implementation of the method for training a web page column recognition model according to an alternative embodiment of the invention;
fig. 7 is a schematic structural diagram illustrating a composition of a training apparatus for a web page column recognition model according to an alternative embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a schematic flow chart of a method for training a web page column recognition model according to an embodiment of the present invention, and referring to fig. 1, the embodiment of the present invention provides a method for training a web page column recognition model, where the method includes:
step S101: selecting a first number of web page columns url, a second number of web page contents url and a title length corresponding to each url, wherein the ratio of the first number to the second number is within a preset range.
As shown in fig. 2, it is a hierarchical diagram of a conventional website structure, where a website home page is an entry webpage of a website, and is a content of a directory nature for guiding an internet user to browse contents of other parts of the website; the column page is an aggregation page established according to the category of the published information, and guides the user to quickly find the interested information; the content page is a detailed introduction page of an article, a commodity and the like, and the content can be displayed on one page.
Randomly selecting column urls and content urls of a batch of websites and title lengths corresponding to the urls from a url library with crawler records in a data acquisition system as an original data set.
It should be noted that most urls of the same website have many duplicate fields, and to prevent such urls from being too many, the website data needs to be equalized.
In a specific implementation mode of the invention, a webpage column url, a webpage content url and a title length corresponding to each url are captured; carrying out character string recognition on data of each webpage column url and each webpage content url to obtain a plurality of groups of data, wherein each group of data is url containing the same character string format; from the plurality of sets of data, a first number of web page columns url and a second number of web page content url are selected.
The collected original data sets are stored separately according to websites, the data of each website are clustered into a plurality of groups by clustering methods such as k-means clustering and regular matching, and a certain number of urls are randomly selected from each group to form a new data set. As shown in fig. 3, 3 sets of url data are clustered for url.
In order to further balance the website data, in the embodiment of the invention, the url number of each group of data is obtained; judging whether the url number is smaller than a preset threshold value or not; if so, the set of data is deleted.
It can be understood that when the number of urls is lower than the preset threshold, the recognition degree of the website is not high, so that the url of the group may not be considered, and therefore, the urls corresponding to the websites which do not satisfy the condition are deleted.
In the actually collected urls, since the number of urls of some websites is too large and the number of urls of other websites is too small, the samples are unbalanced, and the effect of the model is directly influenced. The invention adopts a clustering grouping method to balance the training data set, and can eliminate the webpage column url group data or the webpage content url group data with less quantity or out of the preset range. On the other hand, in order to improve the recognition of the model to the web page column url and the web page content url, a quantity ratio range of the web page column url and the web page content url may be set, for example, 1: 1.
Step S102: and processing the webpage column url to obtain the corresponding first character string characteristic.
In one implementation mode of the invention, firstly, the webpage column url is decoded; performing word segmentation processing on the decoded web page column url to obtain a word list; and vectorizing the words in the word list to obtain a first character string characteristic.
Illustratively, encoding the url may employ encoderURIComponent or encoderURI (). encoderuricomponent differs from encoderuri () in that it is used to encode the components of a url individually, rather than encoding the entire url. For example, "/," @ & = + $, # ", these symbols that are not encoded in encoderuri () are systematically encoded in encoderuri (), converting the URI string into a string in escope format using UTF-8 encoding format. And then decoded according to the encoding rule.
The decoded url can be subjected to n-gram word segmentation to obtain a word list, and the tf-idf vectorization processing is carried out on the words in the word list to obtain the character string characteristics of the url.
It should be noted that TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency (Term Frequency), and IDF means Inverse text Frequency index (Inverse Document Frequency).
N-gram word segmentation is to divide a character string into a plurality of sub-character strings according to N, and each time, the sub-character strings are moved by one bit. N-gram distances can be obtained based on non-repeated N-gram participles: the n-gram set-2 of the two strings, i.e. the number of overlap of the substrings of the two strings.
Step S103: and processing the webpage content url to obtain the corresponding second character string characteristic.
The execution process of step S103 is the same as step S102, and the implementation process may refer to step S102, which is not described herein again.
The conventional word segmentation method according to the url path cannot effectively extract the characteristics of the url, and the n-gram method adopted by the invention can more accurately and comprehensively extract the characteristics of the difference between the column url and the content url.
In this embodiment of the present invention, step S102 and step S103 may be executed simultaneously, or step S102 may be executed first and then step S103 is executed, or step S103 may be executed first and then step S103 is executed, which is not limited in this embodiment of the present invention.
Step S104: and carrying out normalization processing on the title length to obtain the title length characteristic.
The obtained title length was normalized by a 0-1 normalization method and then expressed as a feature.
Illustratively, when the obtained titles a, B, C, D have the following lengths: 4. 7, 8, 3, obtaining a sum of: 4+7+8+3=22, and for normalized title a, title B, title C, title D: 4/22=0.181, 7/22=0.318, 8/22=0.364, 3/22=0.136, and so on, normalized data of titles with the number of titles of 10, 100, 1000, and so on can be obtained, and the normalized data is taken as a feature representation, as shown in fig. 4 for example.
In view of the different ways in which urls are encoded by each browser, the present invention first decodes all urls in a data set. The invention extracts url features from two aspects: url string and url corresponding title length.
In the prior art, when learning the features of a url string, a field extracted after parsing the url is used as a minimum processing unit, and the url features learned according to the processing mode are only local features staying on the field and are shallow, so that when the url string is processed, tri-gram segmentation is performed on the url string to obtain more url fragmentation information, and the fragmentation information contains features of different aspects of the url, so that the learned features are more comprehensive when the information is finally converged. The method for converging the fragmentation information is tf-idf word segmentation, and information with high distinguishing degree on two different urls is screened out by calculating the tf-idf value of each fragmentation information, and the information is the representative characteristic of the urls.
Step S105: the string features (either the first string features or the second string features) and the title length features are divided into a training set and a validation set.
Dividing the first character string feature obtained in step S102 or the second character string feature obtained in step S103 and the length feature of the question obtained in step S104 into a training set and a verification set, where the training set includes: a number of string features (either the first string feature or the second string feature), and a title length feature, as well as the validation set. Specifically, the division may be performed in a ratio of 7:3, or may be performed in other ratios, for example, the ratio of training set to validation set is set to 9:1, 8:2, and so on.
Step S106: and inputting the training set into the selected machine learning model for training to obtain the trained machine learning model.
The training set is used to train the classification model, i.e. the model is completely learned from the training set. The invention trains a column url recognition model by adopting a machine learning algorithm. The specific process of machine learning is as follows:
1) and (3) selecting an algorithm: in the invention, data is marked, so that the method is a process of supervised learning, and the data has two characteristics: a string feature (first string feature or second string feature) and a title length feature. Therefore, a general classification algorithm such as support vector machine, logistic regression, naive Bayes and the like can be selected.
2) Training a model: and inputting the training data set into the model, automatically learning the information in the data by an algorithm according to the label and the data characteristics, and updating the model step by step until all the training data sets are trained completely.
Step S107: testing the trained machine learning model by adopting a verification set to obtain evaluation parameters, wherein the evaluation parameters include but are not limited to: accuracy and recall.
After the model is trained, the model needs to be further evaluated to determine whether the model is good or bad and is put into use. After model training is finished, inputting character string features and title length features of a verification set into a model to obtain a prediction type of each url, and calculating three index values of the model according to the prediction type and an actual type of the url: accuracy and recall and F-score to evaluate the model.
It should be noted that F-score is a comprehensive weighting of accuracy and recall, a model with high recall may be slightly inferior in accuracy, a model with higher accuracy is not satisfactory due to its high standard, and F-score comprehensively considers the influence of these two parameters. The formula is expressed as follows:
Figure 290458DEST_PATH_IMAGE001
the method comprises the following steps that wherein call is Recall rate, Precision is accuracy rate, β is used for balancing Precision, and the weight of Recall in F-score calculation is balanced, wherein the value taking situations include that if the value is 1, Precision is as important as Recall, if the value is less than 1, Precision is more important than Recall, and if the value is more than 1, Recall is more important than Precision.
Step S108: and when the evaluation parameters reach the set threshold value, determining the evaluation parameters as qualified models.
When the three indexes reach the set threshold value, a qualified model is used, otherwise, the model is returned to be retrained until the model reaches the set threshold value.
By applying the embodiment of the invention, the model is controlled from two aspects by combining two important characteristic url character strings of the url and the corresponding title length of the url by using the training model, and the accuracy of the model is ensured.
By applying the embodiment of the invention, when data is crawled, the column url of each website hierarchy can be automatically identified only by providing the home page url and added into the acquisition source, and compared with the existing rule matching method, the method can automatically identify the column url in a deeper layer.
And the characteristics can be extracted from the url itself, the included rules are fully learned, the accuracy is ensured, some data which are not obvious in the rules are corrected based on the title length, and the accuracy is further improved.
The machine learning algorithm used by the invention is more comprehensive and accurate as the number of websites increases, and improves the conditions of complex and conflicting rules as the number of websites increases in the existing scheme. And once the recognition model is trained, the recognition model does not need to be changed, and the new url is recognized quickly. The performance of the data acquisition system is improved as a whole.
As shown in fig. 5, in a specific implementation manner of the present invention, after an algorithm is selected, a model to be trained is determined, then a training data set is input to the model for training, and after the data set is verified to be input to the model for model evaluation, whether the model is a qualified model is determined, if the model is evaluated to be qualified, the model is used as an available model for predicting a url to be detected, otherwise, a new training data set is selected for retraining the model.
The method comprises the steps of collecting data and forming a url database through a data collection system formed by crawlers, extracting the data from the database to obtain an original data set, balancing the number of corresponding web page content url and web page column url in the original data set, obtaining a data set formed by the balanced web page content url and the balanced web page column url, extracting the characteristics corresponding to the web page content url and the web page column url respectively, obtaining a data set, randomly dividing the data set into a training data set and a verification data set, classifying a selected model through the training data set to obtain a training classifier, training the classifier by adopting the training set, and evaluating the classifier by adopting the verification set. After the training set trains the model, the verification set is input into the model to obtain a prediction label, and three indexes are calculated according to the prediction label and the actual label to judge whether the model is qualified. And if the model is qualified, the model is a final classifier, the new url is identified through the classifier to obtain the identified column url, and the corresponding column url is used as an acquisition seed source in the data acquisition system.
As shown in fig. 6, an embodiment of the present invention provides a method for using a web page column recognition model, including the steps of:
step S601: and acquiring a url to be detected, performing word segmentation processing, and acquiring vectorized character string characteristics and normalized title length characteristics, wherein the url to be detected is a web page column url or a web page content url.
Step S602: and inputting the vectorized character string features and the normalized title length features into a pre-trained machine learning model.
The processing procedure of step S601 is the same as that of steps S102, S103, and S104, and the character string feature and the title length feature corresponding to the url to be detected are obtained and input into the trained machine learning model, so as to obtain the output result of the machine learning model.
Step S603: determining the classification type of the url to be detected according to the output result of the machine learning model, wherein the classification type is as follows: a web column url and a web content url.
Specifically, the output label of the machine model is used to indicate whether the column url or the content url of the web page is the web page.
To achieve the above object, the present invention further provides a web page column recognition model training device, referring to fig. 7, the device includes a processor 701, and a memory 702 connected to the processor 701 through a communication bus 703; the memory 702 is used for storing a web page column recognition model training program; the processor 701 is configured to select a first number of web page columns url, a second number of web page contents url, and a title length corresponding to each url, where a ratio of the first number to the second number is within a preset range;
processing a webpage column url to obtain a corresponding first character string characteristic;
processing the webpage content url to obtain a corresponding second character string characteristic;
carrying out normalization processing on the title length to obtain a title length characteristic;
dividing character string features (first character string features or second character string features) and the title length features into a training set and a verification set;
inputting the training set into the selected machine learning model for training to obtain a trained machine learning model;
testing the trained machine learning model by adopting a verification set to obtain evaluation parameters, wherein the evaluation parameters include but are not limited to: accuracy and recall;
and determining the evaluation parameter as a qualified model when the evaluation parameter reaches a set threshold value.
Here, the processor 701 is configured to execute the web page column recognition model training program to implement the following steps of the web page column recognition model training method: capturing a webpage column url, a webpage content url and a title length corresponding to each url;
carrying out character string recognition on data of each webpage column url and each webpage content url to obtain a plurality of groups of data, wherein each group of data is url containing the same character string format;
from the plurality of sets of data, a first number of web page columns url and a second number of web page content url are selected.
Here, the processor 701 is configured to execute the web page column recognition model training program to implement the following steps of the web page column recognition model training method: obtaining the url number of each group of data;
judging whether the url number is smaller than a preset threshold value or not;
if so, the set of data is deleted.
Here, the processor 701 is configured to execute the web page column recognition model training program to implement the following steps of the web page column recognition model training method:
decoding a webpage column url;
performing word segmentation processing on the decoded web page column url to obtain a word list;
and vectorizing the words in the word list to obtain a first character string characteristic.
Here, the processor 701 is configured to execute the web page column recognition model training program to implement the following steps of the web page column recognition model training method: acquiring all website home pages url from a url library of a data acquisition system;
and acquiring a web page column url and a web page content url in each website according to the website home page url.
Here, the processor 701 is configured to execute the web page column recognition model training program to implement the following steps of the web page column recognition model training method: and clustering or regularly matching url paths, and dividing url with the same character string format into one group to obtain multiple groups of data. And judging whether the number of the clustered url is smaller than a preset threshold value, and if so, deleting the url.
Here, the processor 701 is configured to execute the web page column recognition model training program to implement the following steps of the web page column recognition model training method: and inputting the training set into a support vector machine or logistic regression or naive Bayes for training to obtain a trained machine learning model.
Optionally, the processor 701 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. Here, the program executed by the processor 701 may be stored in a memory 702 connected to the processor 701 via a communication bus 703, and the memory 702 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Synchronous Random Access Memory), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), Synchronous linked Dynamic Random Access Memory (SLDRAM, Synchronous Random Access Memory), Direct Memory bus (DRmb Access Memory, Random Access Memory). The described memory 702 of embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory 702. The memory 702 in the present embodiment is used for storing various types of data to support the operation of the processor 701. Examples of such data include: any computer programs for operation of the processor 701, such as an operating system and application programs; contact data; telephone book data; a message; a picture; video, etc. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks.
To achieve the above object, the present invention further provides a computer-readable storage medium, specifically a computer-readable storage medium, which stores one or more programs, where the one or more programs are executable by one or more processors 701, so as to cause the one or more processors 701 to execute the steps of the web page column recognition model training method according to any one of the above aspects: selecting a first number of web page columns url, a second number of web page contents url and a title length corresponding to each url, wherein the ratio of the first number to the second number is within a preset range;
processing a webpage column url to obtain a corresponding first character string characteristic;
processing the webpage content url to obtain a corresponding second character string characteristic;
carrying out normalization processing on the title length to obtain a title length characteristic;
dividing the first character string feature, the second character string feature and the title length feature into a training set and a verification set;
inputting the training set into the selected machine learning model for training to obtain a trained machine learning model;
testing the trained machine learning model by adopting a verification set to obtain evaluation parameters, wherein the evaluation parameters include but are not limited to: accuracy and recall;
and determining the evaluation parameter as a qualified model when the evaluation parameter reaches a set threshold value.
Optionally, the one or more programs may be executable by the one or more processors 701 to cause the one or more processors 701 to perform the following steps of the web page column recognition model training method: capturing a webpage column url, a webpage content url and a title length corresponding to each url;
carrying out character string recognition on data of each webpage column url and each webpage content url to obtain a plurality of groups of data, wherein each group of data is url containing the same character string format;
from the plurality of sets of data, a first number of web page columns url and a second number of web page content url are selected.
Optionally, the one or more programs may be executable by the one or more processors 701 to cause the one or more processors 701 to perform the following steps of the web page column recognition model training method: obtaining the url number of each group of data;
judging whether the url number is smaller than a preset threshold value or not;
if so, the set of data is deleted.
Optionally, the one or more programs may be executable by the one or more processors 701 to cause the one or more processors 701 to perform the following steps of the web page column recognition model training method: decoding a webpage column url;
performing word segmentation processing on the decoded web page column url to obtain a word list;
and vectorizing the words in the word list to obtain a first character string characteristic.
Optionally, the one or more programs may be executable by the one or more processors 701 to cause the one or more processors 701 to perform the following steps of the web page column recognition model training method: acquiring all website home pages url from a url library of a data acquisition system;
and acquiring a web page column url and a web page content url in each website according to the website home page url.
Optionally, the one or more programs may be executable by the one or more processors 701 to cause the one or more processors 701 to perform the following steps of the web page column recognition model training method: and clustering or regularly matching url paths, and dividing url with the same character string format into one group to obtain multiple groups of data.
Optionally, the one or more programs may be executable by the one or more processors 701 to cause the one or more processors 701 to perform the following steps of the web page column recognition model training method: and inputting the training set into a support vector machine or logistic regression or naive Bayes for training to obtain a trained machine learning model.
Alternatively, the computer-readable storage medium may be a volatile memory, such as a random access memory; or a non-volatile memory, such as a read-only memory, flash memory, hard disk, or solid state disk; or may be a respective device, such as a mobile phone, computer, tablet device, personal digital assistant, etc., that includes one or any combination of the above-described memories 702.
The embodiment of the invention also provides a device for using the webpage column identification model, which comprises a processor and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,
the memory is used for storing a webpage column identification model application program;
the processor is used for executing the webpage column recognition model training program,
acquiring a url to be detected, and performing word segmentation processing to obtain vectorized character string characteristics and normalized title length characteristics;
inputting the character string features and the title length features into a machine learning model trained in advance;
determining the classification type of the url to be detected according to the output result of the machine learning model, wherein the classification type is as follows: a web column url and a web content url.
The present invention also provides a computer-readable storage medium, in particular a computer-readable storage medium, storing one or more programs, which are executable by one or more processors to cause the one or more processors to perform the steps of using the web page column identification model according to any one of the above aspects.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method for training a web page column recognition model is characterized by comprising the following steps:
selecting a first number of web page columns url, a second number of web page contents url and a title length corresponding to each url, wherein the ratio of the first number to the second number is within a preset range;
processing a webpage column url to obtain a corresponding first character string characteristic;
processing the webpage content url to obtain a corresponding second character string characteristic;
carrying out normalization processing on the title length to obtain a title length characteristic;
dividing the first character string feature, the second character string feature and the title length feature into a training set and a verification set;
inputting the training set into the selected machine learning model for training to obtain a trained machine learning model;
testing the trained machine learning model by adopting a verification set to obtain evaluation parameters, wherein the evaluation parameters include but are not limited to: accuracy and recall;
and determining the evaluation parameter as a qualified model when the evaluation parameter reaches a set threshold value.
2. The method for training a web page column recognition model according to claim 1, wherein the step of selecting a first number of web page column urls, a second number of web page contents urls, and a title length corresponding to each url comprises:
capturing a webpage column url, a webpage content url and a title length corresponding to each url;
carrying out character string recognition on data of each webpage column url and each webpage content url to obtain a plurality of groups of data, wherein each group of data is url containing the same character string format;
from the plurality of sets of data, a first number of web page columns url and a second number of web page content url are selected.
3. The method for training a web page column recognition model according to claim 2, wherein after the step of performing character string recognition on the data of each website to obtain a plurality of groups of data, the method further comprises:
obtaining the url number of each group of data;
judging whether the url number is smaller than a preset threshold value or not;
if so, the set of data is deleted.
4. The method for training a web page column recognition model according to claim 2, wherein the step of processing the web page column url to obtain the corresponding first character string feature comprises:
decoding a webpage column url;
performing word segmentation processing on the decoded web page column url to obtain a word list;
and vectorizing the words in the word list to obtain a first character string characteristic.
5. The method for training a web page column recognition model according to claim 2, wherein the step of capturing the web page column url, the web page content url, and the title length corresponding to each url comprises:
acquiring all website home pages url from a url library of a data acquisition system;
and acquiring a web page column url and a web page content url in each website according to the website home page url.
6. The training method of web page column recognition models according to any one of claims 2-5, wherein the step of performing character string recognition on the data of each website to obtain a plurality of groups of data comprises:
and clustering or regularly matching url paths, and dividing url with the same character string format into one group to obtain multiple groups of data.
7. The method for training a web page column recognition model according to claim 1, wherein the step of inputting the training set into the selected machine learning model for training to obtain the trained machine learning model comprises:
and inputting the training set into a support vector machine or logistic regression or naive Bayes for training to obtain a trained machine learning model.
8. Use of the web page column recognition model obtained by the web page column recognition model training method according to any one of claims 1 to 7, wherein the use method comprises:
acquiring urls to be detected, performing word segmentation processing, and acquiring vectorized character string characteristics and normalized title length characteristics, wherein the urls to be detected are web page column characteristics or web page content urls, and the title length characteristics are title lengths corresponding to the urls to be detected;
inputting the vectorized character string features and the normalized title length features into a pre-trained machine learning model;
determining the classification type of the url to be detected according to the output result of the machine learning model, wherein the classification type is as follows: a web column url and a web content url.
9. The device for training the webpage column recognition model is characterized by comprising a processor and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,
the memory is used for storing a webpage column recognition model training program;
the processor is configured to execute the web page column recognition model training program to implement the steps of the web page column recognition model training method according to any one of claims 1 to 7.
10. A storage medium, in particular a computer-readable storage medium, storing one or more programs, which are executable by one or more processors to cause the one or more processors to perform the steps of the web page column recognition model training method according to any one of claims 1 to 7.
CN201911161584.XA 2019-11-25 2019-11-25 Webpage column recognition model training method, using method, device and storage medium Active CN110633446B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911161584.XA CN110633446B (en) 2019-11-25 2019-11-25 Webpage column recognition model training method, using method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911161584.XA CN110633446B (en) 2019-11-25 2019-11-25 Webpage column recognition model training method, using method, device and storage medium

Publications (2)

Publication Number Publication Date
CN110633446A CN110633446A (en) 2019-12-31
CN110633446B true CN110633446B (en) 2020-03-13

Family

ID=68979500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911161584.XA Active CN110633446B (en) 2019-11-25 2019-11-25 Webpage column recognition model training method, using method, device and storage medium

Country Status (1)

Country Link
CN (1) CN110633446B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
CN101071426A (en) * 2006-05-10 2007-11-14 北京锐科天智科技有限责任公司 Personalized webpage generating method and device
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053979B (en) * 2009-10-27 2012-12-12 华为技术有限公司 Information acquisition method and system
US20140136948A1 (en) * 2012-11-09 2014-05-15 Microsoft Corporation Taxonomy Driven Page Model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071426A (en) * 2006-05-10 2007-11-14 北京锐科天智科技有限责任公司 Personalized webpage generating method and device
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages

Also Published As

Publication number Publication date
CN110633446A (en) 2019-12-31

Similar Documents

Publication Publication Date Title
CN108959270B (en) Entity linking method based on deep learning
CN107229668B (en) Text extraction method based on keyword matching
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
US10803387B1 (en) Deep neural architectures for detecting false claims
US7565350B2 (en) Identifying a web page as belonging to a blog
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN110991187B (en) Entity linking method, device, electronic equipment and medium
US20090319449A1 (en) Providing context for web articles
WO2017097231A1 (en) Topic processing method and device
CN113590970B (en) Personalized digital book recommendation system and method based on reader preference, computer and storage medium
CN110287409B (en) Webpage type identification method and device
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
Maier et al. How document sampling and vocabulary pruning affect the results of topic models
US20160170993A1 (en) System and method for ranking news feeds
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN110362798B (en) Method, apparatus, computer device and storage medium for judging information retrieval analysis
WO2021190662A1 (en) Medical text sorting method and apparatus, electronic device, and storage medium
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN106202349B (en) Webpage classification dictionary generation method and device
CN111291551A (en) Text processing method and device, electronic equipment and computer readable storage medium
Eisele et al. Capturing a news frame–comparing machine-learning approaches to frame analysis with different degrees of supervision
Alzhrani Political Ideology Detection of News Articles Using Deep Neural Networks.
Liang et al. Detecting novel business blogs
CN110633446B (en) Webpage column recognition model training method, using method, device and storage medium
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant