WO2017167067A1

WO2017167067A1 - Method and device for webpage text classification, method and device for webpage text recognition

Info

Publication number: WO2017167067A1
Application number: PCT/CN2017/077489
Authority: WO
Inventors: 段秉南
Original assignee: 阿里巴巴集团控股有限公司; 段秉南
Priority date: 2016-03-30
Filing date: 2017-03-21
Publication date: 2017-10-05
Also published as: CN107291723A; TWI735543B; TW201737118A; CN107291723B

Abstract

A method and device for webpage text classification, and a method and device for webpage text recognition. The method for webpage text classification comprises: collecting text data from a webpage (101); segmenting the text data to obtain basic text segments(102); calculating a first attribute value and a second attribute value of each of the basic text segments (103); calculating a characteristic value of each of the basic text segments according to the first attribute value and the second attribute value (104); screening and selecting characteristic text segments from the basic text segments according to the characteristic value (105); calculating a weight corresponding to each of the characteristic text segments (106); treating the weight as a characteristic vector corresponding to the characteristic text segments, and utilizing the characteristic vector to train a classification model (107). The method and device of the present invention effectively ensure objectivity and accuracy in extracting a characteristic, and also take into account the influence of a characteristic on classification, thereby increasing the accuracy of webpage text classification, and further facilitating a user to accurately and timely obtain effective information from a massive amount of text.

Description

Method and device for classifying webpage text, method and device for recognizing webpage text

Technical field

The present application relates to the technical field of text classification, and in particular to a method for classifying web page text, a device for classifying web page text, a method for recognizing web page text, and a device for recognizing web page text.

Background technique

In today's information society, all forms of information have greatly enriched people's lives. Especially with the large-scale popularization of the Internet, the amount of information on the Internet is growing rapidly, such as various electronic documents, emails and web pages. Filled with the network, causing messy information. In order to find the information we need quickly, accurately and comprehensively, text categorization has become an important way to effectively organize and manage text data, and it has received more and more attention.

The webpage text classification refers to determining the category of the corresponding webpage according to the content of the massive webpage document according to the predefined theme category. The technical basis for web page text categorization is content-based plain text categorization. The basic method is to extract the content of the plain text of each webpage text in the captured webpage collection, and obtain the corresponding plain text. The extracted plain text is then combined into a new document collection, and a plain text classification algorithm is applied to the new document collection for classification. According to the correspondence between the plain text and the webpage text, the webpage text is classified, that is, the plaintext content information of the webpage is applied, and the webpage is classified.

Due to the multi-intentionality, ambiguity, and dissimilarity of massive texts, in the prior art, it is difficult to select the classification features. For example, the role of some invalid words is often exaggerated, or some The important attributes of some feature parts, resulting in extremely low accuracy of web page text classification.

Summary of the invention

In view of the above problems, embodiments of the present application have been made in order to provide an overcoming of the above problems or A method for classifying web page text, at least partially solving the above problem, a method for recognizing web page text, and a corresponding device for classifying web page text, a device for recognizing web page text.

In order to solve the above problem, the embodiment of the present application discloses a method for classifying webpage text, including:

Collect text data in a web page;

Segmenting the text data to obtain a basic participle;

Calculating a first attribute value and a second attribute value of each base participle;

Calculating a feature value of each basic participle according to the first attribute value and the second attribute value;

Extracting feature word segments from the basic participle according to the feature value;

Calculate the corresponding weight of each feature participle;

The weight is used as the feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.

Preferably, the first attribute value is an information gain value of the base participle, and the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.

Preferably, the feature values of the basic participles are calculated according to the first attribute value and the second attribute value by the following formula:

Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.

Preferably, the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:

Arranging the basic participle according to its corresponding feature value from high to low;

Extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.

Preferably, the step of calculating corresponding weights of each feature word segment includes:

Obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;

Counting the total number of feature word segments in the text data of the webpage;

According to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.

Preferably, the following formula is used to calculate the feature value of the feature word segment, the number of times the feature word segment appears in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, and calculate corresponding feature word segments. Weights:

Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.

Preferably, the step of calculating a corresponding weight of each feature word segment further includes:

The weights of the feature word segments are normalized.

Preferably, the weights of the feature word segmentation are normalized by the following formula:

Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.

The embodiment of the present application further discloses a method for text recognition of a webpage, including:

Extracting text data in the web page to be identified;

Segmenting the text data to obtain a basic participle;

Calculate the corresponding weight of each feature participle;

And inputting the weight as a feature vector into a pre-trained classification model to obtain classification information;

Marking the classification information for the to-be-identified web page.

The weights of the feature word segments are normalized.

The embodiment of the present application further discloses an apparatus for classifying webpage texts, including:

An acquisition module, configured to collect text data in a webpage;

a word segmentation module for segmenting the text data to obtain a basic participle;

a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each base participle;

An eigenvalue calculation module, configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value;

a feature extraction module, configured to filter feature segmentation words from the basic participle according to the feature value;

a feature weight allocation module, configured to calculate a corresponding weight of each feature word segmentation;

The model training module is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.

Preferably, the feature value calculation module calculates the feature values of the basic participle words according to the first attribute value and the second attribute value by using the following formula:

Preferably, the feature extraction module comprises:

a sorting sub-module for arranging the basic participle according to its corresponding feature value from highest to lowest;

And an extraction sub-module, configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.

Preferably, the feature weight allocation module comprises:

a number statistics sub-module, configured to obtain the number of occurrences of each feature word segment in the text data of the corresponding webpage;

a total number of word segmentation sub-modules for counting the total number of feature word segments in the text data of the webpage;

a calculation submodule, configured to calculate, according to the feature value of the feature word segmentation, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words Weights.

Preferably, the calculation sub-module is calculated according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage. The corresponding weights of each feature participle:

Preferably, the feature weight distribution module further includes:

The normalization submodule is configured to normalize the weight of the feature word segmentation.

Preferably, the normalization sub-module normalizes the weight of the feature word segment by the following formula:

Where norm(weight) is the weight after normalization, weight is the weight of the feature participle, and min(weight) is the minimum weight value in the text data in the webpage. Max(weight) is the maximum weight value in the text data in the webpage.

The embodiment of the present application further discloses an apparatus for text recognition of a webpage, including:

a text extraction module, configured to extract text data in the webpage to be identified;

a classification module, configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information;

a marking module, configured to mark classification information for the to-be-identified webpage.

Embodiments of the present application include the following advantages:

The embodiment of the present application improves the objectivity and accuracy of feature extraction by improving the extraction method of feature word segmentation and the calculation method of feature word segmentation weight, and also considers the influence of feature on classification, thereby improving the accuracy of web page text classification. Sex, more convenient for users to obtain valid information in a timely and accurate manner in a large amount of text.

The embodiment of the present application combines at least two feature extraction algorithms, and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of the feature extraction. Moreover, by using the long tail distribution map to select the number of features, the weighting of the feature segmentation effect is adopted for the feature segmentation, so that the effective features can be further screened, so that the effect of the webpage text classification is more accurate.

DRAWINGS

1 is a flow chart showing the steps of a method for classifying web page text according to the present application;

2 is a schematic diagram of a long tail distribution in an example of the present application;

3 is a flow chart of steps of text recognition of a webpage according to the present application;

4 is a structural block diagram of an apparatus for classifying web page text according to the present application;

FIG. 5 is a structural block diagram of an apparatus for text recognition of a webpage according to the present application.

detailed description

The above described objects, features and advantages of the present application will become more apparent and understood.

Text categorization is to obtain a mapping rule between a category and an unknown text by training a certain set of texts, that is, calculating the relevance of the text and the category, and then determining the category attribution of the text according to the trained classifier.

Text categorization is a guided learning process. It finds a relational model (classifier) between text attributes (features) and text categories based on a set of training texts that have been annotated, and then uses the relational model pair obtained by this learning. The new text is judged by category. The process of text categorization can be divided into two parts: training and classification. The purpose of training is to construct a classification model for the classification by linking the new text and categories. The classification process is a process of classifying unknown texts based on training results, giving a category identification.

Referring to FIG. 1 , a flow chart of steps of a method for classifying web page texts according to the present application is shown. Specifically, the method may include the following steps:

Step 101: Collect text data in a webpage;

This step obtains the text data of the webpage used for the training of the classification model. In practice, it may be massive data. The usual processing method is to extract the plain text content of each webpage text in the captured webpage collection, thereby obtaining corresponding plain text, and then extracting the plain text into a new document collection, the document collection. That is, the webpage referred to in this application Text data in .

Step 102: Perform word segmentation on the text data to obtain a basic participle;

As we all know, English is based on words, words and words are separated by spaces, and Chinese is in words. All the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student." The computer can easily know that student is a word by a space, but it is not easy to understand that the words "learning" and "sheng" are combined to represent a word. The Chinese character sequence is divided into meaningful words, which are Chinese word segments. For example, I am a student and the result of the participle is: I am a student.

Here are some common word segmentation methods:

1. Word segmentation based on string matching: refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word). The actual word segmentation system uses mechanical segmentation as a preliminary method, and further improves the accuracy of segmentation by using various other language information.

2. The word segmentation method based on feature scanning or mark segmentation: refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and in turn, test and adjust the word segmentation results in the labeling process. , thereby improving the accuracy of the segmentation.

3. The word segmentation method based on understanding: refers to the effect of identifying words by letting the computer simulate the understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity. That is, it simulates the process of understanding people's sentences. This method of word segmentation requires a large amount of linguistic knowledge and information.

4. Statistical-based word segmentation method: It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary.

The manner in which the text data is segmented by the present application is not limited, and the word segmentation is performed on the document set, and all the word segments obtained are the basic participles referred to in the present application.

In a specific implementation, before proceeding to the next step, the removal process may also be performed in advance for the invalid words in the basic participle, for example, for the stop words. Stop words usually refer to frequent occurrences in various types of text, and are therefore considered to have few high-frequency words such as pronouns, prepositions, conjunctions, etc. that help to classify any information. Those skilled in the art can also design feature words that need to be deleted before or during feature extraction according to requirements, which need not be limited in this application.

Step 103: Calculate a first attribute value and a second attribute value of each basic participle;

Step 104: Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;

Step 105: Filter feature tokens from the basic participle according to the feature value;

The above steps 103-105 relate to the processing of feature selection in text categorization. Usually the original feature space dimension is very high, and there are a lot of redundant features, so feature dimension reduction is needed. Feature selection is one of the characteristics of feature dimension reduction. Its basic idea is to score each original feature item independently according to a certain evaluation function, and then sort by the level of the score, and select several features with the highest score. Item, or a threshold is set in advance, the metric value is filtered out of the threshold feature, and the remaining candidate features are used as the feature subset of the result.

The feature selection algorithm includes algorithms such as document frequency, mutual information amount, information gain, and χ ² statistic (CHI). In the prior art, those skilled in the art usually select one of them to select the feature word segmentation. However, the use of this single algorithm has many drawbacks. Taking the information gain algorithm as an example, the information gain appears in the text through the word segmentation. The amount of information before and after the occurrence of the word segmentation is used to infer the amount of information carried by the participle, that is, the information gain value of a participle indicates the amount of information contained in the participle feature. It can be understood that the higher the information gain value, the segmentation feature can give the classifier a larger amount of information, but the existing information gain algorithm only considers the amount of information provided by the segmentation feature to the overall classifier, ignoring the different segmentation features. The degree of discrimination of each category.

Or, taking the χ ² statistic (CHI) algorithm as an example, the chi-square statistic is also used to characterize the correlation between two variables. It also considers the case when the feature appears and does not appear in a certain type of text. The larger the chi-square statistic, the more relevant it is to the class, and the more the category information is carried, but the existing χ ² statistic (CHI) algorithm over-exaggerates the role of low-frequency words.

In view of the above drawbacks, the present application proposes that no single algorithm is used, and at least two algorithms are used for feature extraction, that is, different first algorithms are used to calculate the first attribute value and the second attribute value of each basic participle, for example, using information gain. The algorithm calculates the first attribute value and uses the CHI algorithm to calculate the second attribute value.

Certainly, those skilled in the art may use other algorithms to calculate different attribute values of the word segmentation according to actual conditions, and even more than two attribute values, which are feasible, and the application does not limit this.

In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard deviation of the statistic value, the eigenvalue may be the degree of discrimination of the basic participle, that is, the step 103 may specifically include the following sub-steps:

Sub-step 1031, calculating an information gain value of each basic participle;

Sub-step 1032, calculating a chi-square statistic value of each basic participle;

Sub-step 1033, based on the number of base participles, the standard deviation of the base participle relative to the predefined chi-square statistic of each category is counted.

In this case, the step 104 may be: obtaining the discrimination degree of each basic participle based on the product of the information gain value and the standard deviation.

More specifically, the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by the following formula:

The application combines at least two feature extraction algorithms and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of feature extraction.

In a preferred embodiment of the present application, the step 105 may specifically include the following sub-steps:

Sub-step 1051, the basic participle is arranged according to its corresponding feature value from high to low;

Sub-step 1052, extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.

After calculating the eigenvalues of the basic participles, it can be found that the value conforms to the long tail distribution (Qilufu's law) as shown in Fig. 2. In Fig. 2, the horizontal axis is the number of basic participles, and the vertical axis is the basic participle. The degree of discrimination, applying a preferred embodiment of the present application, may take, for example, a base participle with an abscissa greater than 0 and less than 30,000 as a feature segmentation.

By using the long tail profile to select the number of features, the present application can further screen out the effective features, thereby making the effect of web page text classification more accurate.

Step 106: Calculate corresponding weights of each feature participle;

In the text, each feature participle is given a weight indicating the importance of the feature participle in the text. The weights are generally calculated based on the frequency of the feature items. There are many calculation methods, such as Boolean weight method, word frequency weight method, TF/IDF weight method, TFC weight method, etc. There are also many disadvantages in the calculation. For example, in TF/IDF weight method, TF indicates the number of features in a single text, and IDF indicates the number of features in the entire corpus, so the influence of features on classification is completely ignored.

Thus, the present application proposes a preferred embodiment for calculating weights. In this embodiment, the step 106 may include the following sub-steps:

Sub-step 1061: obtaining the number of times each feature participle appears in the text data of the corresponding webpage;

Sub-step 1062, counting the total number of feature word segments in the text data of the webpage;

Sub-step 1063, according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated.

As an example of a specific application of the preferred embodiment of the present application, the sub-step 1063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:

In a specific implementation, it is more preferable that the step 105 further includes the following sub-steps:

Sub-step 1064, normalizing the weights of the feature word segments.

As an example of a specific application of the present application, the weight of the feature word segmentation can be normalized by the following formula:

The weights used in the examples of the present application take into account the influence of features on the classification, and thus can further improve the effectiveness of feature selection. Of course, it is feasible to use any of the weight calculation methods in this application, and the application does not need to be limited.

The corresponding weights of each feature segment calculated above (including the weight obtained in sub-step 1063 or the normalized weight obtained in sub-step 1064) can be used as a feature vector of a text, and a text classification can be selected after obtaining the feature vector. The algorithm trains the classification model.

Step 107: The weight is used as a feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.

Those skilled in the art can use any text classification algorithm, such as Bayesian Probability Algorithm (Naive Bayese), support vector machine, KNN algorithm (k nearest neighbor), etc., to use the feature vector to train the classification model, and the present application is feasible. This is not a limitation.

Referring to FIG. 3, a flowchart of an embodiment of a method for text recognition of a webpage according to the present application is shown. Specifically, the method may include the following steps:

Step 301: Extract text data in the webpage to be identified;

Step 302, performing segmentation on the text data to obtain a basic participle;

Step 303: Calculate a first attribute value and a second attribute value of each basic participle;

Step 304: Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;

Step 305: Filter feature feature words from the basic participle according to the feature value;

Step 306, calculating corresponding weights of each feature participle;

Step 307: Enter the weight as a feature vector into a pre-trained classification model to obtain classification information.

Step 308: Mark classification information for the to-be-identified webpage.

In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard deviation of the statistic value, which may be the degree of discrimination of the base participle.

As an example of the specific application of the present application, the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by using the following formula:

In a preferred embodiment of the present application, the step 305 may include the following sub-steps:

Sub-step 3051, the basic participle is arranged according to its corresponding feature value from high to low;

Sub-step 3052, extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.

In a preferred embodiment of the present application, the step 306 may include the following sub-steps:

Sub-step 3061, obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;

Sub-step 3062, counting the total number of feature word segments in the text data of the webpage;

Sub-step 3063, according to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.

As an example of a specific application of the preferred embodiment of the present application, the sub-step 3063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:

In a specific implementation, it is further preferred that the step 306 further includes the following sub-steps:

Sub-step 3064, normalizing the weights of the feature word segments.

The corresponding weights of each feature segment obtained above can be used as a feature vector of a text. After obtaining the feature vector, it can be input into the classification model pre-generated according to the process shown in Figure 1, and the current feature vector can be obtained. Classification information, and finally the current identification The corresponding categorization information on the web page mark can be used.

It should be noted that, for the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the embodiments of the present application are not limited by the described action sequence, because In accordance with embodiments of the present application, certain steps may be performed in other sequences or concurrently. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required in the embodiments of the present application.

Referring to FIG. 4, a structural block diagram of an apparatus embodiment of a webpage text classification of the present application is shown, which may specifically include the following modules:

The collecting module 401 is configured to collect text data in the webpage;

a word segmentation module 402, configured to perform segmentation on the text data to obtain a basic participle;

The word segment attribute calculation module 403 is configured to calculate a first attribute value and a second attribute value of each base participle;

The feature value calculation module 404 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;

The feature extraction module 405 is configured to filter the feature word segmentation from the basic participle according to the feature value;

a feature weight assignment module 406, configured to calculate a corresponding weight of each feature participle;

The model training module 407 is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.

As an example of the specific application of the embodiment of the present application, the feature value calculation module 404 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:

In a preferred embodiment of the present application, the feature extraction module 405 may include the following sub-modules:

a sorting sub-module 4051, configured to rank the basic participle according to its corresponding feature value from high to low;

The extracting sub-module 4052 is configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.

In a preferred embodiment of the present application, the feature weight assignment module 406 can include the following sub-modules:

The number of statistics sub-module 4061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;

a segmentation total number statistics sub-module 4062, configured to count the total number of feature word segments in the text data of the webpage;

The calculation sub-module 4063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage, and calculate corresponding feature wordifiers the weight of.

As an example of the specific application of the embodiment of the present application, the calculation sub-module 4063 may be based on the feature value of the feature segmentation according to the following formula, and each feature segmentation is in the corresponding webpage. The number of occurrences in the text data, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated:

In a preferred embodiment of the present application, the feature weight assignment module 406 may further include the following sub-modules:

The normalization sub-module 4064 is configured to normalize the weight of the feature word segmentation.

As an example of a specific application of the embodiment of the present application, the normalization sub-module 4064 may normalize the weight of the feature word segment by the following formula:

For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

Referring to FIG. 5, a structural block diagram of an apparatus for recognizing a webpage text of the present application is shown. Specifically, the following modules may be included:

a text extraction module 501, configured to extract text data in a webpage to be identified;

a word segmentation module 502, configured to perform segmentation on the text data to obtain a basic participle;

The word segment attribute calculation module 503 is configured to calculate a first attribute value and a second attribute value of each base participle;

The feature value calculation module 504 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;

The feature extraction module 505 is configured to filter the feature word segmentation from the basic participle according to the feature value;

a feature weight assignment module 506, configured to calculate a corresponding weight of each feature participle;

a classification module 507, configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information;

The marking module 508 is configured to mark the classification information for the to-be-identified webpage.

As an example of the specific application of the embodiment of the present application, the feature value calculation module 504 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:

In a preferred embodiment of the present application, the feature extraction module 505 can include the following sub-modules:

a sorting sub-module 5051, configured to rank the basic participle according to its corresponding feature value from high to low;

An extraction sub-module 5052, configured to extract a preset number, where the feature value is higher than a preset threshold The basic participle of the value is used as the feature participle.

In a preferred embodiment of the present application, the feature weight assignment module 506 can include the following sub-modules:

The number of statistics sub-module 5061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;

a segmentation total number statistics sub-module 5062, configured to count the total number of feature word segments in the text data of the webpage;

The calculation sub-module 5063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words correspondingly the weight of.

As an example of the specific application of the embodiment of the present application, the calculation sub-module 4063 may be based on the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the webpage. The total number of feature word segments in the text data, and the corresponding weights of each feature word segment are calculated:

In a preferred embodiment of the present application, the feature weight distribution module 506 may further include the following sub-modules:

The normalization sub-module 5064 is configured to normalize the weight of the feature word segmentation.

Each embodiment in the specification is mainly described as being different from the other embodiments, and the same similar parts between the respective embodiments may be referred to each other.

Those skilled in the art will appreciate that embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

In a typical configuration, the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium. Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic cassette tape, magnetic tape storage or other magnetic storage device or Any other non-transportable medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.

Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.

The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device such that a series of operational steps are performed on the computer or other programmable terminal device to produce computer-implemented processing, such that the computer or other programmable terminal device The instructions executed above provide steps for implementing the functions specified in one or more blocks of the flowchart or in a block or blocks of the flowchart.

While a preferred embodiment of the embodiments of the present application has been described, those skilled in the art can make further changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be construed as including the preferred embodiments and All changes and modifications to the scope of the application.

Finally, it should also be noted that in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between operations. Furthermore, the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article, or terminal device that comprises the element, without further limitation.

The method for classifying webpage texts provided by the present application, a device for classifying webpage texts, a method for recognizing webpage texts, and a device for recognizing webpage texts are described in detail, and specific articles are applied in the text. The principles and implementations of the present application are described in the following examples. The description of the above embodiments is only for helping to understand the method of the present application and its core ideas. Meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation is implemented. There is a change in the scope of the application and the scope of the application. In summary, the content of the specification should not be construed as limiting the application.

Claims

A method for classifying webpage texts, comprising:

Collect text data in a web page;

Segmenting the text data to obtain a basic participle;

Calculating a first attribute value and a second attribute value of each base participle;

Calculating a feature value of each basic participle according to the first attribute value and the second attribute value;

Extracting feature word segments from the basic participle according to the feature value;

Calculate the corresponding weight of each feature participle;

The weight is used as the feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
The method according to claim 1, wherein said first attribute value is an information gain value of said base participle, and said second attribute value is a chi-square of said base participle relative to a predefined respective category The standard deviation of the statistic value, the eigenvalue being the degree of discrimination of the base participle.
The method according to claim 2, wherein the feature values of the basic participles are calculated according to the first attribute value and the second attribute value by the following formula:

Among them, score is the discrimination of the basic participle, igScore is the information gain of the basic participle The value, chiScore is the cardinal statistic value of the base word segment relative to the predefined individual categories, the n being the number of predefined categories.
The method according to claim 1 or 2 or 3, wherein the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:

Arranging the basic participle according to its corresponding feature value from high to low;

Extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
The method according to claim 1 or 2 or 3, wherein the step of calculating the corresponding weight of each feature word segment comprises:

Obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;

Counting the total number of feature word segments in the text data of the webpage;

According to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
The method according to claim 5, wherein the number of occurrences of each feature segmentation in the text data of the corresponding webpage and the feature segmentation in the text data of the webpage are determined by the following formula according to the feature value of the feature word segmentation The total number of the corresponding weights of each feature segmentation is calculated:

Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
The method according to claim 1 or 2 or 3 or 6, wherein the step of calculating the corresponding weight of each feature word segment further comprises:

The weights of the feature word segments are normalized.
The method according to claim 7, wherein the weight of the feature word segmentation is normalized by the following formula:

Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
A method for text recognition of webpages, comprising:

Extracting text data in the web page to be identified;

Segmenting the text data to obtain a basic participle;

Calculating a first attribute value and a second attribute value of each base participle;

Calculating a feature value of each basic participle according to the first attribute value and the second attribute value;

Extracting feature word segments from the basic participle according to the feature value;

Calculate the corresponding weight of each feature participle;

And inputting the weight as a feature vector into a pre-trained classification model to obtain classification information;

Marking the classification information for the to-be-identified web page.
The method according to claim 9, wherein said first attribute value is an information gain value of said base participle, and said second attribute value is a chi-square of said base participle relative to a predefined respective category The standard deviation of the statistic value, the eigenvalue being the degree of discrimination of the base participle.
The method according to claim 9 or 10, wherein the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:

Arranging the basic participle according to its corresponding feature value from high to low;

Extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
The method according to claim 9 or 10, wherein the step of calculating the corresponding weight of each feature word segment comprises:

Obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;

Counting the total number of feature word segments in the text data of the webpage;

Calculating according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page. The corresponding weight of each feature participle.
The method according to claim 9 or 10 or 12, wherein the step of calculating the corresponding weight of each feature word segment further comprises:

The weights of the feature word segments are normalized.
An apparatus for classifying webpage text, comprising:

An acquisition module, configured to collect text data in a webpage;

a word segmentation module for segmenting the text data to obtain a basic participle;

a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each base participle;

An eigenvalue calculation module, configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value;

a feature extraction module, configured to filter feature segmentation words from the basic participle according to the feature value;

a feature weight allocation module, configured to calculate a corresponding weight of each feature word segmentation;

The model training module is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
The apparatus according to claim 14, wherein said first attribute value is an information gain value of said base participle, and said second attribute value is relative to said base participle The standard deviation of the chi-square statistic values of the predefined individual categories, the eigenvalues being the degree of discrimination of the base participle.
The apparatus according to claim 15, wherein the feature value calculation module calculates the feature values of the basic participle words according to the first attribute value and the second attribute value by using the following formula:

Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
The device according to claim 14 or 15 or 16, wherein the feature extraction module comprises:

a sorting sub-module for arranging the basic participle according to its corresponding feature value from highest to lowest;

And an extraction sub-module, configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
The device according to claim 14 or 15 or 16, wherein the feature weight distribution module comprises:

The number statistics sub-module is used to obtain each feature participle in the text data of the corresponding webpage Current number of times;

a total number of word segmentation sub-modules for counting the total number of feature word segments in the text data of the webpage;

a calculation submodule, configured to calculate, according to the feature value of the feature word segmentation, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words Weights.
The device according to claim 18, wherein the calculation sub-module according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the webpage The total number of feature word segments in the text data, and the corresponding weights of each feature word segment are calculated:

Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
The device according to claim 14 or 15 or 16 or 19, wherein the feature weight distribution module further comprises:

The normalization submodule is configured to normalize the weight of the feature word segmentation.
The apparatus according to claim 20, wherein said normalization sub-module normalizes weights of said feature word segments by the following formula:

Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
An apparatus for recognizing a webpage text, comprising:

a text extraction module, configured to extract text data in the webpage to be identified;

a word segmentation module for segmenting the text data to obtain a basic participle;

a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each base participle;

An eigenvalue calculation module, configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value;

a feature extraction module, configured to filter feature segmentation words from the basic participle according to the feature value;

a feature weight allocation module, configured to calculate a corresponding weight of each feature word segmentation;

a classification module, configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information;

a marking module, configured to mark classification information for the to-be-identified webpage.