WO2017167067A1 - Method and device for webpage text classification, method and device for webpage text recognition - Google Patents

Method and device for webpage text classification, method and device for webpage text recognition Download PDF

Info

Publication number
WO2017167067A1
WO2017167067A1 PCT/CN2017/077489 CN2017077489W WO2017167067A1 WO 2017167067 A1 WO2017167067 A1 WO 2017167067A1 CN 2017077489 W CN2017077489 W CN 2017077489W WO 2017167067 A1 WO2017167067 A1 WO 2017167067A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
participle
weight
value
text data
Prior art date
Application number
PCT/CN2017/077489
Other languages
French (fr)
Chinese (zh)
Inventor
段秉南
Original Assignee
阿里巴巴集团控股有限公司
段秉南
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 段秉南 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017167067A1 publication Critical patent/WO2017167067A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Definitions

  • the present application relates to the technical field of text classification, and in particular to a method for classifying web page text, a device for classifying web page text, a method for recognizing web page text, and a device for recognizing web page text.
  • the webpage text classification refers to determining the category of the corresponding webpage according to the content of the massive webpage document according to the predefined theme category.
  • the technical basis for web page text categorization is content-based plain text categorization.
  • the basic method is to extract the content of the plain text of each webpage text in the captured webpage collection, and obtain the corresponding plain text.
  • the extracted plain text is then combined into a new document collection, and a plain text classification algorithm is applied to the new document collection for classification.
  • the webpage text is classified, that is, the plaintext content information of the webpage is applied, and the webpage is classified.
  • embodiments of the present application have been made in order to provide an overcoming of the above problems or A method for classifying web page text, at least partially solving the above problem, a method for recognizing web page text, and a corresponding device for classifying web page text, a device for recognizing web page text.
  • the embodiment of the present application discloses a method for classifying webpage text, including:
  • the weight is used as the feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
  • the first attribute value is an information gain value of the base participle
  • the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
  • the feature values of the basic participles are calculated according to the first attribute value and the second attribute value by the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
  • the step of calculating corresponding weights of each feature word segment includes:
  • the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
  • the following formula is used to calculate the feature value of the feature word segment, the number of times the feature word segment appears in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, and calculate corresponding feature word segments.
  • Weights :
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the step of calculating a corresponding weight of each feature word segment further includes:
  • the weights of the feature word segments are normalized.
  • the weights of the feature word segmentation are normalized by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the embodiment of the present application further discloses a method for text recognition of a webpage, including:
  • the first attribute value is an information gain value of the base participle
  • the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
  • the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
  • the step of calculating corresponding weights of each feature word segment includes:
  • the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
  • the step of calculating a corresponding weight of each feature word segment further includes:
  • the weights of the feature word segments are normalized.
  • the embodiment of the present application further discloses an apparatus for classifying webpage texts, including:
  • An acquisition module configured to collect text data in a webpage
  • a word segmentation module for segmenting the text data to obtain a basic participle
  • a word segment attribute calculation module configured to calculate a first attribute value and a second attribute value of each base participle
  • An eigenvalue calculation module configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value
  • a feature extraction module configured to filter feature segmentation words from the basic participle according to the feature value
  • a feature weight allocation module configured to calculate a corresponding weight of each feature word segmentation
  • the model training module is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
  • the first attribute value is an information gain value of the base participle
  • the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
  • the feature value calculation module calculates the feature values of the basic participle words according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the feature extraction module comprises:
  • a sorting sub-module for arranging the basic participle according to its corresponding feature value from highest to lowest;
  • an extraction sub-module configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the feature weight allocation module comprises:
  • a number statistics sub-module configured to obtain the number of occurrences of each feature word segment in the text data of the corresponding webpage
  • a calculation submodule configured to calculate, according to the feature value of the feature word segmentation, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words Weights.
  • the calculation sub-module is calculated according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage.
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the feature weight distribution module further includes:
  • the normalization submodule is configured to normalize the weight of the feature word segmentation.
  • the normalization sub-module normalizes the weight of the feature word segment by the following formula:
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • Max(weight) is the maximum weight value in the text data in the webpage.
  • the embodiment of the present application further discloses an apparatus for text recognition of a webpage, including:
  • a text extraction module configured to extract text data in the webpage to be identified
  • a word segmentation module for segmenting the text data to obtain a basic participle
  • a word segment attribute calculation module configured to calculate a first attribute value and a second attribute value of each base participle
  • An eigenvalue calculation module configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value
  • a feature extraction module configured to filter feature segmentation words from the basic participle according to the feature value
  • a feature weight allocation module configured to calculate a corresponding weight of each feature word segmentation
  • a classification module configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information
  • a marking module configured to mark classification information for the to-be-identified webpage.
  • the embodiment of the present application improves the objectivity and accuracy of feature extraction by improving the extraction method of feature word segmentation and the calculation method of feature word segmentation weight, and also considers the influence of feature on classification, thereby improving the accuracy of web page text classification. Sex, more convenient for users to obtain valid information in a timely and accurate manner in a large amount of text.
  • the embodiment of the present application combines at least two feature extraction algorithms, and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of the feature extraction. Moreover, by using the long tail distribution map to select the number of features, the weighting of the feature segmentation effect is adopted for the feature segmentation, so that the effective features can be further screened, so that the effect of the webpage text classification is more accurate.
  • FIG. 1 is a flow chart showing the steps of a method for classifying web page text according to the present application
  • FIG. 2 is a schematic diagram of a long tail distribution in an example of the present application.
  • FIG. 3 is a flow chart of steps of text recognition of a webpage according to the present application.
  • FIG. 4 is a structural block diagram of an apparatus for classifying web page text according to the present application.
  • FIG. 5 is a structural block diagram of an apparatus for text recognition of a webpage according to the present application.
  • Text categorization is to obtain a mapping rule between a category and an unknown text by training a certain set of texts, that is, calculating the relevance of the text and the category, and then determining the category attribution of the text according to the trained classifier.
  • Text categorization is a guided learning process. It finds a relational model (classifier) between text attributes (features) and text categories based on a set of training texts that have been annotated, and then uses the relational model pair obtained by this learning. The new text is judged by category.
  • the process of text categorization can be divided into two parts: training and classification.
  • the purpose of training is to construct a classification model for the classification by linking the new text and categories.
  • the classification process is a process of classifying unknown texts based on training results, giving a category identification.
  • FIG. 1 a flow chart of steps of a method for classifying web page texts according to the present application is shown. Specifically, the method may include the following steps:
  • Step 101 Collect text data in a webpage
  • This step obtains the text data of the webpage used for the training of the classification model.
  • it may be massive data.
  • the usual processing method is to extract the plain text content of each webpage text in the captured webpage collection, thereby obtaining corresponding plain text, and then extracting the plain text into a new document collection, the document collection. That is, the webpage referred to in this application Text data in .
  • Step 102 Perform word segmentation on the text data to obtain a basic participle
  • English is based on words, words and words are separated by spaces, and Chinese is in words. All the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student.” The computer can easily know that student is a word by a space, but it is not easy to understand that the words "learning” and "sheng" are combined to represent a word.
  • the Chinese character sequence is divided into meaningful words, which are Chinese word segments. For example, I am a student and the result of the participle is: I am a student.
  • Word segmentation based on string matching refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
  • the actual word segmentation system uses mechanical segmentation as a preliminary method, and further improves the accuracy of segmentation by using various other language information.
  • the word segmentation method based on feature scanning or mark segmentation refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and in turn, test and adjust the word segmentation results in the labeling process. , thereby improving the accuracy of the segmentation.
  • the word segmentation method based on understanding refers to the effect of identifying words by letting the computer simulate the understanding of the sentence.
  • the basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity. That is, it simulates the process of understanding people's sentences. This method of word segmentation requires a large amount of linguistic knowledge and information.
  • Statistical-based word segmentation method It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary.
  • the manner in which the text data is segmented by the present application is not limited, and the word segmentation is performed on the document set, and all the word segments obtained are the basic participles referred to in the present application.
  • the removal process may also be performed in advance for the invalid words in the basic participle, for example, for the stop words.
  • Stop words usually refer to frequent occurrences in various types of text, and are therefore considered to have few high-frequency words such as pronouns, prepositions, conjunctions, etc. that help to classify any information.
  • Those skilled in the art can also design feature words that need to be deleted before or during feature extraction according to requirements, which need not be limited in this application.
  • Step 103 Calculate a first attribute value and a second attribute value of each basic participle
  • Step 104 Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
  • Step 105 Filter feature tokens from the basic participle according to the feature value
  • the above steps 103-105 relate to the processing of feature selection in text categorization.
  • the original feature space dimension is very high, and there are a lot of redundant features, so feature dimension reduction is needed.
  • Feature selection is one of the characteristics of feature dimension reduction. Its basic idea is to score each original feature item independently according to a certain evaluation function, and then sort by the level of the score, and select several features with the highest score. Item, or a threshold is set in advance, the metric value is filtered out of the threshold feature, and the remaining candidate features are used as the feature subset of the result.
  • the feature selection algorithm includes algorithms such as document frequency, mutual information amount, information gain, and ⁇ 2 statistic (CHI).
  • algorithms such as document frequency, mutual information amount, information gain, and ⁇ 2 statistic (CHI).
  • CHI ⁇ 2 statistic
  • those skilled in the art usually select one of them to select the feature word segmentation.
  • the use of this single algorithm has many drawbacks. Taking the information gain algorithm as an example, the information gain appears in the text through the word segmentation.
  • the amount of information before and after the occurrence of the word segmentation is used to infer the amount of information carried by the participle, that is, the information gain value of a participle indicates the amount of information contained in the participle feature.
  • the segmentation feature can give the classifier a larger amount of information, but the existing information gain algorithm only considers the amount of information provided by the segmentation feature to the overall classifier, ignoring the different segmentation features. The degree of discrimination of each category.
  • the chi-square statistic is also used to characterize the correlation between two variables. It also considers the case when the feature appears and does not appear in a certain type of text. The larger the chi-square statistic, the more relevant it is to the class, and the more the category information is carried, but the existing ⁇ 2 statistic (CHI) algorithm over-exaggerates the role of low-frequency words.
  • CHI ⁇ 2 statistic
  • the present application proposes that no single algorithm is used, and at least two algorithms are used for feature extraction, that is, different first algorithms are used to calculate the first attribute value and the second attribute value of each basic participle, for example, using information gain.
  • the algorithm calculates the first attribute value and uses the CHI algorithm to calculate the second attribute value.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value, the eigenvalue may be the degree of discrimination of the basic participle, that is, the step 103 may specifically include the following sub-steps:
  • Sub-step 1031 calculating an information gain value of each basic participle
  • Sub-step 1032 calculating a chi-square statistic value of each basic participle
  • Sub-step 1033 based on the number of base participles, the standard deviation of the base participle relative to the predefined chi-square statistic of each category is counted.
  • the step 104 may be: obtaining the discrimination degree of each basic participle based on the product of the information gain value and the standard deviation.
  • the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the application combines at least two feature extraction algorithms and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of feature extraction.
  • the step 105 may specifically include the following sub-steps:
  • Sub-step 1051 the basic participle is arranged according to its corresponding feature value from high to low;
  • Sub-step 1052 extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the degree of discrimination may take, for example, a base participle with an abscissa greater than 0 and less than 30,000 as a feature segmentation.
  • the present application can further screen out the effective features, thereby making the effect of web page text classification more accurate.
  • Step 106 Calculate corresponding weights of each feature participle
  • each feature participle is given a weight indicating the importance of the feature participle in the text.
  • the weights are generally calculated based on the frequency of the feature items.
  • calculation methods such as Boolean weight method, word frequency weight method, TF/IDF weight method, TFC weight method, etc.
  • TF/IDF weight method indicates the number of features in a single text
  • IDF indicates the number of features in the entire corpus, so the influence of features on classification is completely ignored.
  • the present application proposes a preferred embodiment for calculating weights.
  • the step 106 may include the following sub-steps:
  • Sub-step 1061 obtaining the number of times each feature participle appears in the text data of the corresponding webpage
  • Sub-step 1062 counting the total number of feature word segments in the text data of the webpage
  • Sub-step 1063 according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated.
  • the sub-step 1063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • step 105 further includes the following sub-steps:
  • Sub-step 1064 normalizing the weights of the feature word segments.
  • the weight of the feature word segmentation can be normalized by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • weights used in the examples of the present application take into account the influence of features on the classification, and thus can further improve the effectiveness of feature selection.
  • the corresponding weights of each feature segment calculated above can be used as a feature vector of a text, and a text classification can be selected after obtaining the feature vector.
  • the algorithm trains the classification model.
  • Step 107 The weight is used as a feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
  • Bayesian Probability Algorithm Naive Bayese
  • support vector machine KNN algorithm (k nearest neighbor), etc.
  • KNN algorithm K nearest neighbor
  • the embodiment of the present application improves the objectivity and accuracy of feature extraction by improving the extraction method of feature word segmentation and the calculation method of feature word segmentation weight, and also considers the influence of feature on classification, thereby improving the accuracy of web page text classification. Sex, more convenient for users to obtain valid information in a timely and accurate manner in a large amount of text.
  • FIG. 3 a flowchart of an embodiment of a method for text recognition of a webpage according to the present application is shown. Specifically, the method may include the following steps:
  • Step 301 Extract text data in the webpage to be identified
  • Step 302 performing segmentation on the text data to obtain a basic participle
  • Step 303 Calculate a first attribute value and a second attribute value of each basic participle
  • Step 304 Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
  • Step 305 Filter feature feature words from the basic participle according to the feature value
  • Step 306 calculating corresponding weights of each feature participle
  • Step 307 Enter the weight as a feature vector into a pre-trained classification model to obtain classification information.
  • Step 308 Mark classification information for the to-be-identified webpage.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value which may be the degree of discrimination of the base participle.
  • the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the step 305 may include the following sub-steps:
  • Sub-step 3051 the basic participle is arranged according to its corresponding feature value from high to low;
  • Sub-step 3052 extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the step 306 may include the following sub-steps:
  • Sub-step 3061 obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage
  • Sub-step 3062 counting the total number of feature word segments in the text data of the webpage
  • Sub-step 3063 according to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
  • the sub-step 3063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • step 306 further includes the following sub-steps:
  • Sub-step 3064 normalizing the weights of the feature word segments.
  • the weight of the feature word segmentation can be normalized by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the corresponding weights of each feature segment obtained above can be used as a feature vector of a text. After obtaining the feature vector, it can be input into the classification model pre-generated according to the process shown in Figure 1, and the current feature vector can be obtained. Classification information, and finally the current identification The corresponding categorization information on the web page mark can be used.
  • FIG. 4 a structural block diagram of an apparatus embodiment of a webpage text classification of the present application is shown, which may specifically include the following modules:
  • the collecting module 401 is configured to collect text data in the webpage
  • a word segmentation module 402 configured to perform segmentation on the text data to obtain a basic participle
  • the word segment attribute calculation module 403 is configured to calculate a first attribute value and a second attribute value of each base participle;
  • the feature value calculation module 404 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;
  • the feature extraction module 405 is configured to filter the feature word segmentation from the basic participle according to the feature value
  • a feature weight assignment module 406, configured to calculate a corresponding weight of each feature participle
  • the model training module 407 is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value which may be the degree of discrimination of the base participle.
  • the feature value calculation module 404 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the feature extraction module 405 may include the following sub-modules:
  • a sorting sub-module 4051 configured to rank the basic participle according to its corresponding feature value from high to low;
  • the extracting sub-module 4052 is configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the feature weight assignment module 406 can include the following sub-modules:
  • the number of statistics sub-module 4061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;
  • a segmentation total number statistics sub-module 4062 configured to count the total number of feature word segments in the text data of the webpage
  • the calculation sub-module 4063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage, and calculate corresponding feature wordifiers the weight of.
  • the calculation sub-module 4063 may be based on the feature value of the feature segmentation according to the following formula, and each feature segmentation is in the corresponding webpage.
  • the number of occurrences in the text data, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the feature weight assignment module 406 may further include the following sub-modules:
  • the normalization sub-module 4064 is configured to normalize the weight of the feature word segmentation.
  • the normalization sub-module 4064 may normalize the weight of the feature word segment by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • FIG. 5 a structural block diagram of an apparatus for recognizing a webpage text of the present application is shown. Specifically, the following modules may be included:
  • a text extraction module 501 configured to extract text data in a webpage to be identified
  • a word segmentation module 502 configured to perform segmentation on the text data to obtain a basic participle
  • the word segment attribute calculation module 503 is configured to calculate a first attribute value and a second attribute value of each base participle;
  • the feature value calculation module 504 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;
  • the feature extraction module 505 is configured to filter the feature word segmentation from the basic participle according to the feature value
  • a feature weight assignment module 506, configured to calculate a corresponding weight of each feature participle
  • a classification module 507 configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information
  • the marking module 508 is configured to mark the classification information for the to-be-identified webpage.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value which may be the degree of discrimination of the base participle.
  • the feature value calculation module 504 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the feature extraction module 505 can include the following sub-modules:
  • a sorting sub-module 5051 configured to rank the basic participle according to its corresponding feature value from high to low;
  • An extraction sub-module 5052 configured to extract a preset number, where the feature value is higher than a preset threshold
  • the basic participle of the value is used as the feature participle.
  • the feature weight assignment module 506 can include the following sub-modules:
  • the number of statistics sub-module 5061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;
  • a segmentation total number statistics sub-module 5062 configured to count the total number of feature word segments in the text data of the webpage
  • the calculation sub-module 5063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words correspondingly the weight of.
  • the calculation sub-module 4063 may be based on the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the webpage.
  • the total number of feature word segments in the text data, and the corresponding weights of each feature word segment are calculated:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the feature weight distribution module 506 may further include the following sub-modules:
  • the normalization sub-module 5064 is configured to normalize the weight of the feature word segmentation.
  • the normalization sub-module 4064 may normalize the weight of the feature word segment by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic cassette tape, magnetic tape storage or other magnetic storage device or Any other non-transportable medium that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.
  • the method for classifying webpage texts provided by the present application a device for classifying webpage texts, a method for recognizing webpage texts, and a device for recognizing webpage texts are described in detail, and specific articles are applied in the text.
  • the principles and implementations of the present application are described in the following examples. The description of the above embodiments is only for helping to understand the method of the present application and its core ideas. Meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation is implemented. There is a change in the scope of the application and the scope of the application. In summary, the content of the specification should not be construed as limiting the application.

Abstract

A method and device for webpage text classification, and a method and device for webpage text recognition. The method for webpage text classification comprises: collecting text data from a webpage (101); segmenting the text data to obtain basic text segments(102); calculating a first attribute value and a second attribute value of each of the basic text segments (103); calculating a characteristic value of each of the basic text segments according to the first attribute value and the second attribute value (104); screening and selecting characteristic text segments from the basic text segments according to the characteristic value (105); calculating a weight corresponding to each of the characteristic text segments (106); treating the weight as a characteristic vector corresponding to the characteristic text segments, and utilizing the characteristic vector to train a classification model (107). The method and device of the present invention effectively ensure objectivity and accuracy in extracting a characteristic, and also take into account the influence of a characteristic on classification, thereby increasing the accuracy of webpage text classification, and further facilitating a user to accurately and timely obtain effective information from a massive amount of text.

Description

网页文本分类的方法和装置,网页文本识别的方法和装置Method and device for classifying webpage text, method and device for recognizing webpage text 技术领域Technical field
本申请涉及文本分类的技术领域,特别是涉及一种网页文本分类的方法,一种网页文本分类的装置,一种网页文本识别的方法,以及,一种网页文本识别的装置。The present application relates to the technical field of text classification, and in particular to a method for classifying web page text, a device for classifying web page text, a method for recognizing web page text, and a device for recognizing web page text.
背景技术Background technique
在当今的信息社会,各种形式的信息都极大的丰富了人们的生活,尤其随着Internet的大规模普及,网络上的信息量在飞速增长当中,如各种电子文档、电子邮件和网页充满网络上,从而造成信息杂乱。为了快速、准确、全面地找到我们所需要的信息,文本分类成为了有效组织和管理文本数据的重要方式,越来越受到广泛的关注。In today's information society, all forms of information have greatly enriched people's lives. Especially with the large-scale popularization of the Internet, the amount of information on the Internet is growing rapidly, such as various electronic documents, emails and web pages. Filled with the network, causing messy information. In order to find the information we need quickly, accurately and comprehensively, text categorization has become an important way to effectively organize and manage text data, and it has received more and more attention.
网页文本分类是指按照预先定义的主题类别,根据海量网页文档的内容,确定相应网页的类别。网页文本分类采用的技术基础是基于内容的纯文本分类。基本方法是,在抓取到的网页集合中,对每篇网页文本进行纯文本的内容抽取,得到相应的纯文本。再将抽取出的纯文本组成新的文档集合,在新的文档集合上应用纯文本分类算法进行分类。再根据纯文本与网页文本的对应关系,对网页文本进行分类,即应用网页的纯文本内容信息,对网页进行分类。The webpage text classification refers to determining the category of the corresponding webpage according to the content of the massive webpage document according to the predefined theme category. The technical basis for web page text categorization is content-based plain text categorization. The basic method is to extract the content of the plain text of each webpage text in the captured webpage collection, and obtain the corresponding plain text. The extracted plain text is then combined into a new document collection, and a plain text classification algorithm is applied to the new document collection for classification. According to the correspondence between the plain text and the webpage text, the webpage text is classified, that is, the plaintext content information of the webpage is applied, and the webpage is classified.
由于海量文本所具有的多意性、模糊性、各异性等特点,已有技术中,在分类特征的选取上难以令人满意,例如,往往会夸大某些无效词的作用,或者,忽略某些特征分词的重要属性,从而导致网页文本分类的准确度极低。Due to the multi-intentionality, ambiguity, and dissimilarity of massive texts, in the prior art, it is difficult to select the classification features. For example, the role of some invalid words is often exaggerated, or some The important attributes of some feature parts, resulting in extremely low accuracy of web page text classification.
发明内容Summary of the invention
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或 者至少部分地解决上述问题的一种网页文本分类的方法,一种网页文本识别的方法,和相应的一种网页文本分类的装置,一种网页文本识别的装置。In view of the above problems, embodiments of the present application have been made in order to provide an overcoming of the above problems or A method for classifying web page text, at least partially solving the above problem, a method for recognizing web page text, and a corresponding device for classifying web page text, a device for recognizing web page text.
为了解决上述问题,本申请实施例公开了一种网页文本分类的方法,包括:In order to solve the above problem, the embodiment of the present application discloses a method for classifying webpage text, including:
采集网页中的文本数据;Collect text data in a web page;
对所述文本数据进行分词,获得基础分词;Segmenting the text data to obtain a basic participle;
计算各基础分词的第一属性值和第二属性值;Calculating a first attribute value and a second attribute value of each base participle;
依据所述第一属性值和第二属性值计算各基础分词的特征值;Calculating a feature value of each basic participle according to the first attribute value and the second attribute value;
依据所述特征值从所述基础分词中筛选出特征分词;Extracting feature word segments from the basic participle according to the feature value;
计算各特征分词相应的权重;Calculate the corresponding weight of each feature participle;
将所述权重作为相应特征分词的特征向量,采用所述特征向量训练出分类模型。The weight is used as the feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
优选地,所述第一属性值为所述基础分词的信息增益值,所述第二属性值为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值为所述基础分词的区分度。Preferably, the first attribute value is an information gain value of the base participle, and the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
优选地,通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:Preferably, the feature values of the basic participles are calculated according to the first attribute value and the second attribute value by the following formula:
Figure PCTCN2017077489-appb-000001
Figure PCTCN2017077489-appb-000001
其中,score为基础分词的区分度,igScore为基础分词的信息增益值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
优选地,所述依据所述特征值从所述基础分词中筛选出特征分词的步骤包括: Preferably, the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
将所述基础分词按照其对应的特征值由高至低排列;Arranging the basic participle according to its corresponding feature value from high to low;
提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。Extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
优选地,所述计算各特征分词相应的权重的步骤包括:Preferably, the step of calculating corresponding weights of each feature word segment includes:
获取各特征分词在相应网页的文本数据中出现的次数;Obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;
统计所述网页的文本数据中特征分词的总数;Counting the total number of feature word segments in the text data of the webpage;
依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。According to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
优选地,通过如下公式依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重:Preferably, the following formula is used to calculate the feature value of the feature word segment, the number of times the feature word segment appears in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, and calculate corresponding feature word segments. Weights:
Figure PCTCN2017077489-appb-000002
Figure PCTCN2017077489-appb-000002
其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
优选地,所述计算各特征分词相应的权重的步骤还包括:Preferably, the step of calculating a corresponding weight of each feature word segment further includes:
对所述特征分词的权重进行归一化处理。The weights of the feature word segments are normalized.
优选地,通过以下公式对所述特征分词的权重进行归一化处理:Preferably, the weights of the feature word segmentation are normalized by the following formula:
Figure PCTCN2017077489-appb-000003
Figure PCTCN2017077489-appb-000003
其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值,max(weight)为所述网页中文本数据中最大weight值。Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
本申请实施例还公开了一种网页文本识别的方法,包括: The embodiment of the present application further discloses a method for text recognition of a webpage, including:
提取待识别网页中的文本数据;Extracting text data in the web page to be identified;
对所述文本数据进行分词,获得基础分词;Segmenting the text data to obtain a basic participle;
计算各基础分词的第一属性值和第二属性值;Calculating a first attribute value and a second attribute value of each base participle;
依据所述第一属性值和第二属性值计算各基础分词的特征值;Calculating a feature value of each basic participle according to the first attribute value and the second attribute value;
依据所述特征值从所述基础分词中筛选出特征分词;Extracting feature word segments from the basic participle according to the feature value;
计算各特征分词相应的权重;Calculate the corresponding weight of each feature participle;
将所述权重作为特征向量输入预先训练出的分类模型中,获得分类信息;And inputting the weight as a feature vector into a pre-trained classification model to obtain classification information;
针对所述待识别网页标记分类信息。Marking the classification information for the to-be-identified web page.
优选地,所述第一属性值为所述基础分词的信息增益值,所述第二属性值为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值为所述基础分词的区分度。Preferably, the first attribute value is an information gain value of the base participle, and the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
优选地,所述依据所述特征值从所述基础分词中筛选出特征分词的步骤包括:Preferably, the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
将所述基础分词按照其对应的特征值由高至低排列;Arranging the basic participle according to its corresponding feature value from high to low;
提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。Extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
优选地,所述计算各特征分词相应的权重的步骤包括:Preferably, the step of calculating corresponding weights of each feature word segment includes:
获取各特征分词在相应网页的文本数据中出现的次数;Obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;
统计所述网页的文本数据中特征分词的总数;Counting the total number of feature word segments in the text data of the webpage;
依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。According to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
优选地,所述计算各特征分词相应的权重的步骤还包括:Preferably, the step of calculating a corresponding weight of each feature word segment further includes:
对所述特征分词的权重进行归一化处理。 The weights of the feature word segments are normalized.
本申请实施例还公开了一种网页文本分类的装置,包括:The embodiment of the present application further discloses an apparatus for classifying webpage texts, including:
采集模块,用于采集网页中的文本数据;An acquisition module, configured to collect text data in a webpage;
分词模块,用于对所述文本数据进行分词,获得基础分词;a word segmentation module for segmenting the text data to obtain a basic participle;
分词属性计算模块,用于计算各基础分词的第一属性值和第二属性值;a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each base participle;
特征值计算模块,用于依据所述第一属性值和第二属性值计算各基础分词的特征值;An eigenvalue calculation module, configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
特征提取模块,用于依据所述特征值从所述基础分词中筛选出特征分词;a feature extraction module, configured to filter feature segmentation words from the basic participle according to the feature value;
特征权重分配模块,用于计算各特征分词相应的权重;a feature weight allocation module, configured to calculate a corresponding weight of each feature word segmentation;
模型训练模块,用于将所述权重作为相应特征分词的特征向量,采用所述特征向量训练出分类模型。The model training module is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
优选地,所述第一属性值为所述基础分词的信息增益值,所述第二属性值为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值为所述基础分词的区分度。Preferably, the first attribute value is an information gain value of the base participle, and the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
优选地,所述特征值计算模块通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:Preferably, the feature value calculation module calculates the feature values of the basic participle words according to the first attribute value and the second attribute value by using the following formula:
Figure PCTCN2017077489-appb-000004
Figure PCTCN2017077489-appb-000004
其中,score为基础分词的区分度,igScore为基础分词的信息增益值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
优选地,所述特征提取模块包括:Preferably, the feature extraction module comprises:
排序子模块,用于将所述基础分词按照其对应的特征值由高至低排列; a sorting sub-module for arranging the basic participle according to its corresponding feature value from highest to lowest;
提取子模块,用于提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。And an extraction sub-module, configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
优选地,所述特征权重分配模块包括:Preferably, the feature weight allocation module comprises:
次数统计子模块,用于获取各特征分词在相应网页的文本数据中出现的次数;a number statistics sub-module, configured to obtain the number of occurrences of each feature word segment in the text data of the corresponding webpage;
分词总数统计子模块,用于统计所述网页的文本数据中特征分词的总数;a total number of word segmentation sub-modules for counting the total number of feature word segments in the text data of the webpage;
计算子模块,用于依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。a calculation submodule, configured to calculate, according to the feature value of the feature word segmentation, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words Weights.
优选地,所述计算子模块通过如下公式依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重:Preferably, the calculation sub-module is calculated according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage. The corresponding weights of each feature participle:
Figure PCTCN2017077489-appb-000005
Figure PCTCN2017077489-appb-000005
其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
优选地,所述特征权重分配模块还包括:Preferably, the feature weight distribution module further includes:
归一化子模块,用于对所述特征分词的权重进行归一化处理。The normalization submodule is configured to normalize the weight of the feature word segmentation.
优选地,所述归一化子模块通过以下公式对所述特征分词的权重进行归一化处理:Preferably, the normalization sub-module normalizes the weight of the feature word segment by the following formula:
Figure PCTCN2017077489-appb-000006
Figure PCTCN2017077489-appb-000006
其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值, max(weight)为所述网页中文本数据中最大weight值。Where norm(weight) is the weight after normalization, weight is the weight of the feature participle, and min(weight) is the minimum weight value in the text data in the webpage. Max(weight) is the maximum weight value in the text data in the webpage.
本申请实施例还公开了一种网页文本识别的装置,包括:The embodiment of the present application further discloses an apparatus for text recognition of a webpage, including:
文本提取模块,用于提取待识别网页中的文本数据;a text extraction module, configured to extract text data in the webpage to be identified;
分词模块,用于对所述文本数据进行分词,获得基础分词;a word segmentation module for segmenting the text data to obtain a basic participle;
分词属性计算模块,用于计算各基础分词的第一属性值和第二属性值;a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each base participle;
特征值计算模块,用于依据所述第一属性值和第二属性值计算各基础分词的特征值;An eigenvalue calculation module, configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
特征提取模块,用于依据所述特征值从所述基础分词中筛选出特征分词;a feature extraction module, configured to filter feature segmentation words from the basic participle according to the feature value;
特征权重分配模块,用于计算各特征分词相应的权重;a feature weight allocation module, configured to calculate a corresponding weight of each feature word segmentation;
分类模块,用于将所述权重作为特征向量输入预先训练出的分类模型中,获得分类信息;a classification module, configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information;
标记模块,用于针对所述待识别网页标记分类信息。a marking module, configured to mark classification information for the to-be-identified webpage.
本申请实施例包括以下优点:Embodiments of the present application include the following advantages:
本申请实施例通过改进特征分词的提取方式,以及,特征分词权重的计算方式,不仅有效保证了特征提取的客观性与准确性,还兼顾了特征对分类影响,从而提高了网页文本分类的准确性,更方便于用户在海量的文本中及时准确地获得有效的信息。The embodiment of the present application improves the objectivity and accuracy of feature extraction by improving the extraction method of feature word segmentation and the calculation method of feature word segmentation weight, and also considers the influence of feature on classification, thereby improving the accuracy of web page text classification. Sex, more convenient for users to obtain valid information in a timely and accurate manner in a large amount of text.
本申请实施例融合至少两种特征提取算法,并在卡方统计中引入标准差,有效保证了特征提取的客观性与准确性。并且,通过使用长尾分布图选择特征数量,针对特征分词采用兼顾了特征对分类影响的权重,因而能进一步筛选出有效特征,从而使网页文本分类的效果更精准。 The embodiment of the present application combines at least two feature extraction algorithms, and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of the feature extraction. Moreover, by using the long tail distribution map to select the number of features, the weighting of the feature segmentation effect is adopted for the feature segmentation, so that the effective features can be further screened, so that the effect of the webpage text classification is more accurate.
附图说明DRAWINGS
图1是本申请的一种网页文本分类的方法的步骤流程图;1 is a flow chart showing the steps of a method for classifying web page text according to the present application;
图2是本申请一种示例中长尾分布的示意图;2 is a schematic diagram of a long tail distribution in an example of the present application;
图3是本申请的一种网页文本识别的步骤流程图;3 is a flow chart of steps of text recognition of a webpage according to the present application;
图4是本申请的一种网页文本分类的装置的结构框图;4 is a structural block diagram of an apparatus for classifying web page text according to the present application;
图5是本申请的一种网页文本识别的装置的结构框图。FIG. 5 is a structural block diagram of an apparatus for text recognition of a webpage according to the present application.
具体实施方式detailed description
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。The above described objects, features and advantages of the present application will become more apparent and understood.
文本分类是通过训练一定的文本集合,得到类别与未知文本的映射规则,即计算出文本与类别的相关度,再根据训练的分类器来决定文本的类别归属。Text categorization is to obtain a mapping rule between a category and an unknown text by training a certain set of texts, that is, calculating the relevance of the text and the category, and then determining the category attribution of the text according to the trained classifier.
文本分类是一个有指导的学习过程,它根据一个已经被标注的训练文本集合,找到文本属性(特征)和文本类别之间的关系模型(分类器),然后利用这种学习得到的关系模型对新的文本进行类别判断。文本分类的过程总体可划分为训练和分类两部分。训练的目的是通过新的文本和类别之间的联系构造分类模型,使其用于分类。分类过程是根据训练结果对未知文本进行分类,给定类别标识的过程。Text categorization is a guided learning process. It finds a relational model (classifier) between text attributes (features) and text categories based on a set of training texts that have been annotated, and then uses the relational model pair obtained by this learning. The new text is judged by category. The process of text categorization can be divided into two parts: training and classification. The purpose of training is to construct a classification model for the classification by linking the new text and categories. The classification process is a process of classifying unknown texts based on training results, giving a category identification.
参考图1,示出了本申请的一种网页文本分类的方法实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 1 , a flow chart of steps of a method for classifying web page texts according to the present application is shown. Specifically, the method may include the following steps:
步骤101,采集网页中的文本数据;Step 101: Collect text data in a webpage;
本步骤即获取到用于进行分类模型训练的网页的文本数据,在实际中,其可能是海量数据。通常的处理方法是,在抓取到的网页集合中,对每篇网页文本进行纯文本的内容抽取,从而得到相应的纯文本,然后将抽取出的纯文本组成新的文档集合,该文档集合即为本申请所指网页 中的文本数据。This step obtains the text data of the webpage used for the training of the classification model. In practice, it may be massive data. The usual processing method is to extract the plain text content of each webpage text in the captured webpage collection, thereby obtaining corresponding plain text, and then extracting the plain text into a new document collection, the document collection. That is, the webpage referred to in this application Text data in .
步骤102,对所述文本数据进行分词,获得基础分词;Step 102: Perform word segmentation on the text data to obtain a basic participle;
众所周知,英文是以词为单位的,词和词之间是靠空格隔开,而中文是以字为单位,句子中所有的字连起来才能描述一个意思。例如,英文句子I am a student,用中文则为:“我是一个学生”。计算机可以很简单通过空格知道student是一个单词,但是不能很容易明白“学”、“生”两个字合起来才表示一个词。把中文的汉字序列切分成有意义的词,就是中文分词。例如,我是一个学生,分词的结果是:我是一个学生。As we all know, English is based on words, words and words are separated by spaces, and Chinese is in words. All the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student." The computer can easily know that student is a word by a space, but it is not easy to understand that the words "learning" and "sheng" are combined to represent a word. The Chinese character sequence is divided into meaningful words, which are Chinese word segments. For example, I am a student and the result of the participle is: I am a student.
下面介绍一些常用的分词方法:Here are some common word segmentation methods:
1、基于字符串匹配的分词方法:是指按照一定的策略将待分析的汉字串与一个预置的机器词典中的词条进行匹配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。实际使用的分词系统,都是把机械分词作为一种初分手段,还需通过利用各种其它的语言信息来进一步提高切分的准确率。1. Word segmentation based on string matching: refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word). The actual word segmentation system uses mechanical segmentation as a preliminary method, and further improves the accuracy of segmentation by using various other language information.
2、基于特征扫描或标志切分的分词方法:是指优先在待分析字符串中识别和切分出一些带有明显特征的词,以这些词作为断点,可将原字符串分为较小的串再来进机械分词,从而减少匹配的错误率;或者将分词和词类标注结合起来,利用丰富的词类信息对分词决策提供帮助,并且在标注过程中又反过来对分词结果进行检验、调整,从而提高切分的准确率。2. The word segmentation method based on feature scanning or mark segmentation: refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and in turn, test and adjust the word segmentation results in the labeling process. , thereby improving the accuracy of the segmentation.
3、基于理解的分词方法:是指通过让计算机模拟人对句子的理解,达到识别词的效果。其基本思想就是在分词的同时进行句法、语义分析,利用句法信息和语义信息来处理歧义现象。它通常包括三个部分:分词子系统、句法语义子系统、总控部分。在总控部分的协调下,分词子系统可以获得有关词、句子等的句法和语义信息来对分词歧义进行判断, 即它模拟了人对句子的理解过程。这种分词方法需要使用大量的语言知识和信息。3. The word segmentation method based on understanding: refers to the effect of identifying words by letting the computer simulate the understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity. That is, it simulates the process of understanding people's sentences. This method of word segmentation requires a large amount of linguistic knowledge and information.
4、基于统计的分词方法:是指,中文信息中由于字与字相邻共现的频率或概率能够较好的反映成词的可信度,所以可以对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息,以及计算两个汉字X、Y的相邻共现概率。互现信息可以体现汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词。这种方法只需对语料中的字组频度进行统计,不需要切分词典。4. Statistical-based word segmentation method: It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary.
本申请对所述文本数据进行分词的方式不作限制,在针对文档集合进行分词,所获得的所有分词即为本申请所指的基础分词。The manner in which the text data is segmented by the present application is not limited, and the word segmentation is performed on the document set, and all the word segments obtained are the basic participles referred to in the present application.
在具体实现中,在进入下一步骤前,还可以针对基础分词中的无效词,比如,针对停用词等预先进行去除处理。停用词通常指在各类文本中都频繁出现,因而被认为带有很少的有助于分类任何信息的代词、介词、连词等高频词。本领域技术人员也可以按需求设计需要在特征提取之前或特征提取过程中删除的特征词,本申请对此无需加以限制。In a specific implementation, before proceeding to the next step, the removal process may also be performed in advance for the invalid words in the basic participle, for example, for the stop words. Stop words usually refer to frequent occurrences in various types of text, and are therefore considered to have few high-frequency words such as pronouns, prepositions, conjunctions, etc. that help to classify any information. Those skilled in the art can also design feature words that need to be deleted before or during feature extraction according to requirements, which need not be limited in this application.
步骤103,计算各基础分词的第一属性值和第二属性值;Step 103: Calculate a first attribute value and a second attribute value of each basic participle;
步骤104,依据所述第一属性值和第二属性值计算各基础分词的特征值;Step 104: Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
步骤105,依据所述特征值从所述基础分词中筛选出特征分词;Step 105: Filter feature tokens from the basic participle according to the feature value;
以上步骤103-105涉及文本分类中特征选择的处理。通常原始特征空间维数非常高,且存在大量冗余的特征,因此需要进行特征降维。特征选择是特征降维中的其中一类,它的基本思路:根据某种评价函数独立地对每个原始特征项进行评分,然后按分值的高低排序,从中选取若干个分值最高的特征项,或者预先设定一个阈值,把度量值小于阈值特征过滤掉,剩下的候选特征作为结果的特征子集。 The above steps 103-105 relate to the processing of feature selection in text categorization. Usually the original feature space dimension is very high, and there are a lot of redundant features, so feature dimension reduction is needed. Feature selection is one of the characteristics of feature dimension reduction. Its basic idea is to score each original feature item independently according to a certain evaluation function, and then sort by the level of the score, and select several features with the highest score. Item, or a threshold is set in advance, the metric value is filtered out of the threshold feature, and the remaining candidate features are used as the feature subset of the result.
特征选择算法包括:文档频次、互信息量、信息增益、χ2统计量(CHI)等算法。已有技术中,本领域技术人员通常会选用其中之一进行特征分词的选取,然而这种单一算法的使用存在不少弊端,以信息增益算法为例,信息增益通过分词在文本中出现和不出现前后的信息量之差来推断该分词所带的信息量,即一个分词的信息增益值表示分词特征包含的信息量。可以理解,信息增益值越高表示分词特征可以给分类器来带较大的信息量,但已有的信息增益算法只考虑分词特征对整体分类器提供的信息量,忽略了分词特征对不同的各个分类的区分度。The feature selection algorithm includes algorithms such as document frequency, mutual information amount, information gain, and χ 2 statistic (CHI). In the prior art, those skilled in the art usually select one of them to select the feature word segmentation. However, the use of this single algorithm has many drawbacks. Taking the information gain algorithm as an example, the information gain appears in the text through the word segmentation. The amount of information before and after the occurrence of the word segmentation is used to infer the amount of information carried by the participle, that is, the information gain value of a participle indicates the amount of information contained in the participle feature. It can be understood that the higher the information gain value, the segmentation feature can give the classifier a larger amount of information, but the existing information gain algorithm only considers the amount of information provided by the segmentation feature to the overall classifier, ignoring the different segmentation features. The degree of discrimination of each category.
或者,以χ2统计量(CHI)算法为例,卡方统计也用于表征两个变量的相关性,它同时考虑了特征在某类文本中出现和不出现时的情况。卡方统计量值越大,它与该类的相关性就越大,携带的类别信息也就越多,但已有的χ2统计量(CHI)算法中过分夸大低频词的作用。Or, taking the χ 2 statistic (CHI) algorithm as an example, the chi-square statistic is also used to characterize the correlation between two variables. It also considers the case when the feature appears and does not appear in a certain type of text. The larger the chi-square statistic, the more relevant it is to the class, and the more the category information is carried, but the existing χ 2 statistic (CHI) algorithm over-exaggerates the role of low-frequency words.
针对上述弊端,本申请提出不采用单一算法,而采用至少两种算法进行特征提取,即分别采用不同的两种算法计算各基础分词的第一属性值和第二属性值,例如,采用信息增益算法计算第一属性值,采用CHI算法计算第二属性值。In view of the above drawbacks, the present application proposes that no single algorithm is used, and at least two algorithms are used for feature extraction, that is, different first algorithms are used to calculate the first attribute value and the second attribute value of each basic participle, for example, using information gain. The algorithm calculates the first attribute value and uses the CHI algorithm to calculate the second attribute value.
当然,本领域技术人员依据实际情况采用其它算法分别计算分词不同的属性值,甚至两个以上的属性值,都是可行的,本申请对此不作限制。Certainly, those skilled in the art may use other algorithms to calculate different attribute values of the word segmentation according to actual conditions, and even more than two attribute values, which are feasible, and the application does not limit this.
在本申请的一种优选实施例中,所述第一属性值可以为所述基础分词的信息增益值,所述第二属性值可以为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值可以为所述基础分词的区分度,即所述步骤103具体可以包括如下子步骤:In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard deviation of the statistic value, the eigenvalue may be the degree of discrimination of the basic participle, that is, the step 103 may specifically include the following sub-steps:
子步骤1031,计算各基础分词的信息增益值;Sub-step 1031, calculating an information gain value of each basic participle;
子步骤1032,计算各基础分词的卡方统计量值; Sub-step 1032, calculating a chi-square statistic value of each basic participle;
子步骤1033,基于所述基础分词的数量,统计所述基础分词相对于预定义的各个分类的卡方统计量的标准差。Sub-step 1033, based on the number of base participles, the standard deviation of the base participle relative to the predefined chi-square statistic of each category is counted.
在这种情况下,所述步骤104可以为,基于所述信息增益值和标准差的乘积获得各基础分词的区分度。In this case, the step 104 may be: obtaining the discrimination degree of each basic participle based on the product of the information gain value and the standard deviation.
更具体而言,可以通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:More specifically, the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by the following formula:
Figure PCTCN2017077489-appb-000007
Figure PCTCN2017077489-appb-000007
其中,score为基础分词的区分度,igScore为基础分词的信息增益值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
本申请融合至少两种特征提取算法,并在卡方统计中引入标准差,有效保证了特征提取的客观性与准确性。The application combines at least two feature extraction algorithms and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of feature extraction.
在本申请的一种优选实施例中,所述步骤105具体可以包括如下子步骤:In a preferred embodiment of the present application, the step 105 may specifically include the following sub-steps:
子步骤1051,将所述基础分词按照其对应的特征值由高至低排列;Sub-step 1051, the basic participle is arranged according to its corresponding feature value from high to low;
子步骤1052,提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。Sub-step 1052, extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
在计算出各基础分词的特征值后,可以发现此值符合如图2所示的长尾分布(齐鲁夫定律)示意图,图2中横轴为基础分词的个数,纵轴为基础分词的区分度,应用本申请的优选实施例,可以取例如横坐标大于0小于30000的基础分词作为特征分词。After calculating the eigenvalues of the basic participles, it can be found that the value conforms to the long tail distribution (Qilufu's law) as shown in Fig. 2. In Fig. 2, the horizontal axis is the number of basic participles, and the vertical axis is the basic participle. The degree of discrimination, applying a preferred embodiment of the present application, may take, for example, a base participle with an abscissa greater than 0 and less than 30,000 as a feature segmentation.
本申请通过使用长尾分布图选择特征数量,可以进一步筛选出有效特征,从而使网页文本分类的效果更精准。By using the long tail profile to select the number of features, the present application can further screen out the effective features, thereby making the effect of web page text classification more accurate.
步骤106,计算各特征分词相应的权重; Step 106: Calculate corresponding weights of each feature participle;
在文本中,每一个特征分词赋予一个权重,表示这一特征分词在该文本中的重要程度。权重一般都是以特征项的频率为基础进行计算,计算方式很多,例如,布尔权值法,词频权值法,TF/IDF权值法,TFC权值法等,已有这种权重计算方法的计算也存在不少弊端,例如,TF/IDF权值法中TF表示特征在单个文本中的数量,IDF表示特征在整个语料中的数量,因此完全忽略了特征对分类的影响。In the text, each feature participle is given a weight indicating the importance of the feature participle in the text. The weights are generally calculated based on the frequency of the feature items. There are many calculation methods, such as Boolean weight method, word frequency weight method, TF/IDF weight method, TFC weight method, etc. There are also many disadvantages in the calculation. For example, in TF/IDF weight method, TF indicates the number of features in a single text, and IDF indicates the number of features in the entire corpus, so the influence of features on classification is completely ignored.
因而,本申请提出了一种用于计算权重的优选实施例,在本实施例中,所述步骤106可以包括如下子步骤:Thus, the present application proposes a preferred embodiment for calculating weights. In this embodiment, the step 106 may include the following sub-steps:
子步骤1061,获取各特征分词在相应网页的文本数据中出现的次数;Sub-step 1061: obtaining the number of times each feature participle appears in the text data of the corresponding webpage;
子步骤1062,统计所述网页的文本数据中特征分词的总数;Sub-step 1062, counting the total number of feature word segments in the text data of the webpage;
子步骤1063,依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。Sub-step 1063, according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated.
作为本申请优选实施例具体应用的一种示例,所述子步骤1063具体可以通过如下公式计算各特征分词相应的权重:As an example of a specific application of the preferred embodiment of the present application, the sub-step 1063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:
Figure PCTCN2017077489-appb-000008
Figure PCTCN2017077489-appb-000008
其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
在具体实现中,更为优选的是,所述步骤105还可以包括如下子步骤:In a specific implementation, it is more preferable that the step 105 further includes the following sub-steps:
子步骤1064,对所述特征分词的权重进行归一化处理。Sub-step 1064, normalizing the weights of the feature word segments.
作为本申请具体应用的一种示例,可以通过以下公式对所述特征分词的权重进行归一化处理: As an example of a specific application of the present application, the weight of the feature word segmentation can be normalized by the following formula:
Figure PCTCN2017077489-appb-000009
Figure PCTCN2017077489-appb-000009
其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值,max(weight)为所述网页中文本数据中最大weight值。Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
以上本申请的示例中所采用的权重则兼顾了特征对分类影响,因而能进一步提升特征选取的有效性。当然,本申请采用任一种权重计算方式均是可行的,对此本申请无需加以限制。The weights used in the examples of the present application take into account the influence of features on the classification, and thus can further improve the effectiveness of feature selection. Of course, it is feasible to use any of the weight calculation methods in this application, and the application does not need to be limited.
以上计算得到的各特征分词相应的权重(包括如子步骤1063得到的权重或如子步骤1064得到的归一化权重),可以作为一个文本的特征向量,得到特征向量之后可以选择某个文本分类算法训练出分类模型。The corresponding weights of each feature segment calculated above (including the weight obtained in sub-step 1063 or the normalized weight obtained in sub-step 1064) can be used as a feature vector of a text, and a text classification can be selected after obtaining the feature vector. The algorithm trains the classification model.
步骤107,将所述权重作为相应特征分词的特征向量,采用所述特征向量训练出分类模型。Step 107: The weight is used as a feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
本领域技术人员采用任一种文本分类算法,比如贝叶斯概率算法(Naive Bayese),支持向量机,KNN算法(k nearest neighbor)等采用特征向量训练出分类模型都是可行的,本申请对此不作限制。Those skilled in the art can use any text classification algorithm, such as Bayesian Probability Algorithm (Naive Bayese), support vector machine, KNN algorithm (k nearest neighbor), etc., to use the feature vector to train the classification model, and the present application is feasible. This is not a limitation.
本申请实施例通过改进特征分词的提取方式,以及,特征分词权重的计算方式,不仅有效保证了特征提取的客观性与准确性,还兼顾了特征对分类影响,从而提高了网页文本分类的准确性,更方便于用户在海量的文本中及时准确地获得有效的信息。The embodiment of the present application improves the objectivity and accuracy of feature extraction by improving the extraction method of feature word segmentation and the calculation method of feature word segmentation weight, and also considers the influence of feature on classification, thereby improving the accuracy of web page text classification. Sex, more convenient for users to obtain valid information in a timely and accurate manner in a large amount of text.
参考图3,示出了本申请的一种网页文本识别的方法实施例的流程图,具体可以包括如下步骤:Referring to FIG. 3, a flowchart of an embodiment of a method for text recognition of a webpage according to the present application is shown. Specifically, the method may include the following steps:
步骤301,提取待识别网页中的文本数据;Step 301: Extract text data in the webpage to be identified;
步骤302,对所述文本数据进行分词,获得基础分词; Step 302, performing segmentation on the text data to obtain a basic participle;
步骤303,计算各基础分词的第一属性值和第二属性值; Step 303: Calculate a first attribute value and a second attribute value of each basic participle;
步骤304,依据所述第一属性值和第二属性值计算各基础分词的特征值;Step 304: Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
步骤305,依据所述特征值从所述基础分词中筛选出特征分词;Step 305: Filter feature feature words from the basic participle according to the feature value;
步骤306,计算各特征分词相应的权重; Step 306, calculating corresponding weights of each feature participle;
步骤307,将所述权重作为特征向量输入预先训练出的分类模型中,获得分类信息;Step 307: Enter the weight as a feature vector into a pre-trained classification model to obtain classification information.
步骤308,针对所述待识别网页标记分类信息。Step 308: Mark classification information for the to-be-identified webpage.
在本申请的一种优选实施例中,所述第一属性值可以为所述基础分词的信息增益值,所述第二属性值可以为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值可以为所述基础分词的区分度。In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard deviation of the statistic value, which may be the degree of discrimination of the base participle.
作为本申请具体应用的一种示例,可以通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:As an example of the specific application of the present application, the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by using the following formula:
Figure PCTCN2017077489-appb-000010
Figure PCTCN2017077489-appb-000010
其中,score为基础分词的区分度,igScore为基础分词的信息增益值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
在本申请的一种优选实施例中,所述步骤305可以包括如下子步骤:In a preferred embodiment of the present application, the step 305 may include the following sub-steps:
子步骤3051,将所述基础分词按照其对应的特征值由高至低排列;Sub-step 3051, the basic participle is arranged according to its corresponding feature value from high to low;
子步骤3052,提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。Sub-step 3052, extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
在本申请的一种优选实施例中,所述步骤306可以包括如下子步骤: In a preferred embodiment of the present application, the step 306 may include the following sub-steps:
子步骤3061,获取各特征分词在相应网页的文本数据中出现的次数;Sub-step 3061, obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;
子步骤3062,统计所述网页的文本数据中特征分词的总数;Sub-step 3062, counting the total number of feature word segments in the text data of the webpage;
子步骤3063,依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。Sub-step 3063, according to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
作为本申请优选实施例具体应用的一种示例,所述子步骤3063具体可以通过如下公式计算各特征分词相应的权重:As an example of a specific application of the preferred embodiment of the present application, the sub-step 3063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:
Figure PCTCN2017077489-appb-000011
Figure PCTCN2017077489-appb-000011
其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
在具体实现中,更为优选的是,所述步骤306还可以包括如下子步骤:In a specific implementation, it is further preferred that the step 306 further includes the following sub-steps:
子步骤3064,对所述特征分词的权重进行归一化处理。Sub-step 3064, normalizing the weights of the feature word segments.
作为本申请具体应用的一种示例,可以通过以下公式对所述特征分词的权重进行归一化处理:As an example of a specific application of the present application, the weight of the feature word segmentation can be normalized by the following formula:
Figure PCTCN2017077489-appb-000012
Figure PCTCN2017077489-appb-000012
其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值,max(weight)为所述网页中文本数据中最大weight值。Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
以上计算得到的各特征分词相应的权重,可以作为一个文本的特征向量,得到特征向量之后可以将其输入按图1所示的过程预先生成的分类模型中,即可获得当前特征向量所归属的分类信息,最后将当前识别 的网页标记上相应的分类信息即可。The corresponding weights of each feature segment obtained above can be used as a feature vector of a text. After obtaining the feature vector, it can be input into the classification model pre-generated according to the process shown in Figure 1, and the current feature vector can be obtained. Classification information, and finally the current identification The corresponding categorization information on the web page mark can be used.
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。It should be noted that, for the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the embodiments of the present application are not limited by the described action sequence, because In accordance with embodiments of the present application, certain steps may be performed in other sequences or concurrently. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required in the embodiments of the present application.
参照图4,示出了本申请的一种网页文本分类的装置实施例的结构框图,具体可以包括如下模块:Referring to FIG. 4, a structural block diagram of an apparatus embodiment of a webpage text classification of the present application is shown, which may specifically include the following modules:
采集模块401,用于采集网页中的文本数据;The collecting module 401 is configured to collect text data in the webpage;
分词模块402,用于对所述文本数据进行分词,获得基础分词;a word segmentation module 402, configured to perform segmentation on the text data to obtain a basic participle;
分词属性计算模块403,用于计算各基础分词的第一属性值和第二属性值;The word segment attribute calculation module 403 is configured to calculate a first attribute value and a second attribute value of each base participle;
特征值计算模块404,用于依据所述第一属性值和第二属性值计算各基础分词的特征值;The feature value calculation module 404 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;
特征提取模块405,用于依据所述特征值从所述基础分词中筛选出特征分词;The feature extraction module 405 is configured to filter the feature word segmentation from the basic participle according to the feature value;
特征权重分配模块406,用于计算各特征分词相应的权重;a feature weight assignment module 406, configured to calculate a corresponding weight of each feature participle;
模型训练模块407,用于将所述权重作为相应特征分词的特征向量,采用所述特征向量训练出分类模型。The model training module 407 is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
在本申请的一种优选实施例中,所述第一属性值可以为所述基础分词的信息增益值,所述第二属性值可以为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值可以为所述基础分词的区分度。 In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard deviation of the statistic value, which may be the degree of discrimination of the base participle.
作为本申请实施例具体应用的一种示例,所述特征值计算模块404可以通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:As an example of the specific application of the embodiment of the present application, the feature value calculation module 404 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:
Figure PCTCN2017077489-appb-000013
Figure PCTCN2017077489-appb-000013
其中,score为基础分词的区分度,igScore为基础分词的信息增益值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
在本申请的一种优选实施例中,所述特征提取模块405可以包括如下子模块:In a preferred embodiment of the present application, the feature extraction module 405 may include the following sub-modules:
排序子模块4051,用于将所述基础分词按照其对应的特征值由高至低排列;a sorting sub-module 4051, configured to rank the basic participle according to its corresponding feature value from high to low;
提取子模块4052,用于提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。The extracting sub-module 4052 is configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
在本申请的一种优选实施例中,所述特征权重分配模块406可以包括如下子模块:In a preferred embodiment of the present application, the feature weight assignment module 406 can include the following sub-modules:
次数统计子模块4061,用于获取各特征分词在相应网页的文本数据中出现的次数;The number of statistics sub-module 4061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;
分词总数统计子模块4062,用于统计所述网页的文本数据中特征分词的总数;a segmentation total number statistics sub-module 4062, configured to count the total number of feature word segments in the text data of the webpage;
计算子模块4063,用于依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。The calculation sub-module 4063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage, and calculate corresponding feature wordifiers the weight of.
作为本申请实施例具体应用的一种示例,所述计算子模块4063可以通过如下公式依据所述特征分词的特征值,各特征分词在相应网页的 文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重:As an example of the specific application of the embodiment of the present application, the calculation sub-module 4063 may be based on the feature value of the feature segmentation according to the following formula, and each feature segmentation is in the corresponding webpage. The number of occurrences in the text data, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated:
Figure PCTCN2017077489-appb-000014
Figure PCTCN2017077489-appb-000014
其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
在本申请的一种优选实施例中,所述特征权重分配模块406还可以包括如下子模块:In a preferred embodiment of the present application, the feature weight assignment module 406 may further include the following sub-modules:
归一化子模块4064,用于对所述特征分词的权重进行归一化处理。The normalization sub-module 4064 is configured to normalize the weight of the feature word segmentation.
作为本申请实施例具体应用的一种示例,所述归一化子模块4064可以通过以下公式对所述特征分词的权重进行归一化处理:As an example of a specific application of the embodiment of the present application, the normalization sub-module 4064 may normalize the weight of the feature word segment by the following formula:
Figure PCTCN2017077489-appb-000015
Figure PCTCN2017077489-appb-000015
其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值,max(weight)为所述网页中文本数据中最大weight值。Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
参照图5,示出了本申请的一种网页文本识别的装置实施例的结构框图,具体可以包括如下模块:Referring to FIG. 5, a structural block diagram of an apparatus for recognizing a webpage text of the present application is shown. Specifically, the following modules may be included:
文本提取模块501,用于提取待识别网页中的文本数据;a text extraction module 501, configured to extract text data in a webpage to be identified;
分词模块502,用于对所述文本数据进行分词,获得基础分词;a word segmentation module 502, configured to perform segmentation on the text data to obtain a basic participle;
分词属性计算模块503,用于计算各基础分词的第一属性值和第二属性值; The word segment attribute calculation module 503 is configured to calculate a first attribute value and a second attribute value of each base participle;
特征值计算模块504,用于依据所述第一属性值和第二属性值计算各基础分词的特征值;The feature value calculation module 504 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;
特征提取模块505,用于依据所述特征值从所述基础分词中筛选出特征分词;The feature extraction module 505 is configured to filter the feature word segmentation from the basic participle according to the feature value;
特征权重分配模块506,用于计算各特征分词相应的权重;a feature weight assignment module 506, configured to calculate a corresponding weight of each feature participle;
分类模块507,用于将所述权重作为特征向量输入预先训练出的分类模型中,获得分类信息;a classification module 507, configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information;
标记模块508,用于针对所述待识别网页标记分类信息。The marking module 508 is configured to mark the classification information for the to-be-identified webpage.
在本申请的一种优选实施例中,所述第一属性值可以为所述基础分词的信息增益值,所述第二属性值可以为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值可以为所述基础分词的区分度。In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard deviation of the statistic value, which may be the degree of discrimination of the base participle.
作为本申请实施例具体应用的一种示例,所述特征值计算模块504可以通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:As an example of the specific application of the embodiment of the present application, the feature value calculation module 504 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:
Figure PCTCN2017077489-appb-000016
Figure PCTCN2017077489-appb-000016
其中,score为基础分词的区分度,igScore为基础分词的信息增益值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
在本申请的一种优选实施例中,所述特征提取模块505可以包括如下子模块:In a preferred embodiment of the present application, the feature extraction module 505 can include the following sub-modules:
排序子模块5051,用于将所述基础分词按照其对应的特征值由高至低排列;a sorting sub-module 5051, configured to rank the basic participle according to its corresponding feature value from high to low;
提取子模块5052,用于提取预设数量的,所述特征值高于预设阈 值的基础分词作为特征分词。An extraction sub-module 5052, configured to extract a preset number, where the feature value is higher than a preset threshold The basic participle of the value is used as the feature participle.
在本申请的一种优选实施例中,所述特征权重分配模块506可以包括如下子模块:In a preferred embodiment of the present application, the feature weight assignment module 506 can include the following sub-modules:
次数统计子模块5061,用于获取各特征分词在相应网页的文本数据中出现的次数;The number of statistics sub-module 5061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;
分词总数统计子模块5062,用于统计所述网页的文本数据中特征分词的总数;a segmentation total number statistics sub-module 5062, configured to count the total number of feature word segments in the text data of the webpage;
计算子模块5063,用于依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。The calculation sub-module 5063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words correspondingly the weight of.
作为本申请实施例具体应用的一种示例,所述计算子模块4063可以通过如下公式依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重:As an example of the specific application of the embodiment of the present application, the calculation sub-module 4063 may be based on the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the webpage. The total number of feature word segments in the text data, and the corresponding weights of each feature word segment are calculated:
Figure PCTCN2017077489-appb-000017
Figure PCTCN2017077489-appb-000017
其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
在本申请的一种优选实施例中,所述特征权重分配模块506还可以包括如下子模块:In a preferred embodiment of the present application, the feature weight distribution module 506 may further include the following sub-modules:
归一化子模块5064,用于对所述特征分词的权重进行归一化处理。The normalization sub-module 5064 is configured to normalize the weight of the feature word segmentation.
作为本申请实施例具体应用的一种示例,所述归一化子模块4064可以通过以下公式对所述特征分词的权重进行归一化处理:As an example of a specific application of the embodiment of the present application, the normalization sub-module 4064 may normalize the weight of the feature word segment by the following formula:
Figure PCTCN2017077489-appb-000018
Figure PCTCN2017077489-appb-000018
其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值,max(weight)为所述网页中文本数据中最大weight值。Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
本说明书中的每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。Each embodiment in the specification is mainly described as being different from the other embodiments, and the same similar parts between the respective embodiments may be referred to each other.
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或 任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非持续性的电脑可读媒体(transitory media),如调制的数据信号和载波。In a typical configuration, the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium. Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic cassette tape, magnetic tape storage or other magnetic storage device or Any other non-transportable medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device such that a series of operational steps are performed on the computer or other programmable terminal device to produce computer-implemented processing, such that the computer or other programmable terminal device The instructions executed above provide steps for implementing the functions specified in one or more blocks of the flowchart or in a block or blocks of the flowchart.
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实 施例范围的所有变更和修改。While a preferred embodiment of the embodiments of the present application has been described, those skilled in the art can make further changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be construed as including the preferred embodiments and All changes and modifications to the scope of the application.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between operations. Furthermore, the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article, or terminal device that comprises the element, without further limitation.
以上对本申请所提供的一种网页文本分类的方法,一种网页文本分类的装置,一种网页文本识别的方法,以及,一种网页文本识别的装置进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。 The method for classifying webpage texts provided by the present application, a device for classifying webpage texts, a method for recognizing webpage texts, and a device for recognizing webpage texts are described in detail, and specific articles are applied in the text. The principles and implementations of the present application are described in the following examples. The description of the above embodiments is only for helping to understand the method of the present application and its core ideas. Meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation is implemented. There is a change in the scope of the application and the scope of the application. In summary, the content of the specification should not be construed as limiting the application.

Claims (22)

  1. 一种网页文本分类的方法,其特征在于,包括:A method for classifying webpage texts, comprising:
    采集网页中的文本数据;Collect text data in a web page;
    对所述文本数据进行分词,获得基础分词;Segmenting the text data to obtain a basic participle;
    计算各基础分词的第一属性值和第二属性值;Calculating a first attribute value and a second attribute value of each base participle;
    依据所述第一属性值和第二属性值计算各基础分词的特征值;Calculating a feature value of each basic participle according to the first attribute value and the second attribute value;
    依据所述特征值从所述基础分词中筛选出特征分词;Extracting feature word segments from the basic participle according to the feature value;
    计算各特征分词相应的权重;Calculate the corresponding weight of each feature participle;
    将所述权重作为相应特征分词的特征向量,采用所述特征向量训练出分类模型。The weight is used as the feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
  2. 根据权利要求1所述的方法,其特征在于,所述第一属性值为所述基础分词的信息增益值,所述第二属性值为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值为所述基础分词的区分度。The method according to claim 1, wherein said first attribute value is an information gain value of said base participle, and said second attribute value is a chi-square of said base participle relative to a predefined respective category The standard deviation of the statistic value, the eigenvalue being the degree of discrimination of the base participle.
  3. 根据权利要求2所述的方法,其特征在于,通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:The method according to claim 2, wherein the feature values of the basic participles are calculated according to the first attribute value and the second attribute value by the following formula:
    Figure PCTCN2017077489-appb-100001
    Figure PCTCN2017077489-appb-100001
    其中,score为基础分词的区分度,igScore为基础分词的信息增益 值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Among them, score is the discrimination of the basic participle, igScore is the information gain of the basic participle The value, chiScore is the cardinal statistic value of the base word segment relative to the predefined individual categories, the n being the number of predefined categories.
  4. 根据权利要求1或2或3所述的方法,其特征在于,所述依据所述特征值从所述基础分词中筛选出特征分词的步骤包括:The method according to claim 1 or 2 or 3, wherein the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
    将所述基础分词按照其对应的特征值由高至低排列;Arranging the basic participle according to its corresponding feature value from high to low;
    提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。Extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  5. 根据权利要求1或2或3所述的方法,其特征在于,所述计算各特征分词相应的权重的步骤包括:The method according to claim 1 or 2 or 3, wherein the step of calculating the corresponding weight of each feature word segment comprises:
    获取各特征分词在相应网页的文本数据中出现的次数;Obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;
    统计所述网页的文本数据中特征分词的总数;Counting the total number of feature word segments in the text data of the webpage;
    依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。According to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
  6. 根据权利要求5所述的方法,其特征在于,通过如下公式依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重:The method according to claim 5, wherein the number of occurrences of each feature segmentation in the text data of the corresponding webpage and the feature segmentation in the text data of the webpage are determined by the following formula according to the feature value of the feature word segmentation The total number of the corresponding weights of each feature segmentation is calculated:
    Figure PCTCN2017077489-appb-100002
    Figure PCTCN2017077489-appb-100002
    其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
  7. 根据权利要求1或2或3或6所述的方法,其特征在于,所述计算各特征分词相应的权重的步骤还包括:The method according to claim 1 or 2 or 3 or 6, wherein the step of calculating the corresponding weight of each feature word segment further comprises:
    对所述特征分词的权重进行归一化处理。The weights of the feature word segments are normalized.
  8. 根据权利要求7所述的方法,其特征在于,通过以下公式对所述特征分词的权重进行归一化处理:The method according to claim 7, wherein the weight of the feature word segmentation is normalized by the following formula:
    Figure PCTCN2017077489-appb-100003
    Figure PCTCN2017077489-appb-100003
    其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值,max(weight)为所述网页中文本数据中最大weight值。Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
  9. 一种网页文本识别的方法,其特征在于,包括:A method for text recognition of webpages, comprising:
    提取待识别网页中的文本数据;Extracting text data in the web page to be identified;
    对所述文本数据进行分词,获得基础分词;Segmenting the text data to obtain a basic participle;
    计算各基础分词的第一属性值和第二属性值;Calculating a first attribute value and a second attribute value of each base participle;
    依据所述第一属性值和第二属性值计算各基础分词的特征值;Calculating a feature value of each basic participle according to the first attribute value and the second attribute value;
    依据所述特征值从所述基础分词中筛选出特征分词; Extracting feature word segments from the basic participle according to the feature value;
    计算各特征分词相应的权重;Calculate the corresponding weight of each feature participle;
    将所述权重作为特征向量输入预先训练出的分类模型中,获得分类信息;And inputting the weight as a feature vector into a pre-trained classification model to obtain classification information;
    针对所述待识别网页标记分类信息。Marking the classification information for the to-be-identified web page.
  10. 根据权利要求9所述的方法,其特征在于,所述第一属性值为所述基础分词的信息增益值,所述第二属性值为所述基础分词相对于预定义的各个分类的卡方统计量值的标准差,所述特征值为所述基础分词的区分度。The method according to claim 9, wherein said first attribute value is an information gain value of said base participle, and said second attribute value is a chi-square of said base participle relative to a predefined respective category The standard deviation of the statistic value, the eigenvalue being the degree of discrimination of the base participle.
  11. 根据权利要求9或10所述的方法,其特征在于,所述依据所述特征值从所述基础分词中筛选出特征分词的步骤包括:The method according to claim 9 or 10, wherein the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
    将所述基础分词按照其对应的特征值由高至低排列;Arranging the basic participle according to its corresponding feature value from high to low;
    提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。Extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  12. 根据权利要求9或10所述的方法,其特征在于,所述计算各特征分词相应的权重的步骤包括:The method according to claim 9 or 10, wherein the step of calculating the corresponding weight of each feature word segment comprises:
    获取各特征分词在相应网页的文本数据中出现的次数;Obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage;
    统计所述网页的文本数据中特征分词的总数;Counting the total number of feature word segments in the text data of the webpage;
    依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到 各特征分词相应的权重。Calculating according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page. The corresponding weight of each feature participle.
  13. 根据权利要求9或10或12所述的方法,其特征在于,所述计算各特征分词相应的权重的步骤还包括:The method according to claim 9 or 10 or 12, wherein the step of calculating the corresponding weight of each feature word segment further comprises:
    对所述特征分词的权重进行归一化处理。The weights of the feature word segments are normalized.
  14. 一种网页文本分类的装置,其特征在于,包括:An apparatus for classifying webpage text, comprising:
    采集模块,用于采集网页中的文本数据;An acquisition module, configured to collect text data in a webpage;
    分词模块,用于对所述文本数据进行分词,获得基础分词;a word segmentation module for segmenting the text data to obtain a basic participle;
    分词属性计算模块,用于计算各基础分词的第一属性值和第二属性值;a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each base participle;
    特征值计算模块,用于依据所述第一属性值和第二属性值计算各基础分词的特征值;An eigenvalue calculation module, configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
    特征提取模块,用于依据所述特征值从所述基础分词中筛选出特征分词;a feature extraction module, configured to filter feature segmentation words from the basic participle according to the feature value;
    特征权重分配模块,用于计算各特征分词相应的权重;a feature weight allocation module, configured to calculate a corresponding weight of each feature word segmentation;
    模型训练模块,用于将所述权重作为相应特征分词的特征向量,采用所述特征向量训练出分类模型。The model training module is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
  15. 根据权利要求14所述的装置,其特征在于,所述第一属性值为所述基础分词的信息增益值,所述第二属性值为所述基础分词相对于 预定义的各个分类的卡方统计量值的标准差,所述特征值为所述基础分词的区分度。The apparatus according to claim 14, wherein said first attribute value is an information gain value of said base participle, and said second attribute value is relative to said base participle The standard deviation of the chi-square statistic values of the predefined individual categories, the eigenvalues being the degree of discrimination of the base participle.
  16. 根据权利要求15所述的装置,其特征在于,所述特征值计算模块通过如下公式依据所述第一属性值和第二属性值计算各基础分词的特征值:The apparatus according to claim 15, wherein the feature value calculation module calculates the feature values of the basic participle words according to the first attribute value and the second attribute value by using the following formula:
    Figure PCTCN2017077489-appb-100004
    Figure PCTCN2017077489-appb-100004
    其中,score为基础分词的区分度,igScore为基础分词的信息增益值,chiScore为基础分词对相对于预定义的各个分类的卡方统计量值,所述n为预定义的分类的数量。Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.
  17. 根据权利要求14或15或16所述的装置,其特征在于,所述特征提取模块包括:The device according to claim 14 or 15 or 16, wherein the feature extraction module comprises:
    排序子模块,用于将所述基础分词按照其对应的特征值由高至低排列;a sorting sub-module for arranging the basic participle according to its corresponding feature value from highest to lowest;
    提取子模块,用于提取预设数量的,所述特征值高于预设阈值的基础分词作为特征分词。And an extraction sub-module, configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  18. 根据权利要求14或15或16所述的装置,其特征在于,所述特征权重分配模块包括:The device according to claim 14 or 15 or 16, wherein the feature weight distribution module comprises:
    次数统计子模块,用于获取各特征分词在相应网页的文本数据中出 现的次数;The number statistics sub-module is used to obtain each feature participle in the text data of the corresponding webpage Current number of times;
    分词总数统计子模块,用于统计所述网页的文本数据中特征分词的总数;a total number of word segmentation sub-modules for counting the total number of feature word segments in the text data of the webpage;
    计算子模块,用于依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重。a calculation submodule, configured to calculate, according to the feature value of the feature word segmentation, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words Weights.
  19. 根据权利要求18所述的装置,其特征在于,所述计算子模块通过如下公式依据所述特征分词的特征值,各特征分词在相应网页的文本数据中出现的次数,以及,所述网页的文本数据中特征分词的总数,计算得到各特征分词相应的权重:The device according to claim 18, wherein the calculation sub-module according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the webpage The total number of feature word segments in the text data, and the corresponding weights of each feature word segment are calculated:
    Figure PCTCN2017077489-appb-100005
    Figure PCTCN2017077489-appb-100005
    其中,weight为特征分词的权重,tf为特征分词在相应网页的文本数据中出现的次数,n为网页的文本数据中特征分词的总数,score为特征分词的区分度。Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.
  20. 根据权利要求14或15或16或19所述的装置,其特征在于,所述特征权重分配模块还包括:The device according to claim 14 or 15 or 16 or 19, wherein the feature weight distribution module further comprises:
    归一化子模块,用于对所述特征分词的权重进行归一化处理。The normalization submodule is configured to normalize the weight of the feature word segmentation.
  21. 根据权利要求20所述的装置,其特征在于,所述归一化子模块通过以下公式对所述特征分词的权重进行归一化处理: The apparatus according to claim 20, wherein said normalization sub-module normalizes weights of said feature word segments by the following formula:
    Figure PCTCN2017077489-appb-100006
    Figure PCTCN2017077489-appb-100006
    其中,norm(weight)为归一化之后的权重,weight为所述特征分词的权重,min(weight)为所述网页中文本数据中最小weight值,max(weight)为所述网页中文本数据中最大weight值。Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.
  22. 一种网页文本识别的装置,其特征在于,包括:An apparatus for recognizing a webpage text, comprising:
    文本提取模块,用于提取待识别网页中的文本数据;a text extraction module, configured to extract text data in the webpage to be identified;
    分词模块,用于对所述文本数据进行分词,获得基础分词;a word segmentation module for segmenting the text data to obtain a basic participle;
    分词属性计算模块,用于计算各基础分词的第一属性值和第二属性值;a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each base participle;
    特征值计算模块,用于依据所述第一属性值和第二属性值计算各基础分词的特征值;An eigenvalue calculation module, configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
    特征提取模块,用于依据所述特征值从所述基础分词中筛选出特征分词;a feature extraction module, configured to filter feature segmentation words from the basic participle according to the feature value;
    特征权重分配模块,用于计算各特征分词相应的权重;a feature weight allocation module, configured to calculate a corresponding weight of each feature word segmentation;
    分类模块,用于将所述权重作为特征向量输入预先训练出的分类模型中,获得分类信息;a classification module, configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information;
    标记模块,用于针对所述待识别网页标记分类信息。 a marking module, configured to mark classification information for the to-be-identified webpage.
PCT/CN2017/077489 2016-03-30 2017-03-21 Method and device for webpage text classification, method and device for webpage text recognition WO2017167067A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610195483.4 2016-03-30
CN201610195483.4A CN107291723B (en) 2016-03-30 2016-03-30 Method and device for classifying webpage texts and method and device for identifying webpage texts

Publications (1)

Publication Number Publication Date
WO2017167067A1 true WO2017167067A1 (en) 2017-10-05

Family

ID=59962602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077489 WO2017167067A1 (en) 2016-03-30 2017-03-21 Method and device for webpage text classification, method and device for webpage text recognition

Country Status (3)

Country Link
CN (1) CN107291723B (en)
TW (1) TWI735543B (en)
WO (1) WO2017167067A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053251A (en) * 2017-12-18 2018-05-18 北京小度信息科技有限公司 Information processing method, device, electronic equipment and computer readable storage medium
CN108255797A (en) * 2018-01-26 2018-07-06 上海康斐信息技术有限公司 A kind of text mode recognition method and system
CN108334630A (en) * 2018-02-24 2018-07-27 上海康斐信息技术有限公司 A kind of URL classification method and system
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN110334342A (en) * 2019-06-10 2019-10-15 阿里巴巴集团控股有限公司 The analysis method and device of word importance
CN110427628A (en) * 2019-08-02 2019-11-08 杭州安恒信息技术股份有限公司 Web assets classes detection method and device based on neural network algorithm
CN110705290A (en) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 Webpage classification method and device
CN110837735A (en) * 2019-11-17 2020-02-25 太原蓝知科技有限公司 Intelligent data analysis and identification method and system
CN111159589A (en) * 2019-12-30 2020-05-15 中国银联股份有限公司 Classification dictionary establishing method, merchant data classification method, device and equipment
CN111695353A (en) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 Method, device and equipment for identifying timeliness text and storage medium
CN111737993A (en) * 2020-05-26 2020-10-02 浙江华云电力工程设计咨询有限公司 Method for extracting health state of equipment from fault defect text of power distribution network equipment
CN112200259A (en) * 2020-10-19 2021-01-08 哈尔滨理工大学 Information gain text feature selection method and classification device based on classification and screening
CN113190682A (en) * 2021-06-30 2021-07-30 平安科技(深圳)有限公司 Method and device for acquiring event influence degree based on tree model and computer equipment
WO2023035787A1 (en) * 2021-09-07 2023-03-16 浙江传媒学院 Text data attribution description and generation method based on text character feature
CN115883912A (en) * 2023-03-08 2023-03-31 山东水浒文化传媒有限公司 Interaction method and system for internet communication demonstration
CN116248375A (en) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 Webpage login entity identification method, device, equipment and storage medium
CN116564538A (en) * 2023-07-05 2023-08-08 肇庆市高要区人民医院 Hospital information real-time query method and system based on big data

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844553B (en) * 2017-10-31 2021-07-27 浪潮通用软件有限公司 Text classification method and device
CN108090178B (en) * 2017-12-15 2020-08-25 北京锐安科技有限公司 Text data analysis method, text data analysis device, server and storage medium
CN110287316A (en) * 2019-06-04 2019-09-27 深圳前海微众银行股份有限公司 A kind of Alarm Classification method, apparatus, electronic equipment and storage medium
CN111476025B (en) * 2020-02-28 2021-01-08 开普云信息科技股份有限公司 Government field-oriented new word automatic discovery implementation method, analysis model and system
CN111753525B (en) * 2020-05-21 2023-11-10 浙江口碑网络技术有限公司 Text classification method, device and equipment
CN112667817B (en) * 2020-12-31 2022-05-31 杭州电子科技大学 Text emotion classification integration system based on roulette attribute selection
CN113127595B (en) * 2021-04-26 2022-08-16 数库(上海)科技有限公司 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055183A1 (en) * 2007-08-24 2009-02-26 Siemens Medical Solutions Usa, Inc. System and Method for Text Tagging and Segmentation Using a Generative/Discriminative Hybrid Hidden Markov Model
CN103914478A (en) * 2013-01-06 2014-07-09 阿里巴巴集团控股有限公司 Webpage training method and system and webpage prediction method and system
CN104899310A (en) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 Information ranking method, and method and device for generating information ranking model
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809548B2 (en) * 2004-06-14 2010-10-05 University Of North Texas Graph-based ranking algorithms for text processing
TWI427492B (en) * 2007-01-15 2014-02-21 Hon Hai Prec Ind Co Ltd System and method for searching information
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm
CN104346459B (en) * 2014-11-10 2017-10-27 南京信息工程大学 A kind of text classification feature selection approach based on term frequency and chi
CN105224695B (en) * 2015-11-12 2018-04-20 中南大学 A kind of text feature quantization method and device and file classification method and device based on comentropy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055183A1 (en) * 2007-08-24 2009-02-26 Siemens Medical Solutions Usa, Inc. System and Method for Text Tagging and Segmentation Using a Generative/Discriminative Hybrid Hidden Markov Model
CN103914478A (en) * 2013-01-06 2014-07-09 阿里巴巴集团控股有限公司 Webpage training method and system and webpage prediction method and system
CN104899310A (en) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 Information ranking method, and method and device for generating information ranking model
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, XIAOHONG: "Feature extraction methods for Chinese text classification", COMPUTER ENGINEERING AND DESIGN, vol. 30, no. 17, 31 December 2009 (2009-12-31), ISSN: 1000-7024 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053251B (en) * 2017-12-18 2021-03-02 北京小度信息科技有限公司 Information processing method, information processing device, electronic equipment and computer readable storage medium
CN108053251A (en) * 2017-12-18 2018-05-18 北京小度信息科技有限公司 Information processing method, device, electronic equipment and computer readable storage medium
CN108255797A (en) * 2018-01-26 2018-07-06 上海康斐信息技术有限公司 A kind of text mode recognition method and system
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN108415959B (en) * 2018-02-06 2021-06-25 北京捷通华声科技股份有限公司 Text classification method and device
CN108334630A (en) * 2018-02-24 2018-07-27 上海康斐信息技术有限公司 A kind of URL classification method and system
CN110334342A (en) * 2019-06-10 2019-10-15 阿里巴巴集团控股有限公司 The analysis method and device of word importance
CN110334342B (en) * 2019-06-10 2024-02-09 创新先进技术有限公司 Word importance analysis method and device
CN110427628A (en) * 2019-08-02 2019-11-08 杭州安恒信息技术股份有限公司 Web assets classes detection method and device based on neural network algorithm
CN110705290A (en) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 Webpage classification method and device
CN110837735B (en) * 2019-11-17 2023-11-03 内蒙古中媒互动科技有限公司 Intelligent data analysis and identification method and system
CN110837735A (en) * 2019-11-17 2020-02-25 太原蓝知科技有限公司 Intelligent data analysis and identification method and system
CN111159589A (en) * 2019-12-30 2020-05-15 中国银联股份有限公司 Classification dictionary establishing method, merchant data classification method, device and equipment
CN111159589B (en) * 2019-12-30 2023-10-20 中国银联股份有限公司 Classification dictionary establishment method, merchant data classification method, device and equipment
CN111737993B (en) * 2020-05-26 2024-04-02 浙江华云电力工程设计咨询有限公司 Method for extracting equipment health state from fault defect text of power distribution network equipment
CN111737993A (en) * 2020-05-26 2020-10-02 浙江华云电力工程设计咨询有限公司 Method for extracting health state of equipment from fault defect text of power distribution network equipment
CN111695353A (en) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 Method, device and equipment for identifying timeliness text and storage medium
CN112200259A (en) * 2020-10-19 2021-01-08 哈尔滨理工大学 Information gain text feature selection method and classification device based on classification and screening
CN113190682A (en) * 2021-06-30 2021-07-30 平安科技(深圳)有限公司 Method and device for acquiring event influence degree based on tree model and computer equipment
CN113190682B (en) * 2021-06-30 2021-09-28 平安科技(深圳)有限公司 Method and device for acquiring event influence degree based on tree model and computer equipment
WO2023035787A1 (en) * 2021-09-07 2023-03-16 浙江传媒学院 Text data attribution description and generation method based on text character feature
CN116248375A (en) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 Webpage login entity identification method, device, equipment and storage medium
CN116248375B (en) * 2023-02-01 2023-12-15 北京市燃气集团有限责任公司 Webpage login entity identification method, device, equipment and storage medium
CN115883912A (en) * 2023-03-08 2023-03-31 山东水浒文化传媒有限公司 Interaction method and system for internet communication demonstration
CN116564538B (en) * 2023-07-05 2023-12-19 肇庆市高要区人民医院 Hospital information real-time query method and system based on big data
CN116564538A (en) * 2023-07-05 2023-08-08 肇庆市高要区人民医院 Hospital information real-time query method and system based on big data

Also Published As

Publication number Publication date
CN107291723A (en) 2017-10-24
CN107291723B (en) 2021-04-30
TW201737118A (en) 2017-10-16
TWI735543B (en) 2021-08-11

Similar Documents

Publication Publication Date Title
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
CN107193959B (en) Pure text-oriented enterprise entity classification method
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
US20180260860A1 (en) A computer-implemented method and system for analyzing and evaluating user reviews
CN108197109A (en) A kind of multilingual analysis method and device based on natural language processing
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN108228541A (en) The method and apparatus for generating documentation summary
CN108009135A (en) The method and apparatus for generating documentation summary
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
Barua et al. Multi-class sports news categorization using machine learning techniques: resource creation and evaluation
Budhiraja et al. A supervised learning approach for heading detection
CN113111645B (en) Media text similarity detection method
Roth et al. Feature-based models for improving the quality of noisy training data for relation extraction
Thielmann et al. Coherence based document clustering
CN117216687A (en) Large language model generation text detection method based on ensemble learning
Liu Automatic argumentative-zoning using word2vec
WO2018086518A1 (en) Method and device for real-time detection of new subject
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
Sarı et al. Classification of Turkish Documents Using Paragraph Vector
CN111159410A (en) Text emotion classification method, system and device and storage medium
Butnaru Machine learning applied in natural language processing

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17773097

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17773097

Country of ref document: EP

Kind code of ref document: A1