WO2017167067A1 - Procédé et dispositif pour une classification de texte de page internet, procédé et dispositif pour une reconnaissance de texte de page internet - Google Patents

Procédé et dispositif pour une classification de texte de page internet, procédé et dispositif pour une reconnaissance de texte de page internet Download PDF

Info

Publication number
WO2017167067A1
WO2017167067A1 PCT/CN2017/077489 CN2017077489W WO2017167067A1 WO 2017167067 A1 WO2017167067 A1 WO 2017167067A1 CN 2017077489 W CN2017077489 W CN 2017077489W WO 2017167067 A1 WO2017167067 A1 WO 2017167067A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
participle
weight
value
text data
Prior art date
Application number
PCT/CN2017/077489
Other languages
English (en)
Chinese (zh)
Inventor
段秉南
Original Assignee
阿里巴巴集团控股有限公司
段秉南
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 段秉南 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017167067A1 publication Critical patent/WO2017167067A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Definitions

  • the present application relates to the technical field of text classification, and in particular to a method for classifying web page text, a device for classifying web page text, a method for recognizing web page text, and a device for recognizing web page text.
  • the webpage text classification refers to determining the category of the corresponding webpage according to the content of the massive webpage document according to the predefined theme category.
  • the technical basis for web page text categorization is content-based plain text categorization.
  • the basic method is to extract the content of the plain text of each webpage text in the captured webpage collection, and obtain the corresponding plain text.
  • the extracted plain text is then combined into a new document collection, and a plain text classification algorithm is applied to the new document collection for classification.
  • the webpage text is classified, that is, the plaintext content information of the webpage is applied, and the webpage is classified.
  • embodiments of the present application have been made in order to provide an overcoming of the above problems or A method for classifying web page text, at least partially solving the above problem, a method for recognizing web page text, and a corresponding device for classifying web page text, a device for recognizing web page text.
  • the embodiment of the present application discloses a method for classifying webpage text, including:
  • the weight is used as the feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
  • the first attribute value is an information gain value of the base participle
  • the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
  • the feature values of the basic participles are calculated according to the first attribute value and the second attribute value by the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
  • the step of calculating corresponding weights of each feature word segment includes:
  • the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
  • the following formula is used to calculate the feature value of the feature word segment, the number of times the feature word segment appears in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, and calculate corresponding feature word segments.
  • Weights :
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the step of calculating a corresponding weight of each feature word segment further includes:
  • the weights of the feature word segments are normalized.
  • the weights of the feature word segmentation are normalized by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the embodiment of the present application further discloses a method for text recognition of a webpage, including:
  • the first attribute value is an information gain value of the base participle
  • the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
  • the step of filtering the feature word segmentation from the basic participle according to the feature value comprises:
  • the step of calculating corresponding weights of each feature word segment includes:
  • the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
  • the step of calculating a corresponding weight of each feature word segment further includes:
  • the weights of the feature word segments are normalized.
  • the embodiment of the present application further discloses an apparatus for classifying webpage texts, including:
  • An acquisition module configured to collect text data in a webpage
  • a word segmentation module for segmenting the text data to obtain a basic participle
  • a word segment attribute calculation module configured to calculate a first attribute value and a second attribute value of each base participle
  • An eigenvalue calculation module configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value
  • a feature extraction module configured to filter feature segmentation words from the basic participle according to the feature value
  • a feature weight allocation module configured to calculate a corresponding weight of each feature word segmentation
  • the model training module is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
  • the first attribute value is an information gain value of the base participle
  • the second attribute value is a standard deviation of the base participle relative to a pre-defined chi-square statistic value of each category, the feature The value is the degree of discrimination of the basic participle.
  • the feature value calculation module calculates the feature values of the basic participle words according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the feature extraction module comprises:
  • a sorting sub-module for arranging the basic participle according to its corresponding feature value from highest to lowest;
  • an extraction sub-module configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the feature weight allocation module comprises:
  • a number statistics sub-module configured to obtain the number of occurrences of each feature word segment in the text data of the corresponding webpage
  • a calculation submodule configured to calculate, according to the feature value of the feature word segmentation, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words Weights.
  • the calculation sub-module is calculated according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage.
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the feature weight distribution module further includes:
  • the normalization submodule is configured to normalize the weight of the feature word segmentation.
  • the normalization sub-module normalizes the weight of the feature word segment by the following formula:
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • Max(weight) is the maximum weight value in the text data in the webpage.
  • the embodiment of the present application further discloses an apparatus for text recognition of a webpage, including:
  • a text extraction module configured to extract text data in the webpage to be identified
  • a word segmentation module for segmenting the text data to obtain a basic participle
  • a word segment attribute calculation module configured to calculate a first attribute value and a second attribute value of each base participle
  • An eigenvalue calculation module configured to calculate a feature value of each basic participle according to the first attribute value and the second attribute value
  • a feature extraction module configured to filter feature segmentation words from the basic participle according to the feature value
  • a feature weight allocation module configured to calculate a corresponding weight of each feature word segmentation
  • a classification module configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information
  • a marking module configured to mark classification information for the to-be-identified webpage.
  • the embodiment of the present application improves the objectivity and accuracy of feature extraction by improving the extraction method of feature word segmentation and the calculation method of feature word segmentation weight, and also considers the influence of feature on classification, thereby improving the accuracy of web page text classification. Sex, more convenient for users to obtain valid information in a timely and accurate manner in a large amount of text.
  • the embodiment of the present application combines at least two feature extraction algorithms, and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of the feature extraction. Moreover, by using the long tail distribution map to select the number of features, the weighting of the feature segmentation effect is adopted for the feature segmentation, so that the effective features can be further screened, so that the effect of the webpage text classification is more accurate.
  • FIG. 1 is a flow chart showing the steps of a method for classifying web page text according to the present application
  • FIG. 2 is a schematic diagram of a long tail distribution in an example of the present application.
  • FIG. 3 is a flow chart of steps of text recognition of a webpage according to the present application.
  • FIG. 4 is a structural block diagram of an apparatus for classifying web page text according to the present application.
  • FIG. 5 is a structural block diagram of an apparatus for text recognition of a webpage according to the present application.
  • Text categorization is to obtain a mapping rule between a category and an unknown text by training a certain set of texts, that is, calculating the relevance of the text and the category, and then determining the category attribution of the text according to the trained classifier.
  • Text categorization is a guided learning process. It finds a relational model (classifier) between text attributes (features) and text categories based on a set of training texts that have been annotated, and then uses the relational model pair obtained by this learning. The new text is judged by category.
  • the process of text categorization can be divided into two parts: training and classification.
  • the purpose of training is to construct a classification model for the classification by linking the new text and categories.
  • the classification process is a process of classifying unknown texts based on training results, giving a category identification.
  • FIG. 1 a flow chart of steps of a method for classifying web page texts according to the present application is shown. Specifically, the method may include the following steps:
  • Step 101 Collect text data in a webpage
  • This step obtains the text data of the webpage used for the training of the classification model.
  • it may be massive data.
  • the usual processing method is to extract the plain text content of each webpage text in the captured webpage collection, thereby obtaining corresponding plain text, and then extracting the plain text into a new document collection, the document collection. That is, the webpage referred to in this application Text data in .
  • Step 102 Perform word segmentation on the text data to obtain a basic participle
  • English is based on words, words and words are separated by spaces, and Chinese is in words. All the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student.” The computer can easily know that student is a word by a space, but it is not easy to understand that the words "learning” and "sheng" are combined to represent a word.
  • the Chinese character sequence is divided into meaningful words, which are Chinese word segments. For example, I am a student and the result of the participle is: I am a student.
  • Word segmentation based on string matching refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
  • the actual word segmentation system uses mechanical segmentation as a preliminary method, and further improves the accuracy of segmentation by using various other language information.
  • the word segmentation method based on feature scanning or mark segmentation refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and in turn, test and adjust the word segmentation results in the labeling process. , thereby improving the accuracy of the segmentation.
  • the word segmentation method based on understanding refers to the effect of identifying words by letting the computer simulate the understanding of the sentence.
  • the basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity. That is, it simulates the process of understanding people's sentences. This method of word segmentation requires a large amount of linguistic knowledge and information.
  • Statistical-based word segmentation method It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary.
  • the manner in which the text data is segmented by the present application is not limited, and the word segmentation is performed on the document set, and all the word segments obtained are the basic participles referred to in the present application.
  • the removal process may also be performed in advance for the invalid words in the basic participle, for example, for the stop words.
  • Stop words usually refer to frequent occurrences in various types of text, and are therefore considered to have few high-frequency words such as pronouns, prepositions, conjunctions, etc. that help to classify any information.
  • Those skilled in the art can also design feature words that need to be deleted before or during feature extraction according to requirements, which need not be limited in this application.
  • Step 103 Calculate a first attribute value and a second attribute value of each basic participle
  • Step 104 Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
  • Step 105 Filter feature tokens from the basic participle according to the feature value
  • the above steps 103-105 relate to the processing of feature selection in text categorization.
  • the original feature space dimension is very high, and there are a lot of redundant features, so feature dimension reduction is needed.
  • Feature selection is one of the characteristics of feature dimension reduction. Its basic idea is to score each original feature item independently according to a certain evaluation function, and then sort by the level of the score, and select several features with the highest score. Item, or a threshold is set in advance, the metric value is filtered out of the threshold feature, and the remaining candidate features are used as the feature subset of the result.
  • the feature selection algorithm includes algorithms such as document frequency, mutual information amount, information gain, and ⁇ 2 statistic (CHI).
  • algorithms such as document frequency, mutual information amount, information gain, and ⁇ 2 statistic (CHI).
  • CHI ⁇ 2 statistic
  • those skilled in the art usually select one of them to select the feature word segmentation.
  • the use of this single algorithm has many drawbacks. Taking the information gain algorithm as an example, the information gain appears in the text through the word segmentation.
  • the amount of information before and after the occurrence of the word segmentation is used to infer the amount of information carried by the participle, that is, the information gain value of a participle indicates the amount of information contained in the participle feature.
  • the segmentation feature can give the classifier a larger amount of information, but the existing information gain algorithm only considers the amount of information provided by the segmentation feature to the overall classifier, ignoring the different segmentation features. The degree of discrimination of each category.
  • the chi-square statistic is also used to characterize the correlation between two variables. It also considers the case when the feature appears and does not appear in a certain type of text. The larger the chi-square statistic, the more relevant it is to the class, and the more the category information is carried, but the existing ⁇ 2 statistic (CHI) algorithm over-exaggerates the role of low-frequency words.
  • CHI ⁇ 2 statistic
  • the present application proposes that no single algorithm is used, and at least two algorithms are used for feature extraction, that is, different first algorithms are used to calculate the first attribute value and the second attribute value of each basic participle, for example, using information gain.
  • the algorithm calculates the first attribute value and uses the CHI algorithm to calculate the second attribute value.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value, the eigenvalue may be the degree of discrimination of the basic participle, that is, the step 103 may specifically include the following sub-steps:
  • Sub-step 1031 calculating an information gain value of each basic participle
  • Sub-step 1032 calculating a chi-square statistic value of each basic participle
  • Sub-step 1033 based on the number of base participles, the standard deviation of the base participle relative to the predefined chi-square statistic of each category is counted.
  • the step 104 may be: obtaining the discrimination degree of each basic participle based on the product of the information gain value and the standard deviation.
  • the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the application combines at least two feature extraction algorithms and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of feature extraction.
  • the step 105 may specifically include the following sub-steps:
  • Sub-step 1051 the basic participle is arranged according to its corresponding feature value from high to low;
  • Sub-step 1052 extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the degree of discrimination may take, for example, a base participle with an abscissa greater than 0 and less than 30,000 as a feature segmentation.
  • the present application can further screen out the effective features, thereby making the effect of web page text classification more accurate.
  • Step 106 Calculate corresponding weights of each feature participle
  • each feature participle is given a weight indicating the importance of the feature participle in the text.
  • the weights are generally calculated based on the frequency of the feature items.
  • calculation methods such as Boolean weight method, word frequency weight method, TF/IDF weight method, TFC weight method, etc.
  • TF/IDF weight method indicates the number of features in a single text
  • IDF indicates the number of features in the entire corpus, so the influence of features on classification is completely ignored.
  • the present application proposes a preferred embodiment for calculating weights.
  • the step 106 may include the following sub-steps:
  • Sub-step 1061 obtaining the number of times each feature participle appears in the text data of the corresponding webpage
  • Sub-step 1062 counting the total number of feature word segments in the text data of the webpage
  • Sub-step 1063 according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated.
  • the sub-step 1063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • step 105 further includes the following sub-steps:
  • Sub-step 1064 normalizing the weights of the feature word segments.
  • the weight of the feature word segmentation can be normalized by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • weights used in the examples of the present application take into account the influence of features on the classification, and thus can further improve the effectiveness of feature selection.
  • the corresponding weights of each feature segment calculated above can be used as a feature vector of a text, and a text classification can be selected after obtaining the feature vector.
  • the algorithm trains the classification model.
  • Step 107 The weight is used as a feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.
  • Bayesian Probability Algorithm Naive Bayese
  • support vector machine KNN algorithm (k nearest neighbor), etc.
  • KNN algorithm K nearest neighbor
  • the embodiment of the present application improves the objectivity and accuracy of feature extraction by improving the extraction method of feature word segmentation and the calculation method of feature word segmentation weight, and also considers the influence of feature on classification, thereby improving the accuracy of web page text classification. Sex, more convenient for users to obtain valid information in a timely and accurate manner in a large amount of text.
  • FIG. 3 a flowchart of an embodiment of a method for text recognition of a webpage according to the present application is shown. Specifically, the method may include the following steps:
  • Step 301 Extract text data in the webpage to be identified
  • Step 302 performing segmentation on the text data to obtain a basic participle
  • Step 303 Calculate a first attribute value and a second attribute value of each basic participle
  • Step 304 Calculate a feature value of each basic participle according to the first attribute value and the second attribute value;
  • Step 305 Filter feature feature words from the basic participle according to the feature value
  • Step 306 calculating corresponding weights of each feature participle
  • Step 307 Enter the weight as a feature vector into a pre-trained classification model to obtain classification information.
  • Step 308 Mark classification information for the to-be-identified webpage.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value which may be the degree of discrimination of the base participle.
  • the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the step 305 may include the following sub-steps:
  • Sub-step 3051 the basic participle is arranged according to its corresponding feature value from high to low;
  • Sub-step 3052 extracting a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the step 306 may include the following sub-steps:
  • Sub-step 3061 obtaining the number of occurrences of each feature participle in the text data of the corresponding webpage
  • Sub-step 3062 counting the total number of feature word segments in the text data of the webpage
  • Sub-step 3063 according to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, the corresponding weights of each feature segmentation are calculated.
  • the sub-step 3063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • step 306 further includes the following sub-steps:
  • Sub-step 3064 normalizing the weights of the feature word segments.
  • the weight of the feature word segmentation can be normalized by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the corresponding weights of each feature segment obtained above can be used as a feature vector of a text. After obtaining the feature vector, it can be input into the classification model pre-generated according to the process shown in Figure 1, and the current feature vector can be obtained. Classification information, and finally the current identification The corresponding categorization information on the web page mark can be used.
  • FIG. 4 a structural block diagram of an apparatus embodiment of a webpage text classification of the present application is shown, which may specifically include the following modules:
  • the collecting module 401 is configured to collect text data in the webpage
  • a word segmentation module 402 configured to perform segmentation on the text data to obtain a basic participle
  • the word segment attribute calculation module 403 is configured to calculate a first attribute value and a second attribute value of each base participle;
  • the feature value calculation module 404 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;
  • the feature extraction module 405 is configured to filter the feature word segmentation from the basic participle according to the feature value
  • a feature weight assignment module 406, configured to calculate a corresponding weight of each feature participle
  • the model training module 407 is configured to use the weight as a feature vector of the corresponding feature word segment, and use the feature vector to train the classification model.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value which may be the degree of discrimination of the base participle.
  • the feature value calculation module 404 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the feature extraction module 405 may include the following sub-modules:
  • a sorting sub-module 4051 configured to rank the basic participle according to its corresponding feature value from high to low;
  • the extracting sub-module 4052 is configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.
  • the feature weight assignment module 406 can include the following sub-modules:
  • the number of statistics sub-module 4061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;
  • a segmentation total number statistics sub-module 4062 configured to count the total number of feature word segments in the text data of the webpage
  • the calculation sub-module 4063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage, and calculate corresponding feature wordifiers the weight of.
  • the calculation sub-module 4063 may be based on the feature value of the feature segmentation according to the following formula, and each feature segmentation is in the corresponding webpage.
  • the number of occurrences in the text data, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the feature weight assignment module 406 may further include the following sub-modules:
  • the normalization sub-module 4064 is configured to normalize the weight of the feature word segmentation.
  • the normalization sub-module 4064 may normalize the weight of the feature word segment by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • FIG. 5 a structural block diagram of an apparatus for recognizing a webpage text of the present application is shown. Specifically, the following modules may be included:
  • a text extraction module 501 configured to extract text data in a webpage to be identified
  • a word segmentation module 502 configured to perform segmentation on the text data to obtain a basic participle
  • the word segment attribute calculation module 503 is configured to calculate a first attribute value and a second attribute value of each base participle;
  • the feature value calculation module 504 is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value;
  • the feature extraction module 505 is configured to filter the feature word segmentation from the basic participle according to the feature value
  • a feature weight assignment module 506, configured to calculate a corresponding weight of each feature participle
  • a classification module 507 configured to input the weight as a feature vector into a pre-trained classification model to obtain classification information
  • the marking module 508 is configured to mark the classification information for the to-be-identified webpage.
  • the first attribute value may be an information gain value of the basic participle
  • the second attribute value may be a chi-square part of the basic participle relative to a predefined each category.
  • the standard deviation of the statistic value which may be the degree of discrimination of the base participle.
  • the feature value calculation module 504 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:
  • score is the degree of discrimination of the base participle
  • igScore is the information gain value of the base participle
  • chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories
  • n is the number of predefined classifications.
  • the feature extraction module 505 can include the following sub-modules:
  • a sorting sub-module 5051 configured to rank the basic participle according to its corresponding feature value from high to low;
  • An extraction sub-module 5052 configured to extract a preset number, where the feature value is higher than a preset threshold
  • the basic participle of the value is used as the feature participle.
  • the feature weight assignment module 506 can include the following sub-modules:
  • the number of statistics sub-module 5061 is configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage;
  • a segmentation total number statistics sub-module 5062 configured to count the total number of feature word segments in the text data of the webpage
  • the calculation sub-module 5063 is configured to calculate, according to the feature value of the feature word segment, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words correspondingly the weight of.
  • the calculation sub-module 4063 may be based on the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the webpage.
  • the total number of feature word segments in the text data, and the corresponding weights of each feature word segment are calculated:
  • weight is the weight of the feature word segment
  • tf is the number of times the feature word segment appears in the text data of the corresponding web page
  • n is the total number of feature word segments in the text data of the web page
  • score is the degree of distinguishing feature word segmentation.
  • the feature weight distribution module 506 may further include the following sub-modules:
  • the normalization sub-module 5064 is configured to normalize the weight of the feature word segmentation.
  • the normalization sub-module 4064 may normalize the weight of the feature word segment by the following formula:
  • norm(weight) is the weight after normalization
  • weight is the weight of the feature participle
  • min(weight) is the minimum weight value in the text data in the webpage
  • max(weight) is the text data in the webpage.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic cassette tape, magnetic tape storage or other magnetic storage device or Any other non-transportable medium that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.
  • the method for classifying webpage texts provided by the present application a device for classifying webpage texts, a method for recognizing webpage texts, and a device for recognizing webpage texts are described in detail, and specific articles are applied in the text.
  • the principles and implementations of the present application are described in the following examples. The description of the above embodiments is only for helping to understand the method of the present application and its core ideas. Meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation is implemented. There is a change in the scope of the application and the scope of the application. In summary, the content of the specification should not be construed as limiting the application.

Abstract

L'invention concerne un procédé et un dispositif pour une classification de texte de page Internet, ainsi qu'un procédé et un dispositif pour une reconnaissance de texte de page Internet. Le procédé pour une classification de texte de page Internet consiste : à rassembler des données de texte dans une page Internet (101) ; à segmenter les données de texte pour obtenir des segments de texte de base (102) ; à calculer une première valeur d'attribut et une seconde valeur d'attribut de chacun des segments de texte de base (103) ; à calculer une valeur de caractéristique de chacun des segments de texte de base selon la première valeur d'attribut et la seconde valeur d'attribut (104) ; à filtrer et à sélectionner des segments de texte de caractéristique parmi les segments de texte de base selon la valeur de caractéristique (105) ; à calculer un poids correspondant à chacun des segments de texte de caractéristique (106) ; à traiter le poids comme vecteur de caractéristique correspondant aux segments de texte de caractéristique, et à utiliser le vecteur de caractéristique pour apprendre un modèle de classification (107). Le procédé et le dispositif de la présente invention mesurent de manière efficace l'objectivité et la précision lors de l'extraction d'une caractéristique, et prennent également en compte l'influence d'une caractéristique sur la classification, permettant ainsi d'augmenter la précision de la classification de texte de page Internet, et de faciliter en outre l'obtention précise et opportune, par un utilisateur, d'informations efficaces dans une quantité massive de texte.
PCT/CN2017/077489 2016-03-30 2017-03-21 Procédé et dispositif pour une classification de texte de page internet, procédé et dispositif pour une reconnaissance de texte de page internet WO2017167067A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610195483.4 2016-03-30
CN201610195483.4A CN107291723B (zh) 2016-03-30 2016-03-30 网页文本分类的方法和装置,网页文本识别的方法和装置

Publications (1)

Publication Number Publication Date
WO2017167067A1 true WO2017167067A1 (fr) 2017-10-05

Family

ID=59962602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077489 WO2017167067A1 (fr) 2016-03-30 2017-03-21 Procédé et dispositif pour une classification de texte de page internet, procédé et dispositif pour une reconnaissance de texte de page internet

Country Status (3)

Country Link
CN (1) CN107291723B (fr)
TW (1) TWI735543B (fr)
WO (1) WO2017167067A1 (fr)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053251A (zh) * 2017-12-18 2018-05-18 北京小度信息科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN108255797A (zh) * 2018-01-26 2018-07-06 上海康斐信息技术有限公司 一种文本模式识别方法及系统
CN108334630A (zh) * 2018-02-24 2018-07-27 上海康斐信息技术有限公司 一种url分类方法及系统
CN108415959A (zh) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 一种文本分类方法及装置
CN110334342A (zh) * 2019-06-10 2019-10-15 阿里巴巴集团控股有限公司 词语重要性的分析方法及装置
CN110427628A (zh) * 2019-08-02 2019-11-08 杭州安恒信息技术股份有限公司 基于神经网络算法的web资产分类检测方法及装置
CN110705290A (zh) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 一种网页分类方法及装置
CN110837735A (zh) * 2019-11-17 2020-02-25 太原蓝知科技有限公司 一种数据智能分析识别方法及系统
CN111159589A (zh) * 2019-12-30 2020-05-15 中国银联股份有限公司 分类字典建立方法、商户数据分类方法、装置及设备
CN111695353A (zh) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 时效性文本的识别方法、装置、设备及存储介质
CN111737993A (zh) * 2020-05-26 2020-10-02 浙江华云电力工程设计咨询有限公司 一种配电网设备的故障缺陷文本提取设备健康状态方法
CN112200259A (zh) * 2020-10-19 2021-01-08 哈尔滨理工大学 一种基于分类与筛选的信息增益文本特征选择方法及分类装置
CN113190682A (zh) * 2021-06-30 2021-07-30 平安科技(深圳)有限公司 基于树模型的事件影响度获取方法、装置及计算机设备
WO2023035787A1 (fr) * 2021-09-07 2023-03-16 浙江传媒学院 Procédé de description et de génération d'attribution de données de texte basé sur une caractéristique de caractère de texte
CN115883912A (zh) * 2023-03-08 2023-03-31 山东水浒文化传媒有限公司 一种用于互联网交流演示的互动方法及系统
CN116248375A (zh) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 一种网页登录实体识别方法、装置、设备和存储介质
CN116564538A (zh) * 2023-07-05 2023-08-08 肇庆市高要区人民医院 一种基于大数据的医院就医信息实时查询方法及系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844553B (zh) * 2017-10-31 2021-07-27 浪潮通用软件有限公司 一种文本分类方法及装置
CN108090178B (zh) * 2017-12-15 2020-08-25 北京锐安科技有限公司 一种文本数据分析方法、装置、服务器和存储介质
CN110287316A (zh) * 2019-06-04 2019-09-27 深圳前海微众银行股份有限公司 一种告警分类方法、装置、电子设备及存储介质
CN111476025B (zh) * 2020-02-28 2021-01-08 开普云信息科技股份有限公司 一种面向政府领域新词自动发现的实现方法、分析模型及其系统
CN111753525B (zh) * 2020-05-21 2023-11-10 浙江口碑网络技术有限公司 文本分类方法、装置及设备
CN112667817B (zh) * 2020-12-31 2022-05-31 杭州电子科技大学 一种基于轮盘赌属性选择的文本情感分类集成系统
CN113127595B (zh) * 2021-04-26 2022-08-16 数库(上海)科技有限公司 研报摘要的观点详情提取方法、装置、设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055183A1 (en) * 2007-08-24 2009-02-26 Siemens Medical Solutions Usa, Inc. System and Method for Text Tagging and Segmentation Using a Generative/Discriminative Hybrid Hidden Markov Model
CN103914478A (zh) * 2013-01-06 2014-07-09 阿里巴巴集团控股有限公司 网页训练方法及系统、网页预测方法及系统
CN104899310A (zh) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 信息排序方法、用于生成信息排序模型的方法及装置
CN105426360A (zh) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 一种关键词抽取方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809548B2 (en) * 2004-06-14 2010-10-05 University Of North Texas Graph-based ranking algorithms for text processing
TWI427492B (zh) * 2007-01-15 2014-02-21 Hon Hai Prec Ind Co Ltd 資訊搜尋系統及方法
CN103995876A (zh) * 2014-05-26 2014-08-20 上海大学 一种基于卡方统计和smo算法的文本分类方法
CN104346459B (zh) * 2014-11-10 2017-10-27 南京信息工程大学 一种基于术语频率和卡方统计的文本分类特征选择方法
CN105224695B (zh) * 2015-11-12 2018-04-20 中南大学 一种基于信息熵的文本特征量化方法和装置及文本分类方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090055183A1 (en) * 2007-08-24 2009-02-26 Siemens Medical Solutions Usa, Inc. System and Method for Text Tagging and Segmentation Using a Generative/Discriminative Hybrid Hidden Markov Model
CN103914478A (zh) * 2013-01-06 2014-07-09 阿里巴巴集团控股有限公司 网页训练方法及系统、网页预测方法及系统
CN104899310A (zh) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 信息排序方法、用于生成信息排序模型的方法及装置
CN105426360A (zh) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 一种关键词抽取方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, XIAOHONG: "Feature extraction methods for Chinese text classification", COMPUTER ENGINEERING AND DESIGN, vol. 30, no. 17, 31 December 2009 (2009-12-31), ISSN: 1000-7024 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053251B (zh) * 2017-12-18 2021-03-02 北京小度信息科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN108053251A (zh) * 2017-12-18 2018-05-18 北京小度信息科技有限公司 信息处理方法、装置、电子设备及计算机可读存储介质
CN108255797A (zh) * 2018-01-26 2018-07-06 上海康斐信息技术有限公司 一种文本模式识别方法及系统
CN108415959A (zh) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 一种文本分类方法及装置
CN108415959B (zh) * 2018-02-06 2021-06-25 北京捷通华声科技股份有限公司 一种文本分类方法及装置
CN108334630A (zh) * 2018-02-24 2018-07-27 上海康斐信息技术有限公司 一种url分类方法及系统
CN110334342A (zh) * 2019-06-10 2019-10-15 阿里巴巴集团控股有限公司 词语重要性的分析方法及装置
CN110334342B (zh) * 2019-06-10 2024-02-09 创新先进技术有限公司 词语重要性的分析方法及装置
CN110427628A (zh) * 2019-08-02 2019-11-08 杭州安恒信息技术股份有限公司 基于神经网络算法的web资产分类检测方法及装置
CN110705290A (zh) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 一种网页分类方法及装置
CN110837735B (zh) * 2019-11-17 2023-11-03 内蒙古中媒互动科技有限公司 一种数据智能分析识别方法及系统
CN110837735A (zh) * 2019-11-17 2020-02-25 太原蓝知科技有限公司 一种数据智能分析识别方法及系统
CN111159589A (zh) * 2019-12-30 2020-05-15 中国银联股份有限公司 分类字典建立方法、商户数据分类方法、装置及设备
CN111159589B (zh) * 2019-12-30 2023-10-20 中国银联股份有限公司 分类字典建立方法、商户数据分类方法、装置及设备
CN111737993B (zh) * 2020-05-26 2024-04-02 浙江华云电力工程设计咨询有限公司 一种配电网设备的故障缺陷文本提取设备健康状态方法
CN111737993A (zh) * 2020-05-26 2020-10-02 浙江华云电力工程设计咨询有限公司 一种配电网设备的故障缺陷文本提取设备健康状态方法
CN111695353A (zh) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 时效性文本的识别方法、装置、设备及存储介质
CN112200259A (zh) * 2020-10-19 2021-01-08 哈尔滨理工大学 一种基于分类与筛选的信息增益文本特征选择方法及分类装置
CN113190682A (zh) * 2021-06-30 2021-07-30 平安科技(深圳)有限公司 基于树模型的事件影响度获取方法、装置及计算机设备
CN113190682B (zh) * 2021-06-30 2021-09-28 平安科技(深圳)有限公司 基于树模型的事件影响度获取方法、装置及计算机设备
WO2023035787A1 (fr) * 2021-09-07 2023-03-16 浙江传媒学院 Procédé de description et de génération d'attribution de données de texte basé sur une caractéristique de caractère de texte
CN116248375A (zh) * 2023-02-01 2023-06-09 北京市燃气集团有限责任公司 一种网页登录实体识别方法、装置、设备和存储介质
CN116248375B (zh) * 2023-02-01 2023-12-15 北京市燃气集团有限责任公司 一种网页登录实体识别方法、装置、设备和存储介质
CN115883912A (zh) * 2023-03-08 2023-03-31 山东水浒文化传媒有限公司 一种用于互联网交流演示的互动方法及系统
CN116564538B (zh) * 2023-07-05 2023-12-19 肇庆市高要区人民医院 一种基于大数据的医院就医信息实时查询方法及系统
CN116564538A (zh) * 2023-07-05 2023-08-08 肇庆市高要区人民医院 一种基于大数据的医院就医信息实时查询方法及系统

Also Published As

Publication number Publication date
TWI735543B (zh) 2021-08-11
CN107291723A (zh) 2017-10-24
TW201737118A (zh) 2017-10-16
CN107291723B (zh) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2017167067A1 (fr) Procédé et dispositif pour une classification de texte de page internet, procédé et dispositif pour une reconnaissance de texte de page internet
WO2019218514A1 (fr) Procédé permettant d'extraire des informations cibles de page web, dispositif et support d'informations
US20180260860A1 (en) A computer-implemented method and system for analyzing and evaluating user reviews
CN108197109A (zh) 一种基于自然语言处理的多语言分析方法和装置
WO2022095374A1 (fr) Procédé et appareil d'extraction de mots-clés, ainsi que dispositif terminal et support de stockage
WO2015149533A1 (fr) Procédé et dispositif de traitement de segmentation de mots en fonction d'un classement de contenus de pages web
CN110083832B (zh) 文章转载关系的识别方法、装置、设备及可读存储介质
CN108228541A (zh) 生成文档摘要的方法和装置
CN108009135A (zh) 生成文档摘要的方法和装置
CN108228612B (zh) 一种提取网络事件关键词以及情绪倾向的方法及装置
Barua et al. Multi-class sports news categorization using machine learning techniques: resource creation and evaluation
Budhiraja et al. A supervised learning approach for heading detection
CN113111645B (zh) 一种媒体文本相似性检测方法
Roth et al. Feature-based models for improving the quality of noisy training data for relation extraction
Thielmann et al. Coherence based document clustering
CN117216687A (zh) 一种基于集成学习的大语言模型生成文本检测方法
Liu Automatic argumentative-zoning using word2vec
WO2018086518A1 (fr) Procédé et dispositif de détection en temps réel d'un nouveau sujet
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
CN116029280A (zh) 一种文档关键信息抽取方法、装置、计算设备和存储介质
CN113761125A (zh) 动态摘要确定方法和装置、计算设备以及计算机存储介质
Sarı et al. Classification of Turkish Documents Using Paragraph Vector
CN111159410A (zh) 一种文本情感分类方法、系统、装置及存储介质
Butnaru Machine learning applied in natural language processing
Dai et al. Improving scientific relation classification with task specific supersense

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17773097

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17773097

Country of ref document: EP

Kind code of ref document: A1