CN108108346B

CN108108346B - Method and device for extracting theme characteristic words of document

Info

Publication number: CN108108346B
Application number: CN201611062893.8A
Authority: CN
Inventors: 余虎; 张郭强; 林伟亮
Original assignee: Guangdong Eshore Technology Co Ltd
Current assignee: Guangdong Eshore Technology Co Ltd
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2021-12-24
Anticipated expiration: 2036-11-25
Also published as: CN108108346A

Abstract

The invention discloses a method and a device for extracting topic characteristic words of a document, wherein the method for extracting the topic characteristic words of the document comprises the following steps: importing a group of classified documents, wherein the documents have Chinese text data; performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases; performing characteristic selection on the multiple word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases; and filtering the feature phrases according to preset theme features to obtain theme feature words. The technical scheme of the invention can avoid the influence of irrelevant characteristic words on the document theme, can obtain accurate theme characteristic words and is convenient for searching the document. The method and the device can improve the accuracy of the selection of the theme characteristic words, avoid the omission or multiple selection of the characteristic words, and improve the accuracy of document searching so as to improve the searching experience of users.

Description

Method and device for extracting theme characteristic words of document

Technical Field

The invention relates to the technical field of document searching, in particular to a method and a device for extracting topic characteristic words of a document.

Background

With the continuous development of network technology, searching databases and library documents through websites has gradually replaced the search mode of looking up books manually. When searching for a document by using a website, the topic feature words of the document need to be extracted. In the method for extracting the theme feature words of the document in the prior art, the text of the document is segmented, and then the feature words are extracted according to an extraction algorithm of certain feature words, so that the feature words are obtained. The scheme can only realize fuzzy matching of the feature words, and the obtained feature words have low representativeness and cannot fully represent the features of the theme. In another scheme, after the text of the document is classified, a filtering step is added, and then the filtered feature words are extracted to obtain the feature words. The scheme can filter some invalid feature words, but the filtering is specific to all topics and cannot be performed on a certain topic, the obtained result can omit the features of some topics, and the obtained feature words are not comprehensive enough.

Disclosure of Invention

In order to solve at least one of the above technical problems, a primary object of the present invention is to provide a method for extracting topic feature words from a document.

In order to achieve the purpose, the invention adopts a technical scheme that: a method for extracting topic characteristic words of a document is provided, which comprises the following steps:

importing a group of classified documents, wherein the documents have Chinese text data;

performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;

performing characteristic selection on the multiple word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;

and filtering the feature phrases according to preset theme features to obtain theme feature words.

Preferably, the step of performing word segmentation preprocessing on the chinese text data of the document to obtain a plurality of word segmentation phrases specifically includes:

performing word segmentation on the Chinese text data of the document according to a word segmentation algorithm to obtain a plurality of word groups;

performing part-of-speech screening according to the part-of-speech of the phrase to obtain a phrase with strong part-of-speech;

comparing the phrases with a preset stop word stock to obtain word-segmentation phrases;

and outputting the word segmentation phrase.

Preferably, the step of comparing the word group with a preset stop word bank to obtain a word segmentation word group specifically includes:

determining whether the phrase is a subset of a predetermined disabled word library,

if the phrase is a subset of the preset disabled word stock, the phrase is rejected,

if the phrase is not the subset of the preset disabled word stock, the phrase is left and used as the word segmentation phrase.

Preferably, the step of performing feature selection on the plurality of word segmentation word groups according to the word frequency, the category information, and the mutual information to obtain a feature word group specifically includes:

calculating the word frequency of all word-separating phrases under each theme;

calculating mutual information of each word segmentation phrase and each theme;

and selecting characteristic values according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain the characteristic phrases.

Preferably, the step of filtering the feature phrases according to preset theme features to obtain theme feature words specifically includes:

selecting any one theme from a plurality of themes as a filtering theme;

acquiring a selected phrase to be filtered from a preset filtering word bank according to a filtering theme;

and successively traversing the feature phrases, comparing the feature phrases with the selected phrases, and deleting the feature phrases if the feature phrases exist in the filtering phrases to screen out the subject feature words.

In order to achieve the purpose, the invention adopts another technical scheme that: provided is a document theme characteristic word extraction device, including:

the system comprises an importing module, a classifying module and a classifying module, wherein the importing module is used for importing a group of classified documents, and the documents have Chinese text data;

the preprocessing module is used for carrying out word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;

the selecting module is used for performing characteristic selection on the word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;

and the filtering module is used for filtering the feature phrases according to the preset theme features to obtain theme feature words.

Preferably, the preprocessing module is specifically configured to:

and outputting the word segmentation phrase.

Preferably, the preprocessing module is further configured to:

Preferably, the selecting module is configured to:

calculating the word frequency of all word-separating phrases under each theme;

calculating mutual information of each word segmentation phrase and each theme;

Preferably, the filtration module is configured to:

selecting any one theme from a plurality of themes as a filtering theme;

According to the technical scheme, the Chinese text data of the document is subjected to word segmentation processing, then the feature selection is carried out on a plurality of word segmentation phrases according to the word frequency, the category information and the mutual information to obtain the feature phrases, and finally the feature phrases are subjected to filtering processing according to the preset subject features to obtain the subject feature words.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for extracting topic feature words from a document according to an embodiment of the present invention;

FIG. 2 is a block diagram of a topic feature word extraction apparatus according to another embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the description of the invention relating to "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying any relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Referring to fig. 1, in the embodiment of the present invention, the method for extracting topic feature words from a document includes the following steps:

step S10, importing a group of classified documents, wherein the documents have Chinese text data;

step S20, performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;

step S30, selecting characteristics of a plurality of word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;

and step S40, filtering the feature phrases according to preset theme features to obtain theme feature words.

In the embodiment of the invention, a group of documents with subject classification is imported, each document only belongs to one subject, and the document has Chinese text data. The word-separating phrases are mainly nouns and verb phrases, and the word-separating treatment can remove auxiliary words, conjunctions, adverbs and the like. Because the number of the participles after the participle preprocessing is large, further means can be considered to process the participles, and the specific scheme refers to the following embodiment. A plurality of word-dividing phrases can be subjected to feature selection through word frequency, category information and mutual information, and therefore feature phrases with small quantity can be obtained. Finally, considering the problem of more feature phrases, the feature phrases can be filtered through preset theme features to obtain theme feature words, so that the searching accuracy can be greatly improved, and the use by a user is facilitated.

In a specific embodiment, the step S20 of performing word segmentation preprocessing on the chinese text data of the document to obtain a plurality of word-segmented phrases specifically includes:

and outputting the word segmentation phrase.

In this embodiment, the word segmentation algorithm may be used to divide the chinese text data into verbs, nouns, adverbs, conjunctions, and the like, and at this time, weak-part phrases such as adverbs, conjunctions, punctuations, and the like may be removed according to the part-of-speech of the phrase, and strong-part-of-speech phrases such as verbs, nouns, and the like may be left. Because the obtained strong word phrases are more in number, the strong word phrases need to be compared with the phrases of the disabled word stock, and the phrases which are not contained in the disabled word stock are left as word-dividing phrases.

Further, the step S20 of comparing the phrase with a preset disabled word bank to obtain a word segmentation phrase specifically includes:

In this embodiment, the word group for deactivating the word stock may be set in advance, and when the word group is determined to be the preset subset of the deactivated word stock, the word group is rejected, and if the word group is not the preset subset of the deactivated word stock, the word group is left and used as the word-segmentation word group.

In a specific embodiment, the step S30 of performing feature selection on the multiple word segmentation word groups according to the word frequency, the category information, and the mutual information to obtain a feature word group specifically includes:

calculating the word frequency of all word-separating phrases under each theme;

calculating mutual information of each word segmentation phrase and each theme;

In this embodiment, the selection of the feature of the word segmentation groups is considered based on the word frequency, the category information and the mutual information, wherein the category information refers to the category of the word segmentation groups, such as place names, personal names, algorithms, chemistry, and the like; mutual information, which may measure the mutual nature between two objects. And the method is used for measuring the distinguishing degree of the features to the subject in the filtering problem. Mutual information is a concept in information theory, is used for representing the relationship between information and is a measure of statistical correlation of two random variables, and the characteristic extraction by using the mutual information theory is based on the assumption that terms with high occurrence frequency in a certain category but low occurrence frequency in other categories are larger than the mutual information of the category. Mutual information is usually used as a measure between feature words and categories, and their mutual information amount is the largest if the feature words belong to the category. And the word frequency is used for calculating the capability of the word describing the document content. The formula for calculating the eigenvalues is as follows:

W(t_i,c_j)＝tf_i×MI(t_i,c_j)*N/N_ij

wherein: t is t_iIs the ith word_，CjIs the jth topic. W (t)_i,c_j) Is a word t_iAbout subject c_jCharacteristic value of (1), tf_iIs a word t_iAbout subject c_jWord frequency of, MI (t)_i,c_j) Is t_iAnd subject c_jN is the total number of topics, N_ijIs a word t_iNumber of topics present.

In a specific embodiment, the step S40 of filtering the feature word group according to the preset theme features to obtain the theme feature word specifically includes:

selecting any one theme from a plurality of themes as a filtering theme;

In this embodiment, after the feature phrases are obtained, filtering with the subject features is performed, so as to further reduce the number of the feature phrases, specifically, each feature phrase of the feature phrases is compared with the subject feature words, and if the feature phrase is the same as the filtering subject or included in the filtering subject, the feature phrase is filtered, and the feature phrase that is not filtered is left as the subject feature word. Therefore, the scheme can set the feature word filtering phrase aiming at a certain theme and avoid the influence of irrelevant feature words on the theme. The filtering can not influence the filtered words as the theme characteristic words of other themes, and the searching accuracy of the document can be greatly improved.

Referring to fig. 2, in an embodiment of the present invention, the apparatus for extracting topic feature words from a document includes:

an importing module 10, configured to import a set of classified documents, where the documents have chinese text data;

the preprocessing module 20 is configured to perform word segmentation preprocessing on the chinese text data of the document to obtain a plurality of word segmentation phrases;

the selecting module 30 is configured to perform feature selection on the multiple word segmentation phrases according to the word frequency, the category information, and the mutual information to obtain feature phrases;

and the filtering module 40 is configured to filter the feature phrases according to preset theme features to obtain theme feature words.

In the embodiment of the present invention, since the number of the participles after the participle preprocessing by the preprocessing module 20 is large, a further means can be considered to process the participles, and the specific scheme refers to the following embodiment. The selecting module 30 can perform feature selection on a plurality of word-segmentation phrases according to the word frequency, the category information and the mutual information, so that feature phrases with small quantity can be obtained. Finally, considering the problem of more feature phrases, the filtering module 40 may also filter the feature phrases according to preset subject features to obtain subject feature words, so that the accuracy of searching may be greatly improved, and the user may use the feature words conveniently.

In an embodiment, the preprocessing module 20 is specifically configured to:

and outputting the word segmentation phrase.

In this embodiment, the preprocessing module 20 may use a word segmentation algorithm to segment the chinese text data into verbs, nouns, adverbs, conjunctions, and the like, and at this time, weak-part phrases such as adverbs, conjunctions, punctuations, and the like may be removed according to the part-of-speech of the phrases, and strong-part-of-speech phrases such as verbs, nouns, and the like are left. Because the obtained strong word phrases are more in number, the strong word phrases need to be compared with the phrases of the disabled word stock, and the phrases which are not contained in the disabled word stock are left as word-dividing phrases.

Further, the preprocessing module 20 is further configured to:

In this embodiment, the word group of the disabled word bank may be set in advance, the processing module is further configured to determine a relationship between the word group and the disabled word bank, reject the word group if the word group is a preset subset of the disabled word bank, and leave the word group as a word segmentation word group if the word group is not the preset subset of the disabled word bank.

In a specific embodiment, the selecting module 30 is configured to:

calculating the word frequency of all word-separating phrases under each theme;

calculating mutual information of each word segmentation phrase and each theme;

In this embodiment, the selection module 30 considers the selection of the word segmentation and word group characteristics based on the word frequency, the category information and the mutual information, wherein the category information refers to the category of the word segmentation and word group, such as place name, name of person, algorithm, chemistry, and the like; mutual information, which may measure the mutual nature between two objects. And the method is used for measuring the distinguishing degree of the features to the subject in the filtering problem. Mutual information is a concept in information theory, is used for representing the relationship between information and is a measure of statistical correlation of two random variables, and the characteristic extraction by using the mutual information theory is based on the assumption that terms with high occurrence frequency in a certain category but low occurrence frequency in other categories are larger than the mutual information of the category. Mutual information is usually used as a measure between feature words and categories, and their mutual information amount is the largest if the feature words belong to the category. And the word frequency is used for calculating the capability of the word describing the document content.

In a specific embodiment, the filtering module 40 is configured to:

selecting any one theme from a plurality of themes as a filtering theme;

In this embodiment, the filtering module 40 may be used to further reduce the number of feature phrases by filtering the feature phrases with the subject features after obtaining the feature phrases, specifically, each feature phrase of the feature phrases is compared with the subject feature words, and if the feature phrase is the same as the filtering subject or included in the filtering subject, the feature phrase is filtered, and the feature phrase that is not filtered is left as the subject feature word. Therefore, the scheme can set the feature word filtering phrase aiming at a certain theme and avoid the influence of irrelevant feature words on the theme. The filtering can not influence the filtered words as the theme characteristic words of other themes, and the searching accuracy of the document can be greatly improved.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for extracting topic characteristic words of a document is characterized by comprising the following steps:

the step of performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases specifically comprises:

outputting word-segmentation phrases;

the step of comparing the phrases with a preset stop word stock to obtain word-segmentation phrases specifically comprises:

if the phrase is not the preset subset of the stop word stock, the phrase is left and is used as a word segmentation phrase;

the step of performing feature selection on the multiple word segmentation phrases according to the word frequency, the category information and the mutual information to obtain feature phrases specifically comprises the following steps:

calculating the word frequency of all word-separating phrases under each theme;

calculating mutual information of each word segmentation phrase and each theme;

selecting a characteristic value according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain characteristic phrases;

the calculation formula of the characteristic value is as follows:

W(ti,cj)＝tfi×MI(ti,cj)*N/Nij

wherein: ti is the ith word, cj is the jth topic, W (ti, cj) is the characteristic value of the word ti about the topic cj, tfi is the word frequency of the word ti about the topic cj, MI (ti, cj) is the mutual information of ti and the topic cj, N is the total topic number, and Nij is the number of topics in which the word ti appears;

filtering the feature phrases according to preset theme features to obtain theme feature words;

the step of filtering the feature phrases according to the preset theme features to obtain theme feature words specifically includes:

selecting any one theme from a plurality of themes as a filtering theme;

2. A document theme feature word extraction device, comprising:

the preprocessing module is specifically configured to:

outputting word-segmentation phrases;

the preprocessing module is further configured to:

the selecting module is used for:

calculating the word frequency of all word-separating phrases under each theme;

calculating mutual information of each word segmentation phrase and each theme;

the calculation formula of the characteristic value is as follows:

W(ti,cj)＝tfi×MI(ti,cj)*N/Nij

the filtering module is used for filtering the feature phrases according to preset theme features to obtain theme feature words;

the filtering module is used for:

selecting any one theme from a plurality of themes as a filtering theme;