CN108108346B - Method and device for extracting theme characteristic words of document - Google Patents

Method and device for extracting theme characteristic words of document Download PDF

Info

Publication number
CN108108346B
CN108108346B CN201611062893.8A CN201611062893A CN108108346B CN 108108346 B CN108108346 B CN 108108346B CN 201611062893 A CN201611062893 A CN 201611062893A CN 108108346 B CN108108346 B CN 108108346B
Authority
CN
China
Prior art keywords
word
phrases
phrase
feature
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611062893.8A
Other languages
Chinese (zh)
Other versions
CN108108346A (en
Inventor
余虎
张郭强
林伟亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Eshore Technology Co Ltd
Original Assignee
Guangdong Eshore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Eshore Technology Co Ltd filed Critical Guangdong Eshore Technology Co Ltd
Priority to CN201611062893.8A priority Critical patent/CN108108346B/en
Publication of CN108108346A publication Critical patent/CN108108346A/en
Application granted granted Critical
Publication of CN108108346B publication Critical patent/CN108108346B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method and a device for extracting topic characteristic words of a document, wherein the method for extracting the topic characteristic words of the document comprises the following steps: importing a group of classified documents, wherein the documents have Chinese text data; performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases; performing characteristic selection on the multiple word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases; and filtering the feature phrases according to preset theme features to obtain theme feature words. The technical scheme of the invention can avoid the influence of irrelevant characteristic words on the document theme, can obtain accurate theme characteristic words and is convenient for searching the document. The method and the device can improve the accuracy of the selection of the theme characteristic words, avoid the omission or multiple selection of the characteristic words, and improve the accuracy of document searching so as to improve the searching experience of users.

Description

Method and device for extracting theme characteristic words of document
Technical Field
The invention relates to the technical field of document searching, in particular to a method and a device for extracting topic characteristic words of a document.
Background
With the continuous development of network technology, searching databases and library documents through websites has gradually replaced the search mode of looking up books manually. When searching for a document by using a website, the topic feature words of the document need to be extracted. In the method for extracting the theme feature words of the document in the prior art, the text of the document is segmented, and then the feature words are extracted according to an extraction algorithm of certain feature words, so that the feature words are obtained. The scheme can only realize fuzzy matching of the feature words, and the obtained feature words have low representativeness and cannot fully represent the features of the theme. In another scheme, after the text of the document is classified, a filtering step is added, and then the filtered feature words are extracted to obtain the feature words. The scheme can filter some invalid feature words, but the filtering is specific to all topics and cannot be performed on a certain topic, the obtained result can omit the features of some topics, and the obtained feature words are not comprehensive enough.
Disclosure of Invention
In order to solve at least one of the above technical problems, a primary object of the present invention is to provide a method for extracting topic feature words from a document.
In order to achieve the purpose, the invention adopts a technical scheme that: a method for extracting topic characteristic words of a document is provided, which comprises the following steps:
importing a group of classified documents, wherein the documents have Chinese text data;
performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;
performing characteristic selection on the multiple word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;
and filtering the feature phrases according to preset theme features to obtain theme feature words.
Preferably, the step of performing word segmentation preprocessing on the chinese text data of the document to obtain a plurality of word segmentation phrases specifically includes:
performing word segmentation on the Chinese text data of the document according to a word segmentation algorithm to obtain a plurality of word groups;
performing part-of-speech screening according to the part-of-speech of the phrase to obtain a phrase with strong part-of-speech;
comparing the phrases with a preset stop word stock to obtain word-segmentation phrases;
and outputting the word segmentation phrase.
Preferably, the step of comparing the word group with a preset stop word bank to obtain a word segmentation word group specifically includes:
determining whether the phrase is a subset of a predetermined disabled word library,
if the phrase is a subset of the preset disabled word stock, the phrase is rejected,
if the phrase is not the subset of the preset disabled word stock, the phrase is left and used as the word segmentation phrase.
Preferably, the step of performing feature selection on the plurality of word segmentation word groups according to the word frequency, the category information, and the mutual information to obtain a feature word group specifically includes:
calculating the word frequency of all word-separating phrases under each theme;
calculating mutual information of each word segmentation phrase and each theme;
and selecting characteristic values according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain the characteristic phrases.
Preferably, the step of filtering the feature phrases according to preset theme features to obtain theme feature words specifically includes:
selecting any one theme from a plurality of themes as a filtering theme;
acquiring a selected phrase to be filtered from a preset filtering word bank according to a filtering theme;
and successively traversing the feature phrases, comparing the feature phrases with the selected phrases, and deleting the feature phrases if the feature phrases exist in the filtering phrases to screen out the subject feature words.
In order to achieve the purpose, the invention adopts another technical scheme that: provided is a document theme characteristic word extraction device, including:
the system comprises an importing module, a classifying module and a classifying module, wherein the importing module is used for importing a group of classified documents, and the documents have Chinese text data;
the preprocessing module is used for carrying out word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;
the selecting module is used for performing characteristic selection on the word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;
and the filtering module is used for filtering the feature phrases according to the preset theme features to obtain theme feature words.
Preferably, the preprocessing module is specifically configured to:
performing word segmentation on the Chinese text data of the document according to a word segmentation algorithm to obtain a plurality of word groups;
performing part-of-speech screening according to the part-of-speech of the phrase to obtain a phrase with strong part-of-speech;
comparing the phrases with a preset stop word stock to obtain word-segmentation phrases;
and outputting the word segmentation phrase.
Preferably, the preprocessing module is further configured to:
determining whether the phrase is a subset of a predetermined disabled word library,
if the phrase is a subset of the preset disabled word stock, the phrase is rejected,
if the phrase is not the subset of the preset disabled word stock, the phrase is left and used as the word segmentation phrase.
Preferably, the selecting module is configured to:
calculating the word frequency of all word-separating phrases under each theme;
calculating mutual information of each word segmentation phrase and each theme;
and selecting characteristic values according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain the characteristic phrases.
Preferably, the filtration module is configured to:
selecting any one theme from a plurality of themes as a filtering theme;
acquiring a selected phrase to be filtered from a preset filtering word bank according to a filtering theme;
and successively traversing the feature phrases, comparing the feature phrases with the selected phrases, and deleting the feature phrases if the feature phrases exist in the filtering phrases to screen out the subject feature words.
According to the technical scheme, the Chinese text data of the document is subjected to word segmentation processing, then the feature selection is carried out on a plurality of word segmentation phrases according to the word frequency, the category information and the mutual information to obtain the feature phrases, and finally the feature phrases are subjected to filtering processing according to the preset subject features to obtain the subject feature words.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for extracting topic feature words from a document according to an embodiment of the present invention;
FIG. 2 is a block diagram of a topic feature word extraction apparatus according to another embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description of the invention relating to "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying any relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Referring to fig. 1, in the embodiment of the present invention, the method for extracting topic feature words from a document includes the following steps:
step S10, importing a group of classified documents, wherein the documents have Chinese text data;
step S20, performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;
step S30, selecting characteristics of a plurality of word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;
and step S40, filtering the feature phrases according to preset theme features to obtain theme feature words.
In the embodiment of the invention, a group of documents with subject classification is imported, each document only belongs to one subject, and the document has Chinese text data. The word-separating phrases are mainly nouns and verb phrases, and the word-separating treatment can remove auxiliary words, conjunctions, adverbs and the like. Because the number of the participles after the participle preprocessing is large, further means can be considered to process the participles, and the specific scheme refers to the following embodiment. A plurality of word-dividing phrases can be subjected to feature selection through word frequency, category information and mutual information, and therefore feature phrases with small quantity can be obtained. Finally, considering the problem of more feature phrases, the feature phrases can be filtered through preset theme features to obtain theme feature words, so that the searching accuracy can be greatly improved, and the use by a user is facilitated.
According to the technical scheme, the Chinese text data of the document is subjected to word segmentation processing, then the feature selection is carried out on a plurality of word segmentation phrases according to the word frequency, the category information and the mutual information to obtain the feature phrases, and finally the feature phrases are subjected to filtering processing according to the preset subject features to obtain the subject feature words.
In a specific embodiment, the step S20 of performing word segmentation preprocessing on the chinese text data of the document to obtain a plurality of word-segmented phrases specifically includes:
performing word segmentation on the Chinese text data of the document according to a word segmentation algorithm to obtain a plurality of word groups;
performing part-of-speech screening according to the part-of-speech of the phrase to obtain a phrase with strong part-of-speech;
comparing the phrases with a preset stop word stock to obtain word-segmentation phrases;
and outputting the word segmentation phrase.
In this embodiment, the word segmentation algorithm may be used to divide the chinese text data into verbs, nouns, adverbs, conjunctions, and the like, and at this time, weak-part phrases such as adverbs, conjunctions, punctuations, and the like may be removed according to the part-of-speech of the phrase, and strong-part-of-speech phrases such as verbs, nouns, and the like may be left. Because the obtained strong word phrases are more in number, the strong word phrases need to be compared with the phrases of the disabled word stock, and the phrases which are not contained in the disabled word stock are left as word-dividing phrases.
Further, the step S20 of comparing the phrase with a preset disabled word bank to obtain a word segmentation phrase specifically includes:
determining whether the phrase is a subset of a predetermined disabled word library,
if the phrase is a subset of the preset disabled word stock, the phrase is rejected,
if the phrase is not the subset of the preset disabled word stock, the phrase is left and used as the word segmentation phrase.
In this embodiment, the word group for deactivating the word stock may be set in advance, and when the word group is determined to be the preset subset of the deactivated word stock, the word group is rejected, and if the word group is not the preset subset of the deactivated word stock, the word group is left and used as the word-segmentation word group.
In a specific embodiment, the step S30 of performing feature selection on the multiple word segmentation word groups according to the word frequency, the category information, and the mutual information to obtain a feature word group specifically includes:
calculating the word frequency of all word-separating phrases under each theme;
calculating mutual information of each word segmentation phrase and each theme;
and selecting characteristic values according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain the characteristic phrases.
In this embodiment, the selection of the feature of the word segmentation groups is considered based on the word frequency, the category information and the mutual information, wherein the category information refers to the category of the word segmentation groups, such as place names, personal names, algorithms, chemistry, and the like; mutual information, which may measure the mutual nature between two objects. And the method is used for measuring the distinguishing degree of the features to the subject in the filtering problem. Mutual information is a concept in information theory, is used for representing the relationship between information and is a measure of statistical correlation of two random variables, and the characteristic extraction by using the mutual information theory is based on the assumption that terms with high occurrence frequency in a certain category but low occurrence frequency in other categories are larger than the mutual information of the category. Mutual information is usually used as a measure between feature words and categories, and their mutual information amount is the largest if the feature words belong to the category. And the word frequency is used for calculating the capability of the word describing the document content. The formula for calculating the eigenvalues is as follows:
W(ti,cj)=tfi×MI(ti,cj)*N/Nij
wherein: t is tiIs the ith word,CjIs the jth topic. W (t)i,cj) Is a word tiAbout subject cjCharacteristic value of (1), tfiIs a word tiAbout subject cjWord frequency of, MI (t)i,cj) Is tiAnd subject cjN is the total number of topics, NijIs a word tiNumber of topics present.
In a specific embodiment, the step S40 of filtering the feature word group according to the preset theme features to obtain the theme feature word specifically includes:
selecting any one theme from a plurality of themes as a filtering theme;
acquiring a selected phrase to be filtered from a preset filtering word bank according to a filtering theme;
and successively traversing the feature phrases, comparing the feature phrases with the selected phrases, and deleting the feature phrases if the feature phrases exist in the filtering phrases to screen out the subject feature words.
In this embodiment, after the feature phrases are obtained, filtering with the subject features is performed, so as to further reduce the number of the feature phrases, specifically, each feature phrase of the feature phrases is compared with the subject feature words, and if the feature phrase is the same as the filtering subject or included in the filtering subject, the feature phrase is filtered, and the feature phrase that is not filtered is left as the subject feature word. Therefore, the scheme can set the feature word filtering phrase aiming at a certain theme and avoid the influence of irrelevant feature words on the theme. The filtering can not influence the filtered words as the theme characteristic words of other themes, and the searching accuracy of the document can be greatly improved.
Referring to fig. 2, in an embodiment of the present invention, the apparatus for extracting topic feature words from a document includes:
an importing module 10, configured to import a set of classified documents, where the documents have chinese text data;
the preprocessing module 20 is configured to perform word segmentation preprocessing on the chinese text data of the document to obtain a plurality of word segmentation phrases;
the selecting module 30 is configured to perform feature selection on the multiple word segmentation phrases according to the word frequency, the category information, and the mutual information to obtain feature phrases;
and the filtering module 40 is configured to filter the feature phrases according to preset theme features to obtain theme feature words.
In the embodiment of the present invention, since the number of the participles after the participle preprocessing by the preprocessing module 20 is large, a further means can be considered to process the participles, and the specific scheme refers to the following embodiment. The selecting module 30 can perform feature selection on a plurality of word-segmentation phrases according to the word frequency, the category information and the mutual information, so that feature phrases with small quantity can be obtained. Finally, considering the problem of more feature phrases, the filtering module 40 may also filter the feature phrases according to preset subject features to obtain subject feature words, so that the accuracy of searching may be greatly improved, and the user may use the feature words conveniently.
In an embodiment, the preprocessing module 20 is specifically configured to:
performing word segmentation on the Chinese text data of the document according to a word segmentation algorithm to obtain a plurality of word groups;
performing part-of-speech screening according to the part-of-speech of the phrase to obtain a phrase with strong part-of-speech;
comparing the phrases with a preset stop word stock to obtain word-segmentation phrases;
and outputting the word segmentation phrase.
In this embodiment, the preprocessing module 20 may use a word segmentation algorithm to segment the chinese text data into verbs, nouns, adverbs, conjunctions, and the like, and at this time, weak-part phrases such as adverbs, conjunctions, punctuations, and the like may be removed according to the part-of-speech of the phrases, and strong-part-of-speech phrases such as verbs, nouns, and the like are left. Because the obtained strong word phrases are more in number, the strong word phrases need to be compared with the phrases of the disabled word stock, and the phrases which are not contained in the disabled word stock are left as word-dividing phrases.
Further, the preprocessing module 20 is further configured to:
determining whether the phrase is a subset of a predetermined disabled word library,
if the phrase is a subset of the preset disabled word stock, the phrase is rejected,
if the phrase is not the subset of the preset disabled word stock, the phrase is left and used as the word segmentation phrase.
In this embodiment, the word group of the disabled word bank may be set in advance, the processing module is further configured to determine a relationship between the word group and the disabled word bank, reject the word group if the word group is a preset subset of the disabled word bank, and leave the word group as a word segmentation word group if the word group is not the preset subset of the disabled word bank.
In a specific embodiment, the selecting module 30 is configured to:
calculating the word frequency of all word-separating phrases under each theme;
calculating mutual information of each word segmentation phrase and each theme;
and selecting characteristic values according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain the characteristic phrases.
In this embodiment, the selection module 30 considers the selection of the word segmentation and word group characteristics based on the word frequency, the category information and the mutual information, wherein the category information refers to the category of the word segmentation and word group, such as place name, name of person, algorithm, chemistry, and the like; mutual information, which may measure the mutual nature between two objects. And the method is used for measuring the distinguishing degree of the features to the subject in the filtering problem. Mutual information is a concept in information theory, is used for representing the relationship between information and is a measure of statistical correlation of two random variables, and the characteristic extraction by using the mutual information theory is based on the assumption that terms with high occurrence frequency in a certain category but low occurrence frequency in other categories are larger than the mutual information of the category. Mutual information is usually used as a measure between feature words and categories, and their mutual information amount is the largest if the feature words belong to the category. And the word frequency is used for calculating the capability of the word describing the document content.
In a specific embodiment, the filtering module 40 is configured to:
selecting any one theme from a plurality of themes as a filtering theme;
acquiring a selected phrase to be filtered from a preset filtering word bank according to a filtering theme;
and successively traversing the feature phrases, comparing the feature phrases with the selected phrases, and deleting the feature phrases if the feature phrases exist in the filtering phrases to screen out the subject feature words.
In this embodiment, the filtering module 40 may be used to further reduce the number of feature phrases by filtering the feature phrases with the subject features after obtaining the feature phrases, specifically, each feature phrase of the feature phrases is compared with the subject feature words, and if the feature phrase is the same as the filtering subject or included in the filtering subject, the feature phrase is filtered, and the feature phrase that is not filtered is left as the subject feature word. Therefore, the scheme can set the feature word filtering phrase aiming at a certain theme and avoid the influence of irrelevant feature words on the theme. The filtering can not influence the filtered words as the theme characteristic words of other themes, and the searching accuracy of the document can be greatly improved.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (2)

1. A method for extracting topic characteristic words of a document is characterized by comprising the following steps:
importing a group of classified documents, wherein the documents have Chinese text data;
performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;
the step of performing word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases specifically comprises:
performing word segmentation on the Chinese text data of the document according to a word segmentation algorithm to obtain a plurality of word groups;
performing part-of-speech screening according to the part-of-speech of the phrase to obtain a phrase with strong part-of-speech;
comparing the phrases with a preset stop word stock to obtain word-segmentation phrases;
outputting word-segmentation phrases;
the step of comparing the phrases with a preset stop word stock to obtain word-segmentation phrases specifically comprises:
determining whether the phrase is a subset of a predetermined disabled word library,
if the phrase is a subset of the preset disabled word stock, the phrase is rejected,
if the phrase is not the preset subset of the stop word stock, the phrase is left and is used as a word segmentation phrase;
performing characteristic selection on the multiple word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;
the step of performing feature selection on the multiple word segmentation phrases according to the word frequency, the category information and the mutual information to obtain feature phrases specifically comprises the following steps:
calculating the word frequency of all word-separating phrases under each theme;
calculating mutual information of each word segmentation phrase and each theme;
selecting a characteristic value according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain characteristic phrases;
the calculation formula of the characteristic value is as follows:
W(ti,cj)=tfi×MI(ti,cj)*N/Nij
wherein: ti is the ith word, cj is the jth topic, W (ti, cj) is the characteristic value of the word ti about the topic cj, tfi is the word frequency of the word ti about the topic cj, MI (ti, cj) is the mutual information of ti and the topic cj, N is the total topic number, and Nij is the number of topics in which the word ti appears;
filtering the feature phrases according to preset theme features to obtain theme feature words;
the step of filtering the feature phrases according to the preset theme features to obtain theme feature words specifically includes:
selecting any one theme from a plurality of themes as a filtering theme;
acquiring a selected phrase to be filtered from a preset filtering word bank according to a filtering theme;
and successively traversing the feature phrases, comparing the feature phrases with the selected phrases, and deleting the feature phrases if the feature phrases exist in the filtering phrases to screen out the subject feature words.
2. A document theme feature word extraction device, comprising:
the system comprises an importing module, a classifying module and a classifying module, wherein the importing module is used for importing a group of classified documents, and the documents have Chinese text data;
the preprocessing module is used for carrying out word segmentation preprocessing on the Chinese text data of the document to obtain a plurality of word segmentation phrases;
the preprocessing module is specifically configured to:
performing word segmentation on the Chinese text data of the document according to a word segmentation algorithm to obtain a plurality of word groups;
performing part-of-speech screening according to the part-of-speech of the phrase to obtain a phrase with strong part-of-speech;
comparing the phrases with a preset stop word stock to obtain word-segmentation phrases;
outputting word-segmentation phrases;
the preprocessing module is further configured to:
determining whether the phrase is a subset of a predetermined disabled word library,
if the phrase is a subset of the preset disabled word stock, the phrase is rejected,
if the phrase is not the preset subset of the stop word stock, the phrase is left and is used as a word segmentation phrase;
the selecting module is used for performing characteristic selection on the word segmentation phrases according to the word frequency, the category information and the mutual information to obtain characteristic phrases;
the selecting module is used for:
calculating the word frequency of all word-separating phrases under each theme;
calculating mutual information of each word segmentation phrase and each theme;
selecting a characteristic value according to the category information of the word segmentation phrases and the calculated word frequency and mutual information of the word segmentation phrases to obtain characteristic phrases;
the calculation formula of the characteristic value is as follows:
W(ti,cj)=tfi×MI(ti,cj)*N/Nij
wherein: ti is the ith word, Cj is the jth topic, W (ti, Cj) is the characteristic value of the word ti about the topic Cj, tfi is the word frequency of the word ti about the topic Cj, MI (ti, Cj) is the mutual information of ti and the topic Cj, N is the total topic number, and Nij is the number of topics in which the word ti appears;
the filtering module is used for filtering the feature phrases according to preset theme features to obtain theme feature words;
the filtering module is used for:
selecting any one theme from a plurality of themes as a filtering theme;
acquiring a selected phrase to be filtered from a preset filtering word bank according to a filtering theme;
and successively traversing the feature phrases, comparing the feature phrases with the selected phrases, and deleting the feature phrases if the feature phrases exist in the filtering phrases to screen out the subject feature words.
CN201611062893.8A 2016-11-25 2016-11-25 Method and device for extracting theme characteristic words of document Active CN108108346B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611062893.8A CN108108346B (en) 2016-11-25 2016-11-25 Method and device for extracting theme characteristic words of document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611062893.8A CN108108346B (en) 2016-11-25 2016-11-25 Method and device for extracting theme characteristic words of document

Publications (2)

Publication Number Publication Date
CN108108346A CN108108346A (en) 2018-06-01
CN108108346B true CN108108346B (en) 2021-12-24

Family

ID=62204652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611062893.8A Active CN108108346B (en) 2016-11-25 2016-11-25 Method and device for extracting theme characteristic words of document

Country Status (1)

Country Link
CN (1) CN108108346B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308607A (en) * 2018-09-17 2019-02-05 田歌 The method and device of book of final entry event
CN109800428B (en) * 2018-12-28 2023-01-13 东软集团股份有限公司 Method, device and equipment for labeling segmentation result for corpus and storage medium
CN110851569B (en) * 2019-11-12 2022-11-29 北京创鑫旅程网络技术有限公司 Data processing method, device, equipment and storage medium
CN113673205B (en) * 2021-08-23 2023-01-13 广东电网有限责任公司 Image character information extraction method, system and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831248A (en) * 2012-09-18 2012-12-19 北京奇虎科技有限公司 Network hotspot mining method and network hotspot mining device
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN103631779A (en) * 2012-08-21 2014-03-12 上海凌攀信息科技有限公司 Word recommending system based on socialized dictionary
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN105183813A (en) * 2015-08-26 2015-12-23 山东省计算中心(国家超级计算济南中心) Mutual information based parallel feature selection method for document classification
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050108200A1 (en) * 2001-07-04 2005-05-19 Frank Meik Category based, extensible and interactive system for document retrieval
US8983826B2 (en) * 2011-06-30 2015-03-17 Palo Alto Research Center Incorporated Method and system for extracting shadow entities from emails
CN105488033B (en) * 2016-01-26 2018-01-02 中国人民解放军国防科学技术大学 Associate the preprocess method and device calculated
CN106021388A (en) * 2016-05-11 2016-10-12 华南理工大学 Classifying method of WeChat official accounts based on LDA topic clustering

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631779A (en) * 2012-08-21 2014-03-12 上海凌攀信息科技有限公司 Word recommending system based on socialized dictionary
CN102831248A (en) * 2012-09-18 2012-12-19 北京奇虎科技有限公司 Network hotspot mining method and network hotspot mining device
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN105183813A (en) * 2015-08-26 2015-12-23 山东省计算中心(国家超级计算济南中心) Mutual information based parallel feature selection method for document classification
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于互信息的话题特征选择方法研究;吴树芳 等;《情报杂志》;20140430;第34卷(第4期);第159-161页 *
基于词频和文本类别的互信息改进算法;谢力 等;《井冈山大学学报(自然科学版)》;20130531;第34卷(第3期);第41-44页 *

Also Published As

Publication number Publication date
CN108108346A (en) 2018-06-01

Similar Documents

Publication Publication Date Title
CN108108346B (en) Method and device for extracting theme characteristic words of document
US8577155B2 (en) System and method for duplicate text recognition
WO2020140373A1 (en) Intention recognition method, recognition device and computer-readable storage medium
CN110738039B (en) Case auxiliary information prompting method and device, storage medium and server
CN104598532A (en) Information processing method and device
CN105630975B (en) Information processing method and electronic equipment
CN108363694B (en) Keyword extraction method and device
CN110110325B (en) Repeated case searching method and device and computer readable storage medium
CN110851714A (en) Text recommendation method and system based on heterogeneous topic model and word embedding model
RU2738335C1 (en) Method and system for classifying and filtering prohibited content in a network
CN112100470A (en) Expert recommendation method, device, equipment and storage medium based on thesis data analysis
CN110570199A (en) User identity detection method and system based on user input behaviors
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN106484672A (en) Vocabulary recognition methods and vocabulary identifying system
CN108021595A (en) Examine the method and device of knowledge base triple
CN105843890A (en) Knowledge base based big data and general data oriented data collection method and system
CN109408789B (en) Handwriting template, generation method thereof and handwriting template selection system
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN110619212B (en) Character string-based malicious software identification method, system and related device
CN107229654A (en) A kind of heat searches word acquisition methods and system
CN111079448A (en) Intention identification method and device
CN107844553B (en) Text classification method and device
CN105893397A (en) Video recommendation method and apparatus
CN111061924A (en) Phrase extraction method, device, equipment and storage medium
KR100670789B1 (en) Method for multi-level text filtering for blocking harmful web-sites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant