CN103246655A - Text categorizing method, device and system - Google Patents

Text categorizing method, device and system Download PDF

Info

Publication number
CN103246655A
CN103246655A CN2012100243714A CN201210024371A CN103246655A CN 103246655 A CN103246655 A CN 103246655A CN 2012100243714 A CN2012100243714 A CN 2012100243714A CN 201210024371 A CN201210024371 A CN 201210024371A CN 103246655 A CN103246655 A CN 103246655A
Authority
CN
China
Prior art keywords
text
sorted
feature
speech
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100243714A
Other languages
Chinese (zh)
Inventor
何晓宁
勇凤伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2012100243714A priority Critical patent/CN103246655A/en
Publication of CN103246655A publication Critical patent/CN103246655A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is applicable to the technical field of internet text categorizing and provides a text categorizing method, device and system. The method includes: extracting characteristics of texts to be categorized; and categorizing the texts to be categorized according to the characteristics of the texts to be categorized to obtain normal texts and junk texts, wherein the characteristics include word properties in the texts to be categorized. By using the property of each word in the texts as the characteristics to conduct characteristic extraction and categorizing, characteristic space is greatly reduced, a relatively complex and precise categorizing model can be selected from a categorizer to categorize the texts to be categorized, and the categorizing accuracy is greatly improved.

Description

A kind of file classification method, Apparatus and system
Technical field
The invention belongs to internet text classification technical field, relate in particular to a kind of file classification method, Apparatus and system.
Background technology
The opening that the internet is good and the interactive rubbish text problem of thereupon having brought, some bad users are by a large amount of politics, advertisement and the Pornographs of internet issue, serious harm public network safety, therefore, need classify to the text message that the user uploads, therefrom filter out rubbish text.
Existing file classification method is based on word and carries out feature extraction, because any language all possesses the vocabulary of magnanimity, therefore there is the huge problem of feature space on the one hand in the feature extraction based on word, limited the performance of sorter, on the other hand, with respect to huge feature space, the relative much less of amount of text that is used for training that can get access to, above-mentioned two aspect problems all cause classifying quality undesirable.Simultaneously, at such encyclopaedia class text of for example " searching encyclopaedia ", because the word that occurs in its text is quite unfixing and coverage is extremely wide, and each text to be sorted all relates to brand-new encyclopaedia entry, it is the content of text that sorter was not learnt, therefore as if the feature extracting method of taking based on word, then training text needs often to upgrade, and influences classifying quality.
Summary of the invention
The purpose of the embodiment of the invention is to provide a kind of file classification method, is intended to solve the existing not good problem of file classification method classifying quality of carrying out feature extraction based on word.
The embodiment of the invention is achieved in that a kind of file classification method, and described method comprises:
Extract the feature of text to be sorted, described feature comprises related part of speech in the described text to be sorted;
Feature according to described text to be sorted is classified to described text to be sorted, obtains normal text and rubbish text.
Another purpose of the embodiment of the invention is to provide a kind of document sorting apparatus, and described device comprises:
Characteristic extracting module, for the feature of extracting text to be sorted, described feature comprises related part of speech in the described text to be sorted;
Sort module is used for according to the feature of described text to be sorted described text to be sorted being classified, and obtains normal text and rubbish text.
Another purpose of the embodiment of the invention is to provide described system of a kind of text classification system to comprise aforesaid document sorting apparatus.
The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.
Description of drawings
Fig. 1 is the realization flow figure of the file classification method that provides of first embodiment of the invention;
Fig. 2 is the realization flow figure of the file classification method that provides of second embodiment of the invention;
Fig. 3 is the realization flow figure of the disaggregated model update method that provides of third embodiment of the invention;
Fig. 4 is the structural drawing of the document sorting apparatus that provides of fourth embodiment of the invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explaining the present invention, and be not used in restriction the present invention.
The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.
Fig. 1 shows the realization flow of the file classification method that first embodiment of the invention provides, and details are as follows:
In step S101, extract the feature of text to be sorted, described feature comprises related part of speech in the described text to be sorted.
In the present embodiment, the part of speech of word in the text to be sorted is extracted as feature, carried out the part of speech that word segmentation processing is obtained each word in the text to be sorted by treating classifying text.For example, text to be sorted is " People's Republic of China (PRC) declared its establishment in 1949 ", then carries out can obtaining after the word segmentation processing<1949 years, China, the people, republic, declaration is set up〉these six words, and these six corresponding part of speech<time words of words difference, proper noun, noun, noun, verb, verb 〉, then carry in the text to be sorted<time word, proper noun, noun, verb〉these four features.
In step S102, according to the feature of described text to be sorted described text to be sorted is classified, obtain normal text and rubbish text.
In the present embodiment, after the feature of text to be sorted is extracted, this text to be sorted namely is converted into the feature that disaggregated model can be identified, therefore, feature by will this text to be sorted is input to carries out characteristic matching in the default disaggregated model, namely can classify to this text to be sorted, judge that this text to be sorted is rubbish text or normal text.
In the present embodiment, classification type is that the classifying text of rubbish text then is filtered, thereby has guaranteed the quality of content of text on the internet.
Particularly, as the refinement to first embodiment of the invention, Fig. 2 shows the realization flow of the file classification method that second embodiment of the invention provides, and details are as follows:
In step S201, be the word string with described text dividing to be sorted.
For example, text to be sorted is " People's Republic of China (PRC) declared its establishment in 1949 ", then carry out can obtaining after the cutting of word string<1949, and China, the people, republic, declaration is set up〉these six word strings.
In step S202, extract the part of speech of each word string, with the feature as described text to be sorted.
For example,<1949 years, China, the people, republic, declaration is set up〉these six corresponding part of speech<time words of words difference, and proper noun, noun, noun, verb, verb 〉, above-mentioned part of speech is the feature of this classifying text.
In step S203, calculate the eigenwert of each feature in the described text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.
For example, in text to be sorted " People's Republic of China (PRC) declared its establishment in 1949 ", always have four types part of speech, its feature<time word〉and<proper noun〉respectively occurred once, feature<noun〉and<verb〉respectively occurred twice, feature<time word then, proper noun, noun, verb〉respectively the characteristic of correspondence value be<1/6,1/6,1/3,1/3 〉.
In step S204, feature and the characteristic of correspondence value thereof of described text to be sorted are imported default disaggregated model so that described text to be sorted is classified, obtain classifying text.
In the present embodiment, for default disaggregated model, because a kind of part of speech of language has only tens kinds, feature space based on part of speech is little, can adopt comparatively complicated and accurate non-linear disaggregated model, therefore in the present embodiment, can be based on for example asymptotic decision tree of machine learning method gradient (gradient boost decision tree, GBDT) etc. disaggregated model is trained and made up to method, this model is by the training in early stage, input is through manual sort's samples of text, namely can be according to the feature in the samples of text, eigenwert and text are learnt, train the disaggregated model that carries out text classification by characteristic matching, its concrete modular concept can not repeat them here with reference to pertinent literature.
In the present embodiment, treat classifying text by default disaggregated model and carried out after the classification, the text that classifies as rubbish text can be deleted.And as one embodiment of the present of invention, the text that is classified as normal text can be further via artificial screening, with the accuracy of further raising classification.
Fig. 3 shows the realization flow of the disaggregated model update method that third embodiment of the invention provides, and details are as follows:
In step S301, the normal text that utilizes classification to obtain is optimized described default disaggregated model.
In the present embodiment, in upgrading the process of disaggregated model, at first obtain through the type of classifying text of classification, comprise normal text and rubbish text.Particularly, the type of this classifying text be can obtain by the two-value tag along sort that detects each classifying text, for example normal text and rubbish text indicated respectively with 0 and 1.In the present embodiment, some classification types are the classifying text composing training language material of normal text, are used for further carrying out the training of disaggregated model.
After getting access to the type of text, extract the feature of normal text and calculate the wherein eigenwert of each feature, with the disaggregated model training to presetting.In the present embodiment, training method is different because of the model that adopts, and concrete training method is not done to be limited at this.Be example with GBDT, training algorithm traversal training data is set up decision tree, travels through whole samples afterwards again, at the sample of the decision tree classification mistake that obtains, sets up second decision tree.After N iteration traversal, disaggregated model is made up of N decision tree.
In step S302, upgrade default disaggregated model according to optimizing the result.
In the present embodiment, deposit N the decision tree that obtains in model file upgrading default disaggregated model, and call when text classification next time for sort program.
The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.
Fig. 4 shows the structure of the document sorting apparatus that fourth embodiment of the invention provides, and for convenience of explanation, only shows the part relevant with present embodiment.
As shown in Figure 4, text sorter can run in the text classification systems such as search engine, database, and for running on the software unit of said system, it specifically comprises:
Characteristic extracting module 41, for the feature of extracting text to be sorted, described feature comprises related part of speech in the described text to be sorted.
Sort module 42 is used for according to the feature of described text to be sorted described text to be sorted being classified, and obtains normal text and rubbish text.
Particularly, characteristic extracting module 41 comprises:
Cutting submodule 411, being used for described text dividing to be sorted is the word string.
Extract submodule 412, be used for extracting the part of speech of each word string with the feature as described text to be sorted.
Calculating sub module 413, for the eigenwert of calculating described each feature of text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.
Sort module 42 comprises:
Input submodule 421, the disaggregated model that is used for feature and the input of characteristic of correspondence value thereof of described text to be sorted are preset is to classify to described text to be sorted.
Further, text sorter also comprises:
Optimize module 43, the normal text that is used for utilizing classification to obtain is optimized described default disaggregated model.
Update module 44 is used for upgrading default disaggregated model according to optimizing the result.
Filtering module 45, the described rubbish text that is used for classification is obtained filters.
The embodiment of the invention is carried out feature extraction and classification with the part of speech of each word in the text as feature, dwindled feature space greatly, and therefore can in sorter, select relative complex and accurate disaggregated model to treat classifying text and classify, improved classification accuracy greatly.
The above only is preferred embodiment of the present invention, not in order to limiting the present invention, all any modifications of doing within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a file classification method is characterized in that, described method comprises:
Extract the feature of text to be sorted, described feature comprises related part of speech in the described text to be sorted;
Feature according to described text to be sorted is classified to described text to be sorted, obtains normal text and rubbish text.
2. the method for claim 1 is characterized in that, the step of the feature of described extraction text to be sorted specifically comprises:
Be the word string with described text dividing to be sorted;
Extract the part of speech of each word string with the feature as described text to be sorted;
Calculate the eigenwert of each feature in the described text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted.
3. method as claimed in claim 2 is characterized in that, described step of described text to be sorted being classified according to the feature of described text to be sorted specifically comprises:
Feature and the characteristic of correspondence value thereof of described text to be sorted are imported default disaggregated model so that described text to be sorted is classified.
4. the method for claim 1 is characterized in that, described method also comprises:
The normal text that utilizes classification to obtain is optimized described default disaggregated model;
Upgrade default disaggregated model according to optimizing the result.
5. the method for claim 1 is characterized in that, described method also comprises:
The described rubbish text that classification is obtained filters.
6. a document sorting apparatus is characterized in that, described device comprises:
Characteristic extracting module, for the feature of extracting text to be sorted, described feature comprises related part of speech in the described text to be sorted;
Sort module is used for according to the feature of described text to be sorted described text to be sorted being classified, and obtains normal text and rubbish text.
7. device as claimed in claim 6 is characterized in that, described characteristic extracting module comprises:
The cutting submodule, being used for described text dividing to be sorted is the word string;
Extract submodule, be used for extracting the part of speech of each word string with the feature as described text to be sorted;
Calculating sub module, for the eigenwert of calculating described each feature of text to be sorted, described eigenwert is the ratio of the part of speech total amount in the shared described text to be sorted of the occurrence number of each part of speech in the described text to be sorted;
Described sort module comprises:
The input submodule, the disaggregated model that is used for feature and the input of characteristic of correspondence value thereof of described text to be sorted are preset is to classify to described text to be sorted.
8. device as claimed in claim 6 is characterized in that, described device also comprises:
Optimize module, the normal text that is used for utilizing classification to obtain is optimized described default disaggregated model;
Update module is used for upgrading default disaggregated model according to optimizing the result.
9. device as claimed in claim 6 is characterized in that, described device also comprises:
Filtering module, the described rubbish text that is used for classification is obtained filters.
10. a text classification system is characterized in that, described system comprises as each described document sorting apparatus of claim 6 to 9.
CN2012100243714A 2012-02-03 2012-02-03 Text categorizing method, device and system Pending CN103246655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100243714A CN103246655A (en) 2012-02-03 2012-02-03 Text categorizing method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100243714A CN103246655A (en) 2012-02-03 2012-02-03 Text categorizing method, device and system

Publications (1)

Publication Number Publication Date
CN103246655A true CN103246655A (en) 2013-08-14

Family

ID=48926180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100243714A Pending CN103246655A (en) 2012-02-03 2012-02-03 Text categorizing method, device and system

Country Status (1)

Country Link
CN (1) CN103246655A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066560A (en) * 2017-03-30 2017-08-18 东软集团股份有限公司 The method and apparatus of text classification
CN107784034A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 The recognition methods of page classification and device, the device for the identification of page classification
CN107797981A (en) * 2016-08-31 2018-03-13 科大讯飞股份有限公司 A kind of target text recognition methods and device
CN108563786A (en) * 2018-04-26 2018-09-21 腾讯科技(深圳)有限公司 Text classification and methods of exhibiting, device, computer equipment and storage medium
CN108628873A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of file classification method, device and equipment
CN109213859A (en) * 2017-07-07 2019-01-15 阿里巴巴集团控股有限公司 A kind of Method for text detection, apparatus and system
CN109471938A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal
CN110705928A (en) * 2019-08-26 2020-01-17 贝壳技术有限公司 Data processing method, device, medium, and electronic apparatus

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040181581A1 (en) * 2003-03-11 2004-09-16 Michael Thomas Kosco Authentication method for preventing delivery of junk electronic mail
CN102255922A (en) * 2011-08-24 2011-11-23 山东师范大学 Intelligent multilevel junk email filtering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040181581A1 (en) * 2003-03-11 2004-09-16 Michael Thomas Kosco Authentication method for preventing delivery of junk electronic mail
CN102255922A (en) * 2011-08-24 2011-11-23 山东师范大学 Intelligent multilevel junk email filtering method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李新: "基于语义的文本信息安全过滤平台", 《信息化研究》 *
闫瑞: "博客数据特征提取与基于分类的垃圾博客过滤", 《中国优秀硕士学位论文全文数据库》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784034B (en) * 2016-08-31 2021-05-25 北京搜狗科技发展有限公司 Page type identification method and device for page type identification
CN107784034A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 The recognition methods of page classification and device, the device for the identification of page classification
CN107797981A (en) * 2016-08-31 2018-03-13 科大讯飞股份有限公司 A kind of target text recognition methods and device
CN108628873A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of file classification method, device and equipment
CN108628873B (en) * 2017-03-17 2022-09-27 腾讯科技(北京)有限公司 Text classification method, device and equipment
CN107066560A (en) * 2017-03-30 2017-08-18 东软集团股份有限公司 The method and apparatus of text classification
CN107066560B (en) * 2017-03-30 2019-12-06 东软集团股份有限公司 Text classification method and device
CN109213859A (en) * 2017-07-07 2019-01-15 阿里巴巴集团控股有限公司 A kind of Method for text detection, apparatus and system
CN108563786A (en) * 2018-04-26 2018-09-21 腾讯科技(深圳)有限公司 Text classification and methods of exhibiting, device, computer equipment and storage medium
CN109471938A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal
CN109471938B (en) * 2018-10-11 2023-06-16 平安科技(深圳)有限公司 Text classification method and terminal
CN110705928A (en) * 2019-08-26 2020-01-17 贝壳技术有限公司 Data processing method, device, medium, and electronic apparatus
CN110705928B (en) * 2019-08-26 2022-11-08 贝壳技术有限公司 Data processing method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN103246655A (en) Text categorizing method, device and system
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN107766371A (en) A kind of text message sorting technique and its device
CN110223675B (en) Method and system for screening training text data for voice recognition
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN106156372B (en) A kind of classification method and device of internet site
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
CN112749284A (en) Knowledge graph construction method, device, equipment and storage medium
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN103076892A (en) Method and equipment for providing input candidate items corresponding to input character string
CN106570180A (en) Artificial intelligence based voice searching method and device
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN107194617A (en) A kind of app software engineers soft skill categorizing system and method
CN113590764A (en) Training sample construction method and device, electronic equipment and storage medium
CN114896305A (en) Smart internet security platform based on big data technology
CN111078979A (en) Method and system for identifying network credit website based on OCR and text processing technology
CN105760524A (en) Multi-level and multi-class classification method for science news headlines
CN107229614A (en) Method and apparatus for grouped data
CN109947934A (en) For the data digging method and system of short text
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN111309855A (en) Text information processing method and system
CN103631963A (en) Keyword optimization processing method and device based on big data
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN110990587A (en) Enterprise relation discovery method and system based on topic model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131021

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20131021

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Applicant after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518044 Zhenxing Road, SEG Science Park 2 East Room 403

Applicant before: Tencent Technology (Shenzhen) Co., Ltd.

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130814

RJ01 Rejection of invention patent application after publication