CN106503266A - Document Classification Method and device - Google Patents

Document Classification Method and device Download PDF

Info

Publication number
CN106503266A
CN106503266A CN201611080410.7A CN201611080410A CN106503266A CN 106503266 A CN106503266 A CN 106503266A CN 201611080410 A CN201611080410 A CN 201611080410A CN 106503266 A CN106503266 A CN 106503266A
Authority
CN
China
Prior art keywords
document
sorted
title
class
preset keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611080410.7A
Other languages
Chinese (zh)
Inventor
晋好林
于龙
陈美丽
朱涛
赵西法
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JINAN ZHENGHE TECHNOLOGY Co Ltd
Original Assignee
JINAN ZHENGHE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JINAN ZHENGHE TECHNOLOGY Co Ltd filed Critical JINAN ZHENGHE TECHNOLOGY Co Ltd
Priority to CN201611080410.7A priority Critical patent/CN106503266A/en
Publication of CN106503266A publication Critical patent/CN106503266A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Document Classification Method and device, methods described includes:Obtain the title of document to be sorted;The title of the document to be sorted is compared with the other preset keyword of default document class;If consistent, the document to be sorted belongs to the pre-set categories.The technical scheme of the embodiment of the present invention by comparing the title of document to be sorted with the other preset keyword of default document class, when consistent, document to be sorted is divided into pre-set categories, so as to only need simple step, work of classifying can be completed, classification effectiveness is improve, and classification accuracy is higher.

Description

Document Classification Method and device
Technical field
The present invention relates to document classification technical field, more particularly to a kind of Document Classification Method and device.
Background technology
With the development of information technology, people can obtain substantial amounts of information, such as mail, news etc. daily.In order to These information can efficiently be processed, it is necessary to realize the automatic classification of document.
The method that document is classified in prior art mainly has, and represents text text using the keyword set of Weight The characteristic information of shelves, then, the feature of carrying out presentation class catalogue using the body after body disambiguation and ontology expansion are processed Information, and body is converted into the meaning of a word set of Weight by analyzing body structural feature, finally using Earth Mover ' s Distance methods calculate the semantic similitude between the keyword set of text document and body weight meaning of a word set Value, and the similar value between document and classified catalogue is further calculated, according to the similar value between text document and classified catalogue To carry out the classification and sequence of text document.
Although sorting technique highly versatile of the prior art to document, more complicated, if only needed to special One class document is classified, and can reduce work efficiency.
Content of the invention
In view of this, it is an object of the invention to provide a kind of simple, easy-operating Document Classification Method and device.
To achieve these goals, the invention provides a kind of Document Classification Method, including:
Obtain the title of document to be sorted;
The title of the document to be sorted is compared with the other preset keyword of default document class, judge described in treat point Whether the preset keyword is included in the title of class document;
If consistent, the document to be sorted belongs to the pre-set categories.
Preferably, the title of document to be sorted is obtained, including:
Obtain the document to be sorted;
The title of the document to be sorted is extracted from the document to be sorted.
Preferably, the title of the document to be sorted is compared with the other preset keyword of default document class, wrap Include:
The title of the document to be sorted is divided into leading portion and back segment;
The back segment is compared with the other preset keyword of the default document class.
Preferably, the title of the document to be sorted is compared it with default document class other preset keyword Afterwards, methods described also includes:
If the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the text to be sorted Shelves belong to the 3rd class document.
Preferably, the document to be sorted is html format.
The present invention also provides a kind of document sorting apparatus, including:
Acquisition module, is configured to the title for obtaining document to be sorted;
Comparing module, is configured to be compared the title of the document to be sorted with the other preset keyword of default document class Right, whether judge in the title of the document to be sorted comprising the preset keyword;
Sort module, is configured to when the title of the document to be sorted is consistent with default document classification, described to be sorted Document belongs to the pre-set categories.
Preferably, the acquisition module includes:
Acquisition submodule, is configured to obtain the document to be sorted;
Extracting sub-module, is configured to the title for extracting the document to be sorted from the document to be sorted.
Preferably, the comparing module includes:
Submodule is split, is configured to for the title of the document to be sorted to be divided into leading portion and back segment;
Submodule is compared, is configured to the back segment is compared with the other preset keyword of the default document class.
Preferably, the sort module is additionally operable to:
When the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the text to be sorted Shelves belong to the 3rd class document.
Preferably, the document to be sorted is html format.
Compared with prior art, the embodiment of the present invention has the advantages that:The technical scheme of the embodiment of the present invention is led to Cross and the title of document to be sorted is compared with the other preset keyword of default document class, when consistent, by document to be sorted Pre-set categories are divided into, so as to only need simple step, you can complete work of classifying, improve classification effectiveness, and it is accurate to classify Really rate is higher.
Description of the drawings
Fig. 1 is the flow chart of the embodiment one of the Document Classification Method of the present invention;
Fig. 2 is the flow chart of the embodiment two of the Document Classification Method of the present invention;
Fig. 3 is the schematic diagram of the embodiment one of the document sorting apparatus of the present invention;
Fig. 4 is the schematic diagram of the embodiment two of the document sorting apparatus of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, the specific embodiment of the present invention is described in further detail.Hereinafter implement Example is for illustrating the present invention, but is not limited to the scope of the present invention.
With the high speed development of Internet technology, the data of network documentation are in explosive growth.For some specific areas User, it is desirable to the document for obtaining specific area is exactly an extremely difficult thing.For example, for government staff, or Need the enterprise for obtaining policy information in time, it is desirable to obtain some policy documents, but government's related web site is all issued daily A lot of new treatments and policy, the policy document for how obtaining correlation in these information is exactly these government staff and correlation The problem of enterprise's urgent need to resolve.
The Document Classification Method that prior art is provided, although with versatility, but algorithm is complicated, application difficulty are big, When only needing to classify a certain class document, execution efficiency is relatively low.The embodiment of the present invention provides a kind of text for particular category The method classified by shelves, such as policy class, academic space etc..According to the characteristics of the document of the category, preset keyword, further according to Keyword is classified, and efficiency will be greatly improved.
Fig. 1 is the flow chart of the embodiment one of the Document Classification Method of the present invention, as shown in figure 1, the document of the present embodiment Sorting technique, specifically may include steps of:
S101, obtains the title of document to be sorted.
Specifically, webpage capture method of the prior art, such as crawler technology can be adopted to obtain in web document Title in information, specifically webpage.
S102, the title of document to be sorted is compared with the other preset keyword of default document class, treats described in judgement Whether preset keyword is included in the title of classifying documents;If so, then execution step S103;Otherwise, execution step S104.
Specifically, after obtaining the title of web document, the title and preset keyword can be compared, to determine net The type of page document, for example, certain foreign trade type enterprise needs to pay close attention to the trend of policies, Doctype of the enterprise in required acquisition The document of as policy type, the document of policy type is generally divided into declares news flash, three species of policy news flash and the 3rd class article Type, the then keyword that the Document Title of plan type can include are notices, publicity, bulletin, announce, declare, determining, giving an written reply, just Case, method, policy, suggestion, planning, detailed rules and regulations and plan etc..
S103, document to be sorted belong to pre-set categories.
Specifically, if the title of document to be sorted contains above-mentioned preset keyword, illustrate that the document to be sorted belongs to The corresponding type of the preset keyword.
Document to be sorted is divided into the 3rd class document by S104.
Specifically, if the title of document to be sorted does not include above-mentioned preset keyword, the document to be sorted is divided To the 3rd class article.
The technical scheme of the embodiment of the present invention passes through the title of document to be sorted and the other preset critical of default document class Word is compared, and when consistent, document to be sorted is divided into pre-set categories, so as to only need simple step, you can complete Classification work, improves classification effectiveness, and classification accuracy is higher.
Fig. 2 is the flow chart of the embodiment two of the Document Classification Method of the present invention, and the Document Classification Method of the present embodiment exists On the basis of above-described embodiment one, technical scheme is further introduced in further detail.As shown in Fig. 2 the present embodiment Document Classification Method, specifically may include steps of:
S201, obtains document to be sorted.
Specifically, the Internet can be connected, and starts browser to obtain document to be sorted.Generally, to be sorted Document is html format.
S202, extracts the title of document to be sorted from document to be sorted.
Specifically, webpage capture method of the prior art, such as crawler technology can be adopted to obtain in web document Title in information, specifically webpage.
The title of document to be sorted is divided into leading portion and back segment by S203.
Specifically, typically indicate the keyword of Document Title type to be sorted behind title, therefore, it can to treat point The title classification leading portion of class document and back segment, when so title and preset keyword being compared, then need not compare leading portion, only Back segment is compared, to improve the execution efficiency of program.The method of concrete segmentation, for example, it is possible to put down title according to the number of words of title It is divided into leading portion and back segment.For example, entitled " State Council have a holiday or vacation the Spring Festival on New Year's Day in 2017 notice ", can be divided into leading portion " state affairs Institute 2017 " and back segment " Spring Festival on New Year's Day have a holiday or vacation notice ";Again for example, the feelings of latter two word of title are only located in determination keyword Under condition, then can using latter two word of title as back segment, remaining word as leading portion, for " State Council's Spring Festival on New Year's Day in 2017 Have a holiday or vacation notice ", then leading portion is " State Council has a holiday or vacation the Spring Festival on New Year's Day in 2017 ", and back segment is " notice ", so when comparing, only compares Whether back segment is consistent with preset keyword, further increases execution efficiency.
Whether S204, back segment is compared with the other preset keyword of default document class, judge in back segment comprising default Keyword, if so, then execution step S205;Otherwise, execution step S206.
Specifically, after obtaining the title of web document, the title and preset keyword can be compared, to determine net The type of page document, for example, certain foreign trade type enterprise needs to pay close attention to the trend of policies, Doctype of the enterprise in required acquisition The document of as policy type, the document of policy type is generally divided into declares news flash, three species of policy news flash and the 3rd class article Type, the then keyword that the Document Title of plan type can include are notices, publicity, bulletin, announce, declare, determining, giving an written reply, just Case, method, policy, suggestion, planning, detailed rules and regulations and plan etc..S205, document to be sorted belong to pre-set categories.
Specifically, if the title of document to be sorted contains above-mentioned preset keyword, illustrate that the document to be sorted belongs to The corresponding type of the preset keyword.S206, document to be sorted belong to the 3rd class document.
Specifically, if the title of document to be sorted does not include above-mentioned preset keyword, the document to be sorted is divided To the 3rd class article.
Generally, document to be sorted is html format.
The technical scheme of the embodiment of the present invention passes through the title of document to be sorted and the other preset critical of default document class Word is compared, and when consistent, document to be sorted is divided into pre-set categories, so as to only need simple step, you can complete Classification work, improves classification effectiveness, and classification accuracy is higher.
Fig. 3 is the schematic diagram of the embodiment one of the document sorting apparatus of the present invention, as shown in figure 3, the document of the present embodiment Sorter, can specifically include acquisition module 31, comparing module 32 and sort module 33.
Acquisition module 31, is configured to the title for obtaining document to be sorted;
Comparing module 32, is configured to be compared the title of document to be sorted with the other preset keyword of default document class Right;
Sort module 33, is configured to when the title of document to be sorted is consistent with default document classification, document category to be sorted In pre-set categories.
The document sorting apparatus of the present embodiment, by treating the realization mechanism classified by classifying documents using above-mentioned module Identical with the realization mechanism of the Document Classification Method of above-mentioned embodiment illustrated in fig. 1, may be referred to above-mentioned embodiment illustrated in fig. 1 in detail Record, will not be described here.
Fig. 4 is the schematic diagram of the embodiment two of the document sorting apparatus of the present invention, and the document sorting apparatus of the present embodiment exist On the basis of embodiment as shown in Figure 3, technical scheme is further introduced in further detail.As shown in figure 4, this reality The document sorting apparatus of example are applied, can further be included:
The acquisition module 31 includes:
Acquisition submodule 311, is configured to obtain document to be sorted;
Extracting sub-module 312, is configured to the title for extracting document to be sorted from document to be sorted.
Further, the comparing module 32 includes:
Submodule 321 is split, is configured to for the title of document to be sorted to be divided into leading portion and back segment;
Submodule 322 is compared, is configured to back segment is compared with the other preset keyword of default document class.
Further, the sort module 33 is additionally operable to:
When the title of document to be sorted is inconsistent with the other preset keyword of default document class, document to be sorted belongs to Three class documents.
Above-mentioned document to be sorted is html format.
The document sorting apparatus of the present embodiment, by treating the realization mechanism classified by classifying documents using above-mentioned module Identical with the realization mechanism of the Document Classification Method of above-mentioned embodiment illustrated in fig. 2, may be referred to above-mentioned embodiment illustrated in fig. 2 in detail Record, will not be described here.
Above example is only the exemplary embodiment of the present invention, is not used in the restriction present invention, protection scope of the present invention It is defined by the claims.Those skilled in the art can be made respectively to the present invention in the essence and protection domain of the present invention Modification or equivalent is planted, this modification or equivalent also should be regarded as being within the scope of the present invention.

Claims (10)

1. a kind of Document Classification Method, it is characterised in that include:
Obtain the title of document to be sorted;
The title of the document to be sorted is compared with the other preset keyword of default document class, judges the text to be sorted Whether the preset keyword is included in the title of shelves;
If consistent, the document to be sorted belongs to the pre-set categories.
2. method according to claim 1, it is characterised in that obtain the title of document to be sorted, including:
Obtain the document to be sorted;
The title of the document to be sorted is extracted from the document to be sorted.
3. method according to claim 1, it is characterised in that by the title of the document to be sorted and default document classification Preset keyword compare, including:
The title of the document to be sorted is divided into leading portion and back segment;
The back segment is compared with the other preset keyword of the default document class.
4. method according to claim 1, it is characterised in that by the title of the document to be sorted and default document classification Preset keyword compare after, methods described also includes:
If the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the document category to be sorted In the 3rd class document.
5. the method according to Claims 1-4, it is characterised in that the document to be sorted is html format.
6. a kind of document sorting apparatus, it is characterised in that include:
Acquisition module, is configured to the title for obtaining document to be sorted;
Comparing module, is configured to the title of the document to be sorted is compared with the other preset keyword of default document class, Whether judge in the title of the document to be sorted comprising the preset keyword;
Sort module, is configured to when the title of the document to be sorted is consistent with default document classification, the document to be sorted Belong to the pre-set categories.
7. device according to claim 6, it is characterised in that the acquisition module includes:
Acquisition submodule, is configured to obtain the document to be sorted;
Extracting sub-module, is configured to the title for extracting the document to be sorted from the document to be sorted.
8. device according to claim 6, it is characterised in that the comparing module includes:
Submodule is split, is configured to for the title of the document to be sorted to be divided into leading portion and back segment;
Submodule is compared, is configured to the back segment is compared with the other preset keyword of the default document class.
9. device according to claim 6, it is characterised in that the sort module is additionally operable to:
When the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the document category to be sorted In the 3rd class document.
10. the device according to claim 6 to 9, it is characterised in that the document to be sorted is html format.
CN201611080410.7A 2016-11-30 2016-11-30 Document Classification Method and device Pending CN106503266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611080410.7A CN106503266A (en) 2016-11-30 2016-11-30 Document Classification Method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611080410.7A CN106503266A (en) 2016-11-30 2016-11-30 Document Classification Method and device

Publications (1)

Publication Number Publication Date
CN106503266A true CN106503266A (en) 2017-03-15

Family

ID=58328028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611080410.7A Pending CN106503266A (en) 2016-11-30 2016-11-30 Document Classification Method and device

Country Status (1)

Country Link
CN (1) CN106503266A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388601A (en) * 2018-02-02 2018-08-10 腾讯科技(深圳)有限公司 Sorting technique, storage medium and the computer equipment of failure
CN109002483A (en) * 2018-06-22 2018-12-14 平安科技(深圳)有限公司 Document management method, device, computer equipment and storage medium
CN110135264A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Data entry method, device, computer equipment and storage medium
CN110413569A (en) * 2019-07-30 2019-11-05 石浩灼 Archives of paper quality electronization archiving method, device and terminal device
CN111177392A (en) * 2019-12-31 2020-05-19 腾讯云计算(北京)有限责任公司 Data processing method and device
CN111626076A (en) * 2019-02-27 2020-09-04 富士通株式会社 Information processing method, information processing apparatus, and scanner

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper
CN105740289A (en) * 2014-12-11 2016-07-06 阿里巴巴集团控股有限公司 Method and system for classifying text
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device
CN105893467A (en) * 2016-03-28 2016-08-24 北京麒麟合盛网络技术有限公司 Information classification method and apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740289A (en) * 2014-12-11 2016-07-06 阿里巴巴集团控股有限公司 Method and system for classifying text
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device
CN105893467A (en) * 2016-03-28 2016-08-24 北京麒麟合盛网络技术有限公司 Information classification method and apparatus

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388601A (en) * 2018-02-02 2018-08-10 腾讯科技(深圳)有限公司 Sorting technique, storage medium and the computer equipment of failure
CN109002483A (en) * 2018-06-22 2018-12-14 平安科技(深圳)有限公司 Document management method, device, computer equipment and storage medium
CN111626076A (en) * 2019-02-27 2020-09-04 富士通株式会社 Information processing method, information processing apparatus, and scanner
CN110135264A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Data entry method, device, computer equipment and storage medium
CN110413569A (en) * 2019-07-30 2019-11-05 石浩灼 Archives of paper quality electronization archiving method, device and terminal device
CN111177392A (en) * 2019-12-31 2020-05-19 腾讯云计算(北京)有限责任公司 Data processing method and device

Similar Documents

Publication Publication Date Title
CN106503266A (en) Document Classification Method and device
US9565236B2 (en) Automatic genre classification determination of web content to which the web content belongs together with a corresponding genre probability
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN103309862B (en) Webpage type recognition method and system
US20140180934A1 (en) Systems and Methods for Using Non-Textual Information In Analyzing Patent Matters
CN101794311A (en) Fuzzy data mining based automatic classification method of Chinese web pages
CN104408093A (en) News event element extracting method and device
CN110334178A (en) Data retrieval method, device, equipment and readable storage medium storing program for executing
CN111125343A (en) Text analysis method and device suitable for human-sentry matching recommendation system
CN103914478A (en) Webpage training method and system and webpage prediction method and system
US20140280014A1 (en) Apparatus and method for automatic assignment of industry classification codes
CN103177036A (en) Method and system for label automatic extraction
CN110134842B (en) Information matching method and device based on information map, storage medium and server
CN103577462A (en) Document classification method and document classification device
CN103838798A (en) Page classification system and method
CN105630931A (en) Document classification method and device
CN105183710A (en) Method for automatically generating document summary
CN109710825A (en) Webpage harmful information identification method based on machine learning
CN106156143A (en) Page processor and web page processing method
CN108536664A (en) The knowledge fusion method in commodity field
CN108241867A (en) A kind of sorting technique and device
CN113901169A (en) Information processing method, information processing device, electronic equipment and storage medium
US10504145B2 (en) Automated classification of network-accessible content based on events
CN103678601A (en) Model essay retrieval request processing method and device
CN103164491B (en) The method and apparatus of a kind of data processing and retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170315

RJ01 Rejection of invention patent application after publication