CN106503266A - Document Classification Method and device - Google Patents
Document Classification Method and device Download PDFInfo
- Publication number
- CN106503266A CN106503266A CN201611080410.7A CN201611080410A CN106503266A CN 106503266 A CN106503266 A CN 106503266A CN 201611080410 A CN201611080410 A CN 201611080410A CN 106503266 A CN106503266 A CN 106503266A
- Authority
- CN
- China
- Prior art keywords
- document
- sorted
- title
- class
- preset keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of Document Classification Method and device, methods described includes:Obtain the title of document to be sorted;The title of the document to be sorted is compared with the other preset keyword of default document class;If consistent, the document to be sorted belongs to the pre-set categories.The technical scheme of the embodiment of the present invention by comparing the title of document to be sorted with the other preset keyword of default document class, when consistent, document to be sorted is divided into pre-set categories, so as to only need simple step, work of classifying can be completed, classification effectiveness is improve, and classification accuracy is higher.
Description
Technical field
The present invention relates to document classification technical field, more particularly to a kind of Document Classification Method and device.
Background technology
With the development of information technology, people can obtain substantial amounts of information, such as mail, news etc. daily.In order to
These information can efficiently be processed, it is necessary to realize the automatic classification of document.
The method that document is classified in prior art mainly has, and represents text text using the keyword set of Weight
The characteristic information of shelves, then, the feature of carrying out presentation class catalogue using the body after body disambiguation and ontology expansion are processed
Information, and body is converted into the meaning of a word set of Weight by analyzing body structural feature, finally using Earth
Mover ' s Distance methods calculate the semantic similitude between the keyword set of text document and body weight meaning of a word set
Value, and the similar value between document and classified catalogue is further calculated, according to the similar value between text document and classified catalogue
To carry out the classification and sequence of text document.
Although sorting technique highly versatile of the prior art to document, more complicated, if only needed to special
One class document is classified, and can reduce work efficiency.
Content of the invention
In view of this, it is an object of the invention to provide a kind of simple, easy-operating Document Classification Method and device.
To achieve these goals, the invention provides a kind of Document Classification Method, including:
Obtain the title of document to be sorted;
The title of the document to be sorted is compared with the other preset keyword of default document class, judge described in treat point
Whether the preset keyword is included in the title of class document;
If consistent, the document to be sorted belongs to the pre-set categories.
Preferably, the title of document to be sorted is obtained, including:
Obtain the document to be sorted;
The title of the document to be sorted is extracted from the document to be sorted.
Preferably, the title of the document to be sorted is compared with the other preset keyword of default document class, wrap
Include:
The title of the document to be sorted is divided into leading portion and back segment;
The back segment is compared with the other preset keyword of the default document class.
Preferably, the title of the document to be sorted is compared it with default document class other preset keyword
Afterwards, methods described also includes:
If the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the text to be sorted
Shelves belong to the 3rd class document.
Preferably, the document to be sorted is html format.
The present invention also provides a kind of document sorting apparatus, including:
Acquisition module, is configured to the title for obtaining document to be sorted;
Comparing module, is configured to be compared the title of the document to be sorted with the other preset keyword of default document class
Right, whether judge in the title of the document to be sorted comprising the preset keyword;
Sort module, is configured to when the title of the document to be sorted is consistent with default document classification, described to be sorted
Document belongs to the pre-set categories.
Preferably, the acquisition module includes:
Acquisition submodule, is configured to obtain the document to be sorted;
Extracting sub-module, is configured to the title for extracting the document to be sorted from the document to be sorted.
Preferably, the comparing module includes:
Submodule is split, is configured to for the title of the document to be sorted to be divided into leading portion and back segment;
Submodule is compared, is configured to the back segment is compared with the other preset keyword of the default document class.
Preferably, the sort module is additionally operable to:
When the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the text to be sorted
Shelves belong to the 3rd class document.
Preferably, the document to be sorted is html format.
Compared with prior art, the embodiment of the present invention has the advantages that:The technical scheme of the embodiment of the present invention is led to
Cross and the title of document to be sorted is compared with the other preset keyword of default document class, when consistent, by document to be sorted
Pre-set categories are divided into, so as to only need simple step, you can complete work of classifying, improve classification effectiveness, and it is accurate to classify
Really rate is higher.
Description of the drawings
Fig. 1 is the flow chart of the embodiment one of the Document Classification Method of the present invention;
Fig. 2 is the flow chart of the embodiment two of the Document Classification Method of the present invention;
Fig. 3 is the schematic diagram of the embodiment one of the document sorting apparatus of the present invention;
Fig. 4 is the schematic diagram of the embodiment two of the document sorting apparatus of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, the specific embodiment of the present invention is described in further detail.Hereinafter implement
Example is for illustrating the present invention, but is not limited to the scope of the present invention.
With the high speed development of Internet technology, the data of network documentation are in explosive growth.For some specific areas
User, it is desirable to the document for obtaining specific area is exactly an extremely difficult thing.For example, for government staff, or
Need the enterprise for obtaining policy information in time, it is desirable to obtain some policy documents, but government's related web site is all issued daily
A lot of new treatments and policy, the policy document for how obtaining correlation in these information is exactly these government staff and correlation
The problem of enterprise's urgent need to resolve.
The Document Classification Method that prior art is provided, although with versatility, but algorithm is complicated, application difficulty are big,
When only needing to classify a certain class document, execution efficiency is relatively low.The embodiment of the present invention provides a kind of text for particular category
The method classified by shelves, such as policy class, academic space etc..According to the characteristics of the document of the category, preset keyword, further according to
Keyword is classified, and efficiency will be greatly improved.
Fig. 1 is the flow chart of the embodiment one of the Document Classification Method of the present invention, as shown in figure 1, the document of the present embodiment
Sorting technique, specifically may include steps of:
S101, obtains the title of document to be sorted.
Specifically, webpage capture method of the prior art, such as crawler technology can be adopted to obtain in web document
Title in information, specifically webpage.
S102, the title of document to be sorted is compared with the other preset keyword of default document class, treats described in judgement
Whether preset keyword is included in the title of classifying documents;If so, then execution step S103;Otherwise, execution step S104.
Specifically, after obtaining the title of web document, the title and preset keyword can be compared, to determine net
The type of page document, for example, certain foreign trade type enterprise needs to pay close attention to the trend of policies, Doctype of the enterprise in required acquisition
The document of as policy type, the document of policy type is generally divided into declares news flash, three species of policy news flash and the 3rd class article
Type, the then keyword that the Document Title of plan type can include are notices, publicity, bulletin, announce, declare, determining, giving an written reply, just
Case, method, policy, suggestion, planning, detailed rules and regulations and plan etc..
S103, document to be sorted belong to pre-set categories.
Specifically, if the title of document to be sorted contains above-mentioned preset keyword, illustrate that the document to be sorted belongs to
The corresponding type of the preset keyword.
Document to be sorted is divided into the 3rd class document by S104.
Specifically, if the title of document to be sorted does not include above-mentioned preset keyword, the document to be sorted is divided
To the 3rd class article.
The technical scheme of the embodiment of the present invention passes through the title of document to be sorted and the other preset critical of default document class
Word is compared, and when consistent, document to be sorted is divided into pre-set categories, so as to only need simple step, you can complete
Classification work, improves classification effectiveness, and classification accuracy is higher.
Fig. 2 is the flow chart of the embodiment two of the Document Classification Method of the present invention, and the Document Classification Method of the present embodiment exists
On the basis of above-described embodiment one, technical scheme is further introduced in further detail.As shown in Fig. 2 the present embodiment
Document Classification Method, specifically may include steps of:
S201, obtains document to be sorted.
Specifically, the Internet can be connected, and starts browser to obtain document to be sorted.Generally, to be sorted
Document is html format.
S202, extracts the title of document to be sorted from document to be sorted.
Specifically, webpage capture method of the prior art, such as crawler technology can be adopted to obtain in web document
Title in information, specifically webpage.
The title of document to be sorted is divided into leading portion and back segment by S203.
Specifically, typically indicate the keyword of Document Title type to be sorted behind title, therefore, it can to treat point
The title classification leading portion of class document and back segment, when so title and preset keyword being compared, then need not compare leading portion, only
Back segment is compared, to improve the execution efficiency of program.The method of concrete segmentation, for example, it is possible to put down title according to the number of words of title
It is divided into leading portion and back segment.For example, entitled " State Council have a holiday or vacation the Spring Festival on New Year's Day in 2017 notice ", can be divided into leading portion " state affairs
Institute 2017 " and back segment " Spring Festival on New Year's Day have a holiday or vacation notice ";Again for example, the feelings of latter two word of title are only located in determination keyword
Under condition, then can using latter two word of title as back segment, remaining word as leading portion, for " State Council's Spring Festival on New Year's Day in 2017
Have a holiday or vacation notice ", then leading portion is " State Council has a holiday or vacation the Spring Festival on New Year's Day in 2017 ", and back segment is " notice ", so when comparing, only compares
Whether back segment is consistent with preset keyword, further increases execution efficiency.
Whether S204, back segment is compared with the other preset keyword of default document class, judge in back segment comprising default
Keyword, if so, then execution step S205;Otherwise, execution step S206.
Specifically, after obtaining the title of web document, the title and preset keyword can be compared, to determine net
The type of page document, for example, certain foreign trade type enterprise needs to pay close attention to the trend of policies, Doctype of the enterprise in required acquisition
The document of as policy type, the document of policy type is generally divided into declares news flash, three species of policy news flash and the 3rd class article
Type, the then keyword that the Document Title of plan type can include are notices, publicity, bulletin, announce, declare, determining, giving an written reply, just
Case, method, policy, suggestion, planning, detailed rules and regulations and plan etc..S205, document to be sorted belong to pre-set categories.
Specifically, if the title of document to be sorted contains above-mentioned preset keyword, illustrate that the document to be sorted belongs to
The corresponding type of the preset keyword.S206, document to be sorted belong to the 3rd class document.
Specifically, if the title of document to be sorted does not include above-mentioned preset keyword, the document to be sorted is divided
To the 3rd class article.
Generally, document to be sorted is html format.
The technical scheme of the embodiment of the present invention passes through the title of document to be sorted and the other preset critical of default document class
Word is compared, and when consistent, document to be sorted is divided into pre-set categories, so as to only need simple step, you can complete
Classification work, improves classification effectiveness, and classification accuracy is higher.
Fig. 3 is the schematic diagram of the embodiment one of the document sorting apparatus of the present invention, as shown in figure 3, the document of the present embodiment
Sorter, can specifically include acquisition module 31, comparing module 32 and sort module 33.
Acquisition module 31, is configured to the title for obtaining document to be sorted;
Comparing module 32, is configured to be compared the title of document to be sorted with the other preset keyword of default document class
Right;
Sort module 33, is configured to when the title of document to be sorted is consistent with default document classification, document category to be sorted
In pre-set categories.
The document sorting apparatus of the present embodiment, by treating the realization mechanism classified by classifying documents using above-mentioned module
Identical with the realization mechanism of the Document Classification Method of above-mentioned embodiment illustrated in fig. 1, may be referred to above-mentioned embodiment illustrated in fig. 1 in detail
Record, will not be described here.
Fig. 4 is the schematic diagram of the embodiment two of the document sorting apparatus of the present invention, and the document sorting apparatus of the present embodiment exist
On the basis of embodiment as shown in Figure 3, technical scheme is further introduced in further detail.As shown in figure 4, this reality
The document sorting apparatus of example are applied, can further be included:
The acquisition module 31 includes:
Acquisition submodule 311, is configured to obtain document to be sorted;
Extracting sub-module 312, is configured to the title for extracting document to be sorted from document to be sorted.
Further, the comparing module 32 includes:
Submodule 321 is split, is configured to for the title of document to be sorted to be divided into leading portion and back segment;
Submodule 322 is compared, is configured to back segment is compared with the other preset keyword of default document class.
Further, the sort module 33 is additionally operable to:
When the title of document to be sorted is inconsistent with the other preset keyword of default document class, document to be sorted belongs to
Three class documents.
Above-mentioned document to be sorted is html format.
The document sorting apparatus of the present embodiment, by treating the realization mechanism classified by classifying documents using above-mentioned module
Identical with the realization mechanism of the Document Classification Method of above-mentioned embodiment illustrated in fig. 2, may be referred to above-mentioned embodiment illustrated in fig. 2 in detail
Record, will not be described here.
Above example is only the exemplary embodiment of the present invention, is not used in the restriction present invention, protection scope of the present invention
It is defined by the claims.Those skilled in the art can be made respectively to the present invention in the essence and protection domain of the present invention
Modification or equivalent is planted, this modification or equivalent also should be regarded as being within the scope of the present invention.
Claims (10)
1. a kind of Document Classification Method, it is characterised in that include:
Obtain the title of document to be sorted;
The title of the document to be sorted is compared with the other preset keyword of default document class, judges the text to be sorted
Whether the preset keyword is included in the title of shelves;
If consistent, the document to be sorted belongs to the pre-set categories.
2. method according to claim 1, it is characterised in that obtain the title of document to be sorted, including:
Obtain the document to be sorted;
The title of the document to be sorted is extracted from the document to be sorted.
3. method according to claim 1, it is characterised in that by the title of the document to be sorted and default document classification
Preset keyword compare, including:
The title of the document to be sorted is divided into leading portion and back segment;
The back segment is compared with the other preset keyword of the default document class.
4. method according to claim 1, it is characterised in that by the title of the document to be sorted and default document classification
Preset keyword compare after, methods described also includes:
If the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the document category to be sorted
In the 3rd class document.
5. the method according to Claims 1-4, it is characterised in that the document to be sorted is html format.
6. a kind of document sorting apparatus, it is characterised in that include:
Acquisition module, is configured to the title for obtaining document to be sorted;
Comparing module, is configured to the title of the document to be sorted is compared with the other preset keyword of default document class,
Whether judge in the title of the document to be sorted comprising the preset keyword;
Sort module, is configured to when the title of the document to be sorted is consistent with default document classification, the document to be sorted
Belong to the pre-set categories.
7. device according to claim 6, it is characterised in that the acquisition module includes:
Acquisition submodule, is configured to obtain the document to be sorted;
Extracting sub-module, is configured to the title for extracting the document to be sorted from the document to be sorted.
8. device according to claim 6, it is characterised in that the comparing module includes:
Submodule is split, is configured to for the title of the document to be sorted to be divided into leading portion and back segment;
Submodule is compared, is configured to the back segment is compared with the other preset keyword of the default document class.
9. device according to claim 6, it is characterised in that the sort module is additionally operable to:
When the title of the document to be sorted is inconsistent with the other preset keyword of default document class, the document category to be sorted
In the 3rd class document.
10. the device according to claim 6 to 9, it is characterised in that the document to be sorted is html format.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611080410.7A CN106503266A (en) | 2016-11-30 | 2016-11-30 | Document Classification Method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611080410.7A CN106503266A (en) | 2016-11-30 | 2016-11-30 | Document Classification Method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106503266A true CN106503266A (en) | 2017-03-15 |
Family
ID=58328028
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611080410.7A Pending CN106503266A (en) | 2016-11-30 | 2016-11-30 | Document Classification Method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106503266A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388601A (en) * | 2018-02-02 | 2018-08-10 | 腾讯科技(深圳)有限公司 | Sorting technique, storage medium and the computer equipment of failure |
CN109002483A (en) * | 2018-06-22 | 2018-12-14 | 平安科技(深圳)有限公司 | Document management method, device, computer equipment and storage medium |
CN110135264A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Data entry method, device, computer equipment and storage medium |
CN110413569A (en) * | 2019-07-30 | 2019-11-05 | 石浩灼 | Archives of paper quality electronization archiving method, device and terminal device |
CN111177392A (en) * | 2019-12-31 | 2020-05-19 | 腾讯云计算(北京)有限责任公司 | Data processing method and device |
CN111626076A (en) * | 2019-02-27 | 2020-09-04 | 富士通株式会社 | Information processing method, information processing apparatus, and scanner |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679875A (en) * | 2015-03-10 | 2015-06-03 | 杭州凡闻科技有限公司 | Method for classifying information data based on digital newspaper |
CN105740289A (en) * | 2014-12-11 | 2016-07-06 | 阿里巴巴集团控股有限公司 | Method and system for classifying text |
CN105760526A (en) * | 2016-03-01 | 2016-07-13 | 网易(杭州)网络有限公司 | News classification method and device |
CN105893467A (en) * | 2016-03-28 | 2016-08-24 | 北京麒麟合盛网络技术有限公司 | Information classification method and apparatus |
-
2016
- 2016-11-30 CN CN201611080410.7A patent/CN106503266A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740289A (en) * | 2014-12-11 | 2016-07-06 | 阿里巴巴集团控股有限公司 | Method and system for classifying text |
CN104679875A (en) * | 2015-03-10 | 2015-06-03 | 杭州凡闻科技有限公司 | Method for classifying information data based on digital newspaper |
CN105760526A (en) * | 2016-03-01 | 2016-07-13 | 网易(杭州)网络有限公司 | News classification method and device |
CN105893467A (en) * | 2016-03-28 | 2016-08-24 | 北京麒麟合盛网络技术有限公司 | Information classification method and apparatus |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388601A (en) * | 2018-02-02 | 2018-08-10 | 腾讯科技(深圳)有限公司 | Sorting technique, storage medium and the computer equipment of failure |
CN109002483A (en) * | 2018-06-22 | 2018-12-14 | 平安科技(深圳)有限公司 | Document management method, device, computer equipment and storage medium |
CN111626076A (en) * | 2019-02-27 | 2020-09-04 | 富士通株式会社 | Information processing method, information processing apparatus, and scanner |
CN110135264A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Data entry method, device, computer equipment and storage medium |
CN110413569A (en) * | 2019-07-30 | 2019-11-05 | 石浩灼 | Archives of paper quality electronization archiving method, device and terminal device |
CN111177392A (en) * | 2019-12-31 | 2020-05-19 | 腾讯云计算(北京)有限责任公司 | Data processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106503266A (en) | Document Classification Method and device | |
US9565236B2 (en) | Automatic genre classification determination of web content to which the web content belongs together with a corresponding genre probability | |
CN111310476B (en) | Public opinion monitoring method and system using aspect-based emotion analysis method | |
CN103309862B (en) | Webpage type recognition method and system | |
US20140180934A1 (en) | Systems and Methods for Using Non-Textual Information In Analyzing Patent Matters | |
CN101794311A (en) | Fuzzy data mining based automatic classification method of Chinese web pages | |
CN104408093A (en) | News event element extracting method and device | |
CN110334178A (en) | Data retrieval method, device, equipment and readable storage medium storing program for executing | |
CN111125343A (en) | Text analysis method and device suitable for human-sentry matching recommendation system | |
CN103914478A (en) | Webpage training method and system and webpage prediction method and system | |
US20140280014A1 (en) | Apparatus and method for automatic assignment of industry classification codes | |
CN103177036A (en) | Method and system for label automatic extraction | |
CN110134842B (en) | Information matching method and device based on information map, storage medium and server | |
CN103577462A (en) | Document classification method and document classification device | |
CN103838798A (en) | Page classification system and method | |
CN105630931A (en) | Document classification method and device | |
CN105183710A (en) | Method for automatically generating document summary | |
CN109710825A (en) | Webpage harmful information identification method based on machine learning | |
CN106156143A (en) | Page processor and web page processing method | |
CN108536664A (en) | The knowledge fusion method in commodity field | |
CN108241867A (en) | A kind of sorting technique and device | |
CN113901169A (en) | Information processing method, information processing device, electronic equipment and storage medium | |
US10504145B2 (en) | Automated classification of network-accessible content based on events | |
CN103678601A (en) | Model essay retrieval request processing method and device | |
CN103164491B (en) | The method and apparatus of a kind of data processing and retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170315 |
|
RJ01 | Rejection of invention patent application after publication |