CN104239479A - Document classification method and system - Google Patents

Document classification method and system Download PDF

Info

Publication number
CN104239479A
CN104239479A CN201410449140.7A CN201410449140A CN104239479A CN 104239479 A CN104239479 A CN 104239479A CN 201410449140 A CN201410449140 A CN 201410449140A CN 104239479 A CN104239479 A CN 104239479A
Authority
CN
China
Prior art keywords
document
sorted
classification
training
characteristic attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410449140.7A
Other languages
Chinese (zh)
Inventor
宗栋瑞
郭美思
吴楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201410449140.7A priority Critical patent/CN104239479A/en
Publication of CN104239479A publication Critical patent/CN104239479A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a document classification method and a document classification system, and is applied to a Hadoop cluster comprising a Map program and a Reduce program. The method comprises the following steps that the Map program parses a training document and a document to be classified, determines a characteristic attribute according to a parsing result, and divides the characteristic attribute; the Map program generates a classifier according to the characteristic attribute of the training document and a classification result of the training document; the Reduce program classifies the document to be classified to obtain a classification result of the document to be classified by virtue of the classifier. According to the method and the system, a distributed characteristic of the Hadoop cluster is fully utilized, and the limitation of a conventional system frame is avoided; the method and the system have the characteristics of concurrency and high speed; massive documents can be rapidly classified, so that classification time is saved, and the document classification efficiency and the system performance are improved.

Description

A kind of Document Classification Method and system
Technical field
The present invention relates to field of computer technology, be specifically related to a kind of Document Classification Method and system.
Background technology
Day by day universal along with network technology, the data volume in network sharply increases, and application type is also very abundant.Data mining technology makes full use of existing information resource, finds out hiding knowledge from mass data, is a strong developing direction.Data mining relates to the fields such as machine learning, pattern-recognition, statistics, intelligent database, data visualization and high-performance calculation, its object is to find implicit, novel, interesting relation and rule from mass data.Wherein, document classification is an important directions of data mining.
In prior art, usually use traditional system framework to carry out document classification, when processing mass data, the classification time can be caused long, and system performance is low.
Summary of the invention
The invention provides a kind of Document Classification Method and system, to solve the low defect of system performance in prior art.
The invention provides a kind of Document Classification Method, be applied to and comprise in the Hadoop cluster of Map program and Reduce program, said method comprising the steps of:
Described Map program is resolved Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute;
Described Map program, according to the characteristic attribute of described Training document and the classification results to described Training document, generates sorter;
Described Reduce program uses described sorter to classify to described document to be sorted, obtains the classification results of document to be sorted.
Alternatively, described Map program, according to after analysis result determination characteristic attribute, also comprises:
Described Map program, according to described characteristic attribute, carries out format conversion to described Training document and described document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;
Described Map program, according to the characteristic attribute of described Training document and the classification results to described Training document, generates sorter, is specially:
Described Map program, according to the characteristic attribute of the Training document after format conversion and the classification results to described Training document, generates sorter;
Described Reduce program uses described sorter to classify to described document to be sorted, obtains the classification results of document to be sorted, is specially:
Described Reduce program uses described sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted.
Alternatively, described Map program, according to the characteristic attribute of the Training document after format conversion and the classification results to described Training document, generates sorter, is specially:
Described Map program is according to the span of each characteristic attribute corresponding to the Training document after described format conversion and the classification results to described Training document, calculating the frequency of occurrences of each classification in described Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the described frequency of occurrences and described conditional probability estimated record.
Alternatively, described Reduce program uses described sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted, is specially:
Described Reduce program obtains the span of all characteristic attributes of the document to be sorted after described format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.
Alternatively, described in described Map program, Training document and document to be sorted are resolved, according to analysis result determination characteristic attribute, and described characteristic attribute are divided, be specially:
Described Map program, by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.
Present invention also offers a kind of document classification system, be applied in Hadoop cluster, described system comprises:
Parsing module, for resolving Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute;
Generation module, for the characteristic attribute of described Training document determined according to described parsing module and the classification results to described Training document, generates sorter;
Sort module, the described sorter generated for using described generation module is classified to described document to be sorted, obtains the classification results of document to be sorted.
Alternatively, described system, also comprises:
Modular converter, for the described characteristic attribute determined according to described parsing module, carries out format conversion to described Training document and described document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;
Described generation module, specifically for according to the characteristic attribute of the Training document after described modular converter format conversion and the classification results to described Training document, generates sorter;
Described sort module, the described sorter generated specifically for using described generation module is classified to the document to be sorted after described modular converter format conversion, obtains the classification results of document to be sorted.
Alternatively, described generation module, specifically for according to the span of each characteristic attribute corresponding to the Training document after described modular converter format conversion and the classification results to described Training document, calculating the frequency of occurrences of each classification in described Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the described frequency of occurrences and described conditional probability estimated record.
Alternatively, described sort module, specifically for obtaining the span of all characteristic attributes of the document to be sorted after described modular converter format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.
Alternatively, described parsing module, specifically for by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.
The present invention takes full advantage of the distributed nature of Hadoop cluster, avoids the limitation of legacy system framework, has parallel feature fast, the classification to magnanimity document can be realized fast, save the classification time, improve the efficiency of document classification, improve system performance.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of a kind of Document Classification Method in the embodiment of the present invention;
Fig. 2 is the structural representation of a kind of document classification system in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
It should be noted that, if do not conflicted, each feature in the embodiment of the present invention and embodiment can be combined with each other, all within protection scope of the present invention.In addition, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.
A kind of Document Classification Method is proposed in the embodiment of the present invention, be applied to and comprise in the Hadoop cluster of Map program and Reduce program, in use Hadoop order, Training document and document to be sorted are placed into HDFS (Hadoop Distributed File System, distributed file system) upper after, perform operation as shown in Figure 1:
Step 101, Map program is resolved Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides characteristic attribute.
Particularly, Map program can by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.
Wherein, under Training document and document to be sorted can be arranged in the different directories of HDFS, and managed by split catalog, the name of each file is class label, and the content under file is the document of the class corresponding with belonging to such label.
Such as, Training document be arranged in HDFS /train catalogue under, document to be sorted be arranged in HDFS /test catalogue under.Map program, according to the analysis result to Training document and document to be sorted, selects 3 characteristic attribute: a, daily record quantity/registration number of days; B, good friend's quantity/registration number of days; C, whether use true head portrait, and each characteristic attribute is divided into: { a<=0.05,0.05<a<0.2, a>=0.2}; { b<=0.1,0.1<b<0.8, b>=0.8}; { c=0 (not being), c=1 (YES) }.
Step 102, Map program, according to the characteristic attribute determined, carries out format conversion to Training document and document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted.
Particularly, Map program can PrepareTwentyNewsgroups class in utility command row Mahout, is meet the Training document of preset format and document to be sorted by Training document and document subject feature vector to be sorted.Wherein, preset format can be VectorWritable form, and in the document meeting VectorWritable form, first character is class label, and remaining character is characteristic attribute.
Step 103, the characteristic attribute of Map program according to the Training document after format conversion and the classification results to Training document, generate sorter.
Particularly, Map program can be corresponding according to the Training document after format conversion the span of each characteristic attribute and the classification results to Training document, calculating the frequency of occurrences of each classification in Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the above-mentioned frequency of occurrences and conditional probability estimated record.
Such as, the number of Training document is 10,000, and its classification results is: 8900 Training document belong to real account numbers (that is, C=0), and 1100 Training document belong to non-genuine account (that is, C=1).
The frequency of occurrences of each classification in Training document is:
P(C=0)=8900/10000=0.89;
P(C=1)=1100/10000=0.11;
Under each classification, the conditional probability of each span of all characteristic attributes is estimated as:
P(a<=0.05︱C=0)=0.3
P(0.05<a<0.2︱C=0)=0.5
P(a>=0.2︱C=0)=0.2
P(a<=0.05︱C=1)=0.8
P(0.05<a<0.2︱C=1)=0.1
P(a>=0.2︱C=1)=0.1
P(b<=0.1︱C=0)=0.1
P(0.1<b<0.8︱C=0)=0.7
P(b>=0.8︱C=0)=0.2
P(b<=0.1︱C=1)=0.7
P(0.1<b<0.8︱C=1)=0.2
P(b>=0.8︱C=1)=0.1
P(c=0︱C=0)=0.2
P(c=1︱C=0)=0.8
P(c=0︱C=1)=0.9
P(c=1︱C=1)=0.1
Step 104, Reduce program uses sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted.
Particularly, Reduce program can obtain the span of all characteristic attributes of the document to be sorted after format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that document to be sorted belongs to each classification, and the classification results of classification corresponding for conditional probability maximum for numerical value as document to be sorted is recorded on HDFS.
Such as, the span of 3 characteristic attributes of document to be sorted is: 0.05<a<0.2,0.1<b<0.8, b>=0.8, c=0, then document to be sorted belongs to the conditional probability of real account numbers (that is, C=0) and is:
P(C=0)P(x︱C=0)
=P(C=0)P(0.05<a<0.2︱C=0)P(0.1<b<0.8︱C=0)P(c=0︱C=0)
=0.89*0.5*0.7*0.2
=0.0623;
The conditional probability that document to be sorted belongs to non-genuine account (that is, C=1) is:
P(C=1)P(x︱C=1)
=P(C=1)P(0.05<a<0.2︱C=1)P(0.1<b<0.8︱C=1)P(c=0︱C=1)
=0.11*0.1*0.2*0.9
=0.00198
The conditional probability belonging to real account numbers due to document to be sorted is maximum, then Reduce program determines that this document to be sorted belongs to real account numbers.
The embodiment of the present invention takes full advantage of the distributed nature of Hadoop cluster, avoids the limitation of legacy system framework, has parallel feature fast, the classification to magnanimity document can be realized fast, save the classification time, improve the efficiency of document classification, improve system performance.
Based on above-mentioned Webpage clustering method, the embodiment of the present invention proposes a kind of document classification system, is applied in Hadoop cluster, and as shown in Figure 2, this system comprises:
Parsing module 210, for resolving Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides this characteristic attribute;
Particularly, above-mentioned parsing module 210, specifically for by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.
Generation module 220, for the characteristic attribute of Training document determined according to parsing module 210 and the classification results to Training document, generates sorter;
Sort module 230, the sorter for using generation module 220 to generate is treated classifying documents and is classified, and obtains the classification results of document to be sorted.
Further, said system, also comprises:
Modular converter 240, for the described characteristic attribute determined according to parsing module 210, carries out format conversion to Training document and document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;
Correspondingly, above-mentioned generation module 220, specifically for according to the characteristic attribute of the Training document after modular converter 240 format conversion and the classification results to Training document, generates sorter;
Above-mentioned sort module 230, classifies to the document to be sorted after modular converter 240 format conversion specifically for the sorter using generation module 220 to generate, obtains the classification results of document to be sorted.
Further, above-mentioned generation module 220, specifically for according to the span of each characteristic attribute corresponding to the Training document after modular converter 240 format conversion and the classification results to Training document, calculating the frequency of occurrences of each classification in Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the above-mentioned frequency of occurrences and above-mentioned conditional probability estimated record.
Correspondingly, above-mentioned sort module 230, specifically for obtaining the span of all characteristic attributes of the document to be sorted after modular converter 240 format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.
The embodiment of the present invention takes full advantage of the distributed nature of Hadoop cluster, avoids the limitation of legacy system framework, has parallel feature fast, the classification to magnanimity document can be realized fast, save the classification time, improve the efficiency of document classification, improve system performance.
In conjunction with the software module that the step in the method that embodiment disclosed herein describes can directly use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.

Claims (10)

1. a Document Classification Method, is characterized in that, is applied to and comprises in the Hadoop cluster of Map program and Reduce program, said method comprising the steps of:
Described Map program is resolved Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute;
Described Map program, according to the characteristic attribute of described Training document and the classification results to described Training document, generates sorter;
Described Reduce program uses described sorter to classify to described document to be sorted, obtains the classification results of document to be sorted.
2. the method for claim 1, is characterized in that, described Map program, according to after analysis result determination characteristic attribute, also comprises:
Described Map program, according to described characteristic attribute, carries out format conversion to described Training document and described document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;
Described Map program, according to the characteristic attribute of described Training document and the classification results to described Training document, generates sorter, is specially:
Described Map program, according to the characteristic attribute of the Training document after format conversion and the classification results to described Training document, generates sorter;
Described Reduce program uses described sorter to classify to described document to be sorted, obtains the classification results of document to be sorted, is specially:
Described Reduce program uses described sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted.
3. method as claimed in claim 2, is characterized in that, described Map program, according to the characteristic attribute of the Training document after format conversion and the classification results to described Training document, generates sorter, is specially:
Described Map program is according to the span of each characteristic attribute corresponding to the Training document after described format conversion and the classification results to described Training document, calculating the frequency of occurrences of each classification in described Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the described frequency of occurrences and described conditional probability estimated record.
4. method as claimed in claim 3, it is characterized in that, described Reduce program uses described sorter to classify to the document to be sorted after format conversion, obtains the classification results of document to be sorted, is specially:
Described Reduce program obtains the span of all characteristic attributes of the document to be sorted after described format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.
5. the method for claim 1, is characterized in that, described Map program is resolved Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute, is specially:
Described Map program, by resolving Training document and document to be sorted, obtains the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.
6. a document classification system, is characterized in that, be applied in Hadoop cluster, described system comprises:
Parsing module, for resolving Training document and document to be sorted, according to analysis result determination characteristic attribute, and divides described characteristic attribute;
Generation module, for the characteristic attribute of described Training document determined according to described parsing module and the classification results to described Training document, generates sorter;
Sort module, the described sorter generated for using described generation module is classified to described document to be sorted, obtains the classification results of document to be sorted.
7. system as claimed in claim 6, is characterized in that, also comprise:
Modular converter, for the described characteristic attribute determined according to described parsing module, carries out format conversion to described Training document and described document to be sorted respectively, obtains meeting the Training document of preset format and document to be sorted;
Described generation module, specifically for according to the characteristic attribute of the Training document after described modular converter format conversion and the classification results to described Training document, generates sorter;
Described sort module, the described sorter generated specifically for using described generation module is classified to the document to be sorted after described modular converter format conversion, obtains the classification results of document to be sorted.
8. system as claimed in claim 7, is characterized in that,
Described generation module, specifically for according to the span of each characteristic attribute corresponding to the Training document after described modular converter format conversion and the classification results to described Training document, calculating the frequency of occurrences of each classification in described Training document and the conditional probability of each span of all characteristic attributes is estimated under each classification, is sorter by the described frequency of occurrences and described conditional probability estimated record.
9. system as claimed in claim 8, is characterized in that,
Described sort module, specifically for obtaining the span of all characteristic attributes of the document to be sorted after described modular converter format conversion, according to the frequency of occurrences in Training document of the span got, each classification and under each classification the conditional probability of each span of all characteristic attributes estimate, calculate the conditional probability that described document to be sorted belongs to each classification, and using the classification results of classification corresponding for conditional probability maximum for numerical value as described document to be sorted.
10. system as claimed in claim 6, is characterized in that,
Described parsing module, specifically for by resolving Training document and document to be sorted, obtain the attribute that Training document and document package to be sorted contain, and selected characteristic attribute in the attribute analytically obtained, and divide multiple span for each characteristic attribute.
CN201410449140.7A 2014-09-04 2014-09-04 Document classification method and system Pending CN104239479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410449140.7A CN104239479A (en) 2014-09-04 2014-09-04 Document classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410449140.7A CN104239479A (en) 2014-09-04 2014-09-04 Document classification method and system

Publications (1)

Publication Number Publication Date
CN104239479A true CN104239479A (en) 2014-12-24

Family

ID=52227538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410449140.7A Pending CN104239479A (en) 2014-09-04 2014-09-04 Document classification method and system

Country Status (1)

Country Link
CN (1) CN104239479A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889309A (en) * 2018-09-07 2020-03-17 上海怀若智能科技有限公司 Financial document classification management system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040015557A1 (en) * 1999-07-30 2004-01-22 Eric Horvitz Methods for routing items for communications based on a measure of criticality
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN102639205A (en) * 2009-07-20 2012-08-15 Esk陶瓷有限及两合公司 Separation apparatus for tubular flow-through apparatuses
CN103455842A (en) * 2013-09-04 2013-12-18 福州大学 Credibility measuring method combining Bayesian algorithm and MapReduce

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040015557A1 (en) * 1999-07-30 2004-01-22 Eric Horvitz Methods for routing items for communications based on a measure of criticality
CN102639205A (en) * 2009-07-20 2012-08-15 Esk陶瓷有限及两合公司 Separation apparatus for tubular flow-through apparatuses
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN103455842A (en) * 2013-09-04 2013-12-18 福州大学 Credibility measuring method combining Bayesian algorithm and MapReduce

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
卫洁 等: "基于Hadoop的分布式朴素贝叶斯文本分类", 《计算机系统应用》 *
喜歌: "贝叶斯分类", 《HTTP://WWW.CNBLOGS.COM/HEXINUAA/ARTICLES/2143483.HTML》 *
董西成: "2.3.2 MapReduce编程实例", 《HTTP://BOOK.51CTO.COM/ART/201312/422139.HTM》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889309A (en) * 2018-09-07 2020-03-17 上海怀若智能科技有限公司 Financial document classification management system and method

Similar Documents

Publication Publication Date Title
CN110019218B (en) Data storage and query method and equipment
CN104850633B (en) A kind of three-dimensional model searching system and method based on the segmentation of cartographical sketching component
CN102722713B (en) Handwritten numeral recognition method based on lie group structure data and system thereof
CN103679132B (en) A kind of nude picture detection method and system
WO2021109464A1 (en) Personalized teaching resource recommendation method for large-scale users
CN101446962B (en) Data conversion method, device thereof and data processing system
CN106528874B (en) The CLR multi-tag data classification method of big data platform is calculated based on Spark memory
US20170337229A1 (en) Spatial indexing for distributed storage using local indexes
TWI464604B (en) Data clustering method and device, data processing apparatus and image processing apparatus
CN105279277A (en) Knowledge data processing method and device
JP2018501579A (en) Semantic representation of image content
CN104462802A (en) Method for analyzing outlier data in large-scale data
CN103020645A (en) System and method for junk picture recognition
CN108073815A (en) Family&#39;s determination method, system and storage medium based on code slice
Azri et al. Dendrogram clustering for 3D data analytics in smart city
CN106407392A (en) A marking language-based node mapping relationship extracting method and system
CN103839074A (en) Image classification method based on matching of sketch line segment information and space pyramid
CN103473275A (en) Automatic image labeling method and automatic image labeling system by means of multi-feature fusion
CN107463624A (en) A kind of method and system that city interest domain identification is carried out based on social media data
CN110874366A (en) Data processing and query method and device
JP5765583B2 (en) Multi-class classifier, multi-class classifying method, and program
CN104239479A (en) Document classification method and system
CN104573101B (en) A kind of data flow real-time grading method and system of rule-based route
CN104008095A (en) Object recognition method based on semantic feature extraction and matching
CN113282568B (en) IOT big data real-time sequence flow analysis application technical method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20141224