CN110019785A - A kind of file classification method and device - Google Patents

A kind of file classification method and device Download PDF

Info

Publication number
CN110019785A
CN110019785A CN201710910888.6A CN201710910888A CN110019785A CN 110019785 A CN110019785 A CN 110019785A CN 201710910888 A CN201710910888 A CN 201710910888A CN 110019785 A CN110019785 A CN 110019785A
Authority
CN
China
Prior art keywords
text
sorted
classification
referenced
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710910888.6A
Other languages
Chinese (zh)
Other versions
CN110019785B (en
Inventor
胡斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710910888.6A priority Critical patent/CN110019785B/en
Publication of CN110019785A publication Critical patent/CN110019785A/en
Application granted granted Critical
Publication of CN110019785B publication Critical patent/CN110019785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of file classification method and devices, this method comprises: obtaining text to be sorted and the database including multiple and different classification referenced texts;The similarity for evaluating each referenced text in text to be sorted and database, obtains the measuring similarity value between text to be sorted and each referenced text;According to the classification of measuring similarity value and referenced text, text to be sorted is obtained to the assessed value of each classification;The corresponding classification of the maximum value of assessed value is determined as to the classification of text to be sorted.Since the classification of referenced text in database is exhaustible, the classification for treating classifying text can be realized according to the similarity between text to be sorted and the referenced text of known classification, it can classify to any one text to be sorted, improve the coverage rate of text classification.

Description

A kind of file classification method and device
Technical field
This application involves big data technical fields, and in particular to a kind of file classification method and device.
Background technique
Text classification is exactly to carry out automatic classification marker according to certain classification system or standard to text, in this way, user It browses text and can inquire with not only can be convenient but also by classification required text.
Text classification mode used at present generally uses strong rule match, is calculated using such as regular expression, decision tree The mechanism such as method are realized.When specific implementation, the sequence in the preceding preset some artificially defined conditional plan set of algorithm execution, execution Matching rule determines the affiliated classification of the text according to the corresponding rule classification of rule of text matches.
Since text expression mode of the different authors for same content certainly exists difference, and artificially arrange the rule of definition It is then limited, it is impossible to rule used in reality is subjected to exhaustion, therefore, existing text classification algorithm can not be to certain Text is classified, and there is a problem of that coverage rate is not complete.
Summary of the invention
In view of the above problems, it proposes on the application overcomes the above problem or at least be partially solved in order to provide one kind The file classification method and device for stating problem, improve the coverage rate of text classification.
A kind of file classification method provided by the embodiments of the present application, comprising:
The database for obtaining text to be sorted and constructing in advance, the database include multiple referenced texts and the reference The classification of text, the referenced text in the database are at least divided into two classes;
The similarity for evaluating each referenced text in the text to be sorted and the database, obtains the text to be sorted Measuring similarity value between sheet and each referenced text;
According to the classification of the measuring similarity value and the referenced text, the text to be sorted is obtained to each point The assessed value of class;
The corresponding classification of the maximum value of the assessed value is determined as to the classification of the text to be sorted.
Optionally, the similarity of the evaluation text to be sorted and each referenced text in the database, obtains Measuring similarity value between the text to be sorted and each referenced text, specifically includes:
It treats classifying text and carries out first participle processing, obtain the word segmentation result of the text to be sorted;
Based on TF-IDF algorithm, tribute of each word to each referenced text in the database in the word segmentation result is obtained Degree of offering;
According to each word in the word segmentation result of the text to be sorted to the contribution degree of same referenced text, obtain it is described to The measuring similarity value of classifying text and the referenced text.
Optionally, the classification according to the measuring similarity value and the referenced text obtains described to be sorted Text specifically includes the assessed value of each classification:
From the database, the measuring similarity value filtered out between the text to be sorted is greater than preset threshold Referenced text;
According to the measuring similarity value between the referenced text filtered out in the text to be sorted and each classification, obtain Assessed value of the text to be sorted to the classification.
Optionally, before obtaining text to be sorted and the corresponding database constructed in advance, the method also includes:
The referenced text and referenced text of the website orientation are crawled from the website that at least one chooses in advance by crawler Classification, obtain initial data base;
Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the participle knot of each referenced text The word segmentation regulation of fruit, the first participle processing is identical as the word segmentation regulation of second word segmentation processing;
According to the word segmentation result of referenced text in the initial data base, to the referenced text in the initial data base into The processing of row inverted index;
According to the classification of the processing result of inverted index and the referenced text crawled, the database is obtained.
Optionally, contribution of each word to same referenced text in the word segmentation result according to the text to be sorted Degree obtains the measuring similarity value of the text to be sorted and the referenced text, specifically includes:
Contribution degree of each word in the word segmentation result of the text to be sorted to object reference text is averaging, tribute is obtained Degree of offering mean value, the database include the object reference text;
The contribution degree mean value is determined as to the measuring similarity value of the text to be sorted Yu the object reference text.
A kind of document sorting apparatus provided by the embodiments of the present application, comprising: the first acquisition module, is commented the second acquisition module Valence module and determining module;
Described first obtains module, for obtaining text to be sorted and the database that constructs, the database include in advance The classification of multiple referenced texts and the referenced text, the referenced text in the database are at least divided into two classes;
The evaluation module, for evaluate the text to be sorted in the database each referenced text it is similar Degree, obtains the measuring similarity value between the text to be sorted and each referenced text;
Described second obtains module, for the classification according to the measuring similarity value and the referenced text, obtains Assessed value of the text to be sorted to each classification;
The determining module, for the corresponding classification of the maximum value of the assessed value to be determined as the text to be sorted Classification.
Optionally, described device, further includes: word segmentation module;
The word segmentation module carries out first participle processing for treating classifying text, obtains point of the text to be sorted Word result;
The evaluation module, specifically includes: the first acquisition submodule and the second acquisition submodule;
First acquisition submodule obtains in the word segmentation result each word to described for being based on TF-IDF algorithm The contribution degree of each referenced text in database;
Second acquisition submodule, for each word in the word segmentation result according to the text to be sorted to same reference The contribution degree of text obtains the measuring similarity value of the text to be sorted and the referenced text.
Optionally, described second module is obtained, specifically included: screening submodule and third acquisition submodule;
The screening submodule, for from the database, filtering out the similarity between the text to be sorted Metric is greater than the referenced text of preset threshold;
The third acquisition submodule, for according to the referenced text filtered out in the text to be sorted and each classification Between measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
The embodiment of the present application also provides a kind of storage mediums, are stored thereon with program, when which is executed by processor Realize the file classification method as described in above-described embodiment.
The embodiment of the present application also provides a kind of processor, the processor is for running program, wherein described program fortune The file classification method as described in above-described embodiment is executed when row.
By above-mentioned technical proposal, file classification method provided by the present application, with known point in the database that constructs in advance The referenced text of class is classification foundation, and text to be sorted and referenced text are carried out similarity retrieval matching, evaluate text to be sorted The similarity of this and each referenced text.Further according to referenced text known classification and text to be sorted in each classification The similarity of referenced text determines that text to be sorted to the assessed value of each classification, i.e., text to be sorted and belongs under the classification The corresponding classification of assessed value maximum value is determined as the classification of the text to be sorted by the comprehensive similarity of referenced text.Due to ginseng It is exhaustible for examining the classification of text, and when treating classifying text and being classified, it is not necessary that the rule of classification is manually set, Without considering the integrality of rule, can be realized pair according to the similarity between text to be sorted and the referenced text of known classification The classification of text to be sorted can classify to any one text to be sorted, improve the coverage rate of text classification.
Above description is only the general introduction of technical scheme, in order to better understand the technological means of the application, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects, features and advantages of the application can It is clearer and more comprehensible, below the special specific embodiment for lifting the application.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the application Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of flow diagram of file classification method provided by the embodiments of the present application;
Fig. 2 shows the streams for evaluating referenced text similarity in text to be sorted and database a kind of in the embodiment of the present application Journey schematic diagram;
Fig. 3 shows a kind of structural schematic diagram of document sorting apparatus provided by the embodiments of the present application.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
In existing file classification method, typically relies on artificial predetermined conditional plan and carry out.For example, using When regular expression, decision Tree algorithms, the classifying rules of each classification is first determined, then by point of text to be sorted and each classification Rule-like sequentially matches one by one, finds the rule of successful match, then the corresponding classification of the rule is determined as text to be sorted Classification.However, since the conditional plan for artificially arranging or defining is limited, and people for same content form of presentation not Identical to the greatest extent, this makes the rule in all reality of the impossible exhaustion of existing file classification method, causes text classification algorithm Coverage rate is not complete, can not classify to certain texts to be sorted.
For this purpose, the embodiment of the present application provides a kind of file classification method, with the ginseng of multiple known classification obtained in advance Examining text is foundation, and text to be sorted and these referenced texts are carried out similarity retrieval matching, counts and exists in known classification The at most referenced text high with text similarity to be sorted, then the classification is the classification of text to be sorted.Due to point of text Class can be exhaustive, and the embodiment of the present application is based on text to be sorted and these referenced texts carry out the matched result of similarity retrieval and treat Classifying text is classified, then there is no that can not be classified for certain texts to be sorted, improves text classification calculation The coverage rate of method.
Based on above-mentioned thought, in order to make the above objects, features, and advantages of the present application more apparent, below with reference to Attached drawing elaborates to the specific embodiment of the application.
Referring to Fig. 1, which is a kind of flow diagram of file classification method provided by the embodiments of the present application.
File classification method provided in this embodiment, includes the following steps S101-S104.
S101: the database for obtaining text to be sorted and constructing in advance.
Wherein, database includes multiple referenced texts and the classification of the referenced text, the referenced text in the database Be at least divided into two classes.
It is understood that the referenced text and its classification in database can be provided for text to be sorted classification according to According to.In the specific implementation, managing and maintaining for the ease of database improves the efficiency of classification, can be different field (as taken charge of Method field, medical field etc.) text to be sorted correspond to different types of database.For example, it is carried out for judicial style When classification, referenced text can be the juridical documentation (such as judgement document, legal treatises) issued by authoritative institution in database, Classification (such as civil, criminal, administrative) of the authoritative institution for being classified as the publication document of referenced text to the referenced text.
It in actual operation, can be by crawler to be sorted when constructing database corresponding with text to be sorted in advance Text correspond to field online authoritative institution publication document, such as judicial domain referenced text can from judgement document's net, Intellectual property field authoritative website and judicial domain professional website etc. crawl on websites.It is right generally on these authoritative websites Its document issued is classified.Therefore, in the present embodiment, can directly according to the document issued on authoritative website and its Database is constructed to the classification of the document, not only can in the exhaustive field text classification, it can also be ensured that text classification Accuracy.
S102: evaluating the similarity of each referenced text in text to be sorted and database, obtains text to be sorted and every Measuring similarity value between a referenced text.
In the present embodiment, can by contribution degree of the specific word in text to be sorted in referenced text (that is, it is judged that Whether certain words in classifying text occur in referenced text and contribution degree of the word in referenced text) evaluate two Similarity between person obtains the measuring similarity value.It is understood that when the word is higher to the contribution degree of referenced text, Then illustrate that its importance in referenced text is higher, more may be the keyword of referenced text.Therefore, if in text to be sorted Including referenced text keyword it is more, then illustrate that the similarity of referenced text and text to be sorted is higher.
In the possible implementation of the present embodiment, as shown in Fig. 2, step S102 can specifically include following steps S1021-S1023。
S1021: it treats classifying text and carries out first participle processing, obtain the word segmentation result of text to be sorted.
It is understood that those skilled in the art can treat classifying text using any one segmentation methods carries out the One word segmentation processing does not do any restriction to this in the embodiment of the present application.
S1022: based on the reverse document-frequency (Term Frequency-Inverse Document of word frequency- Frequency, TF-IDF) algorithm, obtain the contribution degree of each referenced text in each word pair database in word segmentation result.
TF-IDF algorithm is a kind of statistical method, for assessing a words (that is, text participle to be sorted in the present embodiment As a result each word in) for a copy of it in a file set or a corpus (that is, database in the present embodiment) The significance level of file.The importance of words, but simultaneously can be with it with the directly proportional increase of number that it occurs hereof The frequency occurred in corpus is inversely proportional decline.
Therefore, in the present embodiment when constructing database, for the ease of the application of subsequent TF-IDF algorithm, the structure of database It builds, can specifically be realized by following steps:
Firstly, the referenced text and reference of the website orientation are crawled from the website that at least one chooses in advance by crawler The classification of text, obtains initial data base.
Referenced text and its classification crawl with it is described above similar, referring specifically to explanation above, here not It repeats again.
Secondly, carrying out the second word segmentation processing to the referenced text in initial data base, the participle of each referenced text is obtained As a result.
It should be noted that in order to guarantee the accurate of TF-IDF algorithm, the word segmentation regulation of first participle processing and second point The word segmentation regulation of word processing is identical.That is, using identical segmentation methods and word segmentation regulation to the referenced text in initial data base Word segmentation processing is carried out with text to be sorted.
Again, the word segmentation result according to referenced text in initial data base carries out the referenced text in initial data base Inverted index processing.
Inverted index needs to search in practical application record according to the value of attribute.It is each in inverted index table Item all includes an attribute value and the address respectively recorded with the attribute value.Attribute value is determined by recording due to not being, and It is that record is determined by attribute value, thus referred to as inverted index.
In the present embodiment, inverted index processing is carried out to the referenced text in initial data base, i.e. statistics primary data At least one referenced text that each word (i.e. query term) occurs in referenced text word segmentation result in library.For example, being tied according to participle " contract " word in fruit carries out inverted index processing to the referenced text in initial data base, i.e., in statistics initial data base The referenced text for occurring " contract ", it is corresponding with " contract " word.
Finally, obtaining database according to the classification of the processing result of inverted index and the referenced text crawled.That is, In the present embodiment, database finally includes the inverted index of referenced text word segmentation result and the classification of each referenced text.
In practical applications, in text word segmentation result to be sorted in each word pair database each referenced text contribution degree It can be obtained using following formula:
In formula, t is the query term comprising domain information, that is to say, that identical word is different in title and article content Query term, the query term in text body that only statistics occurs in the possible implementation of the embodiment of the present application is to contribution degree It influences;
Q is query statement, including at least one query term t, and q is text word segmentation result to be sorted in the embodiment of the present application In currently calculate the participle of contribution degree;
D is the referenced text in database;
It include the query term of domain information in t in q, that is, q,Distinguish the tf of each query word t in statistical query sentence q (t in d)×idf(t)2The sum of × Boost () × norm (t, d);
Tf (t in d) is item frequency factor, and the query term t for including in referenced text d is more, then this text is then given a mark and got over It is high;
Idf (t) is the frequency that query term t occurs in inverted index, the higher participle knot of the frequency of occurrences in inverted index Fruit has lower idf, the less word segmentation result idf with higher of the frequency of occurrences in inverted index;
Boost () is weighted value;
Norm (t, d) and queryNorm (q) is normalization factor;
Coord (q, d) is the measurement for meeting query statement q querying condition number in referenced text d, is wrapped when in a text More containing the word number for meeting query statement q querying condition, then this text coord is higher, i.e., with reference to text in the embodiment of the present application The number that this d participates in the word of inverted index is more, and coord is higher.
S1023: it according to word each in the word segmentation result of text to be sorted to the contribution degree of same referenced text, obtains wait divide The measuring similarity value of class text and the referenced text.
In the present embodiment, it obtains after each word is to the contribution degree of same referenced text in text word segmentation result to be sorted, The contribution degree of each word in same referenced text and text word segmentation result to be sorted can be averaging, obtained value is then wait divide The measuring similarity value of class text and the referenced text.
That is, the contribution degree for treating the object reference text in the word segmentation result of classifying text in each word pair database asks flat , contribution degree mean value is obtained;Contribution degree mean value is determined as to the measuring similarity value of text to be sorted Yu object reference text.
It is understood that those skilled in the art can also be using the calculation in addition to averaging (as summed) Counting each word in text word segmentation result to be sorted, to the contribution degree of same referenced text, the embodiment of the present application does not do this any It limits, also will not enumerate here.
S103: it according to the classification of measuring similarity value and referenced text, obtains text to be sorted and each classification is commented Valuation.
In the present embodiment, according to the classification of referenced text known in database, text to be sorted is counted to the classification The similarity of lower referenced text is comprehensive, i.e., for text to be sorted to the assessed value of the classification, the assessed value the big, illustrates text to be sorted This is higher with the similarity of referenced text under the classification.
In the possible implementation of the present embodiment, in order to remove data noise, can first from database, filter out with Measuring similarity value between text to be sorted is greater than the referenced text of preset threshold;Further according to text to be sorted and each classification In measuring similarity value between the referenced text that filters out, obtain text to be sorted to the assessed value of the classification.
In actual operation, those skilled in the art can specifically set preset threshold according to the actual situation, here no longer It enumerates.
S104: the corresponding classification of the maximum value of assessed value is determined as to the classification of text to be sorted.
Since the assessed value of the text to be sorted to the classification the big, illustrate referenced text under text to be sorted and the classification Similarity it is higher, therefore, classification corresponding to maximum value of the text to be sorted to the assessed value of classification each in database is For the classification of text to be sorted.
It in the present embodiment, will be wait divide using the referenced text of known classification in the database constructed in advance as classification foundation Class text and referenced text carry out similarity retrieval matching, evaluate the similarity of text to be sorted Yu each referenced text.Root again According to the known classification and text to be sorted of referenced text to the similarity of referenced text in each classification, text to be sorted is determined This assessed value to each classification, i.e., text to be sorted and the comprehensive similarity for belonging to referenced text under the classification, by assessed value The corresponding classification of maximum value is determined as the classification of the text to be sorted.Since the classification of referenced text is exhaustible, and When treating classifying text and being classified, it is not necessary that the rule of classification is manually set, without the integrality for considering rule, according to point The classification for treating classifying text can be realized in similarity between class text and the referenced text of known classification, can be to any one A text to be sorted is classified, and the coverage rate of text classification is improved.
The file classification method provided based on the above embodiment, the embodiment of the present application also provides a kind of text classification dresses It sets.
Referring to Fig. 3, which is a kind of structural schematic diagram of text processing apparatus provided by the embodiments of the present application.
A kind of document sorting apparatus provided in this embodiment, comprising: the first acquisition acquisition of module 100, second module 200, Evaluation module 300 and determining module 400.
First obtains module 100, and for obtaining the database constructed in advance and text to be sorted, database includes multiple ginsengs Text and the classification of the referenced text are examined, referenced text is at least divided into two classes in database.
Evaluation module 300 is obtained for evaluating the similarity of each referenced text in text to be sorted and database wait divide Measuring similarity value between class text and each referenced text.
Second acquisition module 200 obtains text to be sorted for the classification according to measuring similarity value and referenced text To the assessed value of each classification.
Determining module 400, for the corresponding classification of the maximum value of assessed value to be determined as to the classification of text to be sorted.
In the possible implementation of the present embodiment, the device further include: word segmentation module.
Word segmentation module carries out first participle processing for treating classifying text, obtains the word segmentation result of text to be sorted.
Evaluation module 300, specifically includes: the first acquisition submodule and the second acquisition submodule.
First acquisition submodule obtains each in each word pair database in word segmentation result for being based on TF-IDF algorithm The contribution degree of referenced text.
Second acquisition submodule, for tribute of each word to same referenced text in the word segmentation result according to text to be sorted Degree of offering obtains the measuring similarity value of text to be sorted Yu the referenced text.
In the possible implementation of the present embodiment, second obtains module 200, specifically includes: screening submodule and third Acquisition submodule.
Submodule is screened, is greater than in advance for from database, filtering out the measuring similarity value between text to be sorted If the referenced text of threshold value.
Third acquisition submodule, for according to the phase between the referenced text filtered out in text to be sorted and each classification Like degree metric, text to be sorted is obtained to the assessed value of the classification.
In the possible implementation of the present embodiment, the device further include: database frame modules.
Database frame modules, specifically include: crawling submodule, processing submodule and the 4th acquisition submodule.
Submodule is crawled, for crawling the reference of the website orientation from the website that at least one chooses in advance by crawler The classification of text and referenced text, obtains initial data base.
Word segmentation module is also used to carry out the second word segmentation processing to the referenced text in initial data base, obtains each reference The word segmentation regulation of the word segmentation result of text, first participle processing is identical as the word segmentation regulation of the second word segmentation processing.
Submodule is handled, for the word segmentation result according to referenced text in initial data base, to the ginseng in initial data base It examines text and carries out inverted index processing.
4th acquisition submodule, for the classification according to the processing result and the referenced text crawled of inverted index, Obtain database.
In the possible implementation of the present embodiment, the second acquisition submodule is specifically included: computational submodule and determining son Module.
Computational submodule, each word seeks the contribution degree of object reference text in the word segmentation result for treating classifying text It is average, contribution degree mean value is obtained, database includes object reference text.
Submodule is determined, for contribution degree mean value to be determined as to the measuring similarity of text to be sorted Yu object reference text Value.
It in the present embodiment, will be wait divide using the referenced text of known classification in the database constructed in advance as classification foundation Class text and referenced text carry out similarity retrieval matching, evaluate the similarity of text to be sorted Yu each referenced text.Root again According to the known classification and text to be sorted of referenced text to the similarity of referenced text in each classification, text to be sorted is determined This assessed value to each classification, i.e., text to be sorted and the comprehensive similarity for belonging to referenced text under the classification, by assessed value The corresponding classification of maximum value is determined as the classification of the text to be sorted.Since the classification of referenced text is exhaustible, and When treating classifying text and being classified, it is not necessary that the rule of classification is manually set, without the integrality for considering rule, according to point The classification for treating classifying text can be realized in similarity between class text and the referenced text of known classification, can be to any one A text to be sorted is classified, and the coverage rate of text classification is improved.
A kind of text handling method and device provided based on the above embodiment, the embodiment of the present application also provides another kinds Text processing apparatus.
Text processing apparatus provided in this embodiment includes processor and memory, and first in above-described embodiment obtains mould Block, the second acquisition module, evaluation module and determining module are used as program module storage in memory, are deposited by processor execution Above procedure module in memory is stored up to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, by adjusting kernel parameter to realize the classification for treating classifying text.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (Flash RAM), memory include that at least one is deposited Store up chip.
It in the present embodiment, will be wait divide using the referenced text of known classification in the database constructed in advance as classification foundation Class text and referenced text carry out similarity retrieval matching, evaluate the similarity of text to be sorted Yu each referenced text.Root again According to the known classification and text to be sorted of referenced text to the similarity of referenced text in each classification, text to be sorted is determined This assessed value to each classification, i.e., text to be sorted and the comprehensive similarity for belonging to referenced text under the classification, by assessed value The corresponding classification of maximum value is determined as the classification of the text to be sorted.Since the classification of referenced text is exhaustible, and When treating classifying text and being classified, it is not necessary that the rule of classification is manually set, without the integrality for considering rule, according to point The classification for treating classifying text can be realized in similarity between class text and the referenced text of known classification, can be to any one A text to be sorted is classified, and the coverage rate of text classification is improved.
A kind of text handling method and device provided based on the above embodiment, the embodiment of the present application also provides a kind of meters Calculation machine program product is adapted for carrying out the program code of initialization there are as below methods step when executing on data processing equipment:
The database for obtaining text to be sorted and constructing in advance, database include multiple referenced texts and the referenced text Classification, the referenced text in database is at least divided into two classes;Evaluate each referenced text in text to be sorted and database Similarity obtains the measuring similarity value between text to be sorted and each referenced text;According to measuring similarity value and ginseng The classification for examining text obtains text to be sorted to the assessed value of each classification;The corresponding classification of the maximum value of assessed value is determined For the classification of text to be sorted.
The similarity of each referenced text in the evaluation text to be sorted and the database obtains described wait divide Measuring similarity value between class text and each referenced text, can specifically include: treating classifying text and carries out first Word segmentation processing obtains the word segmentation result of the text to be sorted;Based on TF-IDF algorithm, each word in the word segmentation result is obtained To the contribution degree of each referenced text in the database;According to each word in the word segmentation result of the text to be sorted to same The contribution degree of referenced text obtains the measuring similarity value of the text to be sorted and the referenced text.
The classification according to the measuring similarity value and the referenced text obtains the text to be sorted to every The assessed value of a classification, can specifically include: from the database, filter out the similarity between the text to be sorted Metric is greater than the referenced text of preset threshold;According to the referenced text filtered out in the text to be sorted and each classification it Between measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
It can also include: by crawler from least before obtaining text to be sorted and the corresponding database constructed in advance The classification that the referenced text and referenced text of the website orientation are crawled in one website chosen in advance, obtains initial data base; Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the word segmentation result of each referenced text, it is described The word segmentation regulation of first participle processing is identical as the word segmentation regulation of second word segmentation processing;According to joining in the initial data base The word segmentation result for examining text carries out inverted index processing to the referenced text in initial data base;According to the processing of inverted index As a result and the classification of referenced text that crawls, the database is obtained.
Each word obtains institute to the contribution degree of same referenced text in the word segmentation result according to the text to be sorted The measuring similarity value for stating text to be sorted Yu the referenced text, can specifically include: to the participle knot of the text to be sorted Each word is averaging the contribution degree of object reference text in fruit, obtains contribution degree mean value, the database includes the target Referenced text;The contribution degree mean value is determined as to the measuring similarity of the text to be sorted Yu the object reference text Value.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor File classification method described in existing above-described embodiment.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation File classification method described in Shi Zhihang above-described embodiment.
The embodiment of the invention provides a kind of equipment, which includes processor, memory and stores on a memory simultaneously The program that can be run on a processor, processor perform the steps of when executing program
The database for obtaining text to be sorted and constructing in advance, the database include multiple referenced texts and the reference The classification of text, the referenced text in the database are at least divided into two classes;Evaluate the text to be sorted and the database In each referenced text similarity, obtain the measuring similarity between the text to be sorted and each referenced text Value;According to the classification of the measuring similarity value and the referenced text, the text to be sorted is obtained to each classification Assessed value;The corresponding classification of the maximum value of the assessed value is determined as to the classification of the text to be sorted.
The similarity of each referenced text in the evaluation text to be sorted and the database obtains described wait divide Measuring similarity value between class text and each referenced text, can specifically include: treating classifying text and carries out first Word segmentation processing obtains the word segmentation result of the text to be sorted;Based on TF-IDF algorithm, each word in the word segmentation result is obtained To the contribution degree of each referenced text in the database;According to each word in the word segmentation result of the text to be sorted to same The contribution degree of referenced text obtains the measuring similarity value of the text to be sorted and the referenced text.
The classification according to the measuring similarity value and the referenced text obtains the text to be sorted to every The assessed value of a classification, can specifically include: from the database, filter out the similarity between the text to be sorted Metric is greater than the referenced text of preset threshold;According to the referenced text filtered out in the text to be sorted and each classification it Between measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
It can also include: by crawler from least before obtaining text to be sorted and the corresponding database constructed in advance The classification that the referenced text and referenced text of the website orientation are crawled in one website chosen in advance, obtains initial data base; Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the word segmentation result of each referenced text, it is described The word segmentation regulation of first participle processing is identical as the word segmentation regulation of second word segmentation processing;According to joining in the initial data base The word segmentation result for examining text carries out inverted index processing to the referenced text in the initial data base;According to inverted index The classification of processing result and the referenced text crawled obtains the database.
Each word obtains institute to the contribution degree of same referenced text in the word segmentation result according to the text to be sorted The measuring similarity value for stating text to be sorted Yu the referenced text, can specifically include: to the participle knot of the text to be sorted Each word is averaging the contribution degree of object reference text in fruit, obtains contribution degree mean value, the database includes the target Referenced text;The contribution degree mean value is determined as to the measuring similarity of the text to be sorted Yu the object reference text Value.
Equipment herein can be server, PC, PAD, mobile phone etc..
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (FlashRAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (Transitory Media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of file classification method, which is characterized in that the described method includes:
The database for obtaining text to be sorted and constructing in advance, the database include multiple referenced texts and the referenced text Classification, the referenced text in the database is at least divided into two classes;
The similarity for evaluating each referenced text in the text to be sorted and the database, obtain the text to be sorted with Measuring similarity value between each referenced text;
According to the classification of the measuring similarity value and the referenced text, the text to be sorted is obtained to each classification Assessed value;
The corresponding classification of the maximum value of the assessed value is determined as to the classification of the text to be sorted.
2. file classification method according to claim 1, which is characterized in that the evaluation text to be sorted with it is described The similarity of each referenced text in database obtains the similarity between the text to be sorted and each referenced text Metric specifically includes:
It treats classifying text and carries out first participle processing, obtain the word segmentation result of the text to be sorted;
Based on TF-IDF algorithm, contribution of each word to each referenced text in the database in the word segmentation result is obtained Degree;
According to each word in the word segmentation result of the text to be sorted to the contribution degree of same referenced text, obtain described to be sorted The measuring similarity value of text and the referenced text.
3. file classification method according to claim 1 or 2, which is characterized in that described according to the measuring similarity value And the classification of the referenced text, the text to be sorted is obtained to the assessed value of each classification, is specifically included:
From the database, the reference that the measuring similarity value between the text to be sorted is greater than preset threshold is filtered out Text;
According to the measuring similarity value between the referenced text filtered out in the text to be sorted and each classification, described in acquisition Assessed value of the text to be sorted to the classification.
4. file classification method according to claim 2, which is characterized in that obtain text to be sorted and it is corresponding in advance Before the database of building, the method also includes:
The referenced text of the website orientation and point of referenced text are crawled from the website that at least one chooses in advance by crawler Class obtains initial data base;
Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the word segmentation result of each referenced text, The word segmentation regulation of the first participle processing is identical as the word segmentation regulation of second word segmentation processing;
According to the word segmentation result of referenced text in the initial data base, the referenced text in the initial data base is fallen Arrange index process;
According to the classification of the processing result of inverted index and the referenced text crawled, the database is obtained.
5. file classification method according to claim 2, which is characterized in that the participle according to the text to be sorted As a result each word obtains the measuring similarity of the text to be sorted and the referenced text to the contribution degree of same referenced text in Value, specifically includes:
Contribution degree of each word in the word segmentation result of the text to be sorted to object reference text is averaging, contribution degree is obtained Mean value, the database include the object reference text;
The contribution degree mean value is determined as to the measuring similarity value of the text to be sorted Yu the object reference text.
6. a kind of document sorting apparatus, which is characterized in that described device includes: the first acquisition module, the second acquisition module, evaluation Module and determining module;
Described first obtains module, and for obtaining text to be sorted and the database that constructs in advance, the database includes multiple The classification of referenced text and the referenced text, the referenced text in the database are at least divided into two classes;
The evaluation module is obtained for evaluating the similarity of each referenced text in the text to be sorted and the database Measuring similarity value between the text to be sorted and each referenced text;
Described second obtains module, for the classification according to the measuring similarity value and the referenced text, described in acquisition Assessed value of the text to be sorted to each classification;
The determining module, for the corresponding classification of the maximum value of the assessed value to be determined as to the class of the text to be sorted Not.
7. document sorting apparatus according to claim 6, which is characterized in that described device, further includes: word segmentation module;
The word segmentation module carries out first participle processing for treating classifying text, obtains the participle knot of the text to be sorted Fruit;
The evaluation module, specifically includes: the first acquisition submodule and the second acquisition submodule;
First acquisition submodule obtains in the word segmentation result each word to the data for being based on TF-IDF algorithm The contribution degree of each referenced text in library;
Second acquisition submodule, for each word in the word segmentation result according to the text to be sorted to same referenced text Contribution degree, obtain the measuring similarity value of the text to be sorted and the referenced text.
8. document sorting apparatus according to claim 6 or 7, which is characterized in that described second obtains module, specific to wrap It includes: screening submodule and third acquisition submodule;
The screening submodule, for from the database, filtering out the measuring similarity between the text to be sorted Value is greater than the referenced text of preset threshold;
The third acquisition submodule, for according between the referenced text filtered out in the text to be sorted and each classification Measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
9. a kind of storage medium, which is characterized in that be stored thereon with program, realized when which is executed by processor as right is wanted Seek the described in any item file classification methods of 1-5.
10. a kind of processor, which is characterized in that the processor is for running program, wherein executed such as when described program is run The described in any item file classification methods of claim 1-5.
CN201710910888.6A 2017-09-29 2017-09-29 Text classification method and device Active CN110019785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710910888.6A CN110019785B (en) 2017-09-29 2017-09-29 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710910888.6A CN110019785B (en) 2017-09-29 2017-09-29 Text classification method and device

Publications (2)

Publication Number Publication Date
CN110019785A true CN110019785A (en) 2019-07-16
CN110019785B CN110019785B (en) 2022-03-01

Family

ID=67186452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710910888.6A Active CN110019785B (en) 2017-09-29 2017-09-29 Text classification method and device

Country Status (1)

Country Link
CN (1) CN110019785B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990577A (en) * 2019-12-25 2020-04-10 北京亚信数据有限公司 Text classification method and device
CN112948370A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Data classification method and device and computer equipment
CN113111173A (en) * 2020-02-13 2021-07-13 北京明亿科技有限公司 Regular expression-based alarm receiving warning condition category determination method and device
CN113220840A (en) * 2021-05-17 2021-08-06 北京百度网讯科技有限公司 Text processing method, device, equipment and storage medium
CN113254655A (en) * 2021-07-05 2021-08-13 北京邮电大学 Text classification method, electronic device and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030172063A1 (en) * 2002-03-07 2003-09-11 Koninklijke Philips Electronics N.V. Method and apparatus for providing search results in response to an information search request
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103714118A (en) * 2013-11-22 2014-04-09 浙江大学 Book cross-reading method
US20140278353A1 (en) * 2013-03-13 2014-09-18 Crimson Hexagon, Inc. Systems and Methods for Language Classification
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030172063A1 (en) * 2002-03-07 2003-09-11 Koninklijke Philips Electronics N.V. Method and apparatus for providing search results in response to an information search request
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
US20140278353A1 (en) * 2013-03-13 2014-09-18 Crimson Hexagon, Inc. Systems and Methods for Language Classification
CN103714118A (en) * 2013-11-22 2014-04-09 浙江大学 Book cross-reading method
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948370A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Data classification method and device and computer equipment
CN110990577A (en) * 2019-12-25 2020-04-10 北京亚信数据有限公司 Text classification method and device
CN113111173A (en) * 2020-02-13 2021-07-13 北京明亿科技有限公司 Regular expression-based alarm receiving warning condition category determination method and device
CN113220840A (en) * 2021-05-17 2021-08-06 北京百度网讯科技有限公司 Text processing method, device, equipment and storage medium
CN113220840B (en) * 2021-05-17 2023-08-01 北京百度网讯科技有限公司 Text processing method, device, equipment and storage medium
CN113254655A (en) * 2021-07-05 2021-08-13 北京邮电大学 Text classification method, electronic device and computer storage medium

Also Published As

Publication number Publication date
CN110019785B (en) 2022-03-01

Similar Documents

Publication Publication Date Title
CN110019785A (en) A kind of file classification method and device
CN110019668A (en) A kind of text searching method and device
US20140108190A1 (en) Recommending product information
US11048732B2 (en) Systems and methods for records tagging based on a specific area or region of a record
Hung et al. Customer segmentation using hierarchical agglomerative clustering
US20110029476A1 (en) Indicating relationships among text documents including a patent based on characteristics of the text documents
CN104281585A (en) Object ordering method and device
CN110059991B (en) Warehouse item selection method, system, electronic device and computer readable medium
CN106934410A (en) The sorting technique and system of data
CN110162778A (en) The generation method and device of text snippet
US8977622B1 (en) Evaluation of nodes
US11650999B2 (en) Database search enhancement and interactive user interface therefor
CN110019670A (en) A kind of text searching method and device
Sivaranjani et al. Hybrid Particle Swarm Optimization-Firefly algorithm (HPSOFF) for combinatorial optimization of non-slicing VLSI floorplanning
CN108228612A (en) A kind of method and device for extracting network event keyword and mood tendency
CN107784027A (en) A kind of reminding method and device of judgement document's search key
CN108595460A (en) Multichannel evaluating method and system, the computer program of keyword Automatic
Sharma et al. Intelligent data analysis using optimized support vector machine based data mining approach for tourism industry
CN109218211A (en) The method of adjustment of threshold value, device and equipment in the control strategy of data flow
CN114490786A (en) Data sorting method and device
Angelini et al. The complex dynamics of products and its asymptotic properties
CN109359346A (en) A kind of heat load prediction method, apparatus, readable medium and electronic equipment
CN109117434A (en) Judgement document's search method, device, storage medium and processor
CN109426905A (en) A kind of determination method and device that the criminal document measurement of penalty deviates
CN105786929B (en) A kind of information monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant