CN110019785A - A kind of file classification method and device - Google Patents
A kind of file classification method and device Download PDFInfo
- Publication number
- CN110019785A CN110019785A CN201710910888.6A CN201710910888A CN110019785A CN 110019785 A CN110019785 A CN 110019785A CN 201710910888 A CN201710910888 A CN 201710910888A CN 110019785 A CN110019785 A CN 110019785A
- Authority
- CN
- China
- Prior art keywords
- text
- sorted
- classification
- referenced
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of file classification method and devices, this method comprises: obtaining text to be sorted and the database including multiple and different classification referenced texts;The similarity for evaluating each referenced text in text to be sorted and database, obtains the measuring similarity value between text to be sorted and each referenced text;According to the classification of measuring similarity value and referenced text, text to be sorted is obtained to the assessed value of each classification;The corresponding classification of the maximum value of assessed value is determined as to the classification of text to be sorted.Since the classification of referenced text in database is exhaustible, the classification for treating classifying text can be realized according to the similarity between text to be sorted and the referenced text of known classification, it can classify to any one text to be sorted, improve the coverage rate of text classification.
Description
Technical field
This application involves big data technical fields, and in particular to a kind of file classification method and device.
Background technique
Text classification is exactly to carry out automatic classification marker according to certain classification system or standard to text, in this way, user
It browses text and can inquire with not only can be convenient but also by classification required text.
Text classification mode used at present generally uses strong rule match, is calculated using such as regular expression, decision tree
The mechanism such as method are realized.When specific implementation, the sequence in the preceding preset some artificially defined conditional plan set of algorithm execution, execution
Matching rule determines the affiliated classification of the text according to the corresponding rule classification of rule of text matches.
Since text expression mode of the different authors for same content certainly exists difference, and artificially arrange the rule of definition
It is then limited, it is impossible to rule used in reality is subjected to exhaustion, therefore, existing text classification algorithm can not be to certain
Text is classified, and there is a problem of that coverage rate is not complete.
Summary of the invention
In view of the above problems, it proposes on the application overcomes the above problem or at least be partially solved in order to provide one kind
The file classification method and device for stating problem, improve the coverage rate of text classification.
A kind of file classification method provided by the embodiments of the present application, comprising:
The database for obtaining text to be sorted and constructing in advance, the database include multiple referenced texts and the reference
The classification of text, the referenced text in the database are at least divided into two classes;
The similarity for evaluating each referenced text in the text to be sorted and the database, obtains the text to be sorted
Measuring similarity value between sheet and each referenced text;
According to the classification of the measuring similarity value and the referenced text, the text to be sorted is obtained to each point
The assessed value of class;
The corresponding classification of the maximum value of the assessed value is determined as to the classification of the text to be sorted.
Optionally, the similarity of the evaluation text to be sorted and each referenced text in the database, obtains
Measuring similarity value between the text to be sorted and each referenced text, specifically includes:
It treats classifying text and carries out first participle processing, obtain the word segmentation result of the text to be sorted;
Based on TF-IDF algorithm, tribute of each word to each referenced text in the database in the word segmentation result is obtained
Degree of offering;
According to each word in the word segmentation result of the text to be sorted to the contribution degree of same referenced text, obtain it is described to
The measuring similarity value of classifying text and the referenced text.
Optionally, the classification according to the measuring similarity value and the referenced text obtains described to be sorted
Text specifically includes the assessed value of each classification:
From the database, the measuring similarity value filtered out between the text to be sorted is greater than preset threshold
Referenced text;
According to the measuring similarity value between the referenced text filtered out in the text to be sorted and each classification, obtain
Assessed value of the text to be sorted to the classification.
Optionally, before obtaining text to be sorted and the corresponding database constructed in advance, the method also includes:
The referenced text and referenced text of the website orientation are crawled from the website that at least one chooses in advance by crawler
Classification, obtain initial data base;
Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the participle knot of each referenced text
The word segmentation regulation of fruit, the first participle processing is identical as the word segmentation regulation of second word segmentation processing;
According to the word segmentation result of referenced text in the initial data base, to the referenced text in the initial data base into
The processing of row inverted index;
According to the classification of the processing result of inverted index and the referenced text crawled, the database is obtained.
Optionally, contribution of each word to same referenced text in the word segmentation result according to the text to be sorted
Degree obtains the measuring similarity value of the text to be sorted and the referenced text, specifically includes:
Contribution degree of each word in the word segmentation result of the text to be sorted to object reference text is averaging, tribute is obtained
Degree of offering mean value, the database include the object reference text;
The contribution degree mean value is determined as to the measuring similarity value of the text to be sorted Yu the object reference text.
A kind of document sorting apparatus provided by the embodiments of the present application, comprising: the first acquisition module, is commented the second acquisition module
Valence module and determining module;
Described first obtains module, for obtaining text to be sorted and the database that constructs, the database include in advance
The classification of multiple referenced texts and the referenced text, the referenced text in the database are at least divided into two classes;
The evaluation module, for evaluate the text to be sorted in the database each referenced text it is similar
Degree, obtains the measuring similarity value between the text to be sorted and each referenced text;
Described second obtains module, for the classification according to the measuring similarity value and the referenced text, obtains
Assessed value of the text to be sorted to each classification;
The determining module, for the corresponding classification of the maximum value of the assessed value to be determined as the text to be sorted
Classification.
Optionally, described device, further includes: word segmentation module;
The word segmentation module carries out first participle processing for treating classifying text, obtains point of the text to be sorted
Word result;
The evaluation module, specifically includes: the first acquisition submodule and the second acquisition submodule;
First acquisition submodule obtains in the word segmentation result each word to described for being based on TF-IDF algorithm
The contribution degree of each referenced text in database;
Second acquisition submodule, for each word in the word segmentation result according to the text to be sorted to same reference
The contribution degree of text obtains the measuring similarity value of the text to be sorted and the referenced text.
Optionally, described second module is obtained, specifically included: screening submodule and third acquisition submodule;
The screening submodule, for from the database, filtering out the similarity between the text to be sorted
Metric is greater than the referenced text of preset threshold;
The third acquisition submodule, for according to the referenced text filtered out in the text to be sorted and each classification
Between measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
The embodiment of the present application also provides a kind of storage mediums, are stored thereon with program, when which is executed by processor
Realize the file classification method as described in above-described embodiment.
The embodiment of the present application also provides a kind of processor, the processor is for running program, wherein described program fortune
The file classification method as described in above-described embodiment is executed when row.
By above-mentioned technical proposal, file classification method provided by the present application, with known point in the database that constructs in advance
The referenced text of class is classification foundation, and text to be sorted and referenced text are carried out similarity retrieval matching, evaluate text to be sorted
The similarity of this and each referenced text.Further according to referenced text known classification and text to be sorted in each classification
The similarity of referenced text determines that text to be sorted to the assessed value of each classification, i.e., text to be sorted and belongs under the classification
The corresponding classification of assessed value maximum value is determined as the classification of the text to be sorted by the comprehensive similarity of referenced text.Due to ginseng
It is exhaustible for examining the classification of text, and when treating classifying text and being classified, it is not necessary that the rule of classification is manually set,
Without considering the integrality of rule, can be realized pair according to the similarity between text to be sorted and the referenced text of known classification
The classification of text to be sorted can classify to any one text to be sorted, improve the coverage rate of text classification.
Above description is only the general introduction of technical scheme, in order to better understand the technological means of the application,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects, features and advantages of the application can
It is clearer and more comprehensible, below the special specific embodiment for lifting the application.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the application
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of flow diagram of file classification method provided by the embodiments of the present application;
Fig. 2 shows the streams for evaluating referenced text similarity in text to be sorted and database a kind of in the embodiment of the present application
Journey schematic diagram;
Fig. 3 shows a kind of structural schematic diagram of document sorting apparatus provided by the embodiments of the present application.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
In existing file classification method, typically relies on artificial predetermined conditional plan and carry out.For example, using
When regular expression, decision Tree algorithms, the classifying rules of each classification is first determined, then by point of text to be sorted and each classification
Rule-like sequentially matches one by one, finds the rule of successful match, then the corresponding classification of the rule is determined as text to be sorted
Classification.However, since the conditional plan for artificially arranging or defining is limited, and people for same content form of presentation not
Identical to the greatest extent, this makes the rule in all reality of the impossible exhaustion of existing file classification method, causes text classification algorithm
Coverage rate is not complete, can not classify to certain texts to be sorted.
For this purpose, the embodiment of the present application provides a kind of file classification method, with the ginseng of multiple known classification obtained in advance
Examining text is foundation, and text to be sorted and these referenced texts are carried out similarity retrieval matching, counts and exists in known classification
The at most referenced text high with text similarity to be sorted, then the classification is the classification of text to be sorted.Due to point of text
Class can be exhaustive, and the embodiment of the present application is based on text to be sorted and these referenced texts carry out the matched result of similarity retrieval and treat
Classifying text is classified, then there is no that can not be classified for certain texts to be sorted, improves text classification calculation
The coverage rate of method.
Based on above-mentioned thought, in order to make the above objects, features, and advantages of the present application more apparent, below with reference to
Attached drawing elaborates to the specific embodiment of the application.
Referring to Fig. 1, which is a kind of flow diagram of file classification method provided by the embodiments of the present application.
File classification method provided in this embodiment, includes the following steps S101-S104.
S101: the database for obtaining text to be sorted and constructing in advance.
Wherein, database includes multiple referenced texts and the classification of the referenced text, the referenced text in the database
Be at least divided into two classes.
It is understood that the referenced text and its classification in database can be provided for text to be sorted classification according to
According to.In the specific implementation, managing and maintaining for the ease of database improves the efficiency of classification, can be different field (as taken charge of
Method field, medical field etc.) text to be sorted correspond to different types of database.For example, it is carried out for judicial style
When classification, referenced text can be the juridical documentation (such as judgement document, legal treatises) issued by authoritative institution in database,
Classification (such as civil, criminal, administrative) of the authoritative institution for being classified as the publication document of referenced text to the referenced text.
It in actual operation, can be by crawler to be sorted when constructing database corresponding with text to be sorted in advance
Text correspond to field online authoritative institution publication document, such as judicial domain referenced text can from judgement document's net,
Intellectual property field authoritative website and judicial domain professional website etc. crawl on websites.It is right generally on these authoritative websites
Its document issued is classified.Therefore, in the present embodiment, can directly according to the document issued on authoritative website and its
Database is constructed to the classification of the document, not only can in the exhaustive field text classification, it can also be ensured that text classification
Accuracy.
S102: evaluating the similarity of each referenced text in text to be sorted and database, obtains text to be sorted and every
Measuring similarity value between a referenced text.
In the present embodiment, can by contribution degree of the specific word in text to be sorted in referenced text (that is, it is judged that
Whether certain words in classifying text occur in referenced text and contribution degree of the word in referenced text) evaluate two
Similarity between person obtains the measuring similarity value.It is understood that when the word is higher to the contribution degree of referenced text,
Then illustrate that its importance in referenced text is higher, more may be the keyword of referenced text.Therefore, if in text to be sorted
Including referenced text keyword it is more, then illustrate that the similarity of referenced text and text to be sorted is higher.
In the possible implementation of the present embodiment, as shown in Fig. 2, step S102 can specifically include following steps
S1021-S1023。
S1021: it treats classifying text and carries out first participle processing, obtain the word segmentation result of text to be sorted.
It is understood that those skilled in the art can treat classifying text using any one segmentation methods carries out the
One word segmentation processing does not do any restriction to this in the embodiment of the present application.
S1022: based on the reverse document-frequency (Term Frequency-Inverse Document of word frequency-
Frequency, TF-IDF) algorithm, obtain the contribution degree of each referenced text in each word pair database in word segmentation result.
TF-IDF algorithm is a kind of statistical method, for assessing a words (that is, text participle to be sorted in the present embodiment
As a result each word in) for a copy of it in a file set or a corpus (that is, database in the present embodiment)
The significance level of file.The importance of words, but simultaneously can be with it with the directly proportional increase of number that it occurs hereof
The frequency occurred in corpus is inversely proportional decline.
Therefore, in the present embodiment when constructing database, for the ease of the application of subsequent TF-IDF algorithm, the structure of database
It builds, can specifically be realized by following steps:
Firstly, the referenced text and reference of the website orientation are crawled from the website that at least one chooses in advance by crawler
The classification of text, obtains initial data base.
Referenced text and its classification crawl with it is described above similar, referring specifically to explanation above, here not
It repeats again.
Secondly, carrying out the second word segmentation processing to the referenced text in initial data base, the participle of each referenced text is obtained
As a result.
It should be noted that in order to guarantee the accurate of TF-IDF algorithm, the word segmentation regulation of first participle processing and second point
The word segmentation regulation of word processing is identical.That is, using identical segmentation methods and word segmentation regulation to the referenced text in initial data base
Word segmentation processing is carried out with text to be sorted.
Again, the word segmentation result according to referenced text in initial data base carries out the referenced text in initial data base
Inverted index processing.
Inverted index needs to search in practical application record according to the value of attribute.It is each in inverted index table
Item all includes an attribute value and the address respectively recorded with the attribute value.Attribute value is determined by recording due to not being, and
It is that record is determined by attribute value, thus referred to as inverted index.
In the present embodiment, inverted index processing is carried out to the referenced text in initial data base, i.e. statistics primary data
At least one referenced text that each word (i.e. query term) occurs in referenced text word segmentation result in library.For example, being tied according to participle
" contract " word in fruit carries out inverted index processing to the referenced text in initial data base, i.e., in statistics initial data base
The referenced text for occurring " contract ", it is corresponding with " contract " word.
Finally, obtaining database according to the classification of the processing result of inverted index and the referenced text crawled.That is,
In the present embodiment, database finally includes the inverted index of referenced text word segmentation result and the classification of each referenced text.
In practical applications, in text word segmentation result to be sorted in each word pair database each referenced text contribution degree
It can be obtained using following formula:
In formula, t is the query term comprising domain information, that is to say, that identical word is different in title and article content
Query term, the query term in text body that only statistics occurs in the possible implementation of the embodiment of the present application is to contribution degree
It influences;
Q is query statement, including at least one query term t, and q is text word segmentation result to be sorted in the embodiment of the present application
In currently calculate the participle of contribution degree;
D is the referenced text in database;
It include the query term of domain information in t in q, that is, q,Distinguish the tf of each query word t in statistical query sentence q
(t in d)×idf(t)2The sum of × Boost () × norm (t, d);
Tf (t in d) is item frequency factor, and the query term t for including in referenced text d is more, then this text is then given a mark and got over
It is high;
Idf (t) is the frequency that query term t occurs in inverted index, the higher participle knot of the frequency of occurrences in inverted index
Fruit has lower idf, the less word segmentation result idf with higher of the frequency of occurrences in inverted index;
Boost () is weighted value;
Norm (t, d) and queryNorm (q) is normalization factor;
Coord (q, d) is the measurement for meeting query statement q querying condition number in referenced text d, is wrapped when in a text
More containing the word number for meeting query statement q querying condition, then this text coord is higher, i.e., with reference to text in the embodiment of the present application
The number that this d participates in the word of inverted index is more, and coord is higher.
S1023: it according to word each in the word segmentation result of text to be sorted to the contribution degree of same referenced text, obtains wait divide
The measuring similarity value of class text and the referenced text.
In the present embodiment, it obtains after each word is to the contribution degree of same referenced text in text word segmentation result to be sorted,
The contribution degree of each word in same referenced text and text word segmentation result to be sorted can be averaging, obtained value is then wait divide
The measuring similarity value of class text and the referenced text.
That is, the contribution degree for treating the object reference text in the word segmentation result of classifying text in each word pair database asks flat
, contribution degree mean value is obtained;Contribution degree mean value is determined as to the measuring similarity value of text to be sorted Yu object reference text.
It is understood that those skilled in the art can also be using the calculation in addition to averaging (as summed)
Counting each word in text word segmentation result to be sorted, to the contribution degree of same referenced text, the embodiment of the present application does not do this any
It limits, also will not enumerate here.
S103: it according to the classification of measuring similarity value and referenced text, obtains text to be sorted and each classification is commented
Valuation.
In the present embodiment, according to the classification of referenced text known in database, text to be sorted is counted to the classification
The similarity of lower referenced text is comprehensive, i.e., for text to be sorted to the assessed value of the classification, the assessed value the big, illustrates text to be sorted
This is higher with the similarity of referenced text under the classification.
In the possible implementation of the present embodiment, in order to remove data noise, can first from database, filter out with
Measuring similarity value between text to be sorted is greater than the referenced text of preset threshold;Further according to text to be sorted and each classification
In measuring similarity value between the referenced text that filters out, obtain text to be sorted to the assessed value of the classification.
In actual operation, those skilled in the art can specifically set preset threshold according to the actual situation, here no longer
It enumerates.
S104: the corresponding classification of the maximum value of assessed value is determined as to the classification of text to be sorted.
Since the assessed value of the text to be sorted to the classification the big, illustrate referenced text under text to be sorted and the classification
Similarity it is higher, therefore, classification corresponding to maximum value of the text to be sorted to the assessed value of classification each in database is
For the classification of text to be sorted.
It in the present embodiment, will be wait divide using the referenced text of known classification in the database constructed in advance as classification foundation
Class text and referenced text carry out similarity retrieval matching, evaluate the similarity of text to be sorted Yu each referenced text.Root again
According to the known classification and text to be sorted of referenced text to the similarity of referenced text in each classification, text to be sorted is determined
This assessed value to each classification, i.e., text to be sorted and the comprehensive similarity for belonging to referenced text under the classification, by assessed value
The corresponding classification of maximum value is determined as the classification of the text to be sorted.Since the classification of referenced text is exhaustible, and
When treating classifying text and being classified, it is not necessary that the rule of classification is manually set, without the integrality for considering rule, according to point
The classification for treating classifying text can be realized in similarity between class text and the referenced text of known classification, can be to any one
A text to be sorted is classified, and the coverage rate of text classification is improved.
The file classification method provided based on the above embodiment, the embodiment of the present application also provides a kind of text classification dresses
It sets.
Referring to Fig. 3, which is a kind of structural schematic diagram of text processing apparatus provided by the embodiments of the present application.
A kind of document sorting apparatus provided in this embodiment, comprising: the first acquisition acquisition of module 100, second module 200,
Evaluation module 300 and determining module 400.
First obtains module 100, and for obtaining the database constructed in advance and text to be sorted, database includes multiple ginsengs
Text and the classification of the referenced text are examined, referenced text is at least divided into two classes in database.
Evaluation module 300 is obtained for evaluating the similarity of each referenced text in text to be sorted and database wait divide
Measuring similarity value between class text and each referenced text.
Second acquisition module 200 obtains text to be sorted for the classification according to measuring similarity value and referenced text
To the assessed value of each classification.
Determining module 400, for the corresponding classification of the maximum value of assessed value to be determined as to the classification of text to be sorted.
In the possible implementation of the present embodiment, the device further include: word segmentation module.
Word segmentation module carries out first participle processing for treating classifying text, obtains the word segmentation result of text to be sorted.
Evaluation module 300, specifically includes: the first acquisition submodule and the second acquisition submodule.
First acquisition submodule obtains each in each word pair database in word segmentation result for being based on TF-IDF algorithm
The contribution degree of referenced text.
Second acquisition submodule, for tribute of each word to same referenced text in the word segmentation result according to text to be sorted
Degree of offering obtains the measuring similarity value of text to be sorted Yu the referenced text.
In the possible implementation of the present embodiment, second obtains module 200, specifically includes: screening submodule and third
Acquisition submodule.
Submodule is screened, is greater than in advance for from database, filtering out the measuring similarity value between text to be sorted
If the referenced text of threshold value.
Third acquisition submodule, for according to the phase between the referenced text filtered out in text to be sorted and each classification
Like degree metric, text to be sorted is obtained to the assessed value of the classification.
In the possible implementation of the present embodiment, the device further include: database frame modules.
Database frame modules, specifically include: crawling submodule, processing submodule and the 4th acquisition submodule.
Submodule is crawled, for crawling the reference of the website orientation from the website that at least one chooses in advance by crawler
The classification of text and referenced text, obtains initial data base.
Word segmentation module is also used to carry out the second word segmentation processing to the referenced text in initial data base, obtains each reference
The word segmentation regulation of the word segmentation result of text, first participle processing is identical as the word segmentation regulation of the second word segmentation processing.
Submodule is handled, for the word segmentation result according to referenced text in initial data base, to the ginseng in initial data base
It examines text and carries out inverted index processing.
4th acquisition submodule, for the classification according to the processing result and the referenced text crawled of inverted index,
Obtain database.
In the possible implementation of the present embodiment, the second acquisition submodule is specifically included: computational submodule and determining son
Module.
Computational submodule, each word seeks the contribution degree of object reference text in the word segmentation result for treating classifying text
It is average, contribution degree mean value is obtained, database includes object reference text.
Submodule is determined, for contribution degree mean value to be determined as to the measuring similarity of text to be sorted Yu object reference text
Value.
It in the present embodiment, will be wait divide using the referenced text of known classification in the database constructed in advance as classification foundation
Class text and referenced text carry out similarity retrieval matching, evaluate the similarity of text to be sorted Yu each referenced text.Root again
According to the known classification and text to be sorted of referenced text to the similarity of referenced text in each classification, text to be sorted is determined
This assessed value to each classification, i.e., text to be sorted and the comprehensive similarity for belonging to referenced text under the classification, by assessed value
The corresponding classification of maximum value is determined as the classification of the text to be sorted.Since the classification of referenced text is exhaustible, and
When treating classifying text and being classified, it is not necessary that the rule of classification is manually set, without the integrality for considering rule, according to point
The classification for treating classifying text can be realized in similarity between class text and the referenced text of known classification, can be to any one
A text to be sorted is classified, and the coverage rate of text classification is improved.
A kind of text handling method and device provided based on the above embodiment, the embodiment of the present application also provides another kinds
Text processing apparatus.
Text processing apparatus provided in this embodiment includes processor and memory, and first in above-described embodiment obtains mould
Block, the second acquisition module, evaluation module and determining module are used as program module storage in memory, are deposited by processor execution
Above procedure module in memory is stored up to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one
Or more, by adjusting kernel parameter to realize the classification for treating classifying text.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (Flash RAM), memory include that at least one is deposited
Store up chip.
It in the present embodiment, will be wait divide using the referenced text of known classification in the database constructed in advance as classification foundation
Class text and referenced text carry out similarity retrieval matching, evaluate the similarity of text to be sorted Yu each referenced text.Root again
According to the known classification and text to be sorted of referenced text to the similarity of referenced text in each classification, text to be sorted is determined
This assessed value to each classification, i.e., text to be sorted and the comprehensive similarity for belonging to referenced text under the classification, by assessed value
The corresponding classification of maximum value is determined as the classification of the text to be sorted.Since the classification of referenced text is exhaustible, and
When treating classifying text and being classified, it is not necessary that the rule of classification is manually set, without the integrality for considering rule, according to point
The classification for treating classifying text can be realized in similarity between class text and the referenced text of known classification, can be to any one
A text to be sorted is classified, and the coverage rate of text classification is improved.
A kind of text handling method and device provided based on the above embodiment, the embodiment of the present application also provides a kind of meters
Calculation machine program product is adapted for carrying out the program code of initialization there are as below methods step when executing on data processing equipment:
The database for obtaining text to be sorted and constructing in advance, database include multiple referenced texts and the referenced text
Classification, the referenced text in database is at least divided into two classes;Evaluate each referenced text in text to be sorted and database
Similarity obtains the measuring similarity value between text to be sorted and each referenced text;According to measuring similarity value and ginseng
The classification for examining text obtains text to be sorted to the assessed value of each classification;The corresponding classification of the maximum value of assessed value is determined
For the classification of text to be sorted.
The similarity of each referenced text in the evaluation text to be sorted and the database obtains described wait divide
Measuring similarity value between class text and each referenced text, can specifically include: treating classifying text and carries out first
Word segmentation processing obtains the word segmentation result of the text to be sorted;Based on TF-IDF algorithm, each word in the word segmentation result is obtained
To the contribution degree of each referenced text in the database;According to each word in the word segmentation result of the text to be sorted to same
The contribution degree of referenced text obtains the measuring similarity value of the text to be sorted and the referenced text.
The classification according to the measuring similarity value and the referenced text obtains the text to be sorted to every
The assessed value of a classification, can specifically include: from the database, filter out the similarity between the text to be sorted
Metric is greater than the referenced text of preset threshold;According to the referenced text filtered out in the text to be sorted and each classification it
Between measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
It can also include: by crawler from least before obtaining text to be sorted and the corresponding database constructed in advance
The classification that the referenced text and referenced text of the website orientation are crawled in one website chosen in advance, obtains initial data base;
Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the word segmentation result of each referenced text, it is described
The word segmentation regulation of first participle processing is identical as the word segmentation regulation of second word segmentation processing;According to joining in the initial data base
The word segmentation result for examining text carries out inverted index processing to the referenced text in initial data base;According to the processing of inverted index
As a result and the classification of referenced text that crawls, the database is obtained.
Each word obtains institute to the contribution degree of same referenced text in the word segmentation result according to the text to be sorted
The measuring similarity value for stating text to be sorted Yu the referenced text, can specifically include: to the participle knot of the text to be sorted
Each word is averaging the contribution degree of object reference text in fruit, obtains contribution degree mean value, the database includes the target
Referenced text;The contribution degree mean value is determined as to the measuring similarity of the text to be sorted Yu the object reference text
Value.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor
File classification method described in existing above-described embodiment.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation
File classification method described in Shi Zhihang above-described embodiment.
The embodiment of the invention provides a kind of equipment, which includes processor, memory and stores on a memory simultaneously
The program that can be run on a processor, processor perform the steps of when executing program
The database for obtaining text to be sorted and constructing in advance, the database include multiple referenced texts and the reference
The classification of text, the referenced text in the database are at least divided into two classes;Evaluate the text to be sorted and the database
In each referenced text similarity, obtain the measuring similarity between the text to be sorted and each referenced text
Value;According to the classification of the measuring similarity value and the referenced text, the text to be sorted is obtained to each classification
Assessed value;The corresponding classification of the maximum value of the assessed value is determined as to the classification of the text to be sorted.
The similarity of each referenced text in the evaluation text to be sorted and the database obtains described wait divide
Measuring similarity value between class text and each referenced text, can specifically include: treating classifying text and carries out first
Word segmentation processing obtains the word segmentation result of the text to be sorted;Based on TF-IDF algorithm, each word in the word segmentation result is obtained
To the contribution degree of each referenced text in the database;According to each word in the word segmentation result of the text to be sorted to same
The contribution degree of referenced text obtains the measuring similarity value of the text to be sorted and the referenced text.
The classification according to the measuring similarity value and the referenced text obtains the text to be sorted to every
The assessed value of a classification, can specifically include: from the database, filter out the similarity between the text to be sorted
Metric is greater than the referenced text of preset threshold;According to the referenced text filtered out in the text to be sorted and each classification it
Between measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
It can also include: by crawler from least before obtaining text to be sorted and the corresponding database constructed in advance
The classification that the referenced text and referenced text of the website orientation are crawled in one website chosen in advance, obtains initial data base;
Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the word segmentation result of each referenced text, it is described
The word segmentation regulation of first participle processing is identical as the word segmentation regulation of second word segmentation processing;According to joining in the initial data base
The word segmentation result for examining text carries out inverted index processing to the referenced text in the initial data base;According to inverted index
The classification of processing result and the referenced text crawled obtains the database.
Each word obtains institute to the contribution degree of same referenced text in the word segmentation result according to the text to be sorted
The measuring similarity value for stating text to be sorted Yu the referenced text, can specifically include: to the participle knot of the text to be sorted
Each word is averaging the contribution degree of object reference text in fruit, obtains contribution degree mean value, the database includes the target
Referenced text;The contribution degree mean value is determined as to the measuring similarity of the text to be sorted Yu the object reference text
Value.
Equipment herein can be server, PC, PAD, mobile phone etc..
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (FlashRAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (Transitory Media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element
There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art,
Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement,
Improve etc., it should be included within the scope of the claims of this application.
Claims (10)
1. a kind of file classification method, which is characterized in that the described method includes:
The database for obtaining text to be sorted and constructing in advance, the database include multiple referenced texts and the referenced text
Classification, the referenced text in the database is at least divided into two classes;
The similarity for evaluating each referenced text in the text to be sorted and the database, obtain the text to be sorted with
Measuring similarity value between each referenced text;
According to the classification of the measuring similarity value and the referenced text, the text to be sorted is obtained to each classification
Assessed value;
The corresponding classification of the maximum value of the assessed value is determined as to the classification of the text to be sorted.
2. file classification method according to claim 1, which is characterized in that the evaluation text to be sorted with it is described
The similarity of each referenced text in database obtains the similarity between the text to be sorted and each referenced text
Metric specifically includes:
It treats classifying text and carries out first participle processing, obtain the word segmentation result of the text to be sorted;
Based on TF-IDF algorithm, contribution of each word to each referenced text in the database in the word segmentation result is obtained
Degree;
According to each word in the word segmentation result of the text to be sorted to the contribution degree of same referenced text, obtain described to be sorted
The measuring similarity value of text and the referenced text.
3. file classification method according to claim 1 or 2, which is characterized in that described according to the measuring similarity value
And the classification of the referenced text, the text to be sorted is obtained to the assessed value of each classification, is specifically included:
From the database, the reference that the measuring similarity value between the text to be sorted is greater than preset threshold is filtered out
Text;
According to the measuring similarity value between the referenced text filtered out in the text to be sorted and each classification, described in acquisition
Assessed value of the text to be sorted to the classification.
4. file classification method according to claim 2, which is characterized in that obtain text to be sorted and it is corresponding in advance
Before the database of building, the method also includes:
The referenced text of the website orientation and point of referenced text are crawled from the website that at least one chooses in advance by crawler
Class obtains initial data base;
Second word segmentation processing is carried out to the referenced text in the initial data base, obtains the word segmentation result of each referenced text,
The word segmentation regulation of the first participle processing is identical as the word segmentation regulation of second word segmentation processing;
According to the word segmentation result of referenced text in the initial data base, the referenced text in the initial data base is fallen
Arrange index process;
According to the classification of the processing result of inverted index and the referenced text crawled, the database is obtained.
5. file classification method according to claim 2, which is characterized in that the participle according to the text to be sorted
As a result each word obtains the measuring similarity of the text to be sorted and the referenced text to the contribution degree of same referenced text in
Value, specifically includes:
Contribution degree of each word in the word segmentation result of the text to be sorted to object reference text is averaging, contribution degree is obtained
Mean value, the database include the object reference text;
The contribution degree mean value is determined as to the measuring similarity value of the text to be sorted Yu the object reference text.
6. a kind of document sorting apparatus, which is characterized in that described device includes: the first acquisition module, the second acquisition module, evaluation
Module and determining module;
Described first obtains module, and for obtaining text to be sorted and the database that constructs in advance, the database includes multiple
The classification of referenced text and the referenced text, the referenced text in the database are at least divided into two classes;
The evaluation module is obtained for evaluating the similarity of each referenced text in the text to be sorted and the database
Measuring similarity value between the text to be sorted and each referenced text;
Described second obtains module, for the classification according to the measuring similarity value and the referenced text, described in acquisition
Assessed value of the text to be sorted to each classification;
The determining module, for the corresponding classification of the maximum value of the assessed value to be determined as to the class of the text to be sorted
Not.
7. document sorting apparatus according to claim 6, which is characterized in that described device, further includes: word segmentation module;
The word segmentation module carries out first participle processing for treating classifying text, obtains the participle knot of the text to be sorted
Fruit;
The evaluation module, specifically includes: the first acquisition submodule and the second acquisition submodule;
First acquisition submodule obtains in the word segmentation result each word to the data for being based on TF-IDF algorithm
The contribution degree of each referenced text in library;
Second acquisition submodule, for each word in the word segmentation result according to the text to be sorted to same referenced text
Contribution degree, obtain the measuring similarity value of the text to be sorted and the referenced text.
8. document sorting apparatus according to claim 6 or 7, which is characterized in that described second obtains module, specific to wrap
It includes: screening submodule and third acquisition submodule;
The screening submodule, for from the database, filtering out the measuring similarity between the text to be sorted
Value is greater than the referenced text of preset threshold;
The third acquisition submodule, for according between the referenced text filtered out in the text to be sorted and each classification
Measuring similarity value, obtain the text to be sorted to the assessed value of the classification.
9. a kind of storage medium, which is characterized in that be stored thereon with program, realized when which is executed by processor as right is wanted
Seek the described in any item file classification methods of 1-5.
10. a kind of processor, which is characterized in that the processor is for running program, wherein executed such as when described program is run
The described in any item file classification methods of claim 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710910888.6A CN110019785B (en) | 2017-09-29 | 2017-09-29 | Text classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710910888.6A CN110019785B (en) | 2017-09-29 | 2017-09-29 | Text classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110019785A true CN110019785A (en) | 2019-07-16 |
CN110019785B CN110019785B (en) | 2022-03-01 |
Family
ID=67186452
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710910888.6A Active CN110019785B (en) | 2017-09-29 | 2017-09-29 | Text classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110019785B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990577A (en) * | 2019-12-25 | 2020-04-10 | 北京亚信数据有限公司 | Text classification method and device |
CN112948370A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Data classification method and device and computer equipment |
CN113111173A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm receiving warning condition category determination method and device |
CN113220840A (en) * | 2021-05-17 | 2021-08-06 | 北京百度网讯科技有限公司 | Text processing method, device, equipment and storage medium |
CN113254655A (en) * | 2021-07-05 | 2021-08-13 | 北京邮电大学 | Text classification method, electronic device and computer storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030172063A1 (en) * | 2002-03-07 | 2003-09-11 | Koninklijke Philips Electronics N.V. | Method and apparatus for providing search results in response to an information search request |
CN103049569A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Text similarity matching method on basis of vector space model |
CN103714118A (en) * | 2013-11-22 | 2014-04-09 | 浙江大学 | Book cross-reading method |
US20140278353A1 (en) * | 2013-03-13 | 2014-09-18 | Crimson Hexagon, Inc. | Systems and Methods for Language Classification |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
-
2017
- 2017-09-29 CN CN201710910888.6A patent/CN110019785B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030172063A1 (en) * | 2002-03-07 | 2003-09-11 | Koninklijke Philips Electronics N.V. | Method and apparatus for providing search results in response to an information search request |
CN103049569A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Text similarity matching method on basis of vector space model |
US20140278353A1 (en) * | 2013-03-13 | 2014-09-18 | Crimson Hexagon, Inc. | Systems and Methods for Language Classification |
CN103714118A (en) * | 2013-11-22 | 2014-04-09 | 浙江大学 | Book cross-reading method |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948370A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Data classification method and device and computer equipment |
CN110990577A (en) * | 2019-12-25 | 2020-04-10 | 北京亚信数据有限公司 | Text classification method and device |
CN113111173A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm receiving warning condition category determination method and device |
CN113220840A (en) * | 2021-05-17 | 2021-08-06 | 北京百度网讯科技有限公司 | Text processing method, device, equipment and storage medium |
CN113220840B (en) * | 2021-05-17 | 2023-08-01 | 北京百度网讯科技有限公司 | Text processing method, device, equipment and storage medium |
CN113254655A (en) * | 2021-07-05 | 2021-08-13 | 北京邮电大学 | Text classification method, electronic device and computer storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110019785B (en) | 2022-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110019785A (en) | A kind of file classification method and device | |
CN110019668A (en) | A kind of text searching method and device | |
US20140108190A1 (en) | Recommending product information | |
US11048732B2 (en) | Systems and methods for records tagging based on a specific area or region of a record | |
Hung et al. | Customer segmentation using hierarchical agglomerative clustering | |
US20110029476A1 (en) | Indicating relationships among text documents including a patent based on characteristics of the text documents | |
CN104281585A (en) | Object ordering method and device | |
CN110059991B (en) | Warehouse item selection method, system, electronic device and computer readable medium | |
CN106934410A (en) | The sorting technique and system of data | |
CN110162778A (en) | The generation method and device of text snippet | |
US8977622B1 (en) | Evaluation of nodes | |
US11650999B2 (en) | Database search enhancement and interactive user interface therefor | |
CN110019670A (en) | A kind of text searching method and device | |
Sivaranjani et al. | Hybrid Particle Swarm Optimization-Firefly algorithm (HPSOFF) for combinatorial optimization of non-slicing VLSI floorplanning | |
CN108228612A (en) | A kind of method and device for extracting network event keyword and mood tendency | |
CN107784027A (en) | A kind of reminding method and device of judgement document's search key | |
CN108595460A (en) | Multichannel evaluating method and system, the computer program of keyword Automatic | |
Sharma et al. | Intelligent data analysis using optimized support vector machine based data mining approach for tourism industry | |
CN109218211A (en) | The method of adjustment of threshold value, device and equipment in the control strategy of data flow | |
CN114490786A (en) | Data sorting method and device | |
Angelini et al. | The complex dynamics of products and its asymptotic properties | |
CN109359346A (en) | A kind of heat load prediction method, apparatus, readable medium and electronic equipment | |
CN109117434A (en) | Judgement document's search method, device, storage medium and processor | |
CN109426905A (en) | A kind of determination method and device that the criminal document measurement of penalty deviates | |
CN105786929B (en) | A kind of information monitoring method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |