CN109739989A - File classification method and computer equipment - Google Patents

File classification method and computer equipment Download PDF

Info

Publication number
CN109739989A
CN109739989A CN201811653926.5A CN201811653926A CN109739989A CN 109739989 A CN109739989 A CN 109739989A CN 201811653926 A CN201811653926 A CN 201811653926A CN 109739989 A CN109739989 A CN 109739989A
Authority
CN
China
Prior art keywords
classification
ziwen
text
notebook data
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811653926.5A
Other languages
Chinese (zh)
Other versions
CN109739989B (en
Inventor
李斌
禹庆华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qianxin Technology Co Ltd
Original Assignee
Beijing Qianxin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qianxin Technology Co Ltd filed Critical Beijing Qianxin Technology Co Ltd
Priority to CN201811653926.5A priority Critical patent/CN109739989B/en
Publication of CN109739989A publication Critical patent/CN109739989A/en
Application granted granted Critical
Publication of CN109739989B publication Critical patent/CN109739989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Present disclose provides a kind of file classification methods, comprising: obtains text to be sorted;Based on the full text notebook data of the text to be sorted, the first classification results are obtained;Based on the one or more Ziwen notebook datas extracted from the full text notebook data, the second classification results are obtained;And according to first classification results and second classification results, determine the classification results of the text to be sorted.The disclosure additionally provides a kind of computer equipment.

Description

File classification method and computer equipment
Technical field
This disclosure relates to a kind of file classification method and computer equipment.
Background technique
Text classification is that the process of text categories is determined according to content of text under given classification system.Text classification is One pith of natural language processing, has a wide range of applications, including the classification of news category, mail, spam filtering, Violation webpage identification etc..
Existing text classification scheme text based entire content is classified, due to existing in the entire content of text The largely interference information unrelated with text classification, the identification feature that may cause text classification are buried in interference information, into And it is unable to get accurate classification results.
Summary of the invention
An aspect of this disclosure provides a kind of file classification method, comprising: obtains text to be sorted;Based on it is described to The full text notebook data of classifying text obtains the first classification results;Based on the one or more extracted from the full text notebook data Ziwen notebook data obtains the second classification results;And it according to first classification results and second classification results, determines The classification results of the text to be sorted.
Optionally, the above-mentioned full text notebook data based on the text to be sorted, the first classification results of acquisition include: will be described The input of full text notebook data corresponds to the full textual classification model of multiple preset classifications, determines institute based on the full textual classification model First score of the full text notebook data about each preset classification in the multiple preset classification is stated, by the preset of the first highest scoring Classification is as classification corresponding with the full text notebook data.
Optionally, above-mentioned based on the one or more Ziwen notebook datas extracted from the full text notebook data, the is obtained Before two classification results, the above method further include: one or more Ziwen notebook datas are extracted from the full text notebook data.It is above-mentioned Extracted from the full text notebook data one or more Ziwen notebook datas include: using in predetermined keyword set keyword with The full text notebook data is matched;For the first keyword of successful match, described is extracted from the full text notebook data The character of the second preset length after the character string of the first preset length before one keyword and/or first keyword String;And it combines the character string extracted and first keyword according to the sequence of positions in the full text notebook data For a sub- text data.
Optionally, above-mentioned based on the one or more Ziwen notebook datas extracted from the full text notebook data, obtain second Classification results include: that the first Ziwen notebook data input is corresponded to the multiple preset class for the first Ziwen notebook data Other sub- textual classification model determines the first Ziwen notebook data about the multiple pre- based on the sub- textual classification model Set the second score of each preset classification in classification;And based on each Ziwen sheet in one or more of Ziwen notebook datas Data calculate third of one or more of Ziwen notebook datas about each preset classification about the second score of each preset classification Score, using the preset classification of third highest scoring as classification corresponding with one or more of Ziwen notebook datas.
Optionally, above-mentioned each Ziwen notebook data based in one or more of Ziwen notebook datas is about each preset classification The second score to calculate one or more Ziwen notebook datas about the third score of each preset classification include: for multiple preset classes Any preset classification in not, the second score by each Ziwen notebook data about the preset classification are weighted summation, obtain institute State third score of one or more Ziwen notebook datas about the preset classification.
Optionally, above-mentioned according to first classification results and second classification results, determine the text to be sorted Classification results include: the first score and one or more of Ziwens according to the full text notebook data about each preset classification The text to be sorted is calculated about the comprehensive of each preset classification in third score of the notebook data about each preset classification Point, using the highest preset classification of comprehensive score as classification corresponding with the text to be sorted.
Optionally, it is above-mentioned according to the full text notebook data about the first score of each preset classification and one or more of Synthesis of the text to be sorted about each preset classification is calculated in third score of the Ziwen notebook data about each preset classification Score includes: setting and corresponding first weight of the full text notebook data and corresponding with one or more of Ziwen notebook datas Second weight;And for any preset classification in the multiple preset classification, according to first weight and described second Weight, to the full text notebook data about the first score of the preset classification and one or more of Ziwen notebook datas about this The third score of preset classification is weighted summation, obtains comprehensive score of the text to be sorted about the preset classification.
Optionally, above-mentioned based on the one or more Ziwen notebook datas extracted from the full text notebook data, obtain second Classification results include: that the first Ziwen notebook data input is corresponded to the multiple preset class for the first Ziwen notebook data Other sub- textual classification model determines the first Ziwen notebook data about each preset classification based on the sub- textual classification model Score, using the preset classification of highest scoring as classification corresponding with the first Ziwen notebook data.When with it is one or There are when first category in the corresponding classification of each Ziwen notebook data in multiple Ziwen sheets, determine and one or more of Ziwens The corresponding classification of notebook data is first category;And when corresponding with each Ziwen notebook data in one or more of Ziwen sheets Classification when being second category, determine that classification corresponding with one or more of Ziwen notebook datas is second category.
Optionally, the preset classification includes first category and second category.It is above-mentioned according to first classification results and Second classification results determine that the classification results of the text to be sorted include: when class corresponding with the full text notebook data When not and classification corresponding with one or more of Ziwen notebook datas is second category, the determining and text pair to be sorted The classification answered is second category;And when classification corresponding with the full text notebook data and/or with one or more of Ziwens When the corresponding classification of notebook data is first category, determining classification corresponding with the text to be sorted is first category.
Another aspect of the present disclosure provides a kind of document sorting apparatus, comprising: obtains module, the first categorization module, the Two categorization modules and compressive classification module.Module is obtained for obtaining text to be sorted.First categorization module is used for based on described The full text notebook data of text to be sorted obtains the first classification results.Second categorization module is used to be based on from the full text notebook data One or more Ziwen notebook datas of middle extraction obtain the second classification results.And compressive classification module is used for according to described the One classification results and second classification results, determine the classification results of the text to be sorted.
Optionally, the first categorization module is used to correspond to the full text notebook data input full text sheet of multiple preset classifications Disaggregated model determines the full text notebook data about each pre- in the multiple preset classification based on the full textual classification model The first score for setting classification, using the preset classification of the first highest scoring as classification corresponding with the full text notebook data.
Optionally, described device further includes this extraction module of Ziwen, for being based in second categorization module from described The one or more Ziwen notebook datas extracted in full text notebook data, before obtaining the second classification results, from the full text notebook data The one or more Ziwen notebook datas of middle extraction.Wherein, this extraction module of Ziwen includes matched sub-block, extracting sub-module and combination Submodule.Matched sub-block using the keyword in predetermined keyword set with the full text notebook data for being matched.It mentions Take submodule for the first keyword for successful match, before extracting first keyword in the full text notebook data The first preset length character string and/or the second preset length after first keyword character string.And combination Submodule is used for the character string extracted and first keyword according to the sequence of positions group in the full text notebook data It is combined into a sub- text data.
Optionally, second categorization module includes the first prediction submodule and computational submodule.First prediction submodule For for the first Ziwen notebook data, the first Ziwen notebook data input to be corresponded to the Ziwen sheet of the multiple preset classification Disaggregated model determines the first Ziwen notebook data about in the multiple preset classification based on the sub- textual classification model Second score of each preset classification.And computational submodule is used for based on each son in one or more of Ziwen notebook datas Text data calculates one or more of Ziwen notebook datas about each preset classification about the second score of each preset classification Third score, using the preset classification of third highest scoring as classification corresponding with one or more of Ziwen notebook datas.
Optionally, computational submodule is specifically used for for any preset classification in the multiple preset classification, Jiang Gezi Text data is weighted summation about the second score of the preset classification, obtain one or more of Ziwen notebook datas about The third score of the preset classification.
Optionally, compressive classification module includes COMPREHENSIVE CALCULATING submodule, is used for according to the full text notebook data about each pre- The third score of the first score and one or more of Ziwen notebook datas of classification about each preset classification is set, institute is calculated State comprehensive score of the text to be sorted about each preset classification, using the highest preset classification of comprehensive score as with it is described to be sorted The corresponding classification of text.
Optionally, COMPREHENSIVE CALCULATING submodule is for being arranged corresponding with the full text notebook data the first weight and with described one Corresponding second weight of a or multiple Ziwen notebook datas;And for any preset classification in the multiple preset classification, root According to first weight and second weight, to the full text notebook data about the first score of the preset classification and described one A or multiple Ziwen notebook datas are weighted summation about the third score of the preset classification, obtain the text to be sorted about The comprehensive score of the preset classification.
Optionally, the second categorization module includes that the second prediction submodule and first determine submodule.Second prediction submodule For for the first Ziwen notebook data, the first Ziwen notebook data input to be corresponded to the Ziwen sheet of the multiple preset classification Disaggregated model determines score of the first Ziwen notebook data about each preset classification based on the sub- textual classification model, will The preset classification of highest scoring is as classification corresponding with the first Ziwen notebook data.First determines that submodule is used to work as and institute State in the corresponding classification of each Ziwen notebook data in one or more Ziwen sheets there are when first category, it is determining with it is one or The corresponding classification of multiple Ziwen notebook datas is first category;And when with each Ziwen sheet in one or more of Ziwen sheets When the corresponding classification of data is second category, determine that classification corresponding with one or more of Ziwen notebook datas is the second class Not.
Optionally, preset classification includes first category and second category.Compressive classification module includes the second determining submodule, For being the when and the corresponding classification of the full text notebook data and classification corresponding with one or more of Ziwen notebook datas When two classifications, determining classification corresponding with the text to be sorted is second category;And when corresponding with the full text notebook data Classification and/or classification corresponding with one or more of Ziwen notebook datas when being first category, it is determining with it is described to be sorted The corresponding classification of text is first category.
Another aspect of the present disclosure provides a kind of computer equipment, and the computer equipment includes memory, processor And the computer program that can be run on a memory and on a processor is stored, when the processor executes the computer program For realizing method as described above.
Another aspect of the present disclosure provides a kind of computer readable storage medium, is stored with computer executable instructions, Described instruction is when executed for realizing method as described above.
Another aspect of the present disclosure provides a kind of computer program, and the computer program, which includes that computer is executable, to be referred to It enables, described instruction is when executed for realizing method as described above.
Detailed description of the invention
In order to which the disclosure and its advantage is more fully understood, referring now to being described below in conjunction with attached drawing, in which:
Fig. 1 diagrammatically illustrates the applied field of file classification method according to an embodiment of the present disclosure and computer equipment Scape;
Fig. 2 diagrammatically illustrates the flow chart of file classification method according to an embodiment of the present disclosure;
Fig. 3 A diagrammatically illustrates the schematic diagram of text classification process according to an embodiment of the present disclosure;
Fig. 3 B diagrammatically illustrates the schematic diagram of text classification process according to another embodiment of the present disclosure;
Fig. 4 diagrammatically illustrates the block diagram of document sorting apparatus according to an embodiment of the present disclosure;
Fig. 5 diagrammatically illustrates the block diagram of document sorting apparatus according to another embodiment of the present disclosure;And
Fig. 6 diagrammatically illustrates the block diagram of computer equipment according to an embodiment of the present disclosure.
Specific embodiment
Hereinafter, will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are only exemplary , and it is not intended to limit the scope of the present disclosure.In the following detailed description, to elaborate many specific thin convenient for explaining Section is to provide the comprehensive understanding to the embodiment of the present disclosure.It may be evident, however, that one or more embodiments are not having these specific thin It can also be carried out in the case where section.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid Unnecessarily obscure the concept of the disclosure.
Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.It uses herein The terms "include", "comprise" etc. show the presence of the feature, step, operation and/or component, but it is not excluded that in the presence of Or add other one or more features, step, operation or component.
There are all terms (including technical and scientific term) as used herein those skilled in the art to be generally understood Meaning, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification Meaning, without that should be explained with idealization or excessively mechanical mode.
It, in general should be according to this using statement as " at least one in A, B and C etc. " is similar to Field technical staff is generally understood the meaning of the statement to make an explanation (for example, " system at least one in A, B and C " Should include but is not limited to individually with A, individually with B, individually with C, with A and B, with A and C, have B and C, and/or System etc. with A, B, C).Using statement as " at least one in A, B or C etc. " is similar to, generally come Saying be generally understood the meaning of the statement according to those skilled in the art to make an explanation (for example, " having in A, B or C at least One system " should include but is not limited to individually with A, individually with B, individually with C, with A and B, have A and C, have B and C, and/or the system with A, B, C etc.).
Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart Frame or combinations thereof can be realized by computer program instructions.These computer program instructions can be supplied to general purpose computer, The processor of special purpose computer or other programmable data processing units, so that these instructions are when executed by this processor can be with Creation is for realizing function/operation device illustrated in these block diagrams and/or flow chart.The technology of the disclosure can be hard The form of part and/or software (including firmware, microcode etc.) is realized.In addition, the technology of the disclosure, which can be taken, is stored with finger The form of computer program product on the computer readable storage medium of order, the computer program product is for instruction execution system System uses or instruction execution system is combined to use.
Embodiment of the disclosure provides a kind of file classification method and can apply the computer equipment of this method.It should Method includes that text to be sorted obtains stage, sorting phase and integrated treatment stage.Text to be sorted obtain the stage obtain to Classifying text.In sorting phase, obtain respectively first classification results corresponding with the full text notebook data of text to be sorted and with Corresponding second classification results of one or more Ziwen notebook datas extracted from the full text notebook data.Subsequently into General Office The reason stage determines the classification results of text to be sorted according to the first classification results and the second classification results.
Fig. 1 diagrammatically illustrates the applied field of file classification method according to an embodiment of the present disclosure and computer equipment Scape.It should be noted that being only the example that can apply the scene of the embodiment of the present disclosure shown in Fig. 1, to help art technology Personnel understand the technology contents of the disclosure, but are not meant to that the embodiment of the present disclosure may not be usable for other equipment, system, environment Or scene.
As shown in Figure 1, the application scenarios may include terminal device 101,102,103, network 104 and servers/services Device cluster 105.Network 104 is to provide communication between terminal device 101,102,103 and server/server cluster 105 The medium of link.Network 104 may include various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102,103 and be interacted by network 104 with server/server cluster 105, To receive or send message etc..Terminal device 101,102,103 can be with display screen and supported web page browsing it is various Electronic equipment, including but not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server/server cluster 105 can be to provide the server or server cluster of various services, back-stage management clothes Business device or server cluster can carry out the processing such as analyzing to data such as the user's requests received, and processing result is fed back to Terminal device.
It should be noted that file classification method provided by the embodiment of the present disclosure generally can be by server/server Cluster 105 executes.Correspondingly, document sorting apparatus provided by the embodiment of the present disclosure generally can be set in servers/services In device cluster 105.File classification method provided by the embodiment of the present disclosure can also be by being different from server/server cluster 105 and the server or server that can be communicated with terminal device 101,102,103, and/or server/server cluster 105 Cluster executes.Correspondingly, document sorting apparatus provided by the embodiment of the present disclosure also can be set in different from servers/services Device cluster 105 and the server that can be communicated with terminal device 101,102,103, and/or server/server cluster 105 or In server cluster.
It should be understood that the terminal device, network and server/server cluster number in Fig. 1 are only schematical. According to needs are realized, any number of terminal device, network and server/server cluster can have.
Fig. 2 diagrammatically illustrates the flow chart of file classification method according to an embodiment of the present disclosure.
As shown in Fig. 2, this method includes operation S201~S204.
In operation S201, text to be sorted is obtained.
The first classification results are obtained based on the full text notebook data of the text to be sorted in operation S202.
Second is obtained based on the one or more Ziwen notebook datas extracted from the full text notebook data in operation S203 Classification results.
The text to be sorted is determined according to first classification results and second classification results in operation S204 Classification results.
As it can be seen that method shown in Fig. 2, for be sorted point of sheet, on the one hand the full text notebook data based on the text to be sorted obtains The first classification results are obtained, on the other hand the Ziwen notebook data in the full text notebook data based on the text to be sorted obtains the second classification As a result, obtaining the classification results of text to be processed in conjunction with the first classification results and the second classification results.Not only from text to be sorted Whole angle, which is set out, considers its classification, also considers its classification from the key message angle in text to be sorted, and the two combines The recall rate and accuracy rate of text classification result can be effectively improved.
In one embodiment of the present disclosure, full text notebook data of the S202 based on the text to be sorted is operated, obtains the One classification results include: the full textual classification model that full text notebook data input is corresponded to multiple preset classifications, are based on institute It states full textual classification model and determines that the full text notebook data is obtained about first of each preset classification in the multiple preset classification Point, using the preset classification of the first highest scoring as classification corresponding with the full text notebook data.
Wherein, it after getting text to be sorted, treats classifying text and is pre-processed to obtain full text notebook data, full text one's duty Class model corresponds to multiple preset classifications.The full text notebook data is inputted into full textual classification model, respectively obtains the full textual data According to the first score about each preset classification.Full text notebook data indicates that prediction is complete about the first score of a preset classification X Text data belongs to the scoring of this preset classification X.When full text notebook data is more than or equal to about the first score of a preset classification X When first score of the full text notebook data about other any preset classifications, using preset classification X as corresponding with full text notebook data Classification.The advantages of being classified using full textual classification model to full text notebook data is in assorting process to examine from whole Consider the feature in full text notebook data and consider the association between feature, but since there is also largely divide with text in full text notebook data The unrelated interference information of class, the identification feature that may cause text classification is buried in interference information, and then is unable to get standard The first true classification results.Therefore subsequent integrated treatment to be carried out to the first classification results and the second classification results.
In one embodiment of the present disclosure, in operation S203 based on one or more extracted from the full text notebook data A sub- text data, before obtaining the second classification results, method shown in Fig. 2 further include: extracted from the full text notebook data One or more Ziwen notebook datas.
Specifically, the above-mentioned one or more Ziwen notebook datas that extract from the full text notebook data include: to be closed using default Keyword in keyword set is matched with the full text notebook data;For the first keyword of successful match, from described complete After the character string of the first preset length before extracting first keyword in text data and/or first keyword The second preset length character string;And by the character string extracted and first keyword according in the full text sheet Sequence of positions group in data is combined into a sub- text data.
Wherein, predetermined keyword set includes the one or more keywords to play a key effect to text classification.It is based on Predetermined keyword set matches full text notebook data, for each matching position, the pass of successful match in the matching position Character string combinations before and after keyword and the keyword obtain the keyword in the context text data of the matching position, will be on this Hereafter text data is as a sub- text data.Opened since Ziwen notebook data filters out the keyword letter with key effect, The interference of redundancy in full text notebook data is eliminated, so that complete compared to being based on based on the process that Ziwen notebook data is classified The process that text data is classified can be more concerned about identification feature relevant to text classification.Further, Ziwen notebook data Include not only key word information, further includes the contextual information of the keyword, keyword is placed in specific context Could embody keyword play the role of in text classification it is key so that the second classification results are more accurate.
In one embodiment of the present disclosure, operation S203 is based on the one or more extracted from the full text notebook data Ziwen notebook data, obtaining the second classification results includes: for the first Ziwen notebook data, by the first Ziwen notebook data input pair The first Ziwen sheet should be determined based on the sub- textual classification model in the sub- textual classification model of the multiple preset classification Second score of the data about each preset classification in the multiple preset classification;And it is based on one or more of Ziwens Each Ziwen notebook data in notebook data calculates one or more of Ziwen notebook datas about the second score of each preset classification and closes In the third score of each preset classification, using the preset classification of third highest scoring as with one or more of Ziwen notebook datas Corresponding classification.
Wherein, sub- textual classification model is consistent with preset classification corresponding to full textual classification model.For any Ziwen The Ziwen notebook data is inputted sub- textual classification model, respectively obtains the Ziwen notebook data about each preset classification by notebook data The second score.It is preset that Ziwen notebook data about the second score of preset classification X indicates that prediction Ziwen notebook data belongs to this The scoring of classification X.
When group text data only has one, which is equal to Ziwen about the second score of each preset classification The whole third score about each preset classification of notebook data.Second score of the group text data about a preset classification X When the second score more than or equal to the Ziwen notebook data about other any preset classifications, using preset classification X as with Ziwen sheet The corresponding classification of data.
When group text data has multiple, the second score according to each Ziwen notebook data about each preset classification is calculated To the whole third score about each preset classification of Ziwen notebook data.Group text data is whole about a preset classification X Third score when being more than or equal to the whole third score about other any preset classifications of Ziwen notebook data, by preset classification X As with the whole corresponding classification of Ziwen notebook data.
The above process by Ziwen notebook data, using Ziwen, classify by this disaggregated model, obtains each Ziwen notebook data Classification results obtain the second classification results in conjunction with the classification results of each Ziwen notebook data.Due to different Ziwen notebook data packets Containing the identical or different key word information in different context scenes, different Ziwen notebook datas can express full text originally Different characteristic relevant to text classification in data, sub- textual classification model are suitable for the assorting process of each Ziwen notebook data In extraction and assorting process to feature corresponding with each Ziwen notebook data, comprehensively consider various relevant to text classification Feature obtains the second classification results.
Specifically, above-mentioned each Ziwen notebook data based in one or more of Ziwen notebook datas is about each preset classification The second score to calculate one or more Ziwen notebook datas about the third score of each preset classification include: for the multiple pre- Any preset classification in classification is set, the second score by each Ziwen notebook data about the preset classification is weighted summation, obtains Third score to one or more of Ziwen notebook datas about the preset classification.
Wherein, the keyword in predetermined keyword set can have different grades, which indicates corresponding keyword The degree of the effect played in text classification.For example, if the text one is set in violation of rules and regulations when in text including some keyword Text, then the keyword has highest level.It can be respectively each according to the grade for the keyword that each Ziwen notebook data is included Corresponding weight is arranged in Ziwen notebook data, weight corresponding with each Ziwen notebook data is based on, to each Ziwen notebook data about same The second score of preset classification X is weighted summation, obtains the whole third score about the preset classification X of Ziwen notebook data.
On this basis, in one embodiment of the present disclosure, operation S204 is according to first classification results and described Second classification results, the classification results for determining the text to be sorted include: according to the full text notebook data about each preset class The third score of other first score and one or more of Ziwen notebook datas about each preset classification, be calculated it is described to Comprehensive score of the classifying text about each preset classification, using the highest preset classification of comprehensive score as with the text to be sorted Corresponding classification.
Specifically, it is above-mentioned according to the full text notebook data about the first score of each preset classification and one or more of Synthesis of the text to be sorted about each preset classification is calculated in third score of the Ziwen notebook data about each preset classification Score includes: setting and corresponding first weight of the full text notebook data and corresponding with one or more of Ziwen notebook datas Second weight;And for any preset classification in the multiple preset classification, according to first weight and described second Weight, to the full text notebook data about the first score of the preset classification and one or more of Ziwen notebook datas about this The third score of preset classification is weighted summation, obtains comprehensive score of the text to be sorted about the preset classification.
The scheme of the present embodiment obtains the classification results of text to be sorted based on the first classification results and the second classification results, Due to the first classification results to be associated as basis of classification between the feature entirety and feature in full text notebook data, the second classification results with Respectively identification feature relevant to text classification is basis of classification, and the two combines, and can hold the correct of whole classification trend, Identification feature can be concerned about so that classification results are correct.It specifically, can be according to this assorting process of full text and sub- text classification The first weight and the second weight is arranged in the significance level of process, respectively the first classification results and the second classification results, based on the One weight and the second weight, to about same preset classification X the first score and third score be weighted summation, obtain about The comprehensive score of the preset classification X determines final classification result according to the comprehensive score about each preset classification.
In another embodiment of the disclosure, S203 is operated based on one or more extracted from the full text notebook data A sub- text data, obtaining the second classification results includes: that the first Ziwen notebook data is inputted the first Ziwen notebook data Corresponding to the sub- textual classification model of the multiple preset classification, first Ziwen is determined based on the sub- textual classification model Score of the notebook data about each preset classification, using the preset classification of highest scoring as corresponding with the first Ziwen notebook data Classification.When there are when first category, being determined in classification corresponding with each Ziwen notebook data in one or more of Ziwen sheets Classification corresponding with one or more of Ziwen notebook datas is first category;And when with one or more of Ziwen sheets In each Ziwen notebook data corresponding classification when being second category, determination is corresponding with one or more of Ziwen notebook datas Classification is second category.
Wherein, sub- textual classification model is consistent with preset classification corresponding to full textual classification model, and preset classification includes First category and second category, the risk class highest of first category, the risk class of second category are minimum.For any Ziwen The Ziwen notebook data is inputted sub- textual classification model, respectively obtains the Ziwen notebook data about each preset classification by notebook data The second score, it is preset that Ziwen notebook data about the second score of preset classification X indicates that prediction Ziwen notebook data belongs to this The scoring of classification X.Group text data is more than or equal to the Ziwen notebook data about it about the second score of a preset classification X When the second score of his any preset classification, using preset classification X as classification corresponding with Ziwen notebook data.
When group text data only has one, classification corresponding with the Ziwen notebook data is integrally right with Ziwen notebook data The classification answered.When group text data has multiple, when in classification corresponding with Ziwen notebook data there are when first category, by first Classification as with the whole corresponding classification of Ziwen notebook data.It, will when classification corresponding with Ziwen notebook data is second category Second category as with the whole corresponding classification of Ziwen notebook data.
The above process by Ziwen notebook data, using Ziwen, classify by this disaggregated model, obtains and each Ziwen notebook data Corresponding classification whether there is the highest first category of risk class in the classification corresponding to each Ziwen notebook data, be then Directly using first category as with Ziwen notebook data integrally corresponding classification, i.e. the second classification results.The process is suitable for preset Comprising the scene of classification (such as first category) for needing to pay special attention in classification, the classification paid special attention to the needs is with highest Weight is judged, to eliminate the security threat that the text of classification of needs special attention may cause.
On this basis, in one embodiment of the present disclosure, operation S204 is according to first classification results and described Second classification results, determine the text to be sorted classification results include: when classification corresponding with the full text notebook data and When classification corresponding with one or more of Ziwen notebook datas is second category, determination is corresponding with the text to be sorted Classification is second category;And when classification corresponding with the full text notebook data and/or with one or more of sub- textual datas When according to corresponding classification being first category, determining classification corresponding with the text to be sorted is first category.
The scheme of the present embodiment obtains the classification results of text to be sorted based on the first classification results and the second classification results, Due to the first classification results to be associated as basis of classification between the feature entirety and feature in full text notebook data, the second classification results with Respectively identification feature relevant to text classification is basis of classification, and the two combines, and can hold the correct of whole classification trend, Identification feature can be concerned about so that classification results are correct.Specifically, when classification corresponding with full text notebook data and with Ziwen sheet In the corresponding classification of data when first category highest there are risk class, the classification results of text to be sorted are first category, When and the corresponding classification of full text notebook data and classification corresponding with Ziwen notebook data is the minimum second category of risk class, The classification results of text to be sorted are second categories.Classification of the process suitable for preset classification comprising needing to pay special attention to The scene of (such as first category), to the needs pay special attention to classification with highest judgement weight, especially closed with eliminating the needs The security threat that the text of the classification of note may cause.
Below with reference to Fig. 3 A~Fig. 3 B, method shown in Fig. 2 is illustrated in conjunction with specific embodiments:
Fig. 3 A diagrammatically illustrates the schematic diagram of text classification process according to an embodiment of the present disclosure.
As shown in Figure 3A, during text classification, text to be sorted is obtained first, is treated classifying text and is located in advance Reason, obtains full text notebook data.Wherein, pretreatment includes some conventional processing methods, such as removes the HTML in text to be sorted Label treats classifying text and the operation such as is segmented, removes redundancy word.Pretreated full text is carried out based on predetermined keyword Set is matched, and in the keyword for originally finding successful match in full, is shared N (N > 0) a matching position, is extracted each match bit Context text of the front and back certain length range as corresponding keyword, referred to as Ziwen notebook data are set, is extracted altogether in this example N number of Ziwen notebook data.Wherein, predetermined keyword set includes the important word collected according to specific classification task.Using keyword With being just to be able to extract key content relevant to text categorization task in sheet in full, the interference of unrelated text is rejected.
On the one hand, full text notebook data is inputted into full textual classification model, is predicted with full text originally using full textual classification model The corresponding classification results of data.Full textual classification model is in the training stage using the text full text in various corpus as training Collection is directed to the ability that text entire content is classified with the full textual classification model of training.The selection of full textual classification model has It is a variety of, such as the text classification (Text-CNN) based on convolutional neural networks, recurrent neural network (RNN) disaggregated model.In full This disaggregated model exports classification results corresponding with full text notebook data, corresponding to the first classification results in Fig. 2.Specifically, entirely Textual classification model corresponds to 3 preset classifications: classification 1, classification 2 and classification 3.Classification results packet corresponding with full text notebook data It includes: first score a (1) of the full text notebook data about classification 1, full text notebook data the first score a (2) about classification 2, Yi Jiquan First score a (3) of the text data about classification 3.
On the other hand, each Ziwen notebook data in N number of Ziwen notebook data is inputted into sub- textual classification model, uses Ziwen This disaggregated model predicts classification results corresponding with each Ziwen notebook data.Sub- textual classification model is in the training stage using default Context text of the keyword of keyword set in various corpus is as training set, with the sub- textual classification model needle of training The ability classified to sub- text data.There are many selections of sub- textual classification model, such as based on the text of convolutional neural networks The disaggregated models such as this classification (Text-CNN), recurrent neural network (RNN).Sub- textual classification model output and each sub- textual data According to corresponding classification results, N number of classification results corresponding with N number of Ziwen notebook data are obtained, corresponding to the second classification in Fig. 2 As a result.Specifically, sub- textual classification model corresponds to 3 preset classifications: classification 1, classification 2 and classification 3.For i-th of Ziwen Notebook data, classification results corresponding with the Ziwen notebook data include: i-th (1≤i≤N, i are integer) a sub- text data about Second score b of classification 1i(1), second score b of i-th of Ziwen notebook data about classification 2i(2) and i-th of sub- textual data According to the second score b about classification 3i(3)。
It is that used model distributes weight according to the significance level that specific classification task is middle model, with distribution weight side Case combines the classification results of full textual classification model and sub- textual classification model to obtain final classification result.In this example, for output The full textual classification model of classification results corresponding with full text notebook data distributes weight α, for output and i-th of Ziwen notebook data pair The sub- textual classification model for the classification results answered distributes weight betai, then obtain comprehensive score of the text to be sorted about classification 1:Comprehensive score of the text to be sorted about classification 2:Comprehensive score of the text to be sorted about classification 3:Choose the highest classification of comprehensive score as text to be sorted classification to get To final classification as a result, corresponding to the classification results of the text to be sorted in Fig. 2.
The scheme of the present embodiment compared with prior art, is accurately positioned crucial content of text, avoids unrelated content of text Interference, effectively raise the recall rate and accuracy rate of text classification.
Fig. 3 B diagrammatically illustrates the schematic diagram of text classification process according to another embodiment of the present disclosure.
As shown in Figure 3B, text assorting process is specially to judge whether webpage is violation content (pornographic, gambling, terror The related contents such as doctrine) assorting process.Webpage source code is crawled by crawler system first, which has html tag, As text to be sorted.The operations such as some pretreatment operations, such as removal html tag, word cutting are carried out to webpage source code.By Webpage original text is obtained after pretreatment, as full text notebook data.Predetermined keyword set includes playing a key effect to text classification Some keywords, such as " Hong Kong horse racing ", " Macao's gambling ", " in one yard special ", " love of one night " etc..Based on predetermined keyword collection Conjunction matches webpage original text, is specifically matched using multimode matching algorithm, finds the keyword of successful match, and N is shared (N > 0) a matching position extracts the character string of each 60 character lengths before and after each Keywords matching position, the word that will be extracted Symbol string and matching keywords itself are according on the keyword that the sequence in webpage original text combines to obtain about 120 character lengths Hereafter text extracts N number of Ziwen notebook data as Ziwen notebook data, in this example altogether.
On the one hand, full text notebook data is inputted into full textual classification model, is predicted with full text originally using full textual classification model The corresponding classification results of data.Full textual classification model uses the text classification (Text- based on convolutional neural networks in this example CNN) model corresponds to two preset classifications: " violation " and " normal ".Full textual classification model output is corresponding with full text notebook data Classification results, corresponding to the first classification results in Fig. 2, which is " violation " or " normal ".
On the other hand, each Ziwen notebook data in N number of Ziwen notebook data is inputted into sub- textual classification model, uses Ziwen This disaggregated model predicts classification results corresponding with each Ziwen notebook data.In this example, sub- textual classification model is also used and is based on Text classification (Text-CNN) model of convolutional neural networks.Sub- textual classification model output is corresponding with each Ziwen notebook data N number of classification results corresponding with N number of Ziwen notebook data are obtained in classification results, corresponding to the second classification results in Fig. 2, with The corresponding classification results of each Ziwen notebook data are " violation " or " normal ".
In order to combine the classification results of two models output, the weight distribution scheme of use is: when in all classification results When in the presence of " violation ", it is that the model of output category result " violation " distributes weight 1, is the model point of output category result " normal " With weight 0.It is that the model of output category result " normal " distributes weight 1 when " violation " is not present in all classification results.? That is the classification results of final webpage are " violations ", if classification results if there is the classification of " violation " in classification results In be " normal " classification, the classification results of final webpage are " normal ", to give " violation " classification results with highest excellent First grade and susceptibility, to avoid security threat brought by violation webpage.
By test, in this violation classifying content task, the scheme of the present embodiment compared with the prior art, tie by classification The accuracy rate of fruit at least improves 10%.
Fig. 4 diagrammatically illustrates the block diagram of document sorting apparatus according to an embodiment of the present disclosure.
As shown in figure 4, document sorting apparatus 400 includes: to obtain module 410, the first categorization module 420, second classification mould Block 430 and compressive classification module 440.
Module 410 is obtained for obtaining text to be sorted.
First categorization module 420 is used for the full text notebook data based on the text to be sorted, obtains the first classification results.
Second categorization module 430 is used for based on the one or more Ziwen notebook datas extracted from the full text notebook data, Obtain the second classification results.
Compressive classification module 440 is used for according to first classification results and second classification results, determine it is described to The classification results of classifying text.
Fig. 5 diagrammatically illustrates the block diagram of document sorting apparatus according to another embodiment of the present disclosure.
As shown in figure 5, document sorting apparatus 500 includes: to obtain module 410, the first categorization module 420, second classification mould Block 430 and compressive classification module 440.
Wherein, the first categorization module 420 is used to correspond to the full text notebook data input full text of multiple preset classifications This disaggregated model determines the full text notebook data about each in the multiple preset classification based on the full textual classification model First score of preset classification, using the preset classification of the first highest scoring as classification corresponding with the full text notebook data.
In one embodiment of the present disclosure, document sorting apparatus 500 further includes Ziwen this extraction module 450, for Second categorization module obtains the second classification based on the one or more Ziwen notebook datas extracted from the full text notebook data As a result before, one or more Ziwen notebook datas are extracted from the full text notebook data.
Wherein, this extraction module of Ziwen 450 may include: matched sub-block 451, extracting sub-module 452 and combination submodule Block 453.
Matched sub-block 451 be used for using in predetermined keyword set keyword and the full text notebook data progress Match.Extracting sub-module 452 is used for the first keyword for successful match, and described first is extracted from the full text notebook data and is closed The character string of the second preset length after the character string of the first preset length before keyword and/or first keyword. And combination submodule 453 is used for the character string extracted and first keyword according in the full text notebook data Sequence of positions group be combined into a sub- text data.
In one embodiment of the present disclosure, the second categorization module 430 includes the first prediction submodule 431 and calculating submodule Block 432.
First prediction submodule 431 is used to that the first Ziwen notebook data to be inputted and corresponded to by the first Ziwen notebook data In the sub- textual classification model of the multiple preset classification, the described first sub- textual data is determined based on the sub- textual classification model According to the second score about each preset classification in the multiple preset classification.And computational submodule 432 is used for based on described Each Ziwen notebook data in one or more Ziwen notebook datas calculates one or more about the second score of each preset classification Third score of a sub- text data about each preset classification, using the preset classification of third highest scoring as with it is one or The corresponding classification of multiple Ziwen notebook datas.
Wherein, computational submodule 432 is specifically used for for any preset classification in the multiple preset classification, Jiang Gezi Text data is weighted summation about the second score of the preset classification, obtain one or more of Ziwen notebook datas about The third score of the preset classification.
In one embodiment of the present disclosure, compressive classification module 440 includes COMPREHENSIVE CALCULATING submodule 441, is used for basis The full text notebook data is about the first score of each preset classification and one or more of Ziwen notebook datas about each preset class Comprehensive score of the text to be sorted about each preset classification is calculated in other third score, and comprehensive score is highest Preset classification is as classification corresponding with the text to be sorted.
Specifically, as an optional embodiment, COMPREHENSIVE CALCULATING submodule 441 is for being arranged and the full text notebook data Corresponding first weight and the second weight corresponding with one or more of Ziwen notebook datas;And for the multiple pre- Any preset classification in classification is set, according to first weight and second weight, to the full text notebook data about this First score of preset classification and one or more of Ziwen notebook datas are weighted about the third score of the preset classification Summation, obtains comprehensive score of the text to be sorted about the preset classification.
In one embodiment of the present disclosure, the second categorization module 430 includes that the second prediction submodule 433 and first determine Submodule 434.Second prediction submodule 433 is used for for the first Ziwen notebook data, by the first Ziwen notebook data input pair The first Ziwen sheet should be determined based on the sub- textual classification model in the sub- textual classification model of the multiple preset classification Score of the data about each preset classification, using the preset classification of highest scoring as class corresponding with the first Ziwen notebook data Not.First determines that submodule 434 is used for when in classification corresponding with each Ziwen notebook data in one or more of Ziwen sheets There are when first category, determine that classification corresponding with one or more of Ziwen notebook datas is first category;And when with institute When stating the corresponding classification of each Ziwen notebook data in one or more Ziwen sheets and being second category, it is determining with it is one or more The corresponding classification of a sub- text data is second category.
Wherein, preset classification includes first category and second category.In one embodiment of the present disclosure, compressive classification mould Block 440 include second determine submodule 442, for when classification corresponding with the full text notebook data and with it is one or more of When the corresponding classification of Ziwen notebook data is second category, determining classification corresponding with the text to be sorted is second category; And when and the corresponding classification of the full text notebook data and/or classification corresponding with one or more of Ziwen notebook datas is When first category, determining classification corresponding with the text to be sorted is first category.
It should be noted that in device section Example each module/unit/subelement etc. embodiment, the skill of solution Art problem, the function of realization and the technical effect reached respectively with the implementation of corresponding step each in method section Example Mode, the technical issues of solving, the function of realization and the technical effect that reaches are same or like, and details are not described herein.
It is module according to an embodiment of the present disclosure, submodule, unit, any number of or in which any more in subelement A at least partly function can be realized in a module.It is single according to the module of the embodiment of the present disclosure, submodule, unit, son Any one or more in member can be split into multiple modules to realize.According to the module of the embodiment of the present disclosure, submodule, Any one or more in unit, subelement can at least be implemented partly as hardware circuit, such as field programmable gate Array (FPGA), programmable logic array (PLA), system on chip, the system on substrate, the system in encapsulation, dedicated integrated electricity Road (ASIC), or can be by the hardware or firmware for any other rational method for integrate or encapsulate to circuit come real Show, or with any one in three kinds of software, hardware and firmware implementations or with wherein any several appropriately combined next reality It is existing.Alternatively, can be at least by part according to one or more of the module of the embodiment of the present disclosure, submodule, unit, subelement Ground is embodied as computer program module, when the computer program module is run, can execute corresponding function.
For example, obtaining module 410, the first categorization module 420, the second categorization module 430, compressive classification module 440 and son Any number of in Text Feature Extraction module 450, which may be incorporated in a module, to be realized or any one module therein can To be split into multiple modules.Alternatively, at least partly function of one or more modules in these modules can be with other moulds At least partly function of block combines, and realizes in a module.In accordance with an embodiment of the present disclosure, module 410, first is obtained At least one of categorization module 420, the second categorization module 430, compressive classification module 440 and Ziwen this extraction module 450 can To be at least implemented partly as hardware circuit, for example, field programmable gate array (FPGA), programmable logic array (PLA), The system in system, encapsulation, specific integrated circuit (ASIC) in system on chip, substrate, or can be by collecting to circuit At or the hardware such as any other rational method or firmware of encapsulation realize, or with software, hardware and three kinds of firmware realizations Any one in mode several appropriately combined is realized with wherein any.Alternatively, obtaining module 410, the first categorization module 420, at least one of the second categorization module 430, compressive classification module 440 and Ziwen this extraction module 450 can at least by It is implemented partly as computer program module, when the computer program module is run, corresponding function can be executed.
Fig. 6 diagrammatically illustrates the computer equipment according to an embodiment of the present disclosure for being adapted for carrying out method as described above Block diagram.Computer equipment shown in Fig. 6 is only an example, should not function and use scope band to the embodiment of the present disclosure Carry out any restrictions.
As shown in fig. 6, computer equipment 600 includes processor 610 and computer readable storage medium 620.The computer Equipment 600 can execute the method according to the embodiment of the present disclosure.
Specifically, processor 610 for example may include general purpose microprocessor, instruction set processor and/or related chip group And/or special microprocessor (for example, specific integrated circuit (ASIC)), etc..Processor 610 can also include using for caching The onboard storage device on way.Processor 610 can be the different movements for executing the method flow according to the embodiment of the present disclosure Single treatment unit either multiple processing units.
Computer readable storage medium 620, such as can be non-volatile computer readable storage medium, specific example Including but not limited to: magnetic memory apparatus, such as tape or hard disk (HDD);Light storage device, such as CD (CD-ROM);Memory, such as Random access memory (RAM) or flash memory;Etc..
Computer readable storage medium 620 may include computer program 621, which may include generation Code/computer executable instructions execute processor 610 according to the embodiment of the present disclosure Method or its any deformation.
Computer program 621 can be configured to have the computer program code for example including computer program module.Example Such as, in the exemplary embodiment, the code in computer program 621 may include one or more program modules, for example including 621A, module 621B ....It should be noted that the division mode and number of module are not fixation, those skilled in the art can To be combined according to the actual situation using suitable program module or program module, when these program modules are combined by processor 610 When execution, processor 610 is executed according to the method for the embodiment of the present disclosure or its any deformation.
According to an embodiment of the invention, obtaining module 410, the first categorization module 420, the second categorization module 430, comprehensive point At least one of generic module 440 and Ziwen this extraction module 450 can be implemented as the computer program mould with reference to Fig. 6 description File classification method described above may be implemented when being executed by processor 610 in block.
The disclosure additionally provides a kind of computer readable storage medium, which can be above-mentioned reality It applies included in equipment/device/system described in example;Be also possible to individualism, and without be incorporated the equipment/device/ In system.Above-mentioned computer readable storage medium carries one or more program, when said one or multiple program quilts When execution, the method according to the embodiment of the present disclosure is realized.
In accordance with an embodiment of the present disclosure, computer readable storage medium can be non-volatile computer-readable storage medium Matter, such as can include but is not limited to: portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In the disclosure, computer readable storage medium can With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
It will be understood by those skilled in the art that the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination, even if such combination or combination are not expressly recited in the disclosure.Particularly, exist In the case where not departing from disclosure spirit or teaching, the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.
Although the disclosure, art technology has shown and described referring to the certain exemplary embodiments of the disclosure Personnel it should be understood that in the case where the spirit and scope of the present disclosure limited without departing substantially from the following claims and their equivalents, A variety of changes in form and details can be carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment, But should be not only determined by appended claims, also it is defined by the equivalent of appended claims.

Claims (10)

1. a kind of file classification method, comprising:
Obtain text to be sorted;
Based on the full text notebook data of the text to be sorted, the first classification results are obtained;
Based on the one or more Ziwen notebook datas extracted from the full text notebook data, the second classification results are obtained;And
According to first classification results and second classification results, the classification results of the text to be sorted are determined.
2. according to the method described in claim 1, wherein, the full text notebook data based on the text to be sorted obtains the One classification results include:
The full textual classification model that full text notebook data input is corresponded to multiple preset classifications, is based on the full text classification Model determines first score of the full text notebook data about each preset classification in the multiple preset classification, by the first score Highest preset classification is as classification corresponding with the full text notebook data.
3. according to the method described in claim 1, wherein, described based on one extracted from the full text notebook data or more A sub- text data, obtain the second classification results before, the method also includes: from the full text notebook data extract one or Multiple Ziwen notebook datas;
It is described to extract one or more Ziwen notebook datas from the full text notebook data and include:
It is matched using the keyword in predetermined keyword set with the full text notebook data;
For the first keyword of successful match, first before extracting first keyword in the full text notebook data is pre- If the character string of the second preset length after the character string of length and/or first keyword;And
The character string extracted and first keyword are combined into one according to the sequence of positions group in the full text notebook data A sub- text data.
4. described based on the one or more extracted from the full text notebook data according to the method described in claim 2, wherein Ziwen notebook data, obtaining the second classification results includes:
For the first Ziwen notebook data, the first Ziwen notebook data input is corresponded to the Ziwen sheet of the multiple preset classification Disaggregated model determines the first Ziwen notebook data about in the multiple preset classification based on the sub- textual classification model Second score of each preset classification;And
The second score based on each Ziwen notebook data in one or more of Ziwen notebook datas about each preset classification calculates Third score of one or more of Ziwen notebook datas about each preset classification, using the preset classification of third highest scoring as Classification corresponding with one or more of Ziwen notebook datas.
5. according to the method described in claim 4, wherein, each Ziwen based in one or more of Ziwen notebook datas Notebook data calculates one or more Ziwen notebook datas about the second score of each preset classification and obtains about the third of each preset classification Divide and includes:
For any preset classification in the multiple preset classification, each Ziwen notebook data is obtained about the second of the preset classification Divide and be weighted summation, obtains third score of one or more of Ziwen notebook datas about the preset classification.
6. according to the method described in claim 4, wherein, it is described according to first classification results with the second classification knot Fruit determines that the classification results of the text to be sorted include:
According to the full text notebook data about each preset classification the first score and one or more of Ziwen notebook datas about Comprehensive score of the text to be sorted about each preset classification is calculated in the third score of each preset classification, will be comprehensive Point highest preset classification is as classification corresponding with the text to be sorted.
7. according to the method described in claim 6, wherein, first according to the full text notebook data about each preset classification The third score of score and one or more of Ziwen notebook datas about each preset classification, is calculated the text to be sorted Comprehensive score about each preset classification includes:
Setting and corresponding first weight of the full text notebook data and with one or more of Ziwen notebook datas corresponding second Weight;And
For any preset classification in the multiple preset classification, according to first weight and second weight, to institute Full text notebook data is stated about the first score of the preset classification and one or more of Ziwen notebook datas about the preset classification Third score be weighted summation, obtain comprehensive score of the text to be sorted about the preset classification.
8. described based on the one or more extracted from the full text notebook data according to the method described in claim 2, wherein Ziwen notebook data, obtaining the second classification results includes:
For the first Ziwen notebook data, the first Ziwen notebook data input is corresponded to the Ziwen sheet of the multiple preset classification Disaggregated model determines score of the first Ziwen notebook data about each preset classification based on the sub- textual classification model, will The preset classification of highest scoring is as classification corresponding with the first Ziwen notebook data;
When there are when first category, determined in classification corresponding with each Ziwen notebook data in one or more of Ziwen sheets with The corresponding classification of one or more of Ziwen notebook datas is first category;And
When classification corresponding with each Ziwen notebook data in one or more of Ziwen sheets is second category, determining and institute Stating the corresponding classification of one or more Ziwen notebook datas is second category.
9. the method according to claim 4 or 8, in which:
The preset classification includes first category and second category;
It is described according to first classification results and second classification results, determine the classification results packet of the text to be sorted It includes:
When and the corresponding classification of the full text notebook data and classification corresponding with one or more of Ziwen notebook datas is When two classifications, determining classification corresponding with the text to be sorted is second category;And
When and the corresponding classification of the full text notebook data and/or classification corresponding with one or more of Ziwen notebook datas be the When one classification, determining classification corresponding with the text to be sorted is first category.
10. a kind of computer equipment, the computer equipment includes memory, processor and storage on a memory and can locate The computer program run on reason device, for realizing in such as claim 1~9 when the processor executes the computer program Described in any item file classification methods.
CN201811653926.5A 2018-12-29 2018-12-29 Text classification method and computer equipment Active CN109739989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811653926.5A CN109739989B (en) 2018-12-29 2018-12-29 Text classification method and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811653926.5A CN109739989B (en) 2018-12-29 2018-12-29 Text classification method and computer equipment

Publications (2)

Publication Number Publication Date
CN109739989A true CN109739989A (en) 2019-05-10
CN109739989B CN109739989B (en) 2021-05-18

Family

ID=66363028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811653926.5A Active CN109739989B (en) 2018-12-29 2018-12-29 Text classification method and computer equipment

Country Status (1)

Country Link
CN (1) CN109739989B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334209A (en) * 2019-05-23 2019-10-15 平安科技(深圳)有限公司 File classification method, device, medium and electronic equipment
CN111881287A (en) * 2019-09-10 2020-11-03 马上消费金融股份有限公司 Classification ambiguity analysis method and device
CN112149403A (en) * 2020-10-16 2020-12-29 军工保密资格审查认证中心 Method and device for determining confidential text
CN113157901A (en) * 2020-01-22 2021-07-23 腾讯科技(深圳)有限公司 User generated content filtering method and related device
CN113449109A (en) * 2021-07-06 2021-09-28 广州华多网络科技有限公司 Security class label detection method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8856123B1 (en) * 2007-07-20 2014-10-07 Hewlett-Packard Development Company, L.P. Document classification
CN105630827A (en) * 2014-11-05 2016-06-01 阿里巴巴集团控股有限公司 Information processing method and system, and auxiliary system
CN106874253A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Recognize the method and device of sensitive information
CN108399164A (en) * 2018-03-27 2018-08-14 国网黑龙江省电力有限公司电力科学研究院 Electronic government documents classification hierarchy system based on template
CN108446388A (en) * 2018-03-22 2018-08-24 平安科技(深圳)有限公司 Text data quality detecting method, device, equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8856123B1 (en) * 2007-07-20 2014-10-07 Hewlett-Packard Development Company, L.P. Document classification
CN105630827A (en) * 2014-11-05 2016-06-01 阿里巴巴集团控股有限公司 Information processing method and system, and auxiliary system
CN106874253A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Recognize the method and device of sensitive information
CN108446388A (en) * 2018-03-22 2018-08-24 平安科技(深圳)有限公司 Text data quality detecting method, device, equipment and computer readable storage medium
CN108399164A (en) * 2018-03-27 2018-08-14 国网黑龙江省电力有限公司电力科学研究院 Electronic government documents classification hierarchy system based on template

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334209A (en) * 2019-05-23 2019-10-15 平安科技(深圳)有限公司 File classification method, device, medium and electronic equipment
WO2020232898A1 (en) * 2019-05-23 2020-11-26 平安科技(深圳)有限公司 Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN111881287A (en) * 2019-09-10 2020-11-03 马上消费金融股份有限公司 Classification ambiguity analysis method and device
CN111881287B (en) * 2019-09-10 2021-08-17 马上消费金融股份有限公司 Classification ambiguity analysis method and device
CN113157901A (en) * 2020-01-22 2021-07-23 腾讯科技(深圳)有限公司 User generated content filtering method and related device
CN113157901B (en) * 2020-01-22 2024-02-23 腾讯科技(深圳)有限公司 User generated content filtering method and related device
CN112149403A (en) * 2020-10-16 2020-12-29 军工保密资格审查认证中心 Method and device for determining confidential text
CN113449109A (en) * 2021-07-06 2021-09-28 广州华多网络科技有限公司 Security class label detection method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN109739989B (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN109739989A (en) File classification method and computer equipment
CN110532451A (en) Search method and device for policy text, storage medium, electronic device
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN107533698A (en) The detection and checking of social media event
CN105023165A (en) Method, device and system for controlling release tasks in social networking platform
CN107578353A (en) The registrable property determination methods of work mark based on big data and device
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN108829656B (en) Data processing method and data processing device for network information
CN104361059B (en) A kind of harmful information identification and Web page classification method based on multi-instance learning
CN107391545A (en) A kind of method classified to user, input method and device
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
CN110532352B (en) Text duplication checking method and device, computer readable storage medium and electronic equipment
CN109766441A (en) File classification method, apparatus and system
CN106294473B (en) Entity word mining method, information recommendation method and device
CN107391737A (en) The method and device that the registrable property of figurative mark based on artificial intelligence judges
CN106991323A (en) The model and method of a kind of detection Android application program ad plug-ins
CN106919588A (en) A kind of application program search system and method
CN107977678A (en) Method and apparatus for output information
CN110083759A (en) Public opinion information crawler method, apparatus, computer equipment and storage medium
CN108073708A (en) Information output method and device
CN110309293A (en) Text recommended method and device
CN109388551A (en) There are the method for loophole probability, leak detection method, relevant apparatus for prediction code
CN108694183A (en) A kind of search method and device
CN108256078B (en) Information acquisition method and device
CN112328469B (en) Function level defect positioning method based on embedding technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100088 Building 3 332, 102, 28 Xinjiekouwai Street, Xicheng District, Beijing

Applicant after: Qianxin Technology Group Co., Ltd.

Address before: 100088 Building 3 332, 102, 28 Xinjiekouwai Street, Xicheng District, Beijing

Applicant before: BEIJING QI'ANXIN SCIENCE & TECHNOLOGY CO., LTD.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant