CN108170818A - A kind of file classification method, server and computer-readable medium - Google Patents

A kind of file classification method, server and computer-readable medium Download PDF

Info

Publication number
CN108170818A
CN108170818A CN201711498600.5A CN201711498600A CN108170818A CN 108170818 A CN108170818 A CN 108170818A CN 201711498600 A CN201711498600 A CN 201711498600A CN 108170818 A CN108170818 A CN 108170818A
Authority
CN
China
Prior art keywords
word
vector
theme
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201711498600.5A
Other languages
Chinese (zh)
Inventor
黄佳恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Jinli Communication Equipment Co Ltd
Original Assignee
Shenzhen Jinli Communication Equipment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Jinli Communication Equipment Co Ltd filed Critical Shenzhen Jinli Communication Equipment Co Ltd
Priority to CN201711498600.5A priority Critical patent/CN108170818A/en
Publication of CN108170818A publication Critical patent/CN108170818A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of file classification method, server and computer-readable medium, wherein method includes:Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model, wherein, N is the positive integer no more than M;According to word each in N number of word and the corresponding theme of each word, the corresponding theme term vector of each word in N number of word is obtained by descriptor vector model, wherein, the theme term vector vector common for word theme corresponding with word represents;According to the corresponding theme term vector of each word in N number of word, the classification of the text to be sorted is obtained by disaggregated model;Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.It can increase classification accuracy in the case of the ambiguity for considering word and efficiently carry out text classification.

Description

A kind of file classification method, server and computer-readable medium
Technical field
It can the present invention relates to a kind of technical field of information processing more particularly to file classification method, server and computer Read medium.
Background technology
With the development of computer technology, a large amount of document information is all sharply increasing daily.It is quick swollen due to information It is swollen, how fast and effeciently to have become a new problem faced using these information.In face of huge text message, Classified using traditional artificial means to these information and face more and more difficulties since efficiency is too low, information processing Have become people and obtain the indispensable tool of useful information, so the autotext based on machine learning artificial intelligence technology Classification has become an important field of research.
But in the technology of existing text classification, since a word may possess different meanings, these prior arts do not have Have and the ambiguity of word is distinguished, so as to cause the accuracy of text classification is reduced.
Invention content
The embodiment of the present invention provides a kind of method of text classification, server and computer-readable medium, can consider word Ambiguity in the case of, increase classification accuracy and simultaneously efficiently carry out text classification.
In a first aspect, an embodiment of the present invention provides a kind of file classification method, this method includes:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model, In, N is the positive integer no more than M;
According to word each in N number of word and the corresponding theme of each word, the N is obtained by descriptor vector model The corresponding theme term vector of each word in a word, wherein, the theme term vector vector common for word theme corresponding with word It represents;
According to the corresponding theme term vector of each word in N number of word, the text to be sorted is obtained by disaggregated model Classification;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
Second aspect, an embodiment of the present invention provides a kind of server, which includes performing above-mentioned first party The unit of the method in face.
The third aspect, an embodiment of the present invention provides another server, including processor, input equipment, output equipment And memory, the processor, input equipment, output equipment and memory are connected with each other, wherein, the memory is used to store Support server perform the above method computer program, the computer program include program instruction, the processor by with It puts that described program is called to instruct, the method for performing above-mentioned first aspect.
Fourth aspect, an embodiment of the present invention provides a kind of computer readable storage medium, the computer storage media Computer program is stored with, the computer program includes program instruction, and described program instruction makes institute when being executed by a processor State the method that processor performs above-mentioned first aspect.
The embodiment of the present invention obtains the corresponding theme of word by using topic model, is obtained by using descriptor vector model The corresponding theme term vector of word is taken, then the corresponding theme term vector of the part or all of word in text to be sorted is by making The classification of the model to be sorted is obtained with disaggregated model.During classification, word is imparted into its corresponding theme, so as to reach It is distinguished to the ambiguity to word, and then improves the accuracy rate of entire text classification.
Description of the drawings
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present invention, general for this field For logical technical staff, without creative efforts, other attached drawings are can also be obtained according to these attached drawings.
Fig. 1 is the method flow diagram of file classification method provided by one embodiment of the present invention;
Fig. 2 is a kind of method flow diagram for file classification method that another embodiment of the present invention provides;
Fig. 2A be the present embodiments relate to a kind of TWE models structure chart;
Fig. 2 B be the present embodiments relate to a kind of FastText models structure chart;
Fig. 3 is a kind of schematic block diagram of server provided in an embodiment of the present invention;
Fig. 4 is a kind of server schematic block diagram that another embodiment of the present invention provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is part of the embodiment of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without making creative work Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " comprising " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, element, component and/or its presence or addition gathered.
It is also understood that the term used in this description of the invention is merely for the sake of the mesh for describing specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singulative, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combinations and all possible combinations of one or more of the associated item listed, and including these combinations.
As used in this specification and in the appended claims, term " if " can be according to context quilt Be construed to " when ... " or " once " or " in response to determining " or " in response to detecting ".Similarly, phrase " if it is determined that " or " if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".
In the specific implementation, the terminal described in the embodiment of the present invention is including but not limited to such as with touch sensitive surface The mobile phone, laptop computer or tablet computer of (for example, touch-screen display and/or touch tablet) etc it is other just Portable device.It is to be further understood that in certain embodiments, the equipment is not portable communication device, but with tactile Touch the desktop computer of sensing surface (for example, touch-screen display and/or touch tablet).
In discussion below, the terminal including display and touch sensitive surface is described.It is, however, to be understood that It is that terminal can include one or more of the other physical user-interface device of such as physical keyboard, mouse and/or control-rod.
Terminal supports various application programs, such as one or more of following:Drawing application program, demonstration application journey Sequence, word-processing application, website create application program, disk imprinting application program, spreadsheet applications, game application Program, telephony application, videoconference application, email application, instant messaging applications, exercise Support application program, photo management application program, digital camera application program, digital camera application program, web-browsing application Program, digital music player application and/or video frequency player application program.
The various application programs that can be performed in terminal can use at least one public of such as touch sensitive surface Physical user-interface device.It can adjust and/or change among applications and/or in corresponding application programs and touch sensitive table The corresponding information shown in the one or more functions and terminal in face.In this way, the public physical structure of terminal is (for example, touch Sensing surface) it can support the various application programs with user interface intuitive and transparent for a user.
The mankind can first understand this text when classifying to text, then with the prophet's common sense of oneself to the text into Row classification.And during terminal-pair text classification and similar operation is done, the word in the text is obtained, these words are handled, Then the classification of the text is obtained by trained model.In the prior art, typically word is represented using vector, with Terminal is facilitated to easily identify and handle, and the word of vectorization equally exists similar place to the word that people use.For example, when asking Element into " automobile and aircraft ", " automobile and banana " in which group is more like.In the learning process of the mankind, due to Knowledge as " automobile and aircraft are all the vehicles, and banana is fruit " is stored, so before answer that can be without thinking Person is more like.Equally, when terminal is asked this problem, " automobile ", " aircraft " and " banana " can be converted by it first Then term vector calculates the vector distance between vector distance and " automobile " and " banana " between " automobile " and " aircraft ", most After compare the two vector distances so as to obtain problem answers.Certainly, this terminal storage for being used to handle the problem artificially The correspondence of word-term vector of processing.However, if terminal encounters " he is brought back to life by doctor " and " he is promoted as doctor ". Due to the property of polysemy, cause often to generate the not high influence of accuracy in text classification.But if to first It is occupation that a " doctor ", which specifies a theme, and it is " one's formal name " to specify a theme to second " doctor ", when terminal recognition is to " greatly Vectorial as husband-occupation ", " doctor-one's formal name ", then it is different word that can judge them, can more accurately be located Reason, so that classifying quality is more preferable.Below in conjunction with embodiment, the present invention will be described.
It please refers to Fig.1, it illustrates the method flow diagrams of file classification method provided by one embodiment of the present invention.This reality It applies example and mainly is applied to come for example, the terminal can wrap in the terminal comprising storage processing function with text sorting technique Include smartwatch, portable digital player, smart mobile phone, tablet computer, laptop, desktop computer, server, big number According to Platform Server and distribution system services device etc..Text sorting technique, including:
Step 101, text to be sorted is obtained;
In the embodiment of the present invention, text to be sorted can be a book, an article, passage or sentence etc. Deng.Text to be sorted can only include a kind of spoken and written languages, can also include multilingual word, can also include a variety of symbols Number, such as text to be sorted is made of kinds of words such as Chinese character, English word, Arabic numerals, Greek characters.On it should be understood that It states citing to be only used for describing the definition of text to be sorted in embodiments of the present invention, be not especially limited.
In the embodiment of the present invention, text to be sorted includes M word, wherein, M is positive integer.For ease of understanding, with English text For this, if text to be sorted is passage, it includes word and punctuation marks, then are included in the text to be sorted The quantity M of word is the word quantity in this section of word.Certainly, unlike the alphabetic writing of such as English, French etc. can be clearly The word in text is divided, in addition as Chinese character cannot divide word, such as " Wuhan City's Yangtze Bridge " easily, can be divided into " Wuhan City/the Changjiang river/bridge " can also be divided into " Wuhan/mayor/Jiang great Qiao ", there is different point in different linguistic context Word method.So word segmentation processing is carried out, and then obtain the M in text firstly the need of the text to be sorted to such as Chinese languages one kind A word.
Step 102, according to N number of word of text to be sorted, the corresponding master of each word in N number of word is obtained by topic model Topic;
In the embodiment of the present invention, whole words (M word) of text to be sorted are chosen therein N number of subsequently located by terminal Reason, wherein, N is the positive integer no more than M.It is understood that not all word may serve to table in text to be sorted This text is levied, in other words, significance level of some words in the text to be sorted is very little.It so can be by these words To removing, avoid carrying out more redundant operations.Certainly, whole words can also be carried out subsequent processing by terminal, not make to have herein Body limits.
In the embodiment of the present invention, topic model is used for the word according to input, exports the corresponding theme of the word.For example, In the case where context is in farm, the theme of " apple " may be fruit, and in the case where mobile phone is context, " apple " Theme may be company.During topic model is used, it can specify a corresponding theme to each word.These masters Topic may be a number, and " fruit " as escribed above may be number 28, and above-mentioned " company " may be number 355, and These number no specific meaning, only according to the number classified to word.Wherein, topic model can be with It is the hidden semantic analysis of probability (Probabilistic Latent Semantic Analysis, PLSA) model, implicit Di Like Thunder distribution (Latent Dirichlet Allocation, LDA) model, level Di Li Cray processes (Hierarchical Dirichlet Process, HDP) model.
Step 103, according to each word in N number of word and the corresponding theme of each word, N is obtained by descriptor vector model The corresponding theme term vector of each word in a word;
In the embodiment of the present invention, descriptor vector model exports the word for the word according to input and its corresponding theme Theme term vector.Wherein, the theme term vector vector common for word theme corresponding with word represents, in other words, a master Epigraph vector represents a word and a theme.For example, the theme that identical word is different, then corresponding theme term vector is not Together;The identical theme of different words, then corresponding theme term vector is also different.And the meaning present invention of each dimensional representation of descriptor Embodiment is not especially limited, and can be that preceding a dimension is used to characterize word, remaining dimension is used to characterize theme;It also may be used Word is characterized to be the set of odd number dimension, the set of even number dimension is used to characterize theme.Wherein, descriptor vector model can To be single model such as descriptor insertion (Topic Word Embedings, TWE) model or the combination made by multiple models Combination, TWE models and continuous type bag of words (the Continuous Bag-of- of model such as TWE models and Skip-Gram models Words, CBOW) model combination.
Step 104, according to the corresponding theme term vector of word each in N number of word, text to be sorted is obtained by disaggregated model Classification;
In the embodiment of the present invention, disaggregated model is defeated for theme term vector all in the text to be sorted according to input Go out the classification of the text to be sorted.Wherein, disaggregated model can be Fast Text Classification (FastText) model, convolutional Neural net Network (Convolutional Neural Network, CNN) model, shot and long term memory network (Long Short-Term Memory, LSTM) model.It should be noted that above-mentioned topic model, descriptor vector model, disaggregated model are all to have instructed Practice model, i.e., these models are with multiple text training so as to reach its accurate processing capacity.
The embodiment of the present invention obtains the corresponding theme of word by using topic model, is obtained by using theme vector model The corresponding theme term vector of word, then the corresponding theme term vector of the part or all of word in text to be sorted by using Disaggregated model obtains the classification of the model to be sorted.During classification, word is imparted into its corresponding theme, so as to reach The ambiguity of word is distinguished, and then improves the accuracy rate of entire text classification.
It please refers to Fig.2, it illustrates the method flow diagrams for the file classification method that another embodiment of the present invention provides.With it is upper One embodiment is compared, and in order to be described in more detail, the present embodiment is mainly carried out with specific algorithm, specific model Explanation.Text sorting technique, including:
Step 201, text to be sorted is obtained;
Terminal can obtain text to be sorted input by user, which can be a book, article, one Section word or sentence etc..In the present embodiment, text that terminal can be made of Chinese character and punctuation mark.At this point, it treats The content of classifying text can refer to following words --- and " his friend is stayed in the mansion at five rings.”.
Step 202, text to be sorted is pre-processed, obtains N number of word;
Text to be sorted can be carried out text participle, go the pretreatments such as stop words by terminal.The algorithm that text participle uses It can include:Forward Maximum Method algorithm, reverse maximum matching algorithm, minimum segmentation algorithm or stammerer (jieba) segmentation methods Etc., in Chinese text, handled frequently with jieba segmentation methods.Assuming that it is segmented by text by above-mentioned reference text Content be divided into " his/friend/stay in/five rings// mansion/inner/." and then need to handle through past stop words, and go to deactivate The substantially corresponding deactivated vocabulary of word processing, if the word in text in vocabulary is deactivated there are identical word, just by the word Remove from text.Assuming that after by the way that stop words is gone to handle, above-mentioned reference text becomes " friend " " five rings " " mansion " three Word.
Step 203, according to N number of word, the theme of each word is obtained by LDA models;
LDA models are a topic models, its purposes has very much, and the embodiment of the present invention is mainly used for every in text One word distributes a theme.During LDA model trainings, need to set two Di Li Crays distributed constant and theme Number, then model is trained by a large amount of corpus of text.
Step 204, according to N number of word and the corresponding theme of each word, by TWE models obtain the descriptor of each word to Amount;
The advantages of TWE models, is term vector and theme vector being combined together, and can obtain term vector and obtain Theme vector.And theme term vector, can splice to obtain with theme vector by term vector.It is understood that The dimension of term vector and theme vector is preset, and the dimension of term vector is typically no less than the dimension of theme vector, for example, word to The dimension of amount is 200, and the dimension of theme vector is 50.And splice (contact) and mean the dimension and theme vector of term vector Dimension be attached, become the vector of a more higher-dimension.Such as term vector be (a1, a2, a3 ... ... a199, a200), theme Vector splices them for (b1, b2, b3 ... ... b49, b50), obtain theme term vector (a1, a2, a3 ... ... a199, A200, b1, b2, b3 ... ... b49, b50).Can certainly be theme vector preceding, term vector rear, i.e., vector (b1, b2, B3 ... ... b49, b50, a1, a2, a3 ... ... a199, a200).It should be understood that above-mentioned example is used only as illustrating, should not be construed as having Body limits.Word theme corresponding with word is associated together by this method, can effectively distinction word ambiguity.TWE moulds The structure of type as shown in Figure 2 A, in the training process of TWE models, initializes term vector and theme vector first, will be a large amount of Word in corpus of text does term vector with after theme vector conversion, being trained by these vectors to TWE models, and more neologisms Vector and theme vector.
It is alternatively possible to the acquisition of term vector and theme vector, the term vector that then will be obtained are carried out by two models Theme term vector is spliced into theme vector.For example, it is corresponding to obtain word in text by Skip-Gram models or CBOW models Term vector obtains the corresponding theme vector of word in text by TWE models.Further, it should be appreciated that word is corresponding with word in model Term vector and theme vector be substantially one-to-one correspondence, and this correspondence hypothesis is term vector matrix and master Inscribe vector matrix.Term vector matrix is also constantly being updated with the vector in theme vector matrix during training pattern, Such as Skip-Gram models, it is for training term vector, and TWE models, not only trains term vector but also training theme vector.When After the completion of Skip-Gram model trainings, term vector matrix is brought into training in TWE models, at this point, can not only train to obtain Theme vector matrix and update that can be micro to the progress of term vector matrix.
Step 205, according to the theme term vector of N number of word, the classification of text to be sorted is obtained by FastText models.
FastText models are relative to other disaggregated models, when substantially reducing trained while classifying quality is kept Between.The structure of FastText models as shown in Figure 2 B, carries out average calculating operation by the vector to input, then passes through grader It can obtain the classification (label, label) of text to be sorted.Certainly, this model is housebroken model.It is instructed in model During white silk, since FastText models are that have monitor model, so need to go training pattern using the corpus of text of tape label, And micro update can also be carried out in model training to theme term vector.Wherein, grader can be Logistic classification Device, Softmax graders etc. can also be further Hierarchical Softmax graders.
It should be noted that the specific implementation process of the which part step of method shown in Fig. 2 can be found in a method Specific implementation process described in embodiment, details are not described herein.
The embodiment of the present invention obtains the corresponding theme of word by using LDA models, and obtaining word by using TWE models corresponds to Theme term vector, then the corresponding theme term vector of the part or all of word in text to be sorted by using FastText models obtain the classification of the model to be sorted.During classification, obtained and led using LDA models and TWE models Epigraph vector, imparts its corresponding theme by word, so as to which the ambiguity reached to word is distinguished, and then improves entire text The accuracy rate of this classification, and the use of FastText models can improve the speed of text classification.Further, text classification It is one of content mostly important in natural language processing, the embodiment of the present invention, can be effectively by the ambiguity of distinction word The accuracy of text classification is improved, can be applied to:The necks such as Web page classifying, spam filtering, information retrieval, question answering system Domain.
The embodiment of the present invention also provides a kind of server, which is used to perform the list of aforementioned any one of them method Member.Specifically, it is a kind of schematic block diagram of server provided in an embodiment of the present invention referring to Fig. 3.The server packet of the present embodiment It includes:Text acquiring unit 301, theme acquiring unit 302, vectorial acquiring unit 303 and taxon 304.
Text acquiring unit 301, for obtaining text to be sorted, the text to be sorted includes M word.
Theme acquiring unit 302 for N number of word according to text to be sorted, is obtained by topic model in N number of word The corresponding theme of each word.
Vectorial acquiring unit 303 according to word each in N number of word and the corresponding theme of each word, passes through descriptor Vector model obtains the corresponding theme term vector of each word in N number of word.
Taxon 304, for according to the corresponding theme term vector of each word in N number of word, being obtained by disaggregated model To the classification of the text to be sorted.
Wherein, M is positive integer;N is the positive integer no more than M;The theme term vector is total to for word theme corresponding with word Same vector represents;The topic model, the descriptor vector model, the disaggregated model are training pattern.
The server of the present embodiment can also include:Pretreatment unit, for carrying out text point to the text to be sorted Word processing, obtains M word of the text to be sorted;Stop words is carried out to the M word to handle, and obtains the text to be sorted This N number of word.Wherein, the algorithm that the text word segmentation processing uses can be:Forward Maximum Method algorithm, reverse maximum matching Algorithm, minimum segmentation algorithm or stammerer segmentation methods etc..
Wherein, the topic model can be that implicit Di Li Crays are distributed LDA topic models, the descriptor vector model Can be descriptor insertion TWE models.
Vectorial acquiring unit 303 can be specifically used for through TWE models, obtain each corresponding word of word in N number of word Vector and the corresponding theme vector of each word;The corresponding term vector of each word theme vector corresponding with each word is spelled It connects, obtains the corresponding theme term vector of each word;Wherein, term vector be i dimensional vectors, theme vector be j dimensional vectors, descriptor to It measures as (i+j) dimensional vector.
Vectorial acquiring unit 303 can also be specifically used for through term vector model, obtain each word pair in N number of word The term vector answered by theme vector model, obtains each corresponding theme vector of word in N number of word;Each word is corresponded to Term vector theme vector corresponding with each word spliced, obtain the corresponding theme term vector of each word;Wherein, term vector For i dimensional vectors, theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
Wherein, the disaggregated model can be Fast Text Classification FastText models, convolutional neural networks CNN models or Person's shot and long term memory network LSTM models.
Taxon 304 can be specifically used for the theme term vector of N number of word making average calculating operation, obtain operation knot Fruit;According to the operation result, the classification of the text to be sorted is obtained by grader;Wherein, grader can be Logistic graders or Softmax graders.
It is a kind of server schematic block diagram that another embodiment of the present invention provides referring to Fig. 4.The present embodiment as depicted In server can include:One or more processors 401;One or more input equipments 402, one or more output are set Standby 403 and memory 404.Above-mentioned processor 401, input equipment 402, output equipment 403 and memory 404 pass through bus 405 Connection.Memory 402 is for storing computer program, and the computer program includes program instruction, and processor 401 is used to perform The program instruction that memory 402 stores.Wherein, processor 401 is configured for that described program instruction is called to perform:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model, In, N is the positive integer no more than M;
According to word each in N number of word and the corresponding theme of each word, the N is obtained by descriptor vector model The corresponding theme term vector of each word in a word, wherein, the theme term vector vector common for word theme corresponding with word It represents;
According to the corresponding theme term vector of each word in N number of word, the text to be sorted is obtained by disaggregated model Classification;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
In one embodiment, processor 401 can call the application program being stored in memory 404, for perform with Lower operation:Text word segmentation processing is carried out to the text to be sorted, obtains M word of the text to be sorted;To the M word It carries out stop words to handle, obtains N number of word of the text to be sorted.Wherein, the algorithm that the text word segmentation processing uses can To be:Forward Maximum Method algorithm, reverse maximum matching algorithm, minimum segmentation algorithm or stammerer segmentation methods etc..
As a kind of embodiment, the topic model is described for that can be that implicit Di Li Crays are distributed LDA topic models Descriptor vector model can be descriptor insertion TWE models.
In one embodiment, processor 401 can call the application program being stored in memory 404, for perform with Lower operation:By TWE models, each corresponding term vector of word and the corresponding theme vector of each word in N number of word are obtained; The corresponding term vector of each word theme vector corresponding with each word is spliced, obtain the corresponding descriptor of each word to Amount;Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
In one embodiment, processor 401 may call upon the application program being stored in memory 404, for performing It operates below:By term vector model, each corresponding term vector of word in N number of word is obtained, by theme vector model, is obtained Each corresponding theme vector of word into N number of word;By the corresponding term vector of each word theme vector corresponding with each word Spliced, obtain the corresponding theme term vector of each word;Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, main Epigraph vector is (i+j) dimensional vector.
As a kind of embodiment, the disaggregated model can be Fast Text Classification FastText models, convolutional Neural Network C NN models or shot and long term memory network LSTM models etc..
In one embodiment, processor 401 may call upon the application program being stored in memory 404, for performing It operates below:The theme term vector of N number of word is made into average calculating operation, obtains operation result;According to the operation result, pass through Grader obtains the classification of the text to be sorted;Wherein, grader can be Logistic graders or Softmax classification One of them of device.
It should be appreciated that in embodiments of the present invention, alleged processor 401 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic Device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor or this at It can also be any conventional processor etc. to manage device.
Input equipment 402 can include Trackpad, fingerprint and adopt sensor (for acquiring the finger print information of user and fingerprint Directional information), microphone etc., output equipment 403 can include display (LCD etc.), loud speaker etc..
The memory 404 can include read-only memory and random access memory, and to processor 401 provide instruction and Data.The a part of of memory 404 can also include nonvolatile RAM.For example, memory 404 can also be deposited Store up the information of device type.
In the specific implementation, processor 401, input equipment 402, the output equipment 403 described in the embodiment of the present invention can Perform the realization side described in the first embodiment and second embodiment of the method for text classification provided in an embodiment of the present invention Formula also can perform the realization method of the described server of the embodiment of the present invention, and details are not described herein.
A kind of computer readable storage medium, the computer-readable storage medium are provided in another embodiment of the invention Matter is stored with computer program, and the computer program includes program instruction, and described program instruction is realized when being executed by processor:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model, In, N is the positive integer no more than M;
According to word each in N number of word and the corresponding theme of each word, the N is obtained by descriptor vector model The corresponding theme term vector of each word in a word, wherein, the theme term vector vector common for word theme corresponding with word It represents;
According to the corresponding theme term vector of each word in N number of word, the text to be sorted is obtained by disaggregated model Classification;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
The computer readable storage medium can be the internal storage unit of the server described in aforementioned any embodiment, Such as the hard disk or memory of server.The computer readable storage medium can also be that the external storage of the server is set Plug-in type hard disk that is standby, such as being equipped on the server, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) blocks, flash card (Flash Card) etc..Further, the computer readable storage medium is also The internal storage unit of the server can both be included or including External memory equipment.The computer readable storage medium is used In other programs and data needed for the storage computer program and the server.The computer readable storage medium It can be also used for temporarily storing the data that has exported or will export.
Those of ordinary skill in the art may realize that each exemplary lists described with reference to the embodiments described herein Member and algorithm steps can be realized with the combination of electronic hardware, computer software or the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are performed actually with hardware or software mode, specific application and design constraint depending on technical solution.Specially Industry technical staff can realize described function to each specific application using distinct methods, but this realization is not It is considered as beyond the scope of this invention.
It is apparent to those skilled in the art that for convenience of description and succinctly, the clothes of foregoing description The specific work process of business device and unit can refer to the corresponding process in preceding method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed server and method can pass through Other modes are realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, Only a kind of division of logic function, can there is an other dividing mode in actual implementation, such as multiple units or component can be with With reference to or be desirably integrated into another system or some features can be ignored or does not perform.It is in addition, shown or discussed Mutual coupling, direct-coupling or communication connection can be by the INDIRECT COUPLING of some interfaces, device or unit or logical Letter connection or electricity, the connection of mechanical or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the embodiment of the present invention Purpose.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit is individually physically present or two or more units integrate in a unit.It is above-mentioned integrated The form that hardware had both may be used in unit is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products It embodies, which is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or the network equipment etc.) performs the complete of each embodiment the method for the present invention Portion or part steps.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey The medium of sequence code.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection domain subject to.

Claims (10)

1. a kind of file classification method, which is characterized in that including:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model, wherein, N To be not more than the positive integer of M;
According to word each in N number of word and the corresponding theme of each word, N number of word is obtained by descriptor vector model In each corresponding theme term vector of word, wherein, the theme term vector vector expression common for word theme corresponding with word;
According to the corresponding theme term vector of each word in N number of word, the class of the text to be sorted is obtained by disaggregated model Not;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
2. according to the method described in claim 1, it is characterized in that, after the acquisition text to be sorted, further include:
Text word segmentation processing is carried out to the text to be sorted, obtains M word of the text to be sorted;
Stop words is carried out to the M word to handle, and obtains N number of word of the text to be sorted;
Wherein, the algorithm that the text word segmentation processing uses includes:Forward Maximum Method algorithm, reverse maximum matching algorithm, most One of them in segmentation algorithm, stammerer segmentation methods less.
3. according to the method described in claim 1, it is characterized in that, the topic model, which is implicit Di Li Crays, is distributed theme mould Type writes inscription incorporation model based on the descriptor vector model.
4. according to the method described in claim 3, it is characterized in that, described obtain N number of word by descriptor vector model In each corresponding theme term vector of word, specifically include:
By descriptor incorporation model, each corresponding term vector of word and the corresponding theme of each word in N number of word are obtained Vector;
The corresponding term vector of each word theme vector corresponding with each word is spliced, obtains the corresponding descriptor of each word Vector;
Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
5. method according to claim 1 or 2, which is characterized in that it is described obtained by descriptor vector model it is described N number of The corresponding theme term vector of each word, specifically includes in word:
By term vector model, each corresponding term vector of word in N number of word is obtained, by theme vector model, obtains institute State the corresponding theme vector of each word in N number of word;
The corresponding term vector of each word theme vector corresponding with each word is spliced, obtains the corresponding descriptor of each word Vector;
Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
6. according to the method described in claim 1, it is characterized in that, the disaggregated model includes Fast Text Classification model, volume Accumulate one of them in neural network model, shot and long term memory network model.
7. according to the method described in claim 1, it is characterized in that, the disaggregated model be Fast Text Classification model, it is described According to the corresponding theme term vector of each word in N number of word, the classification of the text to be sorted is obtained by disaggregated model, is had Body includes:
The theme term vector of N number of word is made into average calculating operation, obtains operation result;
According to the operation result, the classification of the text to be sorted is obtained by grader;
Wherein, grader includes:One of them of Logistic graders, Softmax graders.
8. a kind of server, which is characterized in that including being used for performing the method as described in claim 1-7 any claims Unit.
9. a kind of server, which is characterized in that including processor, input equipment, output equipment and memory, the processor, Input equipment, output equipment and memory are connected with each other, wherein, the memory is used to store computer program, the calculating Machine program includes program instruction, and the processor is configured for calling described program instruction, perform as claim 1-7 is any Method described in.
10. a kind of computer readable storage medium, which is characterized in that the computer storage media is stored with computer program, The computer program includes program instruction, and described program instruction makes the processor perform such as right when being executed by a processor It is required that 1-7 any one of them methods.
CN201711498600.5A 2017-12-29 2017-12-29 A kind of file classification method, server and computer-readable medium Withdrawn CN108170818A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711498600.5A CN108170818A (en) 2017-12-29 2017-12-29 A kind of file classification method, server and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711498600.5A CN108170818A (en) 2017-12-29 2017-12-29 A kind of file classification method, server and computer-readable medium

Publications (1)

Publication Number Publication Date
CN108170818A true CN108170818A (en) 2018-06-15

Family

ID=62516945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711498600.5A Withdrawn CN108170818A (en) 2017-12-29 2017-12-29 A kind of file classification method, server and computer-readable medium

Country Status (1)

Country Link
CN (1) CN108170818A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium
CN108897815A (en) * 2018-06-20 2018-11-27 淮阴工学院 A kind of multi-tag file classification method based on similarity model and FastText
CN109299276A (en) * 2018-11-15 2019-02-01 阿里巴巴集团控股有限公司 One kind converting the text to word insertion, file classification method and device
CN109299251A (en) * 2018-08-13 2019-02-01 同济大学 A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm
CN109614494A (en) * 2018-12-29 2019-04-12 东软集团股份有限公司 A kind of file classification method and relevant apparatus
CN109684475A (en) * 2018-11-21 2019-04-26 斑马网络技术有限公司 Processing method, device, equipment and the storage medium of complaint
CN110059161A (en) * 2019-04-23 2019-07-26 深圳市大众通信技术有限公司 A kind of call voice robot system based on Text Classification
CN110674263A (en) * 2019-12-04 2020-01-10 广联达科技股份有限公司 Method and device for automatically classifying model component files
CN110717038A (en) * 2019-09-17 2020-01-21 腾讯科技(深圳)有限公司 Object classification method and device
CN110766073A (en) * 2019-10-22 2020-02-07 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN111008281A (en) * 2019-12-06 2020-04-14 浙江大搜车软件技术有限公司 Text classification method and device, computer equipment and storage medium
CN111143548A (en) * 2018-11-02 2020-05-12 北大方正集团有限公司 Book classification method, device, equipment and computer readable storage medium
CN111178687A (en) * 2019-12-11 2020-05-19 北京淇瑀信息科技有限公司 Financial risk classification method and device and electronic equipment
CN111324831A (en) * 2018-12-17 2020-06-23 中国移动通信集团北京有限公司 Method and device for detecting fraudulent website
CN111368534A (en) * 2018-12-25 2020-07-03 中国移动通信集团浙江有限公司 Application log noise reduction method and device
CN111708868A (en) * 2020-01-15 2020-09-25 国网浙江省电力有限公司杭州供电公司 Text classification method, device and equipment for electric power operation and inspection events
CN111861596A (en) * 2019-04-04 2020-10-30 北京京东尚科信息技术有限公司 Text classification method and device
CN111858848A (en) * 2020-05-22 2020-10-30 深圳创新奇智科技有限公司 Semantic classification method and device, electronic equipment and storage medium
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN114022086A (en) * 2022-01-06 2022-02-08 深圳前海硬之城信息技术有限公司 Purchasing method, device, equipment and storage medium based on BOM identification

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897815A (en) * 2018-06-20 2018-11-27 淮阴工学院 A kind of multi-tag file classification method based on similarity model and FastText
CN108897815B (en) * 2018-06-20 2021-07-16 淮阴工学院 Multi-label text classification method based on similarity model and FastText
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium
CN109299251A (en) * 2018-08-13 2019-02-01 同济大学 A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm
CN111143548A (en) * 2018-11-02 2020-05-12 北大方正集团有限公司 Book classification method, device, equipment and computer readable storage medium
CN109299276A (en) * 2018-11-15 2019-02-01 阿里巴巴集团控股有限公司 One kind converting the text to word insertion, file classification method and device
CN109299276B (en) * 2018-11-15 2021-11-19 创新先进技术有限公司 Method and device for converting text into word embedding and text classification
CN109684475A (en) * 2018-11-21 2019-04-26 斑马网络技术有限公司 Processing method, device, equipment and the storage medium of complaint
CN111324831A (en) * 2018-12-17 2020-06-23 中国移动通信集团北京有限公司 Method and device for detecting fraudulent website
CN111368534A (en) * 2018-12-25 2020-07-03 中国移动通信集团浙江有限公司 Application log noise reduction method and device
CN109614494A (en) * 2018-12-29 2019-04-12 东软集团股份有限公司 A kind of file classification method and relevant apparatus
CN111861596A (en) * 2019-04-04 2020-10-30 北京京东尚科信息技术有限公司 Text classification method and device
CN111861596B (en) * 2019-04-04 2024-04-12 北京京东振世信息技术有限公司 Text classification method and device
CN110059161A (en) * 2019-04-23 2019-07-26 深圳市大众通信技术有限公司 A kind of call voice robot system based on Text Classification
CN110717038A (en) * 2019-09-17 2020-01-21 腾讯科技(深圳)有限公司 Object classification method and device
CN110766073B (en) * 2019-10-22 2023-10-27 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN110766073A (en) * 2019-10-22 2020-02-07 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN110674263A (en) * 2019-12-04 2020-01-10 广联达科技股份有限公司 Method and device for automatically classifying model component files
CN110674263B (en) * 2019-12-04 2022-02-08 广联达科技股份有限公司 Method and device for automatically classifying model component files
CN111008281A (en) * 2019-12-06 2020-04-14 浙江大搜车软件技术有限公司 Text classification method and device, computer equipment and storage medium
CN111178687A (en) * 2019-12-11 2020-05-19 北京淇瑀信息科技有限公司 Financial risk classification method and device and electronic equipment
CN111178687B (en) * 2019-12-11 2024-04-26 北京淇瑀信息科技有限公司 Financial risk classification method and device and electronic equipment
CN111708868A (en) * 2020-01-15 2020-09-25 国网浙江省电力有限公司杭州供电公司 Text classification method, device and equipment for electric power operation and inspection events
CN111858848A (en) * 2020-05-22 2020-10-30 深圳创新奇智科技有限公司 Semantic classification method and device, electronic equipment and storage medium
CN111858848B (en) * 2020-05-22 2024-03-15 青岛创新奇智科技集团股份有限公司 Semantic classification method and device, electronic equipment and storage medium
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112417153B (en) * 2020-11-20 2023-07-04 虎博网络技术(上海)有限公司 Text classification method, apparatus, terminal device and readable storage medium
CN114022086A (en) * 2022-01-06 2022-02-08 深圳前海硬之城信息技术有限公司 Purchasing method, device, equipment and storage medium based on BOM identification

Similar Documents

Publication Publication Date Title
CN108170818A (en) A kind of file classification method, server and computer-readable medium
US20230016365A1 (en) Method and apparatus for training text classification model
CN106055538B (en) The automatic abstracting method of the text label that topic model and semantic analysis combine
EP3518142B1 (en) Cross-lingual text classification using character embedded data structures
CN108241741A (en) A kind of file classification method, server and computer readable storage medium
CN111125354A (en) Text classification method and device
CN109902271A (en) Text data mask method, device, terminal and medium based on transfer learning
CN113722483A (en) Topic classification method, device, equipment and storage medium
CN111898374A (en) Text recognition method and device, storage medium and electronic equipment
CN112989800A (en) Multi-intention identification method and device based on Bert sections and readable storage medium
CN113282701B (en) Composition material generation method and device, electronic equipment and readable storage medium
CN109800292A (en) The determination method, device and equipment of question and answer matching degree
CN113901836B (en) Word sense disambiguation method and device based on context semantics and related equipment
CN112988963A (en) User intention prediction method, device, equipment and medium based on multi-process node
CN115392237B (en) Emotion analysis model training method, device, equipment and storage medium
EP4336379A1 (en) Tracking concepts within content in content management systems and adaptive learning systems
CN113626576A (en) Method and device for extracting relational characteristics in remote supervision, terminal and storage medium
CN112951233A (en) Voice question and answer method and device, electronic equipment and readable storage medium
Bharathi et al. Machine Learning Based Approach for Sentiment Analysis on Multilingual Code Mixing Text.
CN111382243A (en) Text category matching method, text category matching device and terminal
CN111078874B (en) Foreign Chinese difficulty assessment method based on decision tree classification of random subspace
CN108021609B (en) Text emotion classification method and device, computer equipment and storage medium
Yousif Neural computing based part of speech tagger for Arabic language: a review study
KR20190093439A (en) A method and computer program for inferring genre of a text contents
CN114138928A (en) Method, system, device, electronic equipment and medium for extracting text content

Legal Events

Date Code Title Description
PB01 Publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180615

WW01 Invention patent application withdrawn after publication