CN108170818A - A kind of file classification method, server and computer-readable medium - Google Patents
A kind of file classification method, server and computer-readable medium Download PDFInfo
- Publication number
- CN108170818A CN108170818A CN201711498600.5A CN201711498600A CN108170818A CN 108170818 A CN108170818 A CN 108170818A CN 201711498600 A CN201711498600 A CN 201711498600A CN 108170818 A CN108170818 A CN 108170818A
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- theme
- text
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of file classification method, server and computer-readable medium, wherein method includes:Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model, wherein, N is the positive integer no more than M;According to word each in N number of word and the corresponding theme of each word, the corresponding theme term vector of each word in N number of word is obtained by descriptor vector model, wherein, the theme term vector vector common for word theme corresponding with word represents;According to the corresponding theme term vector of each word in N number of word, the classification of the text to be sorted is obtained by disaggregated model;Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.It can increase classification accuracy in the case of the ambiguity for considering word and efficiently carry out text classification.
Description
Technical field
It can the present invention relates to a kind of technical field of information processing more particularly to file classification method, server and computer
Read medium.
Background technology
With the development of computer technology, a large amount of document information is all sharply increasing daily.It is quick swollen due to information
It is swollen, how fast and effeciently to have become a new problem faced using these information.In face of huge text message,
Classified using traditional artificial means to these information and face more and more difficulties since efficiency is too low, information processing
Have become people and obtain the indispensable tool of useful information, so the autotext based on machine learning artificial intelligence technology
Classification has become an important field of research.
But in the technology of existing text classification, since a word may possess different meanings, these prior arts do not have
Have and the ambiguity of word is distinguished, so as to cause the accuracy of text classification is reduced.
Invention content
The embodiment of the present invention provides a kind of method of text classification, server and computer-readable medium, can consider word
Ambiguity in the case of, increase classification accuracy and simultaneously efficiently carry out text classification.
In a first aspect, an embodiment of the present invention provides a kind of file classification method, this method includes:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model,
In, N is the positive integer no more than M;
According to word each in N number of word and the corresponding theme of each word, the N is obtained by descriptor vector model
The corresponding theme term vector of each word in a word, wherein, the theme term vector vector common for word theme corresponding with word
It represents;
According to the corresponding theme term vector of each word in N number of word, the text to be sorted is obtained by disaggregated model
Classification;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
Second aspect, an embodiment of the present invention provides a kind of server, which includes performing above-mentioned first party
The unit of the method in face.
The third aspect, an embodiment of the present invention provides another server, including processor, input equipment, output equipment
And memory, the processor, input equipment, output equipment and memory are connected with each other, wherein, the memory is used to store
Support server perform the above method computer program, the computer program include program instruction, the processor by with
It puts that described program is called to instruct, the method for performing above-mentioned first aspect.
Fourth aspect, an embodiment of the present invention provides a kind of computer readable storage medium, the computer storage media
Computer program is stored with, the computer program includes program instruction, and described program instruction makes institute when being executed by a processor
State the method that processor performs above-mentioned first aspect.
The embodiment of the present invention obtains the corresponding theme of word by using topic model, is obtained by using descriptor vector model
The corresponding theme term vector of word is taken, then the corresponding theme term vector of the part or all of word in text to be sorted is by making
The classification of the model to be sorted is obtained with disaggregated model.During classification, word is imparted into its corresponding theme, so as to reach
It is distinguished to the ambiguity to word, and then improves the accuracy rate of entire text classification.
Description of the drawings
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present invention, general for this field
For logical technical staff, without creative efforts, other attached drawings are can also be obtained according to these attached drawings.
Fig. 1 is the method flow diagram of file classification method provided by one embodiment of the present invention;
Fig. 2 is a kind of method flow diagram for file classification method that another embodiment of the present invention provides;
Fig. 2A be the present embodiments relate to a kind of TWE models structure chart;
Fig. 2 B be the present embodiments relate to a kind of FastText models structure chart;
Fig. 3 is a kind of schematic block diagram of server provided in an embodiment of the present invention;
Fig. 4 is a kind of server schematic block diagram that another embodiment of the present invention provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is part of the embodiment of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without making creative work
Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " comprising " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, element, component and/or its presence or addition gathered.
It is also understood that the term used in this description of the invention is merely for the sake of the mesh for describing specific embodiment
And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singulative, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is
Refer to any combinations and all possible combinations of one or more of the associated item listed, and including these combinations.
As used in this specification and in the appended claims, term " if " can be according to context quilt
Be construed to " when ... " or " once " or " in response to determining " or " in response to detecting ".Similarly, phrase " if it is determined that " or
" if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true
It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".
In the specific implementation, the terminal described in the embodiment of the present invention is including but not limited to such as with touch sensitive surface
The mobile phone, laptop computer or tablet computer of (for example, touch-screen display and/or touch tablet) etc it is other just
Portable device.It is to be further understood that in certain embodiments, the equipment is not portable communication device, but with tactile
Touch the desktop computer of sensing surface (for example, touch-screen display and/or touch tablet).
In discussion below, the terminal including display and touch sensitive surface is described.It is, however, to be understood that
It is that terminal can include one or more of the other physical user-interface device of such as physical keyboard, mouse and/or control-rod.
Terminal supports various application programs, such as one or more of following:Drawing application program, demonstration application journey
Sequence, word-processing application, website create application program, disk imprinting application program, spreadsheet applications, game application
Program, telephony application, videoconference application, email application, instant messaging applications, exercise
Support application program, photo management application program, digital camera application program, digital camera application program, web-browsing application
Program, digital music player application and/or video frequency player application program.
The various application programs that can be performed in terminal can use at least one public of such as touch sensitive surface
Physical user-interface device.It can adjust and/or change among applications and/or in corresponding application programs and touch sensitive table
The corresponding information shown in the one or more functions and terminal in face.In this way, the public physical structure of terminal is (for example, touch
Sensing surface) it can support the various application programs with user interface intuitive and transparent for a user.
The mankind can first understand this text when classifying to text, then with the prophet's common sense of oneself to the text into
Row classification.And during terminal-pair text classification and similar operation is done, the word in the text is obtained, these words are handled,
Then the classification of the text is obtained by trained model.In the prior art, typically word is represented using vector, with
Terminal is facilitated to easily identify and handle, and the word of vectorization equally exists similar place to the word that people use.For example, when asking
Element into " automobile and aircraft ", " automobile and banana " in which group is more like.In the learning process of the mankind, due to
Knowledge as " automobile and aircraft are all the vehicles, and banana is fruit " is stored, so before answer that can be without thinking
Person is more like.Equally, when terminal is asked this problem, " automobile ", " aircraft " and " banana " can be converted by it first
Then term vector calculates the vector distance between vector distance and " automobile " and " banana " between " automobile " and " aircraft ", most
After compare the two vector distances so as to obtain problem answers.Certainly, this terminal storage for being used to handle the problem artificially
The correspondence of word-term vector of processing.However, if terminal encounters " he is brought back to life by doctor " and " he is promoted as doctor ".
Due to the property of polysemy, cause often to generate the not high influence of accuracy in text classification.But if to first
It is occupation that a " doctor ", which specifies a theme, and it is " one's formal name " to specify a theme to second " doctor ", when terminal recognition is to " greatly
Vectorial as husband-occupation ", " doctor-one's formal name ", then it is different word that can judge them, can more accurately be located
Reason, so that classifying quality is more preferable.Below in conjunction with embodiment, the present invention will be described.
It please refers to Fig.1, it illustrates the method flow diagrams of file classification method provided by one embodiment of the present invention.This reality
It applies example and mainly is applied to come for example, the terminal can wrap in the terminal comprising storage processing function with text sorting technique
Include smartwatch, portable digital player, smart mobile phone, tablet computer, laptop, desktop computer, server, big number
According to Platform Server and distribution system services device etc..Text sorting technique, including:
Step 101, text to be sorted is obtained;
In the embodiment of the present invention, text to be sorted can be a book, an article, passage or sentence etc.
Deng.Text to be sorted can only include a kind of spoken and written languages, can also include multilingual word, can also include a variety of symbols
Number, such as text to be sorted is made of kinds of words such as Chinese character, English word, Arabic numerals, Greek characters.On it should be understood that
It states citing to be only used for describing the definition of text to be sorted in embodiments of the present invention, be not especially limited.
In the embodiment of the present invention, text to be sorted includes M word, wherein, M is positive integer.For ease of understanding, with English text
For this, if text to be sorted is passage, it includes word and punctuation marks, then are included in the text to be sorted
The quantity M of word is the word quantity in this section of word.Certainly, unlike the alphabetic writing of such as English, French etc. can be clearly
The word in text is divided, in addition as Chinese character cannot divide word, such as " Wuhan City's Yangtze Bridge " easily, can be divided into
" Wuhan City/the Changjiang river/bridge " can also be divided into " Wuhan/mayor/Jiang great Qiao ", there is different point in different linguistic context
Word method.So word segmentation processing is carried out, and then obtain the M in text firstly the need of the text to be sorted to such as Chinese languages one kind
A word.
Step 102, according to N number of word of text to be sorted, the corresponding master of each word in N number of word is obtained by topic model
Topic;
In the embodiment of the present invention, whole words (M word) of text to be sorted are chosen therein N number of subsequently located by terminal
Reason, wherein, N is the positive integer no more than M.It is understood that not all word may serve to table in text to be sorted
This text is levied, in other words, significance level of some words in the text to be sorted is very little.It so can be by these words
To removing, avoid carrying out more redundant operations.Certainly, whole words can also be carried out subsequent processing by terminal, not make to have herein
Body limits.
In the embodiment of the present invention, topic model is used for the word according to input, exports the corresponding theme of the word.For example,
In the case where context is in farm, the theme of " apple " may be fruit, and in the case where mobile phone is context, " apple "
Theme may be company.During topic model is used, it can specify a corresponding theme to each word.These masters
Topic may be a number, and " fruit " as escribed above may be number 28, and above-mentioned " company " may be number 355, and
These number no specific meaning, only according to the number classified to word.Wherein, topic model can be with
It is the hidden semantic analysis of probability (Probabilistic Latent Semantic Analysis, PLSA) model, implicit Di Like
Thunder distribution (Latent Dirichlet Allocation, LDA) model, level Di Li Cray processes (Hierarchical
Dirichlet Process, HDP) model.
Step 103, according to each word in N number of word and the corresponding theme of each word, N is obtained by descriptor vector model
The corresponding theme term vector of each word in a word;
In the embodiment of the present invention, descriptor vector model exports the word for the word according to input and its corresponding theme
Theme term vector.Wherein, the theme term vector vector common for word theme corresponding with word represents, in other words, a master
Epigraph vector represents a word and a theme.For example, the theme that identical word is different, then corresponding theme term vector is not
Together;The identical theme of different words, then corresponding theme term vector is also different.And the meaning present invention of each dimensional representation of descriptor
Embodiment is not especially limited, and can be that preceding a dimension is used to characterize word, remaining dimension is used to characterize theme;It also may be used
Word is characterized to be the set of odd number dimension, the set of even number dimension is used to characterize theme.Wherein, descriptor vector model can
To be single model such as descriptor insertion (Topic Word Embedings, TWE) model or the combination made by multiple models
Combination, TWE models and continuous type bag of words (the Continuous Bag-of- of model such as TWE models and Skip-Gram models
Words, CBOW) model combination.
Step 104, according to the corresponding theme term vector of word each in N number of word, text to be sorted is obtained by disaggregated model
Classification;
In the embodiment of the present invention, disaggregated model is defeated for theme term vector all in the text to be sorted according to input
Go out the classification of the text to be sorted.Wherein, disaggregated model can be Fast Text Classification (FastText) model, convolutional Neural net
Network (Convolutional Neural Network, CNN) model, shot and long term memory network (Long Short-Term
Memory, LSTM) model.It should be noted that above-mentioned topic model, descriptor vector model, disaggregated model are all to have instructed
Practice model, i.e., these models are with multiple text training so as to reach its accurate processing capacity.
The embodiment of the present invention obtains the corresponding theme of word by using topic model, is obtained by using theme vector model
The corresponding theme term vector of word, then the corresponding theme term vector of the part or all of word in text to be sorted by using
Disaggregated model obtains the classification of the model to be sorted.During classification, word is imparted into its corresponding theme, so as to reach
The ambiguity of word is distinguished, and then improves the accuracy rate of entire text classification.
It please refers to Fig.2, it illustrates the method flow diagrams for the file classification method that another embodiment of the present invention provides.With it is upper
One embodiment is compared, and in order to be described in more detail, the present embodiment is mainly carried out with specific algorithm, specific model
Explanation.Text sorting technique, including:
Step 201, text to be sorted is obtained;
Terminal can obtain text to be sorted input by user, which can be a book, article, one
Section word or sentence etc..In the present embodiment, text that terminal can be made of Chinese character and punctuation mark.At this point, it treats
The content of classifying text can refer to following words --- and " his friend is stayed in the mansion at five rings.”.
Step 202, text to be sorted is pre-processed, obtains N number of word;
Text to be sorted can be carried out text participle, go the pretreatments such as stop words by terminal.The algorithm that text participle uses
It can include:Forward Maximum Method algorithm, reverse maximum matching algorithm, minimum segmentation algorithm or stammerer (jieba) segmentation methods
Etc., in Chinese text, handled frequently with jieba segmentation methods.Assuming that it is segmented by text by above-mentioned reference text
Content be divided into " his/friend/stay in/five rings// mansion/inner/." and then need to handle through past stop words, and go to deactivate
The substantially corresponding deactivated vocabulary of word processing, if the word in text in vocabulary is deactivated there are identical word, just by the word
Remove from text.Assuming that after by the way that stop words is gone to handle, above-mentioned reference text becomes " friend " " five rings " " mansion " three
Word.
Step 203, according to N number of word, the theme of each word is obtained by LDA models;
LDA models are a topic models, its purposes has very much, and the embodiment of the present invention is mainly used for every in text
One word distributes a theme.During LDA model trainings, need to set two Di Li Crays distributed constant and theme
Number, then model is trained by a large amount of corpus of text.
Step 204, according to N number of word and the corresponding theme of each word, by TWE models obtain the descriptor of each word to
Amount;
The advantages of TWE models, is term vector and theme vector being combined together, and can obtain term vector and obtain
Theme vector.And theme term vector, can splice to obtain with theme vector by term vector.It is understood that
The dimension of term vector and theme vector is preset, and the dimension of term vector is typically no less than the dimension of theme vector, for example, word to
The dimension of amount is 200, and the dimension of theme vector is 50.And splice (contact) and mean the dimension and theme vector of term vector
Dimension be attached, become the vector of a more higher-dimension.Such as term vector be (a1, a2, a3 ... ... a199, a200), theme
Vector splices them for (b1, b2, b3 ... ... b49, b50), obtain theme term vector (a1, a2, a3 ... ... a199,
A200, b1, b2, b3 ... ... b49, b50).Can certainly be theme vector preceding, term vector rear, i.e., vector (b1, b2,
B3 ... ... b49, b50, a1, a2, a3 ... ... a199, a200).It should be understood that above-mentioned example is used only as illustrating, should not be construed as having
Body limits.Word theme corresponding with word is associated together by this method, can effectively distinction word ambiguity.TWE moulds
The structure of type as shown in Figure 2 A, in the training process of TWE models, initializes term vector and theme vector first, will be a large amount of
Word in corpus of text does term vector with after theme vector conversion, being trained by these vectors to TWE models, and more neologisms
Vector and theme vector.
It is alternatively possible to the acquisition of term vector and theme vector, the term vector that then will be obtained are carried out by two models
Theme term vector is spliced into theme vector.For example, it is corresponding to obtain word in text by Skip-Gram models or CBOW models
Term vector obtains the corresponding theme vector of word in text by TWE models.Further, it should be appreciated that word is corresponding with word in model
Term vector and theme vector be substantially one-to-one correspondence, and this correspondence hypothesis is term vector matrix and master
Inscribe vector matrix.Term vector matrix is also constantly being updated with the vector in theme vector matrix during training pattern,
Such as Skip-Gram models, it is for training term vector, and TWE models, not only trains term vector but also training theme vector.When
After the completion of Skip-Gram model trainings, term vector matrix is brought into training in TWE models, at this point, can not only train to obtain
Theme vector matrix and update that can be micro to the progress of term vector matrix.
Step 205, according to the theme term vector of N number of word, the classification of text to be sorted is obtained by FastText models.
FastText models are relative to other disaggregated models, when substantially reducing trained while classifying quality is kept
Between.The structure of FastText models as shown in Figure 2 B, carries out average calculating operation by the vector to input, then passes through grader
It can obtain the classification (label, label) of text to be sorted.Certainly, this model is housebroken model.It is instructed in model
During white silk, since FastText models are that have monitor model, so need to go training pattern using the corpus of text of tape label,
And micro update can also be carried out in model training to theme term vector.Wherein, grader can be Logistic classification
Device, Softmax graders etc. can also be further Hierarchical Softmax graders.
It should be noted that the specific implementation process of the which part step of method shown in Fig. 2 can be found in a method
Specific implementation process described in embodiment, details are not described herein.
The embodiment of the present invention obtains the corresponding theme of word by using LDA models, and obtaining word by using TWE models corresponds to
Theme term vector, then the corresponding theme term vector of the part or all of word in text to be sorted by using
FastText models obtain the classification of the model to be sorted.During classification, obtained and led using LDA models and TWE models
Epigraph vector, imparts its corresponding theme by word, so as to which the ambiguity reached to word is distinguished, and then improves entire text
The accuracy rate of this classification, and the use of FastText models can improve the speed of text classification.Further, text classification
It is one of content mostly important in natural language processing, the embodiment of the present invention, can be effectively by the ambiguity of distinction word
The accuracy of text classification is improved, can be applied to:The necks such as Web page classifying, spam filtering, information retrieval, question answering system
Domain.
The embodiment of the present invention also provides a kind of server, which is used to perform the list of aforementioned any one of them method
Member.Specifically, it is a kind of schematic block diagram of server provided in an embodiment of the present invention referring to Fig. 3.The server packet of the present embodiment
It includes:Text acquiring unit 301, theme acquiring unit 302, vectorial acquiring unit 303 and taxon 304.
Text acquiring unit 301, for obtaining text to be sorted, the text to be sorted includes M word.
Theme acquiring unit 302 for N number of word according to text to be sorted, is obtained by topic model in N number of word
The corresponding theme of each word.
Vectorial acquiring unit 303 according to word each in N number of word and the corresponding theme of each word, passes through descriptor
Vector model obtains the corresponding theme term vector of each word in N number of word.
Taxon 304, for according to the corresponding theme term vector of each word in N number of word, being obtained by disaggregated model
To the classification of the text to be sorted.
Wherein, M is positive integer;N is the positive integer no more than M;The theme term vector is total to for word theme corresponding with word
Same vector represents;The topic model, the descriptor vector model, the disaggregated model are training pattern.
The server of the present embodiment can also include:Pretreatment unit, for carrying out text point to the text to be sorted
Word processing, obtains M word of the text to be sorted;Stop words is carried out to the M word to handle, and obtains the text to be sorted
This N number of word.Wherein, the algorithm that the text word segmentation processing uses can be:Forward Maximum Method algorithm, reverse maximum matching
Algorithm, minimum segmentation algorithm or stammerer segmentation methods etc..
Wherein, the topic model can be that implicit Di Li Crays are distributed LDA topic models, the descriptor vector model
Can be descriptor insertion TWE models.
Vectorial acquiring unit 303 can be specifically used for through TWE models, obtain each corresponding word of word in N number of word
Vector and the corresponding theme vector of each word;The corresponding term vector of each word theme vector corresponding with each word is spelled
It connects, obtains the corresponding theme term vector of each word;Wherein, term vector be i dimensional vectors, theme vector be j dimensional vectors, descriptor to
It measures as (i+j) dimensional vector.
Vectorial acquiring unit 303 can also be specifically used for through term vector model, obtain each word pair in N number of word
The term vector answered by theme vector model, obtains each corresponding theme vector of word in N number of word;Each word is corresponded to
Term vector theme vector corresponding with each word spliced, obtain the corresponding theme term vector of each word;Wherein, term vector
For i dimensional vectors, theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
Wherein, the disaggregated model can be Fast Text Classification FastText models, convolutional neural networks CNN models or
Person's shot and long term memory network LSTM models.
Taxon 304 can be specifically used for the theme term vector of N number of word making average calculating operation, obtain operation knot
Fruit;According to the operation result, the classification of the text to be sorted is obtained by grader;Wherein, grader can be
Logistic graders or Softmax graders.
It is a kind of server schematic block diagram that another embodiment of the present invention provides referring to Fig. 4.The present embodiment as depicted
In server can include:One or more processors 401;One or more input equipments 402, one or more output are set
Standby 403 and memory 404.Above-mentioned processor 401, input equipment 402, output equipment 403 and memory 404 pass through bus 405
Connection.Memory 402 is for storing computer program, and the computer program includes program instruction, and processor 401 is used to perform
The program instruction that memory 402 stores.Wherein, processor 401 is configured for that described program instruction is called to perform:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model,
In, N is the positive integer no more than M;
According to word each in N number of word and the corresponding theme of each word, the N is obtained by descriptor vector model
The corresponding theme term vector of each word in a word, wherein, the theme term vector vector common for word theme corresponding with word
It represents;
According to the corresponding theme term vector of each word in N number of word, the text to be sorted is obtained by disaggregated model
Classification;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
In one embodiment, processor 401 can call the application program being stored in memory 404, for perform with
Lower operation:Text word segmentation processing is carried out to the text to be sorted, obtains M word of the text to be sorted;To the M word
It carries out stop words to handle, obtains N number of word of the text to be sorted.Wherein, the algorithm that the text word segmentation processing uses can
To be:Forward Maximum Method algorithm, reverse maximum matching algorithm, minimum segmentation algorithm or stammerer segmentation methods etc..
As a kind of embodiment, the topic model is described for that can be that implicit Di Li Crays are distributed LDA topic models
Descriptor vector model can be descriptor insertion TWE models.
In one embodiment, processor 401 can call the application program being stored in memory 404, for perform with
Lower operation:By TWE models, each corresponding term vector of word and the corresponding theme vector of each word in N number of word are obtained;
The corresponding term vector of each word theme vector corresponding with each word is spliced, obtain the corresponding descriptor of each word to
Amount;Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
In one embodiment, processor 401 may call upon the application program being stored in memory 404, for performing
It operates below:By term vector model, each corresponding term vector of word in N number of word is obtained, by theme vector model, is obtained
Each corresponding theme vector of word into N number of word;By the corresponding term vector of each word theme vector corresponding with each word
Spliced, obtain the corresponding theme term vector of each word;Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, main
Epigraph vector is (i+j) dimensional vector.
As a kind of embodiment, the disaggregated model can be Fast Text Classification FastText models, convolutional Neural
Network C NN models or shot and long term memory network LSTM models etc..
In one embodiment, processor 401 may call upon the application program being stored in memory 404, for performing
It operates below:The theme term vector of N number of word is made into average calculating operation, obtains operation result;According to the operation result, pass through
Grader obtains the classification of the text to be sorted;Wherein, grader can be Logistic graders or Softmax classification
One of them of device.
It should be appreciated that in embodiments of the present invention, alleged processor 401 can be central processing unit (Central
Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital
Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit,
ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic
Device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor or this at
It can also be any conventional processor etc. to manage device.
Input equipment 402 can include Trackpad, fingerprint and adopt sensor (for acquiring the finger print information of user and fingerprint
Directional information), microphone etc., output equipment 403 can include display (LCD etc.), loud speaker etc..
The memory 404 can include read-only memory and random access memory, and to processor 401 provide instruction and
Data.The a part of of memory 404 can also include nonvolatile RAM.For example, memory 404 can also be deposited
Store up the information of device type.
In the specific implementation, processor 401, input equipment 402, the output equipment 403 described in the embodiment of the present invention can
Perform the realization side described in the first embodiment and second embodiment of the method for text classification provided in an embodiment of the present invention
Formula also can perform the realization method of the described server of the embodiment of the present invention, and details are not described herein.
A kind of computer readable storage medium, the computer-readable storage medium are provided in another embodiment of the invention
Matter is stored with computer program, and the computer program includes program instruction, and described program instruction is realized when being executed by processor:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model,
In, N is the positive integer no more than M;
According to word each in N number of word and the corresponding theme of each word, the N is obtained by descriptor vector model
The corresponding theme term vector of each word in a word, wherein, the theme term vector vector common for word theme corresponding with word
It represents;
According to the corresponding theme term vector of each word in N number of word, the text to be sorted is obtained by disaggregated model
Classification;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
The computer readable storage medium can be the internal storage unit of the server described in aforementioned any embodiment,
Such as the hard disk or memory of server.The computer readable storage medium can also be that the external storage of the server is set
Plug-in type hard disk that is standby, such as being equipped on the server, intelligent memory card (Smart Media Card, SMC), secure digital
(Secure Digital, SD) blocks, flash card (Flash Card) etc..Further, the computer readable storage medium is also
The internal storage unit of the server can both be included or including External memory equipment.The computer readable storage medium is used
In other programs and data needed for the storage computer program and the server.The computer readable storage medium
It can be also used for temporarily storing the data that has exported or will export.
Those of ordinary skill in the art may realize that each exemplary lists described with reference to the embodiments described herein
Member and algorithm steps can be realized with the combination of electronic hardware, computer software or the two, in order to clearly demonstrate hardware
With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This
A little functions are performed actually with hardware or software mode, specific application and design constraint depending on technical solution.Specially
Industry technical staff can realize described function to each specific application using distinct methods, but this realization is not
It is considered as beyond the scope of this invention.
It is apparent to those skilled in the art that for convenience of description and succinctly, the clothes of foregoing description
The specific work process of business device and unit can refer to the corresponding process in preceding method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed server and method can pass through
Other modes are realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit,
Only a kind of division of logic function, can there is an other dividing mode in actual implementation, such as multiple units or component can be with
With reference to or be desirably integrated into another system or some features can be ignored or does not perform.It is in addition, shown or discussed
Mutual coupling, direct-coupling or communication connection can be by the INDIRECT COUPLING of some interfaces, device or unit or logical
Letter connection or electricity, the connection of mechanical or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple
In network element.Some or all of unit therein can be selected according to the actual needs to realize the embodiment of the present invention
Purpose.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
It is that each unit is individually physically present or two or more units integrate in a unit.It is above-mentioned integrated
The form that hardware had both may be used in unit is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially
The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products
It embodies, which is stored in a storage medium, is used including some instructions so that a computer
Equipment (can be personal computer, server or the network equipment etc.) performs the complete of each embodiment the method for the present invention
Portion or part steps.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey
The medium of sequence code.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right
It is required that protection domain subject to.
Claims (10)
1. a kind of file classification method, which is characterized in that including:
Text to be sorted is obtained, the text to be sorted includes M word, wherein, M is positive integer;
According to N number of word of text to be sorted, each corresponding theme of word in N number of word is obtained by topic model, wherein, N
To be not more than the positive integer of M;
According to word each in N number of word and the corresponding theme of each word, N number of word is obtained by descriptor vector model
In each corresponding theme term vector of word, wherein, the theme term vector vector expression common for word theme corresponding with word;
According to the corresponding theme term vector of each word in N number of word, the class of the text to be sorted is obtained by disaggregated model
Not;
Wherein, the topic model, the descriptor vector model, the disaggregated model are training pattern.
2. according to the method described in claim 1, it is characterized in that, after the acquisition text to be sorted, further include:
Text word segmentation processing is carried out to the text to be sorted, obtains M word of the text to be sorted;
Stop words is carried out to the M word to handle, and obtains N number of word of the text to be sorted;
Wherein, the algorithm that the text word segmentation processing uses includes:Forward Maximum Method algorithm, reverse maximum matching algorithm, most
One of them in segmentation algorithm, stammerer segmentation methods less.
3. according to the method described in claim 1, it is characterized in that, the topic model, which is implicit Di Li Crays, is distributed theme mould
Type writes inscription incorporation model based on the descriptor vector model.
4. according to the method described in claim 3, it is characterized in that, described obtain N number of word by descriptor vector model
In each corresponding theme term vector of word, specifically include:
By descriptor incorporation model, each corresponding term vector of word and the corresponding theme of each word in N number of word are obtained
Vector;
The corresponding term vector of each word theme vector corresponding with each word is spliced, obtains the corresponding descriptor of each word
Vector;
Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
5. method according to claim 1 or 2, which is characterized in that it is described obtained by descriptor vector model it is described N number of
The corresponding theme term vector of each word, specifically includes in word:
By term vector model, each corresponding term vector of word in N number of word is obtained, by theme vector model, obtains institute
State the corresponding theme vector of each word in N number of word;
The corresponding term vector of each word theme vector corresponding with each word is spliced, obtains the corresponding descriptor of each word
Vector;
Wherein, term vector is i dimensional vectors, and theme vector is j dimensional vectors, and theme term vector is (i+j) dimensional vector.
6. according to the method described in claim 1, it is characterized in that, the disaggregated model includes Fast Text Classification model, volume
Accumulate one of them in neural network model, shot and long term memory network model.
7. according to the method described in claim 1, it is characterized in that, the disaggregated model be Fast Text Classification model, it is described
According to the corresponding theme term vector of each word in N number of word, the classification of the text to be sorted is obtained by disaggregated model, is had
Body includes:
The theme term vector of N number of word is made into average calculating operation, obtains operation result;
According to the operation result, the classification of the text to be sorted is obtained by grader;
Wherein, grader includes:One of them of Logistic graders, Softmax graders.
8. a kind of server, which is characterized in that including being used for performing the method as described in claim 1-7 any claims
Unit.
9. a kind of server, which is characterized in that including processor, input equipment, output equipment and memory, the processor,
Input equipment, output equipment and memory are connected with each other, wherein, the memory is used to store computer program, the calculating
Machine program includes program instruction, and the processor is configured for calling described program instruction, perform as claim 1-7 is any
Method described in.
10. a kind of computer readable storage medium, which is characterized in that the computer storage media is stored with computer program,
The computer program includes program instruction, and described program instruction makes the processor perform such as right when being executed by a processor
It is required that 1-7 any one of them methods.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711498600.5A CN108170818A (en) | 2017-12-29 | 2017-12-29 | A kind of file classification method, server and computer-readable medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711498600.5A CN108170818A (en) | 2017-12-29 | 2017-12-29 | A kind of file classification method, server and computer-readable medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108170818A true CN108170818A (en) | 2018-06-15 |
Family
ID=62516945
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711498600.5A Withdrawn CN108170818A (en) | 2017-12-29 | 2017-12-29 | A kind of file classification method, server and computer-readable medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108170818A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875067A (en) * | 2018-06-29 | 2018-11-23 | 北京百度网讯科技有限公司 | text data classification method, device, equipment and storage medium |
CN108897815A (en) * | 2018-06-20 | 2018-11-27 | 淮阴工学院 | A kind of multi-tag file classification method based on similarity model and FastText |
CN109299276A (en) * | 2018-11-15 | 2019-02-01 | 阿里巴巴集团控股有限公司 | One kind converting the text to word insertion, file classification method and device |
CN109299251A (en) * | 2018-08-13 | 2019-02-01 | 同济大学 | A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm |
CN109614494A (en) * | 2018-12-29 | 2019-04-12 | 东软集团股份有限公司 | A kind of file classification method and relevant apparatus |
CN109684475A (en) * | 2018-11-21 | 2019-04-26 | 斑马网络技术有限公司 | Processing method, device, equipment and the storage medium of complaint |
CN110059161A (en) * | 2019-04-23 | 2019-07-26 | 深圳市大众通信技术有限公司 | A kind of call voice robot system based on Text Classification |
CN110674263A (en) * | 2019-12-04 | 2020-01-10 | 广联达科技股份有限公司 | Method and device for automatically classifying model component files |
CN110717038A (en) * | 2019-09-17 | 2020-01-21 | 腾讯科技(深圳)有限公司 | Object classification method and device |
CN110766073A (en) * | 2019-10-22 | 2020-02-07 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN111008281A (en) * | 2019-12-06 | 2020-04-14 | 浙江大搜车软件技术有限公司 | Text classification method and device, computer equipment and storage medium |
CN111143548A (en) * | 2018-11-02 | 2020-05-12 | 北大方正集团有限公司 | Book classification method, device, equipment and computer readable storage medium |
CN111178687A (en) * | 2019-12-11 | 2020-05-19 | 北京淇瑀信息科技有限公司 | Financial risk classification method and device and electronic equipment |
CN111324831A (en) * | 2018-12-17 | 2020-06-23 | 中国移动通信集团北京有限公司 | Method and device for detecting fraudulent website |
CN111368534A (en) * | 2018-12-25 | 2020-07-03 | 中国移动通信集团浙江有限公司 | Application log noise reduction method and device |
CN111708868A (en) * | 2020-01-15 | 2020-09-25 | 国网浙江省电力有限公司杭州供电公司 | Text classification method, device and equipment for electric power operation and inspection events |
CN111861596A (en) * | 2019-04-04 | 2020-10-30 | 北京京东尚科信息技术有限公司 | Text classification method and device |
CN111858848A (en) * | 2020-05-22 | 2020-10-30 | 深圳创新奇智科技有限公司 | Semantic classification method and device, electronic equipment and storage medium |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
CN114022086A (en) * | 2022-01-06 | 2022-02-08 | 深圳前海硬之城信息技术有限公司 | Purchasing method, device, equipment and storage medium based on BOM identification |
-
2017
- 2017-12-29 CN CN201711498600.5A patent/CN108170818A/en not_active Withdrawn
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108897815A (en) * | 2018-06-20 | 2018-11-27 | 淮阴工学院 | A kind of multi-tag file classification method based on similarity model and FastText |
CN108897815B (en) * | 2018-06-20 | 2021-07-16 | 淮阴工学院 | Multi-label text classification method based on similarity model and FastText |
CN108875067A (en) * | 2018-06-29 | 2018-11-23 | 北京百度网讯科技有限公司 | text data classification method, device, equipment and storage medium |
CN109299251A (en) * | 2018-08-13 | 2019-02-01 | 同济大学 | A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm |
CN111143548A (en) * | 2018-11-02 | 2020-05-12 | 北大方正集团有限公司 | Book classification method, device, equipment and computer readable storage medium |
CN109299276A (en) * | 2018-11-15 | 2019-02-01 | 阿里巴巴集团控股有限公司 | One kind converting the text to word insertion, file classification method and device |
CN109299276B (en) * | 2018-11-15 | 2021-11-19 | 创新先进技术有限公司 | Method and device for converting text into word embedding and text classification |
CN109684475A (en) * | 2018-11-21 | 2019-04-26 | 斑马网络技术有限公司 | Processing method, device, equipment and the storage medium of complaint |
CN111324831A (en) * | 2018-12-17 | 2020-06-23 | 中国移动通信集团北京有限公司 | Method and device for detecting fraudulent website |
CN111368534A (en) * | 2018-12-25 | 2020-07-03 | 中国移动通信集团浙江有限公司 | Application log noise reduction method and device |
CN109614494A (en) * | 2018-12-29 | 2019-04-12 | 东软集团股份有限公司 | A kind of file classification method and relevant apparatus |
CN111861596A (en) * | 2019-04-04 | 2020-10-30 | 北京京东尚科信息技术有限公司 | Text classification method and device |
CN111861596B (en) * | 2019-04-04 | 2024-04-12 | 北京京东振世信息技术有限公司 | Text classification method and device |
CN110059161A (en) * | 2019-04-23 | 2019-07-26 | 深圳市大众通信技术有限公司 | A kind of call voice robot system based on Text Classification |
CN110717038A (en) * | 2019-09-17 | 2020-01-21 | 腾讯科技(深圳)有限公司 | Object classification method and device |
CN110766073B (en) * | 2019-10-22 | 2023-10-27 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN110766073A (en) * | 2019-10-22 | 2020-02-07 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN110674263A (en) * | 2019-12-04 | 2020-01-10 | 广联达科技股份有限公司 | Method and device for automatically classifying model component files |
CN110674263B (en) * | 2019-12-04 | 2022-02-08 | 广联达科技股份有限公司 | Method and device for automatically classifying model component files |
CN111008281A (en) * | 2019-12-06 | 2020-04-14 | 浙江大搜车软件技术有限公司 | Text classification method and device, computer equipment and storage medium |
CN111178687A (en) * | 2019-12-11 | 2020-05-19 | 北京淇瑀信息科技有限公司 | Financial risk classification method and device and electronic equipment |
CN111178687B (en) * | 2019-12-11 | 2024-04-26 | 北京淇瑀信息科技有限公司 | Financial risk classification method and device and electronic equipment |
CN111708868A (en) * | 2020-01-15 | 2020-09-25 | 国网浙江省电力有限公司杭州供电公司 | Text classification method, device and equipment for electric power operation and inspection events |
CN111858848A (en) * | 2020-05-22 | 2020-10-30 | 深圳创新奇智科技有限公司 | Semantic classification method and device, electronic equipment and storage medium |
CN111858848B (en) * | 2020-05-22 | 2024-03-15 | 青岛创新奇智科技集团股份有限公司 | Semantic classification method and device, electronic equipment and storage medium |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
CN112417153B (en) * | 2020-11-20 | 2023-07-04 | 虎博网络技术(上海)有限公司 | Text classification method, apparatus, terminal device and readable storage medium |
CN114022086A (en) * | 2022-01-06 | 2022-02-08 | 深圳前海硬之城信息技术有限公司 | Purchasing method, device, equipment and storage medium based on BOM identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108170818A (en) | A kind of file classification method, server and computer-readable medium | |
US20230016365A1 (en) | Method and apparatus for training text classification model | |
CN106055538B (en) | The automatic abstracting method of the text label that topic model and semantic analysis combine | |
EP3518142B1 (en) | Cross-lingual text classification using character embedded data structures | |
CN108241741A (en) | A kind of file classification method, server and computer readable storage medium | |
CN111125354A (en) | Text classification method and device | |
CN109902271A (en) | Text data mask method, device, terminal and medium based on transfer learning | |
CN113722483A (en) | Topic classification method, device, equipment and storage medium | |
CN111898374A (en) | Text recognition method and device, storage medium and electronic equipment | |
CN112989800A (en) | Multi-intention identification method and device based on Bert sections and readable storage medium | |
CN113282701B (en) | Composition material generation method and device, electronic equipment and readable storage medium | |
CN109800292A (en) | The determination method, device and equipment of question and answer matching degree | |
CN113901836B (en) | Word sense disambiguation method and device based on context semantics and related equipment | |
CN112988963A (en) | User intention prediction method, device, equipment and medium based on multi-process node | |
CN115392237B (en) | Emotion analysis model training method, device, equipment and storage medium | |
EP4336379A1 (en) | Tracking concepts within content in content management systems and adaptive learning systems | |
CN113626576A (en) | Method and device for extracting relational characteristics in remote supervision, terminal and storage medium | |
CN112951233A (en) | Voice question and answer method and device, electronic equipment and readable storage medium | |
Bharathi et al. | Machine Learning Based Approach for Sentiment Analysis on Multilingual Code Mixing Text. | |
CN111382243A (en) | Text category matching method, text category matching device and terminal | |
CN111078874B (en) | Foreign Chinese difficulty assessment method based on decision tree classification of random subspace | |
CN108021609B (en) | Text emotion classification method and device, computer equipment and storage medium | |
Yousif | Neural computing based part of speech tagger for Arabic language: a review study | |
KR20190093439A (en) | A method and computer program for inferring genre of a text contents | |
CN114138928A (en) | Method, system, device, electronic equipment and medium for extracting text content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180615 |
|
WW01 | Invention patent application withdrawn after publication |