CN108804512A - Generating means, method and the computer readable storage medium of textual classification model - Google Patents
Generating means, method and the computer readable storage medium of textual classification model Download PDFInfo
- Publication number
- CN108804512A CN108804512A CN201810361702.0A CN201810361702A CN108804512A CN 108804512 A CN108804512 A CN 108804512A CN 201810361702 A CN201810361702 A CN 201810361702A CN 108804512 A CN108804512 A CN 108804512A
- Authority
- CN
- China
- Prior art keywords
- word
- word segmentation
- dictionary
- candidate
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a kind of generating means of textual classification model, including memory and processor, it is stored with the model generator that can be run on a processor on memory, which realizes following steps when being executed by processor:Obtain the dictionary for word segmentation of financial field and the corpus of text of financial field;Candidate neologisms are selected to be added to dictionary for word segmentation from corpus of text;It obtains sample set and classification mark is carried out to the training sample in sample set;Based on the dictionary for word segmentation for being added to candidate neologisms, the training sample in sample set is segmented using preset segmentation methods and extracts term vector, based on adaboost algorithms, the classification information of term vector and mark is input to training in multiple Weak Classifiers, obtains textual classification model.The present invention also proposes a kind of generation method of textual classification model and a kind of computer readable storage medium.The present invention solves the problems, such as to cannot achieve the classification for carrying out financial field text emotion tendency in the prior art.
Description
Technical field
The present invention relates to Text Classification field more particularly to a kind of generating means of textual classification model, method and
Computer readable storage medium.
Background technology
With the development of internet and information technology, more and more mechanisms are with individual by internet approach with various sides
Formula delivers the viewpoint, attitude and position to various things, such as various news analysis, forum and social network sites.These magnanimity
Information for the various aspects such as e-commerce, market prediction have certain commercial value, especially financial industry, be interconnection
Net information increases most fast, impacted maximum industry, therefore, Sentiment orientation analysis is carried out to carry out more to financial text message
In-depth study is increasingly becoming important topic.
Emotion tendentiousness of text analysis is to belong to a part for text emotion analysis, can be with by emotional orientation analysis
Grasp this paper and pass judgement on sexual orientation, for financial field, news public sentiment be embody market and industry prosperity degree and
The important indicator of the transaction enthusiasm of investor, therefore, when to the analysis of the emotion tendency of the text of financial field for finance
Long influence of the research with play staff's weight, but also lack realization in the prior art and Sentiment orientation is carried out to financial field text
Classification scheme, lead to not to realize the classification that emotion tendency is carried out to financial field text.
Invention content
The present invention provides a kind of generating means of textual classification model, method and computer readable storage medium, main
Purpose is to propose a kind of generating means of the textual classification model for the Sentiment orientation classification can be used for financial field text, with
It solves the problems, such as to cannot achieve the classification for carrying out financial field text emotion tendency in the prior art.
To achieve the above object, the present invention provides a kind of generating means of textual classification model, which includes memory
And processor, the model generator that can be run on the processor is stored in the memory, the model generates journey
Sequence realizes following steps when being executed by the processor:
Obtain the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and preset financial field
Corpus of text;
Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to the dictionary for word segmentation;
Sample set is obtained, classification mark is carried out to the training sample in the sample set according to default Sentiment orientation classification mode
Note;
Based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the instruction in the sample set
Practice sample and carries out word segmentation processing;
Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by the corresponding term vector of training sample and mark
Classification information be input in preset multiple Weak Classifiers and be trained, multiple Weak Classifiers that training obtain are combined as gold
Melt the textual classification model in field.
Optionally, described to select candidate neologisms from the corpus of text according to preset algorithm, it is added to the participle word
The step of allusion quotation includes:
Based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to described
Word segmentation result obtains candidate word set;
The information gain of each candidate word in the candidate word set is calculated, information gain is selected to be more than the first predetermined threshold value
Candidate word as the first candidate neologisms, the described first candidate neologisms are added in the dictionary for word segmentation;
Based on the dictionary for word segmentation for being added to the described first candidate neologisms, using the segmentation methods to the corpus of text into
Row participle, and train term vector model using the corpus of text after word segmentation processing;
The semantic phase of the word and the described first candidate neologisms in word segmentation result is calculated using the term vector model that training obtains
Like degree;
Semantic similarity is more than the word of the second predetermined threshold value as the second candidate neologisms, and will the second candidate neologisms
It is added in the dictionary for word segmentation.
Optionally, the processor can also be used to execute the model generator, with described that semantic similarity is big
In the second predetermined threshold value word as the second candidate neologisms, and the step of the described second candidate neologisms are added to institute's predicate dictionary
Later, following steps are also realized:
Word frequency of the described second candidate neologisms in corpus of text is calculated, and using the word frequency being calculated as second time
Select weight of the neologisms in the dictionary for word segmentation.
Optionally, the acquisition sample set, according to default Sentiment orientation classification mode to the training sample in the sample set
The step of this progress classification mark includes:
Sample set is obtained, and obtains multiple mark people according to default Sentiment orientation classification mode to the training sample in sample set
Originally the multiple markup informations being labeled, from the multiple markup information, the most markup information of selection occurrence number
Annotation results as corresponding training sample.
Optionally, the Weak Classifier includes grader based on convolutional neural networks algorithm, is based on Recognition with Recurrent Neural Network
The grader of algorithm and grader based on shot and long term memory network algorithm.
In addition, to achieve the above object, the present invention also provides a kind of generation method of textual classification model, this method packets
It includes:
Obtain the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and preset financial field
Corpus of text;
Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to the dictionary for word segmentation;
Sample set is obtained, classification mark is carried out to the training sample in the sample set according to default Sentiment orientation classification mode
Note;
Based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the instruction in the sample set
Practice sample and carries out word segmentation processing;
Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by the corresponding term vector of training sample and mark
Classification information be input in preset multiple Weak Classifiers and be trained, multiple Weak Classifiers that training obtain are combined as gold
Melt the textual classification model in field.
Optionally, described to select candidate neologisms from the corpus of text according to preset algorithm, it is added to the participle word
The step of allusion quotation includes:
Based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to described
Word segmentation result obtains candidate word set;
The information gain of each candidate word in the candidate word set is calculated, information gain is selected to be more than the first predetermined threshold value
Candidate word as the first candidate neologisms, the described first candidate neologisms are added in the dictionary for word segmentation;
Based on the dictionary for word segmentation for being added to the described first candidate neologisms, using the segmentation methods to the corpus of text into
Row participle, and train term vector model using the corpus of text after word segmentation processing;
The semantic phase of the word and the described first candidate neologisms in word segmentation result is calculated using the term vector model that training obtains
Like degree;
Semantic similarity is more than the word of the second predetermined threshold value as the second candidate neologisms, and will the second candidate neologisms
It is added in the dictionary for word segmentation.
Optionally, described that semantic similarity is more than the word of the second predetermined threshold value as the second candidate neologisms, and will be described
After the step of second candidate neologisms are added to institute's predicate dictionary, the method further includes step:
Word frequency of the described second candidate neologisms in corpus of text is calculated, and using the word frequency being calculated as second time
Select weight of the neologisms in the dictionary for word segmentation.
Optionally, the acquisition sample set, according to default Sentiment orientation classification mode to the training sample in the sample set
The step of this progress classification mark includes:
Sample set is obtained, and obtains multiple mark people according to default Sentiment orientation classification mode to the training sample in sample set
Originally the multiple markup informations being labeled, from the multiple markup information, the most markup information of selection occurrence number
Annotation results as corresponding training sample.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium
Model generator is stored on storage medium, the model generator can be executed by one or more processor, with reality
Now the step of generation method of textual classification model as described above.
Generating means, method and the computer readable storage medium of textual classification model proposed by the present invention, based on collection
Financial field vocabulary structure financial field dictionary for word segmentation, the corpus of text of preset financial field is obtained, according to text language
Material obtains candidate word set, and candidate neologisms is selected to be added to dictionary for word segmentation from candidate word set.Sample set is obtained, according to default
Sentiment orientation classification mode carries out classification mark to the training sample in sample set.Based on the participle word for being added to candidate neologisms
Allusion quotation carries out word segmentation processing to the training sample in sample set using preset segmentation methods, term vector is extracted according to word segmentation result,
Based on adaboost algorithms, the classification information of the corresponding term vector of training sample and mark is input to preset multiple weak typings
It is trained in device, multiple Weak Classifiers that training obtains is combined as to the textual classification model of financial field.The side of the present invention
In case, by the corpus of text excavation to financial field, filters out candidate neologisms and be added in dictionary for word segmentation, by updated
Dictionary for word segmentation is to the sample word segmentation processing in sample set, and according to default Sentiment orientation classification mode to the sample number in sample set
According to classification mark is carried out, final training obtains textual classification model, which can be applied to the Sentiment orientation point of financial field
Class problem.
Description of the drawings
Fig. 1 is the schematic diagram of the generating means preferred embodiment of textual classification model of the present invention;
Fig. 2 illustrates for the program module of model generator in one embodiment of generating means of textual classification model of the present invention
Figure;
Fig. 3 is the flow chart of the generation method preferred embodiment of textual classification model of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of generating means of textual classification model.It is textual classification model of the present invention shown in referring to Fig.1
Generating means preferred embodiment schematic diagram.
In the present embodiment, the generating means of textual classification model can be PC (Personal Computer, personal electricity
Brain), can also be the terminal devices such as smart mobile phone, tablet computer, pocket computer.The generating means 1 of text disaggregated model
Including at least memory 11, processor 12, communication bus 13 and network interface 14.
Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory,
Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), magnetic storage, disk, CD etc..Memory 11
Can be the internal storage unit of the generating means 1 of textual classification model, such as text disaggregated model in some embodiments
Generating means 1 hard disk.Memory 11 can also be the outer of the generating means 1 of textual classification model in further embodiments
The plug-in type hard disk being equipped in portion's storage device, such as the generating means 1 of textual classification model, intelligent memory card (Smart
Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further,
Memory 11 can also both include textual classification model generating means 1 internal storage unit and also including External memory equipment.
Memory 11 can be not only used for the application software and Various types of data that storage is installed on the generating means 1 of textual classification model, example
Such as code of model generator 01 can be also used for temporarily storing the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11
Code or processing data, such as execute model generator 01 etc..
Communication bus 13 is for realizing the connection communication between these components.
Network interface 14 may include optionally standard wireline interface and wireless interface (such as WI-FI interface), be commonly used in
Communication connection is generated between the device 1 and other electronic equipments.
Fig. 1 illustrates only the generating means 1 of the textual classification model with component 11-14 and model generator 01,
It should be understood that be not required for implementing all components shown, the implementation that can be substituted is more or less component.
Optionally, which can also include user interface, and user interface may include display (Display), input
Unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.It is optional
Ground, in some embodiments, display can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and OLED
(Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..Wherein, display can also be appropriate
Referred to as display screen or display unit, for being shown in the information handled in the generating means 1 of textual classification model and for showing
Show visual user interface.
In 1 embodiment of device shown in Fig. 1, model generator 01 is stored in memory 11;Processor 12 executes
Following steps are realized when the model generator 01 stored in memory 11:
A1, the dictionary for word segmentation for obtaining the financial field that the financial field vocabulary based on collection is built and preset finance
The corpus of text in field.
First, full field dictionary for word segmentation is obtained, on the basis of full field dictionary for word segmentation, the financial field of collection is added
Vocabulary constitutes financial field dictionary for word segmentation.Wherein, the vocabulary source of financial field includes mainly following three classes:Financial field profession
Term, such as " William's index ", " Moving Average ", " transferable bond " etc.;Financial forum's term, such as the use in some speculation in stocks forums
Term of the family when commenting on stock;Cyberspeak and special symbol, such as " rubbish stock " etc. applied to financial field.
A2, candidate neologisms are selected from the corpus of text according to preset algorithm, is added to the dictionary for word segmentation.
On the basis of above-mentioned dictionary for word segmentation, candidate neologisms is selected to be added to current participle word in expecting from new text
In allusion quotation.Specifically, step A2 includes:
A21, it is based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to
The word segmentation result obtains candidate word set;A22, the information gain for calculating each candidate word in the candidate word set, selection
Information gain is more than the candidate word of the first predetermined threshold value as the first candidate neologisms, the described first candidate neologisms is added to described
In dictionary for word segmentation;A23, based on the dictionary for word segmentation for being added to the described first candidate neologisms, using the segmentation methods to the text
This language material is segmented, and trains term vector model using the corpus of text after word segmentation processing;A24, the word obtained using training
Vector model calculates the semantic similarity of word and the described first candidate neologisms in word segmentation result;A25, semantic similarity is more than
The word of second predetermined threshold value is added to as the second candidate neologisms, and by the described second candidate neologisms in the dictionary for word segmentation.
Obtain the corpus of text for expanding dictionary for word segmentation.Specifically, it is grabbed from financial web site by the way of web crawlers
The a large amount of and to be analyzed relevant money article text message of financial theme is taken, corpus of text is formed.To the data crawled
It is pre-processed, removes the garbages such as mess code symbol wherein included, web escape symbols, retain text data as text
Language material.Next, by way of manually marking a large amount of text datas in corpus of text are carried out with the classification of Sentiment orientation, i.e.,
Classification markup information is added for text data.
Using current dictionary for word segmentation as the dictionary of preset segmentation methods, word segmentation processing is carried out to corpus of text, then,
The stop words in word segmentation result is filtered out according to preset stop words vocabulary, to remove the unrelated vocabulary in result, by remaining
Word segmentation result constitutes candidate word set.The classification of the corresponding corresponding text data of classification markup information of word segmentation result marks
Information is consistent.
Next, candidate neologisms are selected from candidate word set according to information gain, wherein information gain is that one kind is based on
The appraisal procedure of entropy, when being used for feature selecting, measurement be the appearance of some word whether to judging whether a text belongs to
The information content that some class is provided;It is defined as the difference that front and back information content occurs in a document in a certain characteristic value, calculation formula
For:
In above-mentioned formula, P (Cj) indicate classification CjThe probability occurred in data set, P (ti) indicate characteristic item tiIt appears in
Probability in data set, P (Cj|ti) indicate characteristic item tiIt appears in and is determined as classification CjDocument in probability,It indicates
Characteristic item tiThe probability not occurred,Indicate characteristic item tiIt appears in and is not belonging to classification CjDocument in probability, |
C | it is the sum of classification.Wherein, classification refers to the classification of Sentiment orientation, and characteristic item is candidate word.Above-mentioned probability value can lead to
The statistical conditions to candidate word in corpus of text are crossed to be calculated.
The useful degree of candidate word is judged according to the information gain being calculated, the value of information gain is bigger, then to classification
It is more useful.The candidate word that information gain in candidate word set is more than the first predetermined threshold value is added to as the first candidate neologisms
In current dictionary for word segmentation, the expansion to dictionary for word segmentation is realized.
Based on the above-mentioned dictionary for word segmentation expanded by vocabulary, using same segmentation methods to above-mentioned same corpus of text
Word segmentation processing is carried out, word segmentation result is obtained, term vector model is trained using the corpus of text after word segmentation processing, is obtained using training
Term vector model calculate the term vector of each word that participle obtains, the word first that word segmentation processing obtains is calculated according to term vector
The semantic similarity of candidate neologisms, will be from as the second candidate neologisms if semantic similarity is more than the second predetermined threshold value
The second candidate word selected in word segmentation result is added in dictionary for word segmentation, realizes the expansion again to dictionary for word segmentation.
It is understood that after carrying out word segmentation processing to corpus of text using segmentation methods, stop words vocabulary can be passed through
The stop words in word segmentation result is deleted, because these stop words noises are big and meaningless to text classification, deleting these words can
To improve the order of accuarcy of text classification, while reducing calculation amount when selecting candidate neologisms.
By above-mentioned steps, the expansion three times to dictionary for word segmentation is indeed achieved, for the first time by artificially collecting
The mode that mode obtains financial field vocabulary carries out preliminary expansion, is to select neologisms by calculating information gain for the second time, the
It is that neologisms are selected by term vector computing semantic similarity again three times.In addition, it is upper one that second is expanded with third time
The expansion again carried out on the basis of dictionary for word segmentation after secondary expansion.By way of above-mentioned Dynamic expansion, as much as possible from
New financial field vocabulary is screened in language material.Point of the segmentation methods of dictionary for word segmentation for the training sample of disaggregated model is expanded
Word, the financial field vocabulary in dictionary for word segmentation is abundanter, then more accurate to the word segmentation result of financial field text, what training obtained
The classification accuracy of disaggregated model is also higher.
Optionally, as an implementation, after having selected the second candidate neologisms and being added in dictionary for word segmentation, meter
Calculate word frequency of the second candidate neologisms in corpus of text, and the power using word frequency as the second candidate neologisms in dictionary for word segmentation
Weight.Same mode may be used for the first candidate neologisms and calculate word frequency and as its weight in dictionary for word segmentation.
A3, sample set is obtained, class is carried out to the training sample in the sample set according to default Sentiment orientation classification mode
It does not mark.
The sample set for training text disaggregated model is obtained, for each training data in sample set, is obtained more
A mark people, to multiple markup informations of each training data, and selects multiple marks according to default Sentiment orientation classification mode
Annotation results of the most markup information of occurrence number as the training data in information.Wherein, user can be according to be analyzed
Monetary affair corresponding Sentiment orientation classification mode is set, for example, the text in stock forum is divided into hold, sell and
It buys in;It is positive, passive and neutral that stock in microblogging or forum is discussed that text is divided into;Financial and economic news text is divided into
Positive, negative sense and neutrality etc..
A4, based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods in the sample set
Training sample carry out word segmentation processing.
A5, term vector is extracted according to word segmentation result, adaboost algorithms is based on, by the corresponding term vector of training sample and mark
The classification information of note is input in preset multiple Weak Classifiers and is trained, and multiple Weak Classifiers that training obtains are combined as
The textual classification model of financial field.
After completing to the mark of training sample, based on the dictionary for word segmentation by repeatedly expanding, default participle is used
Algorithm uses the term vector of the term vector model extraction word segmentation result after training for training sample word segmentation processing.It needs to illustrate
, the segmentation methods used in the scheme of the present embodiment are always the same algorithm.
In the present embodiment, in order to improve the accuracy of textual classification model, word2vec models and Glove are used respectively
(Global Vectors for word representation) model extraction term vector, each word segmentation result can obtain
Two kinds of term vectors.In addition, in the present embodiment by based on convolutional neural networks algorithm grader, be based on Recognition with Recurrent Neural Network algorithm
Grader and grader based on shot and long term memory network algorithm as Weak Classifier.For each Weak Classifier, respectively
Using above two term vector as input, then six weak typing models can essentially be built.Based on adaboost algorithms, use
The each Weak Classifier of sample training in sample set.In the training process, if certain sample is accurately classified,
It constructs in next sample set, reduces the weights of the sample;If certain sample is not classified accurately, one under construction
In sample set, the weights of the sample are improved.The sample set that right value update is crossed be used to train next grader, entirely train
Journey is so made iteratively down.In addition, after the training process of each Weak Classifier, small weak point of error in classification rate is increased
The weight of class device makes it play larger decisive action in final classification function, and reduces weak point of error in classification rate greatly
The weight of class device makes it play smaller decisive action in final classification function.Iteratively training is each as procedure described above
A Weak Classifier.The Weak Classifier fusion that each training is obtained, as final textual classification model.The text is classified
Model can be used for carrying out financial field text the classification of Sentiment orientation, for judging that the stock in forum discusses that text is to disappear
Pole, positive or neutrality etc..
The generating means for the textual classification model that the present embodiment proposes, by the corpus of text excavation to financial field, to the greatest extent
New financial field word is screened in slave language material more than possible, is added in dictionary for word segmentation, is realized to financial field dictionary for word segmentation
Expansion, and word segmentation processing is carried out to the training sample in sample set using the dictionary for word segmentation after financial vocabulary has been expanded, and
Classification mark is carried out to the sample data in sample set according to default Sentiment orientation classification mode, final training obtains text classification
Model, the model can be applied to the Sentiment orientation classification problem of financial field.
Optionally, in other examples, model generator can also be divided into one or more module, and one
A or multiple modules are stored in memory 11, and are held by one or more processors (the present embodiment is by processor 12)
For row to complete the present invention, the so-called module of the present invention is the series of computation machine program instruction section for referring to complete specific function,
Implementation procedure of the program in the generating means of textual classification model is generated for descriptive model.
Shown in Fig. 2, journey is generated for the model in one embodiment of generating means of textual classification model of the present invention
The program module schematic diagram of sequence, in the embodiment, model generator can be divided into data acquisition module 10, neologisms selection
Module 20, sample labeling module 30, sample word-dividing mode 40 and model training module 50, illustratively:
Data acquisition module 10 is used for:Obtain the participle word of the financial field of the financial field vocabulary structure based on collection
The corpus of text of allusion quotation and preset financial field;
Neologisms selecting module 20 is used for:Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to institute
State dictionary for word segmentation;
Sample labeling module 30 is used for:Sample set is obtained, according to default Sentiment orientation classification mode in the sample set
Training sample carry out classification mark;
Sample word-dividing mode 40 is used for:Based on the dictionary for word segmentation for being added to candidate neologisms, calculated using preset participle
Method carries out word segmentation processing to the training sample in the sample set;
Model training module 50 is used for:Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by training sample
The classification information of corresponding term vector and mark is input in preset multiple Weak Classifiers and is trained, and training is obtained more
A Weak Classifier is combined as the textual classification model of financial field.
Above-mentioned data acquisition module 10, neologisms selecting module 20, sample labeling module 30, sample word-dividing mode 40 and model
The program modules such as training module 50 are performed realized functions or operations step and are substantially the same with above-described embodiment, herein not
It repeats again.
In addition, the present invention also provides a kind of generation methods of textual classification model.It is text of the present invention with reference to shown in Fig. 3
The flow chart of the generation method preferred embodiment of disaggregated model.This method can be executed by a device, which can be by soft
Part and/or hardware realization.
In the present embodiment, the generation method of textual classification model includes:
Step S10 obtains the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and preset
The corpus of text of financial field.
First, full field dictionary for word segmentation is obtained, on the basis of full field dictionary for word segmentation, the financial field of collection is added
Vocabulary constitutes financial field dictionary for word segmentation.Wherein, the vocabulary source of financial field includes mainly following three classes:Financial field profession
Term, such as " William's index ", " Moving Average ", " transferable bond " etc.;Financial forum's term, such as the use in some speculation in stocks forums
Term of the family when commenting on stock;Cyberspeak and special symbol, such as " rubbish stock " etc. applied to financial field.
Step S20 selects candidate neologisms from the corpus of text according to preset algorithm, is added to the dictionary for word segmentation.
On the basis of above-mentioned dictionary for word segmentation, candidate neologisms is selected to be added to current participle word in expecting from new text
In allusion quotation.Specifically, step S20 includes:Based on the dictionary for word segmentation, the corpus of text is divided using the segmentation methods
Word processing obtains candidate word set according to the word segmentation result;The information for calculating each candidate word in the candidate word set increases
Benefit selects the candidate word that information gain is more than the first predetermined threshold value to add the described first candidate neologisms as the first candidate neologisms
It is added in the dictionary for word segmentation;Based on the dictionary for word segmentation for being added to the described first candidate neologisms, using the segmentation methods to institute
It states corpus of text to be segmented, and term vector model is trained using the corpus of text after word segmentation processing;The word obtained using training
Vector model calculates the semantic similarity of word and the described first candidate neologisms in word segmentation result;Semantic similarity is more than second
The word of predetermined threshold value is added to as the second candidate neologisms, and by the described second candidate neologisms in the dictionary for word segmentation.
Obtain the corpus of text for expanding dictionary for word segmentation.Specifically, it is grabbed from financial web site by the way of web crawlers
The a large amount of and to be analyzed relevant money article text message of financial theme is taken, corpus of text is formed.To the data crawled
It is pre-processed, removes the garbages such as mess code symbol wherein included, web escape symbols, retain text data as text
Language material.Next, by way of manually marking a large amount of text datas in corpus of text are carried out with the classification of Sentiment orientation, i.e.,
Classification markup information is added for text data.
Using current dictionary for word segmentation as the dictionary of preset segmentation methods, word segmentation processing is carried out to corpus of text, then,
The stop words in word segmentation result is filtered out according to preset stop words vocabulary, to remove the unrelated vocabulary in result, by remaining
Word segmentation result constitutes candidate word set.The classification of the corresponding corresponding text data of classification markup information of word segmentation result marks
Information is consistent.
Next, candidate neologisms are selected from candidate word set according to information gain, wherein information gain is that one kind is based on
The appraisal procedure of entropy, when being used for feature selecting, measurement be the appearance of some word whether to judging whether a text belongs to
The information content that some class is provided;It is defined as the difference that front and back information content occurs in a document in a certain characteristic value, calculation formula
For:
In above-mentioned formula, P (Cj) indicate classification CjThe probability occurred in data set, P (ti) indicate characteristic item tiIt appears in
Probability in data set, P (Cj|ti) indicate characteristic item tiIt appears in and is determined as classification CjDocument in probability,It indicates
Characteristic item tiThe probability not occurred,Indicate characteristic item tiIt appears in and is not belonging to classification CjDocument in probability, |
C | it is the sum of classification.Wherein, classification refers to the classification of Sentiment orientation, and characteristic item is candidate word.Above-mentioned probability value can lead to
The statistical conditions to candidate word in corpus of text are crossed to be calculated.
The useful degree of candidate word is judged according to the information gain being calculated, the value of information gain is bigger, then to classification
It is more useful.The candidate word that information gain in candidate word set is more than the first predetermined threshold value is added to as the first candidate neologisms
In current dictionary for word segmentation, the expansion to dictionary for word segmentation is realized.
Based on the above-mentioned dictionary for word segmentation expanded by vocabulary, using same segmentation methods to above-mentioned same corpus of text
Word segmentation processing is carried out, word segmentation result is obtained, term vector model is trained using the corpus of text after word segmentation processing, is obtained using training
Term vector model calculate the term vector of each word that participle obtains, the word first that word segmentation processing obtains is calculated according to term vector
The semantic similarity of candidate neologisms, will be from as the second candidate neologisms if semantic similarity is more than the second predetermined threshold value
The second candidate word selected in word segmentation result is added in dictionary for word segmentation, realizes the expansion again to dictionary for word segmentation.
It is understood that after carrying out word segmentation processing to corpus of text using segmentation methods, stop words vocabulary can be passed through
The stop words in word segmentation result is deleted, because these stop words noises are big and meaningless to text classification, deleting these words can
To improve the order of accuarcy of text classification, while reducing calculation amount when selecting candidate neologisms.
By above-mentioned steps, the expansion three times to dictionary for word segmentation is indeed achieved, for the first time by artificially collecting
The mode that mode obtains financial field vocabulary carries out preliminary expansion, is to select neologisms by calculating information gain for the second time, the
It is that neologisms are selected by term vector computing semantic similarity again three times.In addition, it is upper one that second is expanded with third time
The expansion again carried out on the basis of dictionary for word segmentation after secondary expansion.By way of above-mentioned Dynamic expansion, as much as possible from
New financial field vocabulary is screened in language material.Point of the segmentation methods of dictionary for word segmentation for the training sample of disaggregated model is expanded
Word, the financial field vocabulary in dictionary for word segmentation is abundanter, then more accurate to the word segmentation result of financial field text, what training obtained
The classification accuracy of disaggregated model is also higher.
Optionally, as an implementation, after having selected the second candidate neologisms and being added in dictionary for word segmentation, meter
Calculate word frequency of the second candidate neologisms in corpus of text, and the power using word frequency as the second candidate neologisms in dictionary for word segmentation
Weight.Same mode may be used for the first candidate neologisms and calculate word frequency and as its weight in dictionary for word segmentation.
Step S30, obtain sample set, according to default Sentiment orientation classification mode to the training sample in the sample set into
Row classification marks.
The sample set for training text disaggregated model is obtained, for each training data in sample set, is obtained more
A mark people, to multiple markup informations of each training data, and selects multiple marks according to default Sentiment orientation classification mode
Annotation results of the most markup information of occurrence number as the training data in information.Wherein, user can be according to be analyzed
Monetary affair corresponding Sentiment orientation classification mode is set, hold, sell and buy for example, stock forum text is divided into
Enter;It is positive, passive and neutral that microblogging stock discussion text is divided into;By financial and economic news text be divided into positive, negative sense and in
Property etc..
Step S40, based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the sample
The training sample of concentration carries out word segmentation processing.
Step S50, according to word segmentation result extract term vector, be based on adaboost algorithms, by the corresponding word of training sample to
Amount and the classification information of mark are input in preset multiple Weak Classifiers and are trained, multiple Weak Classifiers that training is obtained
It is combined as the textual classification model of financial field.
After completing to the mark of training sample, based on the dictionary for word segmentation by repeatedly expanding, default participle is used
Algorithm uses the term vector of the term vector model extraction word segmentation result after training for training sample word segmentation processing.It needs to illustrate
, the segmentation methods used in the scheme of the present embodiment are always the same algorithm.
In the present embodiment, in order to improve the accuracy of textual classification model, word2vec models and Glove are used respectively
Model extraction term vector, each word segmentation result can obtain two kinds of term vectors.In addition, convolutional Neural will be based in the present embodiment
The grader of network algorithm, the grader based on Recognition with Recurrent Neural Network algorithm and the grader based on shot and long term memory network algorithm
As Weak Classifier.For each Weak Classifier, respectively using above two term vector as input, then can essentially build
Six weak typing models.Based on adaboost algorithms, each Weak Classifier of sample training in sample set is used.In training process
In, if certain sample is accurately classified, under construction in a sample set, reduce the weights of the sample;If
Certain sample is not classified accurately, then improving the weights of the sample in a sample set under construction.The sample that right value update is crossed
This collection be used to that next grader, entire training process be trained so to be made iteratively down.In addition, each Weak Classifier
After training process, increase the small Weak Classifier of error in classification rate weight, make its played in final classification function compared with
Big decisive action, and reduce the weight of the big Weak Classifier of error in classification rate, make its played in final classification function compared with
Small decisive action.Each Weak Classifier is iteratively trained as procedure described above.The Weak Classifier fusion that each training is obtained
Get up, as final textual classification model.Text disaggregated model can be used for carrying out Sentiment orientation to financial field text
Classification, for judge the stock in forum discuss text be passive, positive or neutral etc..
The generation method for the textual classification model that the present embodiment proposes, by the corpus of text excavation to financial field, to the greatest extent
New financial field word is screened in slave language material more than possible, is added in dictionary for word segmentation, is realized to financial field dictionary for word segmentation
Expansion, and word segmentation processing is carried out to the training sample in sample set using the dictionary for word segmentation after financial vocabulary has been expanded, and
Classification mark is carried out to the sample data in sample set according to default Sentiment orientation classification mode, final training obtains text classification
Model, the model can be applied to the Sentiment orientation classification problem of financial field.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium
On be stored with model generator, the model generator can be executed by one or more processors, to realize following operation:
Obtain the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and preset financial field
Corpus of text;
Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to the dictionary for word segmentation;
Sample set is obtained, classification mark is carried out to the training sample in the sample set according to default Sentiment orientation classification mode
Note;
Based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the instruction in the sample set
Practice sample and carries out word segmentation processing;
Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by the corresponding term vector of training sample and mark
Classification information be input in preset multiple Weak Classifiers and be trained, multiple Weak Classifiers that training obtain are combined as gold
Melt the textual classification model in field.
Generating means and side of the computer readable storage medium specific implementation mode of the present invention with above-mentioned textual classification model
Each embodiment of method is essentially identical, does not make tired state herein.
It should be noted that the embodiments of the present invention are for illustration only, can not represent the quality of embodiment.And
The terms "include", "comprise" herein or any other variant thereof is intended to cover non-exclusive inclusion, so that packet
Process, device, article or the method for including a series of elements include not only those elements, but also include being not explicitly listed
Other element, or further include for this process, device, article or the intrinsic element of method.Do not limiting more
In the case of, the element that is limited by sentence "including a ...", it is not excluded that in the process including the element, device, article
Or there is also other identical elements in method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art
Going out the part of contribution can be expressed in the form of software products, which is stored in one as described above
In storage medium (such as ROM/RAM, magnetic disc, CD), including some instructions use so that a station terminal equipment (can be mobile phone,
Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of generating means of textual classification model, which is characterized in that described device includes memory and processor, described to deposit
The model generator that can be run on the processor is stored on reservoir, the model generator is held by the processor
Following steps are realized when row:
Obtain the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and the text of preset financial field
This language material;
Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to the dictionary for word segmentation;
Sample set is obtained, classification mark is carried out to the training sample in the sample set according to default Sentiment orientation classification mode;
Based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the training sample in the sample set
This progress word segmentation processing;
Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by the class of training sample corresponding term vector and mark
It is trained in other information input to preset multiple Weak Classifiers, multiple Weak Classifiers that training obtains is combined as financial neck
The textual classification model in domain.
2. the generating means of textual classification model as described in claim 1, which is characterized in that it is described according to preset algorithm from institute
The step of stating and select candidate neologisms in corpus of text, being added to the dictionary for word segmentation include:
Based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to the participle
As a result candidate word set is obtained;
The information gain of each candidate word in the candidate word set is calculated, information gain is selected to be more than the time of the first predetermined threshold value
It selects word as the first candidate neologisms, the described first candidate neologisms is added in the dictionary for word segmentation;
Based on the dictionary for word segmentation for being added to the described first candidate neologisms, the corpus of text is divided using the segmentation methods
Word, and train term vector model using the corpus of text after word segmentation processing;
The semantic similarity of the word and the described first candidate neologisms in word segmentation result is calculated using the term vector model that training obtains;
The word that semantic similarity is more than to the second predetermined threshold value is added as the second candidate neologisms, and by the described second candidate neologisms
Into the dictionary for word segmentation.
3. the generating means of textual classification model as claimed in claim 2, which is characterized in that the processor can also be used to hold
The row model generator, semantic similarity is more than the word of the second predetermined threshold value as the second candidate neologisms described,
And after the step of the described second candidate neologisms are added to institute's predicate dictionary, also realize following steps:
Word frequency of the described second candidate neologisms in corpus of text is calculated, and the word frequency being calculated is new as second candidate
Weight of the word in the dictionary for word segmentation.
4. the generating means of textual classification model as claimed any one in claims 1 to 3, which is characterized in that the acquisition
Sample set, according to default Sentiment orientation classification mode in the sample set training sample carry out classification mark the step of wrap
It includes:
Obtain sample set, and obtain multiple mark people according to default Sentiment orientation classification mode to the training sample in sample set into
Multiple markup informations that rower is noted, from the multiple markup information, the markup information that selects occurrence number most as
The annotation results of corresponding training sample.
5. the generating means of textual classification model as claimed any one in claims 1 to 3, which is characterized in that described weak point
Class device include the grader based on convolutional neural networks algorithm, the grader based on Recognition with Recurrent Neural Network algorithm and be based on shot and long term
The grader of memory network algorithm.
6. a kind of generation method of textual classification model, which is characterized in that the method includes:
Obtain the dictionary for word segmentation of the financial field of the financial field vocabulary structure based on collection and the text of preset financial field
This language material;
Candidate neologisms are selected from the corpus of text according to preset algorithm, are added to the dictionary for word segmentation;
Sample set is obtained, classification mark is carried out to the training sample in the sample set according to default Sentiment orientation classification mode;
Based on the dictionary for word segmentation for being added to candidate neologisms, using preset segmentation methods to the training sample in the sample set
This progress word segmentation processing;
Term vector is extracted according to word segmentation result, adaboost algorithms are based on, by the class of training sample corresponding term vector and mark
It is trained in other information input to preset multiple Weak Classifiers, multiple Weak Classifiers that training obtains is combined as financial neck
The textual classification model in domain.
7. the generation method of textual classification model as claimed in claim 6, which is characterized in that it is described according to preset algorithm from institute
The step of stating and select candidate neologisms in corpus of text, being added to the dictionary for word segmentation include:
Based on the dictionary for word segmentation, word segmentation processing is carried out to the corpus of text using the segmentation methods, according to the participle
As a result candidate word set is obtained;
The information gain of each candidate word in the candidate word set is calculated, information gain is selected to be more than the time of the first predetermined threshold value
It selects word as the first candidate neologisms, the described first candidate neologisms is added in the dictionary for word segmentation;
Based on the dictionary for word segmentation for being added to the described first candidate neologisms, the corpus of text is divided using the segmentation methods
Word, and train term vector model using the corpus of text after word segmentation processing;
The semantic similarity of the word and the described first candidate neologisms in word segmentation result is calculated using the term vector model that training obtains;
The word that semantic similarity is more than to the second predetermined threshold value is added as the second candidate neologisms, and by the described second candidate neologisms
Into the dictionary for word segmentation.
8. the generation method of textual classification model as claimed in claim 7, which is characterized in that described to be more than semantic similarity
The word of second predetermined threshold value as the second candidate neologisms, and the step of the described second candidate neologisms are added to institute's predicate dictionary it
Afterwards, the method further includes step:
Word frequency of the described second candidate neologisms in corpus of text is calculated, and the word frequency being calculated is new as second candidate
Weight of the word in the dictionary for word segmentation.
9. the generation method of the textual classification model as described in any one of claim 6 to 8, which is characterized in that the acquisition
Sample set, according to default Sentiment orientation classification mode in the sample set training sample carry out classification mark the step of wrap
It includes:
Obtain sample set, and obtain multiple mark people according to default Sentiment orientation classification mode to the training sample in sample set into
Multiple markup informations that rower is noted, from the multiple markup information, the markup information that selects occurrence number most as
The annotation results of corresponding training sample.
10. a kind of computer readable storage medium, which is characterized in that be stored with model life on the computer readable storage medium
At program, the model generator can be executed by one or more processor, to realize as any in claim 6 to 9
The step of generation method of textual classification model described in.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810361702.0A CN108804512B (en) | 2018-04-20 | 2018-04-20 | Text classification model generation device and method and computer readable storage medium |
PCT/CN2018/102400 WO2019200806A1 (en) | 2018-04-20 | 2018-08-27 | Device for generating text classification model, method, and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810361702.0A CN108804512B (en) | 2018-04-20 | 2018-04-20 | Text classification model generation device and method and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108804512A true CN108804512A (en) | 2018-11-13 |
CN108804512B CN108804512B (en) | 2020-11-24 |
Family
ID=64093733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810361702.0A Active CN108804512B (en) | 2018-04-20 | 2018-04-20 | Text classification model generation device and method and computer readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108804512B (en) |
WO (1) | WO2019200806A1 (en) |
Cited By (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299276A (en) * | 2018-11-15 | 2019-02-01 | 阿里巴巴集团控股有限公司 | One kind converting the text to word insertion, file classification method and device |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN109685156A (en) * | 2018-12-30 | 2019-04-26 | 浙江新铭智能科技有限公司 | A kind of acquisition methods of the classifier of mood for identification |
CN109684634A (en) * | 2018-12-17 | 2019-04-26 | 北京百度网讯科技有限公司 | Sentiment analysis method, apparatus, equipment and storage medium |
CN109741190A (en) * | 2018-12-27 | 2019-05-10 | 清华大学 | A kind of method, system and the equipment of the classification of personal share bulletin |
CN109783632A (en) * | 2019-02-15 | 2019-05-21 | 腾讯科技(深圳)有限公司 | Customer service information-pushing method, device, computer equipment and storage medium |
CN109783800A (en) * | 2018-12-13 | 2019-05-21 | 北京百度网讯科技有限公司 | Acquisition methods, device, equipment and the storage medium of emotion keyword |
CN109871889A (en) * | 2019-01-31 | 2019-06-11 | 内蒙古工业大学 | Mass psychology appraisal procedure under emergency event |
CN109871444A (en) * | 2019-01-16 | 2019-06-11 | 北京邮电大学 | A kind of file classification method and system |
CN110008464A (en) * | 2019-01-02 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Construction method, device, server and the readable storage medium storing program for executing of business dictionary |
CN110059187A (en) * | 2019-04-10 | 2019-07-26 | 华侨大学 | A kind of deep learning file classification method of integrated shallow semantic anticipation mode |
CN110188204A (en) * | 2019-06-11 | 2019-08-30 | 腾讯科技(深圳)有限公司 | A kind of extension corpora mining method, apparatus, server and storage medium |
CN110210028A (en) * | 2019-05-30 | 2019-09-06 | 杭州远传新业科技有限公司 | For domain feature words extracting method, device, equipment and the medium of speech translation text |
CN110232914A (en) * | 2019-05-20 | 2019-09-13 | 平安普惠企业管理有限公司 | A kind of method for recognizing semantics, device and relevant device |
CN110347821A (en) * | 2019-05-29 | 2019-10-18 | 华东理工大学 | A kind of method, electronic equipment and the readable storage medium storing program for executing of text categories mark |
CN110457475A (en) * | 2019-07-25 | 2019-11-15 | 阿里巴巴集团控股有限公司 | A kind of method and system expanded for text classification system construction and mark corpus |
CN110489556A (en) * | 2019-08-22 | 2019-11-22 | 重庆锐云科技有限公司 | Quality evaluating method, device, server and storage medium about follow-up record |
CN110597958A (en) * | 2019-09-12 | 2019-12-20 | 苏州思必驰信息科技有限公司 | Text classification model training and using method and device |
CN110674289A (en) * | 2019-07-04 | 2020-01-10 | 南瑞集团有限公司 | Method, device and storage medium for judging article belonged classification based on word segmentation weight |
CN110704581A (en) * | 2019-09-11 | 2020-01-17 | 阿里巴巴集团控股有限公司 | Computer-executed text emotion analysis method and device |
CN110990567A (en) * | 2019-11-25 | 2020-04-10 | 国家电网有限公司 | Electric power audit text classification method for enhancing domain features |
CN111104510A (en) * | 2019-11-15 | 2020-05-05 | 南京中新赛克科技有限责任公司 | Word embedding-based text classification training sample expansion method |
CN111144097A (en) * | 2019-12-25 | 2020-05-12 | 华中科技大学鄂州工业技术研究院 | Modeling method and device for emotion tendency classification model of dialog text |
CN111143569A (en) * | 2019-12-31 | 2020-05-12 | 腾讯科技(深圳)有限公司 | Data processing method and device and computer readable storage medium |
CN111159589A (en) * | 2019-12-30 | 2020-05-15 | 中国银联股份有限公司 | Classification dictionary establishing method, merchant data classification method, device and equipment |
CN111177378A (en) * | 2019-12-20 | 2020-05-19 | 北京淇瑀信息科技有限公司 | Text mining method and device and electronic equipment |
CN111177403A (en) * | 2019-12-16 | 2020-05-19 | 恩亿科(北京)数据科技有限公司 | Sample data processing method and device |
CN111325033A (en) * | 2020-03-20 | 2020-06-23 | 中国建设银行股份有限公司 | Entity identification method, entity identification device, electronic equipment and computer readable storage medium |
CN111339268A (en) * | 2020-02-19 | 2020-06-26 | 北京百度网讯科技有限公司 | Entity word recognition method and device |
CN111368555A (en) * | 2020-05-27 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Data identification method and device, storage medium and electronic equipment |
CN111401030A (en) * | 2018-12-28 | 2020-07-10 | 北京嘀嘀无限科技发展有限公司 | Service abnormity identification method, device, server and readable storage medium |
CN111444326A (en) * | 2020-03-30 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text data processing method, device, equipment and storage medium |
CN111523308A (en) * | 2020-03-18 | 2020-08-11 | 大箴(杭州)科技有限公司 | Chinese word segmentation method and device and computer equipment |
CN111782803A (en) * | 2020-06-05 | 2020-10-16 | 京东数字科技控股有限公司 | Work order processing method and device, electronic equipment and storage medium |
CN112417860A (en) * | 2020-12-08 | 2021-02-26 | 携程计算机技术(上海)有限公司 | Training sample enhancement method, system, device and storage medium |
CN112445907A (en) * | 2019-09-02 | 2021-03-05 | 顺丰科技有限公司 | Text emotion classification method, device and equipment and storage medium |
CN112579768A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Emotion classification model training method, text emotion classification method and text emotion classification device |
CN112632971A (en) * | 2020-12-18 | 2021-04-09 | 上海明略人工智能(集团)有限公司 | Word vector training method and system for entity matching |
CN112926631A (en) * | 2021-02-01 | 2021-06-08 | 大箴(杭州)科技有限公司 | Financial text classification method and device and computer equipment |
CN113051401A (en) * | 2021-04-06 | 2021-06-29 | 明品云(北京)数据科技有限公司 | Text structured labeling method, system, device and medium |
WO2021134524A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市欢太科技有限公司 | Data processing method, apparatus, electronic device, and storage medium |
CN113111175A (en) * | 2020-04-28 | 2021-07-13 | 北京明亿科技有限公司 | Extreme behavior identification method, device, equipment and medium based on deep learning model |
CN113177109A (en) * | 2021-05-27 | 2021-07-27 | 中国平安人寿保险股份有限公司 | Text weak labeling method, device, equipment and storage medium |
CN113240485A (en) * | 2021-05-10 | 2021-08-10 | 北京沃东天骏信息技术有限公司 | Training method of text generation model, and text generation method and device |
CN113642678A (en) * | 2021-10-12 | 2021-11-12 | 南京山猫齐动信息技术有限公司 | Method, device and storage medium for generating confrontation message sample |
CN113723114A (en) * | 2021-08-31 | 2021-11-30 | 平安普惠企业管理有限公司 | Semantic analysis method, device and equipment based on multi-intent recognition and storage medium |
CN114091469A (en) * | 2021-11-23 | 2022-02-25 | 杭州萝卜智能技术有限公司 | Sample expansion based network public opinion analysis method |
CN115861606A (en) * | 2022-05-09 | 2023-03-28 | 北京中关村科金技术有限公司 | Method and device for classifying long-tail distribution documents and storage medium |
Families Citing this family (62)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110879934B (en) * | 2019-10-31 | 2023-05-23 | 杭州电子科技大学 | Text prediction method based on Wide & Deep learning model |
CN110837732B (en) * | 2019-10-31 | 2024-01-26 | 北京奇艺世纪科技有限公司 | Method and device for identifying intimacy between target persons, electronic equipment and storage medium |
CN111125323B (en) * | 2019-11-21 | 2024-01-19 | 腾讯科技(深圳)有限公司 | Chat corpus labeling method and device, electronic equipment and storage medium |
CN112861533A (en) * | 2019-11-26 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Entity word recognition method and device |
CN111046177A (en) * | 2019-11-26 | 2020-04-21 | 方正璞华软件(武汉)股份有限公司 | Automatic arbitration case prejudging method and device |
CN110991612A (en) * | 2019-11-29 | 2020-04-10 | 交通银行股份有限公司 | Message analysis method of international routine real-time reasoning model based on word vector |
CN110968702B (en) * | 2019-11-29 | 2023-05-09 | 北京明略软件系统有限公司 | Method and device for extracting rational relation |
CN111078546B (en) * | 2019-12-05 | 2023-06-16 | 北京云聚智慧科技有限公司 | Page feature expression method and electronic equipment |
CN111078883A (en) * | 2019-12-13 | 2020-04-28 | 北京明略软件系统有限公司 | Risk index analysis method and device, electronic equipment and storage medium |
CN111191119B (en) * | 2019-12-16 | 2023-12-12 | 绍兴市上虞区理工高等研究院 | Neural network-based scientific and technological achievement self-learning method and device |
CN112989032A (en) * | 2019-12-17 | 2021-06-18 | 医渡云(北京)技术有限公司 | Entity relationship classification method, apparatus, medium and electronic device |
CN111309855A (en) * | 2019-12-24 | 2020-06-19 | 中国银行股份有限公司 | Text information processing method and system |
CN113302683B (en) * | 2019-12-24 | 2023-08-04 | 深圳市优必选科技股份有限公司 | Multi-tone word prediction method, disambiguation method, device, apparatus, and computer-readable storage medium |
CN113052191A (en) * | 2019-12-26 | 2021-06-29 | 航天信息股份有限公司 | Training method, device, equipment and medium of neural language network model |
CN111125317A (en) * | 2019-12-27 | 2020-05-08 | 携程计算机技术(上海)有限公司 | Model training, classification, system, device and medium for conversational text classification |
CN111221950A (en) * | 2019-12-30 | 2020-06-02 | 航天信息股份有限公司 | Method and device for analyzing weak emotion of user |
CN111259148B (en) * | 2020-01-19 | 2024-03-26 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN111309859B (en) * | 2020-01-21 | 2023-07-07 | 上饶市中科院云计算中心大数据研究院 | Scenic spot network public praise emotion analysis method and device |
CN111310464B (en) * | 2020-02-17 | 2024-02-02 | 北京明略软件系统有限公司 | Word vector acquisition model generation method and device and word vector acquisition method and device |
CN111325562B (en) * | 2020-02-17 | 2023-08-01 | 武汉轻工大学 | Grain safety tracing system and method |
CN111367962B (en) * | 2020-02-28 | 2024-01-30 | 北京金堤科技有限公司 | Database updating method and device, computer readable storage medium and electronic equipment |
CN113449097A (en) * | 2020-03-24 | 2021-09-28 | 百度在线网络技术(北京)有限公司 | Method and device for generating countermeasure sample, electronic equipment and storage medium |
CN111309920B (en) * | 2020-03-26 | 2023-03-24 | 清华大学深圳国际研究生院 | Text classification method, terminal equipment and computer readable storage medium |
CN111680225B (en) * | 2020-04-26 | 2023-08-18 | 国家计算机网络与信息安全管理中心 | WeChat financial message analysis method and system based on machine learning |
CN111652281B (en) * | 2020-04-30 | 2023-08-18 | 中国平安财产保险股份有限公司 | Information data classification method, device and readable storage medium |
CN111680155A (en) * | 2020-05-13 | 2020-09-18 | 新华网股份有限公司 | Text classification method and device, electronic equipment and computer storage medium |
CN111737993B (en) * | 2020-05-26 | 2024-04-02 | 浙江华云电力工程设计咨询有限公司 | Method for extracting equipment health state from fault defect text of power distribution network equipment |
CN111709233B (en) * | 2020-05-27 | 2023-04-18 | 西安交通大学 | Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network |
CN111601314B (en) * | 2020-05-27 | 2023-04-28 | 北京亚鸿世纪科技发展有限公司 | Method and device for double judging bad short message by pre-training model and short message address |
CN111680804B (en) * | 2020-06-02 | 2023-09-01 | 中国电力科学研究院有限公司 | Method, equipment and computer readable medium for generating operation checking work ticket |
CN111680803B (en) * | 2020-06-02 | 2023-09-01 | 中国电力科学研究院有限公司 | Operation checking work ticket generation system |
CN111832292B (en) * | 2020-06-03 | 2024-02-02 | 北京百度网讯科技有限公司 | Text recognition processing method, device, electronic equipment and storage medium |
CN113761882A (en) * | 2020-06-08 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Dictionary construction method and device |
CN111737999A (en) * | 2020-06-24 | 2020-10-02 | 深圳前海微众银行股份有限公司 | Sequence labeling method, device and equipment and readable storage medium |
CN111783451A (en) * | 2020-06-30 | 2020-10-16 | 北京百度网讯科技有限公司 | Method and apparatus for enhancing text samples |
CN114004234A (en) * | 2020-07-28 | 2022-02-01 | 深圳Tcl数字技术有限公司 | Semantic recognition method, storage medium and terminal equipment |
CN111930942B (en) * | 2020-08-07 | 2023-08-15 | 腾讯云计算(长沙)有限责任公司 | Text classification method, language model training method, device and equipment |
CN111966944B (en) * | 2020-08-17 | 2024-04-09 | 中电科大数据研究院有限公司 | Model construction method for multi-level user comment security audit |
CN112015895A (en) * | 2020-08-26 | 2020-12-01 | 广东电网有限责任公司 | Patent text classification method and device |
CN112016319B (en) * | 2020-09-08 | 2023-12-15 | 平安科技(深圳)有限公司 | Pre-training model acquisition and disease entity labeling method, device and storage medium |
CN112101015B (en) * | 2020-09-08 | 2024-01-26 | 腾讯科技(深圳)有限公司 | Method and device for identifying multi-label object |
CN114281928A (en) * | 2020-09-28 | 2022-04-05 | 中国移动通信集团广西有限公司 | Model generation method, device and equipment based on text data |
CN113392209B (en) * | 2020-10-26 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Text clustering method based on artificial intelligence, related equipment and storage medium |
CN112287639A (en) * | 2020-10-30 | 2021-01-29 | 上海中通吉网络技术有限公司 | Intelligent customer service work order classification method |
CN112529743B (en) * | 2020-12-18 | 2023-08-08 | 平安银行股份有限公司 | Contract element extraction method, device, electronic equipment and medium |
CN112650837B (en) * | 2020-12-28 | 2023-12-12 | 上海秒针网络科技有限公司 | Text quality control method and system combining classification algorithm and unsupervised algorithm |
CN112765936B (en) * | 2020-12-31 | 2024-02-23 | 出门问问(武汉)信息科技有限公司 | Training method and device for operation based on language model |
CN112784061A (en) * | 2021-01-27 | 2021-05-11 | 数贸科技(北京)有限公司 | Knowledge graph construction method and device, computing equipment and storage medium |
CN112948573B (en) * | 2021-02-05 | 2024-04-02 | 北京百度网讯科技有限公司 | Text label extraction method, device, equipment and computer storage medium |
CN112948583A (en) * | 2021-02-26 | 2021-06-11 | 中国光大银行股份有限公司 | Data classification method and device, storage medium and electronic device |
CN113011183B (en) * | 2021-03-23 | 2023-09-05 | 北京科东电力控制系统有限责任公司 | Unstructured text data processing method and system in electric power regulation and control field |
CN113033198B (en) * | 2021-03-25 | 2022-08-26 | 平安国际智慧城市科技股份有限公司 | Similar text pushing method and device, electronic equipment and computer storage medium |
CN113032573B (en) * | 2021-04-30 | 2024-01-23 | 同方知网数字出版技术股份有限公司 | Large-scale text classification method and system combining topic semantics and TF-IDF algorithm |
CN113377965B (en) * | 2021-06-30 | 2024-02-23 | 中国农业银行股份有限公司 | Method and related device for sensing text keywords |
CN113627530B (en) * | 2021-08-11 | 2023-09-15 | 中国平安人寿保险股份有限公司 | Similar problem text generation method, device, equipment and medium |
CN114090601B (en) * | 2021-11-23 | 2023-11-03 | 北京百度网讯科技有限公司 | Data screening method, device, equipment and storage medium |
CN114638195B (en) * | 2022-01-21 | 2022-11-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-task learning-based ground detection method |
CN114443849B (en) * | 2022-02-09 | 2023-10-27 | 北京百度网讯科技有限公司 | Labeling sample selection method and device, electronic equipment and storage medium |
CN116307792B (en) * | 2022-10-12 | 2024-03-12 | 广州市阿尔法软件信息技术有限公司 | Urban physical examination subject scene-oriented evaluation method and device |
CN115952290B (en) * | 2023-03-09 | 2023-06-02 | 太极计算机股份有限公司 | Case characteristic labeling method, device and equipment based on active learning and semi-supervised learning |
CN116361463B (en) * | 2023-03-27 | 2023-12-08 | 应急管理部国家减灾中心(应急管理部卫星减灾应用中心) | Earthquake disaster information extraction method, device, equipment and medium |
CN117093715B (en) * | 2023-10-18 | 2023-12-29 | 湖南财信数字科技有限公司 | Word stock expansion method, system, computer equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104142913A (en) * | 2013-05-07 | 2014-11-12 | 株式会社日立制作所 | Distinguishing method and distinguishing system for polarities of words and expressions |
CN104331506A (en) * | 2014-11-20 | 2015-02-04 | 北京理工大学 | Multiclass emotion analyzing method and system facing bilingual microblog text |
CN105022725A (en) * | 2015-07-10 | 2015-11-04 | 河海大学 | Text emotional tendency analysis method applied to field of financial Web |
CN105740349A (en) * | 2016-01-25 | 2016-07-06 | 重庆邮电大学 | Sentiment classification method capable of combining Doc2vce with convolutional neural network |
CN106598940A (en) * | 2016-11-01 | 2017-04-26 | 四川用联信息技术有限公司 | Text similarity solution algorithm based on global optimization of keyword quality |
CN107122382A (en) * | 2017-02-16 | 2017-09-01 | 江苏大学 | A kind of patent classification method based on specification |
WO2017202125A1 (en) * | 2016-05-25 | 2017-11-30 | 华为技术有限公司 | Text classification method and apparatus |
CN107491531A (en) * | 2017-08-18 | 2017-12-19 | 华南师范大学 | Chinese network comment sensibility classification method based on integrated study framework |
US20180032508A1 (en) * | 2016-07-28 | 2018-02-01 | Abbyy Infopoisk Llc | Aspect-based sentiment analysis using machine learning methods |
CN107729374A (en) * | 2017-09-13 | 2018-02-23 | 厦门快商通科技股份有限公司 | A kind of extending method of sentiment dictionary and text emotion recognition methods |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023967A (en) * | 2010-11-11 | 2011-04-20 | 清华大学 | Text emotion classifying method in stock field |
US8676730B2 (en) * | 2011-07-11 | 2014-03-18 | Accenture Global Services Limited | Sentiment classifiers based on feature extraction |
CN103559174B (en) * | 2013-09-30 | 2016-03-09 | 东软集团股份有限公司 | Semantic emotion classification characteristic value extraction and system |
SG11201704150WA (en) * | 2014-11-24 | 2017-06-29 | Agency Science Tech & Res | A method and system for sentiment classification and emotion classification |
CN106547738B (en) * | 2016-11-02 | 2019-05-07 | 北京亿美软通科技有限公司 | A kind of overdue short message intelligent method of discrimination of financial class based on text mining |
-
2018
- 2018-04-20 CN CN201810361702.0A patent/CN108804512B/en active Active
- 2018-08-27 WO PCT/CN2018/102400 patent/WO2019200806A1/en active Application Filing
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104142913A (en) * | 2013-05-07 | 2014-11-12 | 株式会社日立制作所 | Distinguishing method and distinguishing system for polarities of words and expressions |
CN104331506A (en) * | 2014-11-20 | 2015-02-04 | 北京理工大学 | Multiclass emotion analyzing method and system facing bilingual microblog text |
CN105022725A (en) * | 2015-07-10 | 2015-11-04 | 河海大学 | Text emotional tendency analysis method applied to field of financial Web |
CN105740349A (en) * | 2016-01-25 | 2016-07-06 | 重庆邮电大学 | Sentiment classification method capable of combining Doc2vce with convolutional neural network |
WO2017202125A1 (en) * | 2016-05-25 | 2017-11-30 | 华为技术有限公司 | Text classification method and apparatus |
US20180032508A1 (en) * | 2016-07-28 | 2018-02-01 | Abbyy Infopoisk Llc | Aspect-based sentiment analysis using machine learning methods |
CN106598940A (en) * | 2016-11-01 | 2017-04-26 | 四川用联信息技术有限公司 | Text similarity solution algorithm based on global optimization of keyword quality |
CN107122382A (en) * | 2017-02-16 | 2017-09-01 | 江苏大学 | A kind of patent classification method based on specification |
CN107491531A (en) * | 2017-08-18 | 2017-12-19 | 华南师范大学 | Chinese network comment sensibility classification method based on integrated study framework |
CN107729374A (en) * | 2017-09-13 | 2018-02-23 | 厦门快商通科技股份有限公司 | A kind of extending method of sentiment dictionary and text emotion recognition methods |
Non-Patent Citations (1)
Title |
---|
爱暖手的苦咖啡: ""adaboost"", 《HTTPS://BAIKE.BAIDU.COM/HISTORY/ADABOOST/4531273/122627277》 * |
Cited By (72)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299276B (en) * | 2018-11-15 | 2021-11-19 | 创新先进技术有限公司 | Method and device for converting text into word embedding and text classification |
CN109299276A (en) * | 2018-11-15 | 2019-02-01 | 阿里巴巴集团控股有限公司 | One kind converting the text to word insertion, file classification method and device |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN109614499B (en) * | 2018-11-22 | 2023-02-17 | 创新先进技术有限公司 | Dictionary generation method, new word discovery method, device and electronic equipment |
CN109783800A (en) * | 2018-12-13 | 2019-05-21 | 北京百度网讯科技有限公司 | Acquisition methods, device, equipment and the storage medium of emotion keyword |
CN109783800B (en) * | 2018-12-13 | 2024-04-12 | 北京百度网讯科技有限公司 | Emotion keyword acquisition method, device, equipment and storage medium |
CN109684634A (en) * | 2018-12-17 | 2019-04-26 | 北京百度网讯科技有限公司 | Sentiment analysis method, apparatus, equipment and storage medium |
CN109684634B (en) * | 2018-12-17 | 2023-07-25 | 北京百度网讯科技有限公司 | Emotion analysis method, device, equipment and storage medium |
CN109741190A (en) * | 2018-12-27 | 2019-05-10 | 清华大学 | A kind of method, system and the equipment of the classification of personal share bulletin |
CN111401030A (en) * | 2018-12-28 | 2020-07-10 | 北京嘀嘀无限科技发展有限公司 | Service abnormity identification method, device, server and readable storage medium |
CN111401030B (en) * | 2018-12-28 | 2024-01-09 | 北京嘀嘀无限科技发展有限公司 | Method and device for identifying service abnormality, server and readable storage medium |
CN109685156A (en) * | 2018-12-30 | 2019-04-26 | 浙江新铭智能科技有限公司 | A kind of acquisition methods of the classifier of mood for identification |
CN110008464A (en) * | 2019-01-02 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Construction method, device, server and the readable storage medium storing program for executing of business dictionary |
CN109871444A (en) * | 2019-01-16 | 2019-06-11 | 北京邮电大学 | A kind of file classification method and system |
CN109871889A (en) * | 2019-01-31 | 2019-06-11 | 内蒙古工业大学 | Mass psychology appraisal procedure under emergency event |
CN109783632A (en) * | 2019-02-15 | 2019-05-21 | 腾讯科技(深圳)有限公司 | Customer service information-pushing method, device, computer equipment and storage medium |
CN110059187B (en) * | 2019-04-10 | 2022-06-07 | 华侨大学 | Deep learning text classification method integrating shallow semantic pre-judging mode |
CN110059187A (en) * | 2019-04-10 | 2019-07-26 | 华侨大学 | A kind of deep learning file classification method of integrated shallow semantic anticipation mode |
CN110232914A (en) * | 2019-05-20 | 2019-09-13 | 平安普惠企业管理有限公司 | A kind of method for recognizing semantics, device and relevant device |
CN110347821B (en) * | 2019-05-29 | 2023-08-25 | 华东理工大学 | Text category labeling method, electronic equipment and readable storage medium |
CN110347821A (en) * | 2019-05-29 | 2019-10-18 | 华东理工大学 | A kind of method, electronic equipment and the readable storage medium storing program for executing of text categories mark |
CN110210028B (en) * | 2019-05-30 | 2023-04-28 | 杭州远传新业科技股份有限公司 | Method, device, equipment and medium for extracting domain feature words aiming at voice translation text |
CN110210028A (en) * | 2019-05-30 | 2019-09-06 | 杭州远传新业科技有限公司 | For domain feature words extracting method, device, equipment and the medium of speech translation text |
CN110188204A (en) * | 2019-06-11 | 2019-08-30 | 腾讯科技(深圳)有限公司 | A kind of extension corpora mining method, apparatus, server and storage medium |
CN110188204B (en) * | 2019-06-11 | 2022-10-04 | 腾讯科技(深圳)有限公司 | Extended corpus mining method and device, server and storage medium |
CN110674289A (en) * | 2019-07-04 | 2020-01-10 | 南瑞集团有限公司 | Method, device and storage medium for judging article belonged classification based on word segmentation weight |
CN110457475B (en) * | 2019-07-25 | 2023-06-30 | 创新先进技术有限公司 | Method and system for text classification system construction and annotation corpus expansion |
CN110457475A (en) * | 2019-07-25 | 2019-11-15 | 阿里巴巴集团控股有限公司 | A kind of method and system expanded for text classification system construction and mark corpus |
CN110489556A (en) * | 2019-08-22 | 2019-11-22 | 重庆锐云科技有限公司 | Quality evaluating method, device, server and storage medium about follow-up record |
CN112445907A (en) * | 2019-09-02 | 2021-03-05 | 顺丰科技有限公司 | Text emotion classification method, device and equipment and storage medium |
CN110704581B (en) * | 2019-09-11 | 2024-03-08 | 创新先进技术有限公司 | Text emotion analysis method and device executed by computer |
CN110704581A (en) * | 2019-09-11 | 2020-01-17 | 阿里巴巴集团控股有限公司 | Computer-executed text emotion analysis method and device |
CN110597958A (en) * | 2019-09-12 | 2019-12-20 | 苏州思必驰信息科技有限公司 | Text classification model training and using method and device |
CN110597958B (en) * | 2019-09-12 | 2022-03-25 | 思必驰科技股份有限公司 | Text classification model training and using method and device |
CN112579768A (en) * | 2019-09-30 | 2021-03-30 | 北京国双科技有限公司 | Emotion classification model training method, text emotion classification method and text emotion classification device |
CN111104510A (en) * | 2019-11-15 | 2020-05-05 | 南京中新赛克科技有限责任公司 | Word embedding-based text classification training sample expansion method |
CN111104510B (en) * | 2019-11-15 | 2023-05-09 | 南京中新赛克科技有限责任公司 | Text classification training sample expansion method based on word embedding |
CN110990567A (en) * | 2019-11-25 | 2020-04-10 | 国家电网有限公司 | Electric power audit text classification method for enhancing domain features |
CN111177403A (en) * | 2019-12-16 | 2020-05-19 | 恩亿科(北京)数据科技有限公司 | Sample data processing method and device |
CN111177403B (en) * | 2019-12-16 | 2023-06-23 | 恩亿科(北京)数据科技有限公司 | Sample data processing method and device |
CN111177378A (en) * | 2019-12-20 | 2020-05-19 | 北京淇瑀信息科技有限公司 | Text mining method and device and electronic equipment |
CN111177378B (en) * | 2019-12-20 | 2023-09-26 | 北京淇瑀信息科技有限公司 | Text mining method and device and electronic equipment |
CN111144097A (en) * | 2019-12-25 | 2020-05-12 | 华中科技大学鄂州工业技术研究院 | Modeling method and device for emotion tendency classification model of dialog text |
CN111144097B (en) * | 2019-12-25 | 2023-08-18 | 华中科技大学鄂州工业技术研究院 | Modeling method and device for emotion tendency classification model of dialogue text |
CN111159589A (en) * | 2019-12-30 | 2020-05-15 | 中国银联股份有限公司 | Classification dictionary establishing method, merchant data classification method, device and equipment |
CN111159589B (en) * | 2019-12-30 | 2023-10-20 | 中国银联股份有限公司 | Classification dictionary establishment method, merchant data classification method, device and equipment |
WO2021134524A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市欢太科技有限公司 | Data processing method, apparatus, electronic device, and storage medium |
CN111143569B (en) * | 2019-12-31 | 2023-05-02 | 腾讯科技(深圳)有限公司 | Data processing method, device and computer readable storage medium |
CN111143569A (en) * | 2019-12-31 | 2020-05-12 | 腾讯科技(深圳)有限公司 | Data processing method and device and computer readable storage medium |
CN111339268A (en) * | 2020-02-19 | 2020-06-26 | 北京百度网讯科技有限公司 | Entity word recognition method and device |
CN111339268B (en) * | 2020-02-19 | 2023-08-15 | 北京百度网讯科技有限公司 | Entity word recognition method and device |
CN111523308A (en) * | 2020-03-18 | 2020-08-11 | 大箴(杭州)科技有限公司 | Chinese word segmentation method and device and computer equipment |
CN111523308B (en) * | 2020-03-18 | 2024-01-26 | 大箴(杭州)科技有限公司 | Chinese word segmentation method and device and computer equipment |
CN111325033A (en) * | 2020-03-20 | 2020-06-23 | 中国建设银行股份有限公司 | Entity identification method, entity identification device, electronic equipment and computer readable storage medium |
CN111444326A (en) * | 2020-03-30 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Text data processing method, device, equipment and storage medium |
CN111444326B (en) * | 2020-03-30 | 2023-10-20 | 腾讯科技(深圳)有限公司 | Text data processing method, device, equipment and storage medium |
CN113111175A (en) * | 2020-04-28 | 2021-07-13 | 北京明亿科技有限公司 | Extreme behavior identification method, device, equipment and medium based on deep learning model |
CN111368555A (en) * | 2020-05-27 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Data identification method and device, storage medium and electronic equipment |
CN111782803A (en) * | 2020-06-05 | 2020-10-16 | 京东数字科技控股有限公司 | Work order processing method and device, electronic equipment and storage medium |
CN112417860A (en) * | 2020-12-08 | 2021-02-26 | 携程计算机技术(上海)有限公司 | Training sample enhancement method, system, device and storage medium |
CN112632971B (en) * | 2020-12-18 | 2023-08-25 | 上海明略人工智能(集团)有限公司 | Word vector training method and system for entity matching |
CN112632971A (en) * | 2020-12-18 | 2021-04-09 | 上海明略人工智能(集团)有限公司 | Word vector training method and system for entity matching |
CN112926631A (en) * | 2021-02-01 | 2021-06-08 | 大箴(杭州)科技有限公司 | Financial text classification method and device and computer equipment |
CN113051401A (en) * | 2021-04-06 | 2021-06-29 | 明品云(北京)数据科技有限公司 | Text structured labeling method, system, device and medium |
CN113240485A (en) * | 2021-05-10 | 2021-08-10 | 北京沃东天骏信息技术有限公司 | Training method of text generation model, and text generation method and device |
CN113177109A (en) * | 2021-05-27 | 2021-07-27 | 中国平安人寿保险股份有限公司 | Text weak labeling method, device, equipment and storage medium |
CN113723114A (en) * | 2021-08-31 | 2021-11-30 | 平安普惠企业管理有限公司 | Semantic analysis method, device and equipment based on multi-intent recognition and storage medium |
CN113642678A (en) * | 2021-10-12 | 2021-11-12 | 南京山猫齐动信息技术有限公司 | Method, device and storage medium for generating confrontation message sample |
CN114091469B (en) * | 2021-11-23 | 2022-08-19 | 杭州萝卜智能技术有限公司 | Network public opinion analysis method based on sample expansion |
CN114091469A (en) * | 2021-11-23 | 2022-02-25 | 杭州萝卜智能技术有限公司 | Sample expansion based network public opinion analysis method |
CN115861606B (en) * | 2022-05-09 | 2023-09-08 | 北京中关村科金技术有限公司 | Classification method, device and storage medium for long-tail distributed documents |
CN115861606A (en) * | 2022-05-09 | 2023-03-28 | 北京中关村科金技术有限公司 | Method and device for classifying long-tail distribution documents and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108804512B (en) | 2020-11-24 |
WO2019200806A1 (en) | 2019-10-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804512A (en) | Generating means, method and the computer readable storage medium of textual classification model | |
CN110287479B (en) | Named entity recognition method, electronic device and storage medium | |
CN108717406A (en) | Text mood analysis method, device and storage medium | |
CN104268197B (en) | A kind of industry comment data fine granularity sentiment analysis method | |
CN108629043A (en) | Extracting method, device and the storage medium of webpage target information | |
CN108984530A (en) | A kind of detection method and detection system of network sensitive content | |
CN108416384A (en) | A kind of image tag mask method, system, equipment and readable storage medium storing program for executing | |
CN108664473A (en) | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message | |
CN107330011A (en) | The recognition methods of the name entity of many strategy fusions and device | |
CN107818105A (en) | The recommendation method and server of application program | |
CN107729309A (en) | A kind of method and device of the Chinese semantic analysis based on deep learning | |
CN107943847A (en) | Business connection extracting method, device and storage medium | |
CN107169001A (en) | A kind of textual classification model optimization method based on mass-rent feedback and Active Learning | |
CN107251060A (en) | For the pre-training and/or transfer learning of sequence label device | |
CN110110335A (en) | A kind of name entity recognition method based on Overlay model | |
CN109376240A (en) | A kind of text analyzing method and terminal | |
CN107633036A (en) | A kind of microblog users portrait method, electronic equipment, storage medium, system | |
CN111309910A (en) | Text information mining method and device | |
CN106095845A (en) | File classification method and device | |
CN107797989A (en) | Enterprise name recognition methods, electronic equipment and computer-readable recording medium | |
CN111783468A (en) | Text processing method, device, equipment and medium | |
CN108304373A (en) | Construction method, device, storage medium and the electronic device of semantic dictionary | |
CN111522908A (en) | Multi-label text classification method based on BiGRU and attention mechanism | |
CN111475615A (en) | Fine-grained emotion prediction method, device and system for emotion enhancement and storage medium | |
CN107169061A (en) | A kind of text multi-tag sorting technique for merging double information sources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |