CN106294718A - Information processing method and device - Google Patents

Information processing method and device Download PDF

Info

Publication number
CN106294718A
CN106294718A CN201610645598.9A CN201610645598A CN106294718A CN 106294718 A CN106294718 A CN 106294718A CN 201610645598 A CN201610645598 A CN 201610645598A CN 106294718 A CN106294718 A CN 106294718A
Authority
CN
China
Prior art keywords
information
text
term vector
sample text
grader
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610645598.9A
Other languages
Chinese (zh)
Inventor
双锴
叶青
徐鹏
苏森
程祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201610645598.9A priority Critical patent/CN106294718A/en
Publication of CN106294718A publication Critical patent/CN106294718A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides a kind of information processing method and device.The method includes: obtain multiple sample text;Respectively each sample text is carried out word segmentation processing, obtain multiple participles of described each sample text;According to multiple participles of described each sample text, determine the term vector that described this text of various kinds is the most corresponding;The term vector that described this text of various kinds is corresponding respectively is carried out classification process, it is thus achieved that training set and test set;Grader to be measured is obtained according to described training set;According to described test set, described grader to be measured is tested, it is thus achieved that object classifiers;According to described object classifiers, target text is classified.The method, for classifying target text, overcomes the limitation of prior art application scenarios.

Description

Information processing method and device
Technical field
The present embodiments relate to field of computer technology, particularly relate to a kind of information processing method and device.
Background technology
In recent years, the Internet develops rapidly, and has become as people and expresses the main channel of viewpoint;Every day all can produce in a large number User behavior data, these data all have researching value to a lot of fields.Thus be derived to user behavior data Sentiment orientation prediction also become instantly popular research topic.
Wanting to predict the Sentiment orientation of user behavior data, prior art mainly uses the mode of actuarial prediction to realize 's.Such as user, after browsing certain news, understands the directed submission Sentiment orientation option to this news (such as glad, indignation, nothing So-called etc.), then website is analyzed by the Sentiment orientation option data submitting user to and then predicts that user is to this news Sentiment orientation.
But, above-mentioned actuarial prediction mode is applied to user and submits the scene of Sentiment orientation option to, and user is not submitted to feelings The scene of sense tendency option is difficult to be suitable for, and there is the biggest application limitation.Such as, microblogging that user delivers, comment etc., typically Only text data and do not have concrete emotion to incline option, the most above-mentioned actuarial prediction mode is just difficult to be suitable for.
Summary of the invention
The embodiment of the present invention provides a kind of information processing method and device, the mesh classified target text with realization 's.
On the one hand, the embodiment of the present invention provides a kind of information processing method, including:
Obtain multiple sample text;
Respectively each sample text is carried out word segmentation processing, obtain multiple participles of described each sample text;
According to multiple participles of described each sample text, determine the term vector that described this text of various kinds is the most corresponding;
The term vector that described this text of various kinds is corresponding respectively is carried out classification process, it is thus achieved that training set and test set;
Grader to be measured is obtained according to described training set;
According to described test set, described grader to be measured is tested, it is thus achieved that object classifiers;
According to described object classifiers, target text is classified.
On the other hand, the present invention provides a kind of information processor.This device includes:
Sample text acquisition module, is used for obtaining multiple sample text;
Word segmentation processing module, for respectively each sample text being carried out word segmentation processing, obtains described each sample text many Individual participle;
Term vector determines module, for the multiple participles according to described each sample text, determines described various kinds one's duty herein Not corresponding term vector;
Term vector sort module, for carrying out classification process to the term vector that described this text of various kinds is corresponding respectively, it is thus achieved that Training set and test set;
Training module, for obtaining grader to be measured according to described training set;
Test module, for testing described grader to be measured according to described test set, it is thus achieved that object classifiers;
Application module, for classifying to target text according to described object classifiers.
The embodiment of the present invention provide information processing method and device, by sample text is carried out participle, generate word to Measure, term vector is divided into training set and test set, is trained training set generating grader to be measured, treating according to test set Survey grader carries out testing thus generates object classifiers, and then utilizes object classifiers to classify target text, overcomes The application limitation of prior art.
Accompanying drawing explanation
The information processing method flow chart that Fig. 1 provides for the embodiment of the present invention one;
The information processing method flow chart that Fig. 2 provides for the embodiment of the present invention two;
The information processing method flow chart that Fig. 3 provides for the embodiment of the present invention three;
The information processor structure chart that Fig. 4 provides for the embodiment of the present invention four;
The information processor structure chart that Fig. 5 provides for the embodiment of the present invention five;
The information processor structure chart that Fig. 6 provides for the embodiment of the present invention six.
Detailed description of the invention
The information processing method provided the embodiment of the present invention below in conjunction with the accompanying drawings and device understand the description of system.
The information processing method flow chart that Fig. 1 provides for the embodiment of the present invention one.As it is shown in figure 1, the embodiment of the present invention The information processing method provided specifically includes following steps:
Step S101, obtains multiple sample text.
Such as, obtaining the multiple microbloggings about certain topic from webpage or/and related commentary, multiple microbloggings are or/and be correlated with Comment can be as the multiple sample texts in the present embodiment.Certain topic can be specifically " how treating the minibus number of shaking ", many Individual microblogging can be specifically a plurality of microblogging that multiple bloger is delivered with regard to certain topic, and related commentary can be specifically that netizen is to micro- The rich comment delivered.
Step S102, carries out word segmentation processing to each sample text respectively, obtains multiple participles of described each sample text.
It is for instance possible to use general participle instrument is such as stammered, (jieba) segmenting method carries out participle to sample text Process.Stammerer a kind of method being widely used in participle field of segmenting method, stammerer participle has a Three models:
1. accurate model, for the most accurately being cut by sentence, is suitable for text analyzing;
2. syntype, for all scanning out by the word that can become word all of in sentence, speed is very fast, but not Ambiguity can be solved;
3. search engine pattern, on the basis of accurate model, for long word cutting again, improves recall rate, is suitable for For search engine participle.
As a example by the accurate model of stammerer participle, after text " I loves Tian An-men, Beijing " word segmentation processing, available " I " " like " " Beijing " " Tian An-men " 4 participles.By that analogy, by stammerer segmenting method, each sample text is processed, can obtain To multiple participles that each sample text is corresponding, the most each sample text correspondence obtains multiple participle.
Step S103, according to multiple participles of described each sample text, determines the word that described this text of various kinds is the most corresponding Vector.
In the present embodiment, specifically can be according to term frequency-inverse document frequency (Term Frequency-Inverse Document Frequency, is called for short TF-IDF) method determines the term vector that described this text of various kinds is respectively corresponding.This term vector For the word in language is carried out mathematicization, for characterizing the vector of word.TF-IDF is a kind of weighting technique, and it is by statistics Method calculate and express certain key word significance level in the text.TF-IDF is made up of two parts, and a part is Word frequency (Term Frequency is called for short TF), represents the number of times that a word occurs in the text;Another part is inverse document frequency (Inverse Document Frequency is called for short IDF), represent that certain word occurs in how many texts.Illustrate below such as What uses TF-IDF method to represent term vector:
After one text is carried out word segmentation processing, the text just can be with a multi-C vector being made up of some key words Represent, such as di(t1,t2,……,tn), wherein diRepresent that text i, n are the number of key word, t in dnRepresent the n-th pass in d Keyword.The following is 2 text d comprising some key words (such as A, B, C, S etc.)1And d2:
d1(A, B, C, C, S, D, A, B, T, S, S, S, T, W, W);
d2(C, S, S, T, W, W, A, B, S, B).
According to formula IDFt=ln ((1+ | D |)/| Dt|) IDF can be calculatedt, wherein | D | represents total number of documents, | Dt| table Show the number of documents comprising key word t.TF-IDF can be according to formula TF-IDFtD=TFtd×IDFtDetermine, concrete, TF-IDF The product of natural logrithm of the value inverse document frequency equal to key word t word frequency in text d and key word t.
With above-mentioned d1And d2As a example by, the frequency matrix of each key word is as shown in table 1 below:
Table 1
As shown in table 1, after obtaining word frequency TF, in addition it is also necessary to word frequency data are carried out normalization process, to prevent word frequency data Being partial to the text that key word is more, wherein, normalization process refers to by word frequency divided by the key word sum of all texts, obtains One regular word frequency.If some word is at text d1In occur in that 100 times, at d2In occur in that 100 times, only in terms of word frequency, this Individual word importance in the two text is identical, if but considering further that another factor, i.e. d1Key word sum be 1000、d2Key word sum be 100000, then this word is at d1And d2In importance just different.Accordingly, it would be desirable to word Frequency does normalization process.After frequency matrix above is carried out normalization process, result is as shown in table 2 below:
Table 2
Calculate the value of inverse document frequency IDF corresponding to each key word the most again, as shown in table 3 below:
Table 3
Key word ln(IDFt)
A 0.4
B 0.4
C 0.4
D 1.1
S 0.4
T 0.4
W 0.4
Finally the word frequency after normalization being multiplied with IDF value, result is as shown in table 4 below:
Table 4
The term vector of i.e. d1 is:
D1 (0.032,0.032,0.032,0.032,0.064,0.044,0.032,0.032,0.032,0.064,0.064, 0.064,0.032,0.032,0.032);
The term vector of d2 is:
D2 (0.016,0.048,0.048,0.016,0.032,0.032,0.016,0.032,0.048,0.032).
By that analogy, the multiple participles to described each sample text, according to above-mentioned TF-IDT method, all can generate described respectively The term vector that sample text is the most corresponding, i.e. one sample text correspondence generates a term vector.
Step S104, carries out classification process to the term vector that described this text of various kinds is corresponding respectively, it is thus achieved that training set and survey Examination collection.
For example, it is possible to term vector corresponding for each sample text is divided into two parts, a part of conduct according to predetermined threshold value Training set uses, and another part uses as test set.As having 5000 sample texts, produce respectively 5000 words to Amount, predetermined threshold value is 4:1, then can be used as training set by wherein 4000 term vectors, by other 1000 words to Amount uses as test set.
Step S105, obtains grader to be measured according to described training set.
It is, for example possible to use Nae Bayesianmethod carries out model training to the term vector in training set, obtain to be measured point Class device, i.e. Naive Bayes Classifier.Model-naive Bayesian is one of conventional disaggregated model, and it is based on Bayes theorem The sorting technique independently assumed with characteristic condition.Bayesian probability is a kind of solution to probability provided by bayesian theory Releasing, all graders based on Bayesian probability are referred to as Bayes classifier.
Step S106, tests described grader to be measured according to described test set, it is thus achieved that object classifiers.
It is, for example possible to use the term vector in test set is for above-mentioned grader to be measured (i.e. Naive Bayes Classifier) Accuracy is tested.Concrete grammar is, calculates each term vector in test set by described Naive Bayes Classifier Classification results to be measured, contrasts the classification results to be measured of each term vector with the classification results previously generated, obtains one point Class accurateness such as 80%, this accurateness be 80% Naive Bayes Classifier be i.e. defined as object classifiers.If described just Exactness is not reaching to target, then described Naive Bayes Classifier can be carried out repetition training, until it reaches expection instruction Till practicing target.Adding up through substantial amounts of experimental data, the object classifiers of the embodiment of the present invention is to the classification of target text just Exactness is general all more than 80%.
Step S107, classifies to target text according to described object classifiers.
Such as, the object classifiers obtained through above-mentioned test is used in the web page text the newest to target text Hold and classify, thus obtain the variable precision of target text.Specifically, target text can be carried out according to abovementioned steps Participle, generation term vector, the term vector then generated target text by described object classifiers is calculated, and produces target The variable precision of text, this variable precision can be considered as being predicted target text the accuracy rate of classification.
The embodiment of the present invention by sample text is carried out participle, generate term vector, term vector is divided into training set and Test set, training set is trained generating grader to be measured, according to test set grader to be measured is tested thus generate Object classifiers, and then realize utilizing object classifiers that target text is classified.At the information that the embodiment of the present invention provides Reason method, overcomes the application limitation of prior art.
The information processing method flow chart that Fig. 2 provides for the embodiment of the present invention two.As in figure 2 it is shown, at the base of embodiment one On plinth, the information processing method that the embodiment of the present invention two provides, specifically include following steps:
Step S201, by the Selenium technical modelling user operation to browser.
In typical case's application scenarios, step S201 typically can be realized by following steps:
1) start browser by the Selenium Webdriver of client, bind a port for this browser, and This browser is set to Remote Server.Wherein, this browser can be Chrome, Firefox or IE browser etc., In the present embodiment, Chrome preferentially selected by this browser.
2) client sends HttpRequest instruction by CommandExcuter to Remote Server, wherein, sends out The instruction sent represents the operation that client needs browser to carry out, and such as click, rolls, closes, chooses.
3) Remote Server is by primary browser component, and client being sent instruction morphing is the clear of reality Device of looking at operates, thus the operation that analog subscriber is to browser, such as: to the click of hot issue, microblogging comment is checked with And user profile is checked, and then in subsequent step, can be used to obtain multiple sample texts of browsing device net page.
Client and objective browser can be generally arranged in a user terminal such as microcomputer.
Selenium technology is a kind of automated test tool, may be used for the driver with browser and communicates, and Obtain the control of this driver, such that it is able to the interface directly invoking browser goes to capture the data in browsing device net page.Should Technology can carry out data grabber sustainedly and stably from webpage, and do not captured data by each website for reptile mode and carried out The impact of various restrictions.
Step S202, according to described operation, obtain the multiple sample datas in described browsing device net page.
Such as, by the mouse action of analog subscriber, choose the text in different web pages and iconic content, and by these literary compositions This and iconic content are as multiple sample datas.
Step S203, each sample data is carried out data cleansing, obtain the plurality of sample text.
Such as, data cleansing is carried out for such as one microblogging of a sample data got, including:
1) official therein expression data is converted into text data.
2) complex form of Chinese characters therein is converted to simplified Chinese character.
3) space between Chinese text is removed.
4) if there being the such as empty data of invalid data in sample data, then being abandoned need not.
5) data being delivered duplicate customer ID carry out duplicate removal etc..
So far, this microblogging, after data cleansing, is all converted to content of text, and text content is sample text;With This analogizes, and can obtain each sample text that each sample data is the most corresponding.
Above-mentioned steps S201-step S203 is the specific implementation of step S101.
Step S204, carries out word segmentation processing to each sample text respectively, obtains multiple participles of described each sample text.
Step S205, according to multiple participles of described each sample text, determines the word that described this text of various kinds is the most corresponding Vector.
Step S206, carries out classification process to the term vector that described this text of various kinds is corresponding respectively, it is thus achieved that training set and survey Examination collection.
Step S207, obtains grader to be measured according to described training set.
Step S208, tests described grader to be measured according to described test set, it is thus achieved that object classifiers.
Step S209, classifies to target text according to described object classifiers.
Step S204-step S209 is consistent with step S102-step S107 respectively, and here is omitted for concrete grammar.
The embodiment of the present invention carrys out the analog subscriber operation to browser by Selenium technology, to realize sustainedly and stably From webpage, obtain the purpose of multiple sample text, thus walk around website crawler technology is obtained sample text and carry out various Limit.
The information processing method flow chart that Fig. 3 provides for the embodiment of the present invention three.As it is shown on figure 3, at the base of embodiment two On plinth, the information processing method of the embodiment of the present invention three specifically includes following steps:
Step S301, by the Selenium technical modelling user operation to browser.
Step S302, according to described operation, obtain the multiple sample datas in described browsing device net page.
Step S303, each sample data is carried out data cleansing, obtain the plurality of sample text.
Above-mentioned steps S301-step S303 is consistent with step S201-step S203 respectively, and concrete grammar is the most superfluous at this State.
Step S304, determines the classification information of described each sample text.Wherein, described classification information include forward information, Negative sense information and neutral information.
Such as, for a sample text, according to its content, judge by analysis, forward information can be defined as, Such as it is labeled as forward information;By that analogy, can be to described each sample text all to should determine that a concrete classification information. In actual applications, described forward information, negative sense information and neutral information can use numerical value 1 ,-1 and 0 to represent, also respectively Can represent by concrete Sentiment orientation value such as forward, negative sense and neutrality respectively.
Step S305, carries out word segmentation processing to each sample text respectively, obtains multiple participles of described each sample text.
Step S305 is consistent with step S204, and concrete grammar does not repeats them here.
Step S306, the multiple participles to described each sample text, according to TF-IDF method, determine described each sample text The most corresponding term vector.
Step S306 and the multiple participles to described each sample text in step S103 given example, according to TF-IDF side Method, determines that the implementation method of the term vector that described this text of various kinds is the most corresponding is similar to, and concrete grammar does not repeats them here.
Wherein, above-mentioned classification information is particularly used in and identifies the term vector that described each sample text is corresponding.Such as, one is determined The classification information of individual sample text is forward information, then this forward information can be used for identifying word corresponding to this sample text to Amount.By that analogy, the classification information of above-mentioned each sample text determined, it is used to identify the word that this text of various kinds is the most corresponding Vector.
Step S307, carries out classification process to the term vector that described this text of various kinds is corresponding respectively, it is thus achieved that training set and survey Examination collection.
Step S306-step S307 is consistent with step S205-step S206 respectively, and concrete grammar does not repeats them here.
Preferably, step S307 particularly as follows:
According to the classification information of described each sample text, the term vector that described this text of various kinds is corresponding respectively is classified Process, it is thus achieved that described training set and described test set.
Wherein, the number ratio of the number of the term vector in described training set and the term vector in described test set is one to preset Threshold value;Term vector number corresponding to the term vector number corresponding to forward information that training set includes, negative sense information and in Property term vector number corresponding to information should be equal, such as, training set has 12 term vectors, then it should be guaranteed that mark Term vector, the number that is designated the term vector of negative sense information and the term vector that is designated neutral information for forward information are 4 Individual, i.e. the ratio of three's number is 1:1:1, and do so is conducive to improving the accuracy that follow-up data calculates.Meanwhile, through a large amount of real Testing statistics to draw, when predetermined threshold value is 4:1, follow-up calculating is better.
Step S308, by Nae Bayesianmethod, carries out model training to described training set, it is thus achieved that grader to be measured.
It is, for example possible to use Nae Bayesianmethod carries out model-naive Bayesian training to the term vector in training set, Obtain Naive Bayes Classifier, grader the most to be measured.Model-naive Bayesian is one of conventional disaggregated model, it be based on The sorting technique that Bayes theorem and characteristic condition are independently assumed.
Step S309, tests described grader to be measured according to described test set, it is thus achieved that object classifiers.
Preferably, step S309 particularly as follows:
A, according to described test set calculate described grader to be measured accurateness.
Such as, it is input to each term vector in described test set in described grader to be measured calculate, generates each word The classification information to be measured of vector, believes the classification information to be measured of each term vector generated with the classification of each term vector in test set Breath contrasts, and calculates the accurateness of described grader to be measured.Such as, test set is designated forward information containing 10 Term vector, 10 term vectors being designated negative sense information and 10 be designated the term vector of neutral information, by these 30 words to Amount is input in described grader to be measured calculate, 30 term vectors containing classification information to be measured of corresponding generation, wherein, raw Front 10 term vectors become have 9 be designated forward information, 1 be designated anon-normal to information, middle 10 term vectors have 8 Individual be designated negative sense information, 2 be designated non-negative to information, rear 10 term vectors have 9 be designated neutral information, 1 mark Knowing is non-neutral information, and the accurateness of the most described grader to be measured is (9+8+9)/30 ≈ 86.67%.
B, the grader to be measured with above-mentioned accurateness is defined as object classifiers.
Such as, the above-mentioned grader to be measured with 86.67% accurateness is defined as object classifiers, i.e. shows to be determined The variable precision of object classifiers be 86.67%.
Wherein it is preferred to, step A specifically:
1) according to described grader to be measured, the classification results of each term vector in described test set is determined.Wherein, classification knot Fruit is forward information, negative sense information and neutral information.
Such as, it is input to each term vector in described test set in described grader to be measured calculate, directly generates The classification results of each term vector.
Or it is input to each term vector in described test set in described grader to be measured calculate, respectively obtains each The probability of all classification results that term vector is corresponding, is then defined as the classification results of maximum of probability corresponding for each term vector respectively The classification results of term vector.As, the term vector A in test set is input in grader to be measured, generates term vector A corresponding just To the probability of information be 80%, the probability of negative sense information be 30%, the probability of neutral information be 20, then by corresponding to 80% just The classification results of term vector A it is defined as to information.By that analogy, each term vector in test set all can get each term vector correspondence Classification results.
2) according to the classification results of each term vector in described test set, and each term vector in described test set is respectively The classification information of corresponding sample text, determines the accurateness of described grader to be measured.
Such as, test set includes corresponding with respective sample text respectively 10 be designated the word of forward information to Amount, 10 corresponding with respective sample text respectively term vectors being designated negative sense information and corresponding with respective sample text respectively 10 term vectors being designated neutral information, be input in described grader to be measured calculate by these 30 term vectors, phase The classification results that 30 term vectors are the most corresponding should be generated, wherein, front 10 term vectors of generation have 9 be designated forward letter Breath, 1 be designated anon-normal to information, middle 10 term vectors have 8 be designated negative sense information, 2 be designated non-negative to letter Breath, rear 10 term vectors have 9 be designated neutral information, 1 be designated non-neutral information, then described grader to be measured Accurateness is (9+8+9)/30 ≈ 86.67%.
Step S310, classifies to target text according to described object classifiers, to determine the feelings of described target text Sense tendency.
In the present embodiment, target text can be specifically a topic, and this topic can be the hottest in recent network Door topic, around this topic to having some microbloggings or comment, for each microblogging or comment according to above-described embodiment Method generate a term vector, for each term vector carry out classification marking obtain this term vector classification results, classification Result can be in forward, negative sense, neutral three, in the present embodiment, carrys out labelling forward with 1, bears with-1 come labellings To, carry out labelling with 0 neutral, be derived from every microblogging or the classification results of comment of the association of this topic i.e. target text, thus The forward microblogging drawn around this topic or the number of comment, negative sense microblogging or the number of comment and neutrality can be counted Microblogging or the number of comment, meanwhile, also can count total number of the microblogging drawn around this topic or comment, micro-according to every Rich or the classification results of comment, and microblogging or total number of comment, can calculate the flat of all microbloggings under this topic or comment All classification results, this average classification results can be a decimal between 0-1, the present embodiment using this average classification results as The Sentiment orientation that this topic is overall.
The present embodiment uses Nae Bayesianmethod that training set is carried out model training, generates grader to be measured, and then really Set the goal grader.This object classifiers, when for classifying target text, has higher variable precision.
In actual application, the described classification information in above-described embodiment specifically can be with Sentiment orientation classification information, described point Class result specifically can be with Sentiment orientation classification results, and described forward information, negative sense information and neutral information can be respectively specifically " forward " Sentiment orientation value, " negative sense " Sentiment orientation value and " neutral " Sentiment orientation value.
The information processor structure chart that Fig. 4 provides for the embodiment of the present invention four.At the information that the embodiment of the present invention provides Reason device can perform the handling process that information processing method embodiment provides, and as shown in Figure 4, the embodiment of the present invention four provides Information processor, specifically includes with lower module:
Sample text acquisition module 1, is used for obtaining multiple sample text.
Word segmentation processing module 2, for respectively each sample text being carried out word segmentation processing, obtains described each sample text many Individual participle.
Term vector determines module 3, for the multiple participles according to described each sample text, determines described various kinds one's duty herein Not corresponding term vector.
Term vector sort module 4, for carrying out classification process to the term vector that described this text of various kinds is corresponding respectively, it is thus achieved that Training set and test set.
Training module 5, for obtaining grader to be measured according to described training set.
Test module 6, for testing described grader to be measured according to described test set, it is thus achieved that object classifiers.
Application module 7, for classifying to target text according to described object classifiers.
The information processor that the embodiment of the present invention provides can be specifically for performing the information processing that embodiment one provides Method, concrete function does not repeats them here.
The embodiment of the present invention by sample text is carried out participle, generate term vector, term vector is divided into training set and Test set, training set is trained generating grader to be measured, according to test set grader to be measured is tested thus generate Object classifiers, and then realize utilizing object classifiers that target text is classified.At the information that the embodiment of the present invention provides Reason device, overcomes the application limitation of prior art.
The information processor structure chart that Fig. 5 provides for this embodiment of the present invention five.As it is shown in figure 5, in embodiment four On the basis of, in the present embodiment, sample text acquisition module 1 specifically includes with lower unit:
Analogue unit 11, for by the Selenium technical modelling user operation to browser;
Acquiring unit 12, for according to described operation, obtains the multiple sample datas in described browsing device net page;
Data cleansing unit 13, for each sample data is carried out data cleansing, obtains the plurality of sample text.
The information processor that the embodiment of the present invention provides can be specifically for performing the information processing that embodiment two provides Method, concrete function does not repeats them here.
The embodiment of the present invention carrys out the analog subscriber operation to browser by Selenium technology, to realize sustainedly and stably From webpage, obtain the purpose of multiple sample text, thus walk around website crawler technology is obtained sample text and carry out various Limit.
The information processor structure chart that Fig. 6 provides for the embodiment of the present invention six.As shown in Figure 6, at the base of embodiment five On plinth, the information processor that the present embodiment provides the most also includes:
Classification information determination module 8, for after obtaining multiple sample texts, determines the classification letter of each sample text Breath, described classification information includes forward information, negative sense information and neutral information, and described classification information is used for identifying described each sample The term vector that text is corresponding;
Term vector sort module 3 is specifically for the classification information according to described each sample text, to described each sample text Term vector corresponding respectively carries out classification process, it is thus achieved that described training set and described test set;Wherein, the word in described training set The number of number and the term vector in described test set of vector ratio is for predetermined threshold value, the described forward that described training set includes The number of the term vector that information, described negative sense information are the most corresponding with described neutral information is equal.
Preferably, term vector determine module 3 specifically for the multiple participles to described each sample text, according to TF-IDF side Method, determines the term vector that described this text of various kinds is the most corresponding.
Preferably, training module 5 specifically for by Nae Bayesianmethod, described training set being carried out model training, Obtain grader to be measured.
Preferably, test module 6 specifically includes:
Accurateness computing unit 61, for calculating the accurateness of described grader to be measured according to described test set;
Object classifiers determines unit 62, for the grader to be measured with described accurateness is defined as target classification Device.
The information processor that the embodiment of the present invention provides can be specifically for performing the information processing that embodiment three provides Method, concrete function does not repeats them here.
The present embodiment uses Nae Bayesianmethod that training set is carried out model training, generates grader to be measured, and then really Set the goal grader.This object classifiers, when for classifying target text, has higher variable precision.
Finally it should be noted that various embodiments above, only in order to technical scheme to be described, is not intended to limit;To the greatest extent The present invention has been described in detail by pipe with reference to foregoing embodiments, it will be understood by those within the art that: it depends on So the technical scheme described in foregoing embodiments can be modified, or the most some or all of technical characteristic is entered Row equivalent;And these amendments or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology The scope of scheme.

Claims (14)

1. an information processing method, it is characterised in that including:
Obtain multiple sample text;
Respectively each sample text is carried out word segmentation processing, obtain multiple participles of described each sample text;
According to multiple participles of described each sample text, determine the term vector that described this text of various kinds is the most corresponding;
The term vector that described this text of various kinds is corresponding respectively is carried out classification process, it is thus achieved that training set and test set;
Grader to be measured is obtained according to described training set;
According to described test set, described grader to be measured is tested, it is thus achieved that object classifiers;
According to described object classifiers, target text is classified.
Information processing method the most according to claim 1, it is characterised in that the multiple sample text of described acquisition, including:
By the Selenium technical modelling user operation to browser;
According to described operation, obtain the multiple sample datas in described browsing device net page;
Each sample data is carried out data cleansing, obtains the plurality of sample text.
Information processing method the most according to claim 2, it is characterised in that described multiple according to described each sample text Participle, determines the term vector that described this text of various kinds is the most corresponding, including:
Multiple participles to described each sample text, according to TF-IDF method, determine the word that described this text of various kinds is the most corresponding Vector.
Information processing method the most according to claim 3, it is characterised in that after the multiple sample text of described acquisition, also Including:
Determine that the classification information of each sample text, described classification information include forward information, negative sense information and neutral information, described Classification information is for identifying the term vector that described each sample text is corresponding;
The described term vector to described this text of various kinds correspondence respectively is classified, it is thus achieved that training set and test set include:
According to the classification information of described each sample text, the term vector that described this text of various kinds is corresponding respectively is carried out at classification Reason, it is thus achieved that described training set and described test set;
Wherein, the number of the number of the term vector in described training set and the term vector in described test set than for predetermined threshold value, The described forward information that described training set includes, described negative sense information distinguish the individual of the most corresponding term vector with described neutral information Number is equal.
Information processing method the most according to claim 4, it is characterised in that described obtain to be measured point according to described training set Class device, including:
By Nae Bayesianmethod, described training set is carried out model training, it is thus achieved that grader to be measured.
Information processing method the most according to claim 5, it is characterised in that described according to described test set to described to be measured Grader is tested, it is thus achieved that object classifiers, including:
The accurateness of described grader to be measured is calculated according to described test set;
The grader to be measured with described accurateness is defined as object classifiers.
Information processing method the most according to claim 6, it is characterised in that described according to described test set calculate described in treat Survey the accurateness of class device, including:
According to described grader to be measured, determining the classification results of each term vector in described test set, described classification results includes Forward information, negative sense information and neutral information;
According to the classification results of each term vector in described test set, and each term vector in described test set distinguishes correspondence The classification information of sample text, determines the accurateness of described grader to be measured.
8. an information processor, it is characterised in that including:
Sample text acquisition module, is used for obtaining multiple sample text;
Word segmentation processing module, for respectively each sample text being carried out word segmentation processing, obtains multiple points of described each sample text Word;
Term vector determines module, for the multiple participles according to described each sample text, determines that described this text of various kinds is the most right The term vector answered;
Term vector sort module, for carrying out classification process to the term vector that described this text of various kinds is corresponding respectively, it is thus achieved that training Collection and test set;
Training module, for obtaining grader to be measured according to described training set;
Test module, for testing described grader to be measured according to described test set, it is thus achieved that object classifiers;
Application module, for classifying to target text according to described object classifiers.
Information processor the most according to claim 8, it is characterised in that described sample text acquisition module includes:
Analogue unit, for by the Selenium technical modelling user operation to browser;
Acquiring unit, for according to described operation, obtains the multiple sample datas in described browsing device net page;
Data cleansing unit, for each sample data is carried out data cleansing, obtains the plurality of sample text.
Information processor the most according to claim 9, it is characterised in that described term vector determines module, specifically for Multiple participles to described each sample text, according to TF-IDF method, determine the term vector that described this text of various kinds is the most corresponding.
11. information processors according to claim 10, it is characterised in that also include:
Classification information determination module, for, after obtaining multiple sample texts, determining the classification information of each sample text, described Classification information includes forward information, negative sense information and neutral information, and described classification information is used for identifying described each sample text pair The term vector answered;
Described term vector sort module, specifically for the classification information according to described each sample text, to described each sample text Term vector corresponding respectively carries out classification process, it is thus achieved that described training set and described test set;
Wherein, the number of the number of the term vector in described training set and the term vector in described test set than for predetermined threshold value, The described forward information that described training set includes, described negative sense information distinguish the individual of the most corresponding term vector with described neutral information Number is equal.
12. information processors according to claim 11, it is characterised in that described training module is specifically for by Piao Element bayes method, carries out model training to described training set, it is thus achieved that grader to be measured.
13. information processors according to claim 12, it is characterised in that described test module includes:
Accurateness computing unit, for calculating the accurateness of described grader to be measured according to described test set;
Object classifiers determines unit, for the grader to be measured with described accurateness is defined as object classifiers.
14. information processors according to claim 13, it is characterised in that described accurateness computing unit includes:
Classification results determines subelement, for according to described grader to be measured, determining dividing of each term vector in described test set Class result, described classification results includes forward information, negative sense information and neutral information;
Accurateness determines subelement, for the classification results according to each term vector in described test set, and described test set In the classification information of the respectively corresponding sample text of each term vector, determine the accurateness of described grader to be measured.
CN201610645598.9A 2016-08-08 2016-08-08 Information processing method and device Pending CN106294718A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610645598.9A CN106294718A (en) 2016-08-08 2016-08-08 Information processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610645598.9A CN106294718A (en) 2016-08-08 2016-08-08 Information processing method and device

Publications (1)

Publication Number Publication Date
CN106294718A true CN106294718A (en) 2017-01-04

Family

ID=57666865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610645598.9A Pending CN106294718A (en) 2016-08-08 2016-08-08 Information processing method and device

Country Status (1)

Country Link
CN (1) CN106294718A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519976A (en) * 2018-04-04 2018-09-11 郑州大学 The method for generating extensive sentiment dictionary based on neural network
CN109299276A (en) * 2018-11-15 2019-02-01 阿里巴巴集团控股有限公司 One kind converting the text to word insertion, file classification method and device
CN109741190A (en) * 2018-12-27 2019-05-10 清华大学 A kind of method, system and the equipment of the classification of personal share bulletin
CN110895562A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Feedback information processing method and device
CN111291168A (en) * 2018-12-07 2020-06-16 北大方正集团有限公司 Book retrieval method and device and readable storage medium
WO2022036998A1 (en) * 2020-08-20 2022-02-24 广东电网有限责任公司清远供电局 Power system violation management method and apparatus, and power device
CN115292487A (en) * 2022-07-22 2022-11-04 杭州易有料科技有限公司 Text classification method, device, equipment and medium based on naive Bayes

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012151786A1 (en) * 2011-05-11 2012-11-15 北京航空航天大学 Chinese voice emotion extraction and modeling method combining emotion points
CN103034626A (en) * 2012-12-26 2013-04-10 上海交通大学 Emotion analyzing system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012151786A1 (en) * 2011-05-11 2012-11-15 北京航空航天大学 Chinese voice emotion extraction and modeling method combining emotion points
CN103034626A (en) * 2012-12-26 2013-04-10 上海交通大学 Emotion analyzing system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱亚琼 等: "一种基于动态调度的数据挖掘并行算法", 《现代电子技术》 *
李杏杏: "B2C网站商品评论挖掘技术的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519976A (en) * 2018-04-04 2018-09-11 郑州大学 The method for generating extensive sentiment dictionary based on neural network
CN110895562A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Feedback information processing method and device
CN109299276A (en) * 2018-11-15 2019-02-01 阿里巴巴集团控股有限公司 One kind converting the text to word insertion, file classification method and device
CN109299276B (en) * 2018-11-15 2021-11-19 创新先进技术有限公司 Method and device for converting text into word embedding and text classification
CN111291168A (en) * 2018-12-07 2020-06-16 北大方正集团有限公司 Book retrieval method and device and readable storage medium
CN109741190A (en) * 2018-12-27 2019-05-10 清华大学 A kind of method, system and the equipment of the classification of personal share bulletin
WO2022036998A1 (en) * 2020-08-20 2022-02-24 广东电网有限责任公司清远供电局 Power system violation management method and apparatus, and power device
CN115292487A (en) * 2022-07-22 2022-11-04 杭州易有料科技有限公司 Text classification method, device, equipment and medium based on naive Bayes

Similar Documents

Publication Publication Date Title
CN106294718A (en) Information processing method and device
Sobhani et al. A dataset for multi-target stance detection
Shrestha et al. Convolutional neural networks for authorship attribution of short texts
CN108399158B (en) Attribute emotion classification method based on dependency tree and attention mechanism
Nobata et al. Abusive language detection in online user content
Zhou et al. Learning continuous word embedding with metadata for question retrieval in community question answering
Yu et al. Learning composition models for phrase embeddings
Bergsma et al. Language identification for creating language-specific twitter collections
CN106202032B (en) A kind of sentiment analysis method and its system towards microblogging short text
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN109471942B (en) Chinese comment emotion classification method and device based on evidence reasoning rule
CN110222178A (en) Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing
CN107145560B (en) Text classification method and device
Wang et al. Multi-label Chinese microblog emotion classification via convolutional neural network
JP6306400B2 (en) Skill evaluation apparatus, program and method for evaluating worker skill in crowdsourcing
CN104361037B (en) Microblogging sorting technique and device
CN105912716A (en) Short text classification method and apparatus
CN108733675B (en) Emotion evaluation method and device based on large amount of sample data
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
Hercig et al. Detecting Stance in Czech News Commentaries.
Sham et al. Ethical AI in facial expression analysis: racial bias
Rhodes Author attribution with cnns
CN106776566A (en) The recognition methods of emotion vocabulary and device
CN104462229A (en) Event classification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104

RJ01 Rejection of invention patent application after publication