CN103336764A - Orientation analysis-based classification model building and content identification method and device - Google Patents

Orientation analysis-based classification model building and content identification method and device Download PDF

Info

Publication number
CN103336764A
CN103336764A CN2013102414098A CN201310241409A CN103336764A CN 103336764 A CN103336764 A CN 103336764A CN 2013102414098 A CN2013102414098 A CN 2013102414098A CN 201310241409 A CN201310241409 A CN 201310241409A CN 103336764 A CN103336764 A CN 103336764A
Authority
CN
China
Prior art keywords
content
feature
classification
information
disaggregated model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102414098A
Other languages
Chinese (zh)
Inventor
陈洪亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2013102414098A priority Critical patent/CN103336764A/en
Publication of CN103336764A publication Critical patent/CN103336764A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides an orientation analysis-based classification model building and content identification method and device. On one hand, according to the embodiment of the invention, first content features are obtained according to a filtered second training data, and further, according to the mutual information of the first content features appearing in a first content category to the first content features appearing in a second content category, the first content features are filtered so as to obtain second content features, so that a classification model can be trained by utilizing the second content features; and as the first content features are filtered according to the mutual information of the first content features appearing in a first content category to the first content features appearing in a second content category, the second content features with higher distinguishing ability can be obtained, and thus, the classification model trained by utilizing the second content features with the higher distinguishing ability is capable of more accurately identifying content with negative orientation, and therefore, the reliability of content identification is improved.

Description

Disaggregated model foundation, content identification method and device based on based on sentiment classification
[technical field]
The present invention relates to the content recognition technology, relate in particular to a kind of disaggregated model foundation, content identification method and device based on based on sentiment classification.
[background technology]
Abundant information is colorful, various in the current internet, and most of information are all useful to us, but also can mix some negative information, and for example, reaction political message, pornographic information etc. have the information of negative tendency.These have the information of negative tendency, can produce bad influence to the reader usually.For example, the reaction political message can be hoodwinked the reader, causes reader's reaction mood easily, is unfavorable for that the harmony of society is with stable; Again for example, pornographic information can endanger teen-age mental health, influences pupillary growing up healthy and sound.Therefore, can accurate recognition go out the information that these have negative tendency, become the problem that Internet firm must solve, so that the internet environment of a safety and Health to be provided to the reader.
In the prior art, specifically can be in advance by a large amount of tendentiousness vocabulary of artificial collection, to form the tendentiousness contents list.Utilize this tendentiousness contents list, treat content identified, for example, (World Wide Web, Web) page etc. carries out matching treatment to WWW.If the content characteristic that matches satisfies the assign thresholds condition, is the content with negative tendency with described content recognition then.This recognition methods reliability is not high.
[summary of the invention]
Many aspects of the present invention provide a kind of disaggregated model foundation, content identification method and device based on based on sentiment classification, in order to improve content aware reliability.
An aspect of of the present present invention provides a kind of disaggregated model method for building up based on based on sentiment classification, comprising:
Tendentiousness contents list according to setting in advance filters first corpus, to obtain second corpus;
According to described second corpus, obtain the first content feature;
Mutual information that described first content feature is occurred in the first content classification, occurs according to described first content feature in the second content classification, described first content feature is filtered, to obtain the second content feature;
Utilize described second content feature, train classification models, described disaggregated model is used for content map to be identified to described first content classification or described second content classification.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation, and be described according to described second corpus, obtains the first content feature, comprising:
Utilize the N-Gram model, from described second corpus, select described first content feature.
Aforesaid aspect and arbitrary possible implementation, a kind of implementation further is provided, the described mutual information that described first content feature is occurred in the second content classification of in the first content classification, occurring according to described first content feature, described first content feature is filtered, to obtain the second content feature, comprising:
If described mutual information is more than or equal to the threshold value that sets in advance, keep described first content feature, with as described second content feature;
If described mutual information less than the threshold value that sets in advance, abandons described first content feature.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation, describedly utilize described second content feature, after the train classification models, also comprise:
According to content to be tested, upgrade described disaggregated model.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
Another aspect of the present invention provides a kind of content identification method based on disaggregated model, comprising: described disaggregated model is set up for adopting aforesaid disaggregated model method for building up based on based on sentiment classification; Described method comprises:
Obtain content to be identified;
Tendentiousness contents list according to setting in advance mates described content, to obtain content characteristic to be identified;
According to described content characteristic, utilize described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
Another aspect of the present invention provides a kind of disaggregated model apparatus for establishing based on based on sentiment classification, comprising:
The language material filter element is used for first corpus being filtered, to obtain second corpus according to the tendentiousness contents list that sets in advance;
Feature obtains the unit, is used for according to described second corpus, obtains the first content feature;
The feature filter element is used for occurring mutual information that described first content feature is occurred according to described first content feature in the first content classification in the second content classification, described first content feature is filtered, to obtain the second content feature;
The model training unit is used for utilizing described second content feature, train classification models, and described disaggregated model is used for content map to be identified to described first content classification or described second content classification.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation, and described feature obtains the unit, specifically is used for
Utilize the N-Gram model, from described second corpus, select described first content feature.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation, and described feature filter element specifically is used for
If described mutual information is more than or equal to the threshold value that sets in advance, keep described first content feature, with as described second content feature;
If described mutual information less than the threshold value that sets in advance, abandons described first content feature.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation, and described model training unit also is used for
According to content to be tested, upgrade described disaggregated model.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
Another aspect of the present invention provides a kind of content recognition device based on disaggregated model, comprising: described disaggregated model is set up for adopting aforesaid disaggregated model method for building up based on based on sentiment classification; Described device comprises:
Acquiring unit is used for obtaining content to be identified;
Matching unit is used for described content being mated, to obtain content characteristic to be identified according to the tendentiousness contents list that sets in advance;
Taxon is used for according to described content characteristic, utilizes described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.
Aforesaid aspect and arbitrary possible implementation further provide a kind of implementation,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
As shown from the above technical solution, on the one hand, the embodiment of the invention is passed through according to second corpus after filtering, obtain the first content feature, and then according to described first content feature mutual information that described first content feature is occurred appears in the first content classification again in the second content classification, described first content feature is filtered, to obtain the second content feature, make it possible to utilize described second content feature, train classification models, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these to have the second content feature of strong separating capacity, the disaggregated model that training is come out can identify the content with negative tendency more exactly, thereby improve content aware reliability.
As shown from the above technical solution, on the other hand, the embodiment of the invention is passed through according to content characteristic to be identified, utilization is by the second content feature with strong separating capacity, the disaggregated model that training is come out, treating content identified classifies, so that described content map is arrived first content classification or second content classification, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these to have the second content feature of strong separating capacity, the disaggregated model that training is come out can identify the content with negative tendency more exactly, thereby improve content aware reliability.
[description of drawings]
In order to be illustrated more clearly in the technical scheme in the embodiment of the invention, to do one to the accompanying drawing of required use in embodiment or the description of the Prior Art below introduces simply, apparently, accompanying drawing in describing below is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
The schematic flow sheet based on the disaggregated model method for building up of based on sentiment classification that Fig. 1 provides for one embodiment of the invention;
The schematic flow sheet based on the content identification method of disaggregated model that Fig. 2 provides for another embodiment of the present invention;
The structural representation based on the disaggregated model apparatus for establishing of based on sentiment classification that Fig. 3 provides for another embodiment of the present invention;
The structural representation based on the content recognition device of disaggregated model that Fig. 4 provides for another embodiment of the present invention.
[embodiment]
For the purpose, technical scheme and the advantage that make the embodiment of the invention clearer, below in conjunction with the accompanying drawing in the embodiment of the invention, technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making whole other embodiment that obtain under the creative work prerequisite.
Need to prove, herein term " and/or ", only be a kind of incidence relation of describing affiliated partner, can there be three kinds of relations in expression, for example, A and/or B, can represent: individualism A exists A and B, these three kinds of situations of individualism B simultaneously.In addition, character "/" herein, generally represent forward-backward correlation to as if a kind of " or " relation.
The schematic flow sheet based on the disaggregated model method for building up of based on sentiment classification that Fig. 1 provides for one embodiment of the invention, as shown in Figure 1.
101, according to the tendentiousness contents list that sets in advance, first corpus is filtered, to obtain second corpus.
102, according to described second corpus, obtain the first content feature.
103, mutual information that described first content feature is occurred in the first content classification, occurs according to described first content feature in the second content classification, described first content feature is filtered, to obtain the second content feature.
104, utilize described second content feature, train classification models, described disaggregated model is used for content map to be identified to described first content classification or described second content classification.
Wherein, described first content classification can comprise the information with negative tendency; So, described second content classification can comprise that then having non-negative tendency is positive tendentious information.
Perhaps, described first content classification can comprise that having non-negative tendency is positive tendentious information; So, described second content classification then can comprise the information with negative tendency.
Need to prove that 101~104 executive agent can be model building device.
Like this, by according to second corpus after filtering, obtain the first content feature, and then according to described first content feature mutual information that described first content feature is occurred appears in the first content classification again in the second content classification, described first content feature is filtered, to obtain the second content feature, make it possible to utilize described second content feature, train classification models, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these second content features with strong separating capacity, the disaggregated model that training is come out, can identify the content with negative tendency more exactly, thereby improve content aware reliability.
In addition, adopt technical scheme provided by the invention, can identify the content with negative tendency more exactly, thereby can further improve content aware recall rate.
In addition, adopt technical scheme provided by the invention, by according to the tendentiousness contents list that sets in advance, first corpus is filtered, to obtain second corpus, like this, utilize second corpus as the training basis of disaggregated model, can make that two kinds of content types are that the distribution of first content classification and second content classification is comparatively even, thereby improve the predictive ability of disaggregated model.
Be understandable that in the present embodiment, described first corpus can be the page of Web at random that obtains from the internet, comprises the various Web pages, for example, news pages, review pages, the advertisement page etc.
Like this, by according to the tendentiousness contents list that sets in advance, first corpus is filtered, to obtain second corpus, and then can utilize second corpus as the training basis of disaggregated model, be first content classification or second content classification by manually carrying out content type mark, not only reduced the scope of artificial mark, but also can make that two kinds of content types are that the distribution of first content classification and second content classification is comparatively even, thereby improved the predictive ability of disaggregated model.
Alternatively, in one of present embodiment possible implementation, in 102, specifically can utilize the N unit syntax (N-Gram) model, from described second corpus, select described first content feature.For example, the monobasic feature is as Tian An-men, Falun Gong, Li Peng, motion etc.; Perhaps more for example, binary feature, as Tian An-men and burn oneself to death, Falun Gong and addict, suppression and motion etc.Particularly, the specific descriptions of grammatical (N-Gram) model of described N unit can repeat no more referring to related content of the prior art herein.
Alternatively, in one of present embodiment possible implementation, in 103, if described mutual information is more than or equal to the threshold value that sets in advance, then can keep described first content feature, with as described second content feature; If described mutual information less than the threshold value that sets in advance, then can abandon described first content feature.
Particularly, the value of mutual information is more big, shows the probability that probability that the first content feature occurs in the first content classification and first content feature occur in the second content classification, differs more greatly different; Otherwise the value of mutual information is more little, shows the probability that probability that the first content feature occurs in the first content classification and first content feature occur in the second content classification, differs more little.Wherein, the detailed description of described mutual information can repeat no more referring to associated description of the prior art herein.
Particularly, in 104, described disaggregated model can include but not limited to maximum entropy model (Maximum Entropy Model).Particularly, utilize described second content feature, the concrete grammar of train classification models can repeat no more referring to associated description of the prior art herein.
Alternatively, in one of present embodiment possible implementation, after 104, can also further test described disaggregated model.For example, obtain content to be tested; Tendentiousness contents list according to setting in advance mates described content, to obtain content characteristic to be tested; According to described content characteristic, utilize described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.Further, if classification results is incorrect, so then with this content as second corpus, carry out 102~104, to upgrade described disaggregated model.
In the present embodiment, by according to second corpus after filtering, obtain the first content feature, and then according to described first content feature mutual information that described first content feature is occurred appears in the first content classification again in the second content classification, described first content feature is filtered, to obtain the second content feature, make it possible to utilize described second content feature, train classification models, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these second content features with strong separating capacity, the disaggregated model that training is come out, can identify the content with negative tendency more exactly, thereby improve content aware reliability.
In addition, adopt technical scheme provided by the invention, can identify the content with negative tendency more exactly, thereby can further improve content aware recall rate.
The schematic flow sheet based on the content identification method of disaggregated model that Fig. 2 provides for another embodiment of the present invention, as shown in Figure 2.
201, obtain content to be identified.
202, according to the tendentiousness contents list that sets in advance, described content is mated, to obtain content characteristic to be identified.
203, according to described content characteristic, utilize described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.
Wherein, described disaggregated model is set up for the disaggregated model method for building up based on based on sentiment classification that the embodiment that adopts Fig. 1 correspondence provides, and detailed description can repeat no more referring to the related content among the embodiment of Fig. 1 correspondence herein.
Wherein, described first content classification can comprise the information with negative tendency; So, described second content classification can comprise that then having non-negative tendency is positive tendentious information.
Perhaps, described first content classification can comprise that having non-negative tendency is positive tendentious information; So, described second content classification then can comprise the information with negative tendency.
Need to prove that 201~103 executive agent can be recognition device, can be arranged in local client, to carry out identified off-line, perhaps can also be arranged in the server of network side, to carry out ONLINE RECOGNITION, present embodiment does not limit this.
Be understandable that, described client can be mounted in the application program on the terminal, perhaps can also be a webpage of browser, as long as can realize content recognition, with outwardness form that identification service is provided can, present embodiment does not limit this.
Be understandable that, in 202, the tendentiousness contents list according to setting in advance mates described content, if successfully namely it fails to match for coupling, then described content map can be had the content type that non-negative tendency is positive tendentious information to comprising; If the match is successful, then obtain content characteristic to be identified.
Alternatively, in one of present embodiment possible implementation, in 203, specifically can utilize described disaggregated model, described content map is arrived first content classification or second content classification.
Like this, identify content type under the content and be after first content classification or the second content classification, then can further operate according to recognition result, for example, has the content that negative tendency is the content type of positive tendentious information to comprising belonging to, carry out shielding processing, perhaps replace processing etc.
In the present embodiment, by according to content characteristic to be identified, utilization is by the second content feature with strong separating capacity, the disaggregated model that training is come out, treating content identified classifies, so that described content map is arrived first content classification or second content classification, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these second content features with strong separating capacity, the disaggregated model that training is come out, can identify the content with negative tendency more exactly, thereby improve content aware reliability.
In addition, adopt technical scheme provided by the invention, can identify the content with negative tendency more exactly, thereby can further improve content aware recall rate.
Need to prove, for aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can be referring to the associated description of other embodiment.
The structural representation based on the disaggregated model apparatus for establishing of based on sentiment classification that Fig. 3 provides for another embodiment of the present invention, as shown in Figure 3.The disaggregated model apparatus for establishing based on based on sentiment classification of present embodiment can comprise that language material filter element 31, feature obtain unit 32, feature filter element 33 and model training unit 34.Wherein, language material filter element 31 is used for first corpus being filtered, to obtain second corpus according to the tendentiousness contents list that sets in advance; Feature obtains unit 32, is used for according to described second corpus, obtains the first content feature; Feature filter element 33 is used for occurring mutual information that described first content feature is occurred according to described first content feature in the first content classification in the second content classification, described first content feature is filtered, to obtain the second content feature; Model training unit 34 is used for utilizing described second content feature, train classification models, and described disaggregated model is used for content map to be identified to described first content classification or described second content classification.
Wherein, described first content classification can comprise the information with negative tendency; So, described second content classification can comprise that then having non-negative tendency is positive tendentious information.
Perhaps, described first content classification can comprise that having non-negative tendency is positive tendentious information; So, described second content classification then can comprise the information with negative tendency.
Need to prove that the device that present embodiment provides can be model building device.
Like this, obtain the unit according to second corpus after filtering by feature, obtain the first content feature, and then by the feature filter element mutual information that described first content feature is occurred appears in the first content classification according to described first content feature again in the second content classification, described first content feature is filtered, to obtain the second content feature, make the model training unit can utilize described second content feature, train classification models, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these second content features with strong separating capacity, the disaggregated model that training is come out, can identify the content with negative tendency more exactly, thereby improve content aware reliability.
In addition, adopt technical scheme provided by the invention, can identify the content with negative tendency more exactly, thereby can further improve content aware recall rate.
Be understandable that in the present embodiment, described first corpus can be the page of Web at random that obtains from the internet, comprises the various Web pages, for example, news pages, review pages, the advertisement page etc.
Like this, by the language material filter element according to the tendentiousness contents list that sets in advance, first corpus is filtered, to obtain second corpus, and then this device can utilize second corpus as the training basis of disaggregated model, be first content classification or second content classification by manually carrying out content type mark, not only reduced the scope of artificial mark, but also can make that two kinds of content types are that the distribution of first content classification and second content classification is comparatively even, thereby improved the predictive ability of disaggregated model.
Alternatively, in one of present embodiment possible implementation, described feature obtains unit 32, specifically can be used for utilizing the N-Gram model, from described second corpus, selects described first content feature.For example, the monobasic feature is as Tian An-men, Falun Gong, Li Peng, motion etc.; Perhaps more for example, binary feature, as Tian An-men and burn oneself to death, Falun Gong and addict, suppression and motion etc.Particularly, the specific descriptions of grammatical (N-Gram) model of described N unit can repeat no more referring to related content of the prior art herein.
Alternatively, in one of present embodiment possible implementation, described feature filter element 33 specifically can be used for if described mutual information more than or equal to the threshold value that sets in advance, keeps described first content feature, with as described second content feature; If described mutual information less than the threshold value that sets in advance, abandons described first content feature.
Particularly, the value of mutual information is more big, shows the probability that probability that the first content feature occurs in the first content classification and first content feature occur in the second content classification, differs more greatly different; Otherwise the value of mutual information is more little, shows the probability that probability that the first content feature occurs in the first content classification and first content feature occur in the second content classification, differs more little.Wherein, the detailed description of described mutual information can repeat no more referring to associated description of the prior art herein.
Particularly, described disaggregated model can include but not limited to maximum entropy model (Maximum Entropy Model).Wherein, utilize described second content feature, the concrete grammar of train classification models can repeat no more referring to associated description of the prior art herein.
Alternatively, in one of present embodiment possible implementation, this device can also further be tested described disaggregated model.For example, obtain content to be tested; Tendentiousness contents list according to setting in advance mates described content, to obtain content characteristic to be tested; According to described content characteristic, utilize described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.Further, if classification results is incorrect, 34 of so described model training unit send to feature with this content and obtain the unit, as second corpus, obtain unit, feature filter element and model training unit by feature, carry out operation accordingly, to upgrade described disaggregated model.
In the present embodiment, obtain the unit according to second corpus after filtering by feature, obtain the first content feature, and then by the feature filter element mutual information that described first content feature is occurred appears in the first content classification according to described first content feature again in the second content classification, described first content feature is filtered, to obtain the second content feature, make the model training unit can utilize described second content feature, train classification models, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these second content features with strong separating capacity, the disaggregated model that training is come out, can identify the content with negative tendency more exactly, thereby improve content aware reliability.
In addition, adopt technical scheme provided by the invention, can identify the content with negative tendency more exactly, thereby can further improve content aware recall rate.
The structural representation based on the content recognition device of disaggregated model that Fig. 4 provides for another embodiment of the present invention, as shown in Figure 4.The content recognition device based on disaggregated model of present embodiment can comprise acquiring unit 41, matching unit 42 and taxon 43.Wherein, acquiring unit 41 is used for obtaining content to be identified; Matching unit 42 is used for described content being mated, to obtain content characteristic to be identified according to the tendentiousness contents list that sets in advance; Taxon 43 is used for according to described content characteristic, utilizes described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.
Wherein, described disaggregated model is set up for the disaggregated model method for building up based on based on sentiment classification that the embodiment that adopts Fig. 1 correspondence provides, and detailed description can repeat no more referring to the related content among the embodiment of Fig. 1 correspondence herein.
Wherein, described first content classification can comprise the information with negative tendency; So, described second content classification can comprise that then having non-negative tendency is positive tendentious information.
Perhaps, described first content classification can comprise that having non-negative tendency is positive tendentious information; So, described second content classification then can comprise the information with negative tendency.
Need to prove that the device that present embodiment provides can be recognition device, can be arranged in local client, to carry out identified off-line, perhaps can also be arranged in the server of network side, to carry out ONLINE RECOGNITION, present embodiment does not limit this.
Be understandable that, described client can be mounted in the application program on the terminal, perhaps can also be a webpage of browser, as long as can realize content recognition, with outwardness form that identification service is provided can, present embodiment does not limit this.
Be understandable that, matching unit 42 is according to the tendentiousness contents list that sets in advance, described content is mated, if successfully namely it fails to match for coupling, then described content map can be had the content type that non-negative tendency is positive tendentious information to comprising; If the match is successful, then obtain content characteristic to be identified.
Alternatively, in one of present embodiment possible implementation, taxon 43 specifically can be utilized described disaggregated model, and described content map is arrived first content classification or second content classification.
Like this, it is after first content classification or the second content classification that this device identifies content type under the content, then recognition result can be sent to other devices, so that other devices are further operated according to recognition result, for example, have the content that negative tendency is the content type of positive tendentious information to comprising belonging to, carry out shielding processing, perhaps replace processing etc.
In the present embodiment, the content characteristic to be identified that obtains according to matching unit by taxon, utilization is by the second content feature with strong separating capacity, the disaggregated model that training is come out, treating content identified classifies, so that described content map is arrived first content classification or second content classification, owing to the mutual information that occurs in the first content classification according to the first content feature first content feature is occurred in the second content classification, the first content feature is filtered, therefore, can obtain to have the second content feature of strong separating capacity, like this, utilize these second content features with strong separating capacity, the disaggregated model that training is come out, can identify the content with negative tendency more exactly, thereby improve content aware reliability.
In addition, adopt technical scheme provided by the invention, can identify the content with negative tendency more exactly, thereby can further improve content aware recall rate.
The those skilled in the art can be well understood to, and is the convenience described and succinct, the system of foregoing description, and the concrete course of work of device and unit can not repeat them here with reference to the corresponding process among the preceding method embodiment.
In several embodiment provided by the present invention, should be understood that, disclosed system, apparatus and method can realize by other mode.For example, device embodiment described above only is schematic, for example, the division of described unit, only be that a kind of logic function is divided, during actual the realization other dividing mode can be arranged, for example a plurality of unit or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.Another point, the shown or coupling each other discussed or directly to be coupled or to communicate to connect can be by some interfaces, the indirect coupling of device or unit or communicate to connect can be electrically, machinery or other form.
Described unit as separating component explanation can or can not be physically to separate also, and the parts that show as the unit can be or can not be physical locations also, namely can be positioned at a place, perhaps also can be distributed on a plurality of network element.Can select wherein some or all of unit to realize the purpose of present embodiment scheme according to the actual needs.
In addition, each functional unit in each embodiment of the present invention can be integrated in the processing unit, also can be that the independent physics in each unit exists, and also can be integrated in the unit two or more unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, the form that also can adopt hardware to add SFU software functional unit realizes.
The above-mentioned integrated unit of realizing with the form of SFU software functional unit can be stored in the computer read/write memory medium.Above-mentioned SFU software functional unit is stored in the storage medium, comprise that some instructions are with so that a computer installation (can be personal computer, server, perhaps network equipment etc.) or processor (processor) carry out the part steps of the described method of each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), various media that can be program code stored such as magnetic disc or CD.
It should be noted that at last: above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment puts down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (14)

1. the disaggregated model method for building up based on based on sentiment classification is characterized in that, comprising:
Tendentiousness contents list according to setting in advance filters first corpus, to obtain second corpus;
According to described second corpus, obtain the first content feature;
Mutual information that described first content feature is occurred in the first content classification, occurs according to described first content feature in the second content classification, described first content feature is filtered, to obtain the second content feature;
Utilize described second content feature, train classification models, described disaggregated model is used for content map to be identified to described first content classification or described second content classification.
2. method according to claim 1 is characterized in that, and is described according to described second corpus, obtains the first content feature, comprising:
Utilize the N-Gram model, from described second corpus, select described first content feature.
3. method according to claim 1 and 2, it is characterized in that, the described mutual information that described first content feature is occurred in the second content classification of in the first content classification, occurring according to described first content feature, described first content feature is filtered, to obtain the second content feature, comprising:
If described mutual information is more than or equal to the threshold value that sets in advance, keep described first content feature, with as described second content feature;
If described mutual information less than the threshold value that sets in advance, abandons described first content feature.
4. according to the described method of the arbitrary claim of claim 1~3, it is characterized in that, describedly utilize described second content feature, after the train classification models, also comprise:
According to content to be tested, upgrade described disaggregated model.
5. according to the described method of the arbitrary claim of claim 1~4, it is characterized in that,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
6. the content identification method based on disaggregated model is characterized in that, described disaggregated model is set up for adopting as the described disaggregated model method for building up based on based on sentiment classification of the arbitrary claim of claim 1~5; Described method comprises:
Obtain content to be identified;
Tendentiousness contents list according to setting in advance mates described content, to obtain content characteristic to be identified;
According to described content characteristic, utilize described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.
7. method according to claim 6 is characterized in that,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
8. the disaggregated model apparatus for establishing based on based on sentiment classification is characterized in that, comprising:
The language material filter element is used for first corpus being filtered, to obtain second corpus according to the tendentiousness contents list that sets in advance;
Feature obtains the unit, is used for according to described second corpus, obtains the first content feature;
The feature filter element is used for occurring mutual information that described first content feature is occurred according to described first content feature in the first content classification in the second content classification, described first content feature is filtered, to obtain the second content feature;
The model training unit is used for utilizing described second content feature, train classification models, and described disaggregated model is used for content map to be identified to described first content classification or described second content classification.
9. device according to claim 8 is characterized in that, described feature obtains the unit, specifically is used for
Utilize the N-Gram model, from described second corpus, select described first content feature.
10. according to Claim 8 or 9 described devices, it is characterized in that described feature filter element specifically is used for
If described mutual information is more than or equal to the threshold value that sets in advance, keep described first content feature, with as described second content feature;
If described mutual information less than the threshold value that sets in advance, abandons described first content feature.
11. the described device of arbitrary claim is characterized in that according to Claim 8~10, described model training unit also is used for
According to content to be tested, upgrade described disaggregated model.
12. the described device of arbitrary claim is characterized in that according to Claim 8~11,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
13. the content recognition device based on disaggregated model is characterized in that, described disaggregated model is set up for adopting as the described disaggregated model method for building up based on based on sentiment classification of the arbitrary claim of claim 1~5; Described device comprises:
Acquiring unit is used for obtaining content to be identified;
Matching unit is used for described content being mated, to obtain content characteristic to be identified according to the tendentiousness contents list that sets in advance;
Taxon is used for according to described content characteristic, utilizes described disaggregated model, described content is classified, so that described content map is arrived first content classification or second content classification.
14. device according to claim 13 is characterized in that,
Described first content classification comprises the information with negative tendency; Described second content classification comprises having positive tendentious information; Perhaps
Described first content classification comprises having positive tendentious information; Described second content classification comprises the information with negative tendency.
CN2013102414098A 2013-06-18 2013-06-18 Orientation analysis-based classification model building and content identification method and device Pending CN103336764A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102414098A CN103336764A (en) 2013-06-18 2013-06-18 Orientation analysis-based classification model building and content identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102414098A CN103336764A (en) 2013-06-18 2013-06-18 Orientation analysis-based classification model building and content identification method and device

Publications (1)

Publication Number Publication Date
CN103336764A true CN103336764A (en) 2013-10-02

Family

ID=49244933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102414098A Pending CN103336764A (en) 2013-06-18 2013-06-18 Orientation analysis-based classification model building and content identification method and device

Country Status (1)

Country Link
CN (1) CN103336764A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224603A (en) * 2015-09-01 2016-01-06 北京京东尚科信息技术有限公司 Corpus acquisition methods and device
CN105930359A (en) * 2016-04-11 2016-09-07 百度在线网络技术(北京)有限公司 Tendency monitoring method and device
CN107220355A (en) * 2017-06-02 2017-09-29 北京百度网讯科技有限公司 News Quality estimation method, equipment and storage medium based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414300A (en) * 2008-11-28 2009-04-22 电子科技大学 Method for sorting and processing internet public feelings information
CN102200969A (en) * 2010-03-25 2011-09-28 日电(中国)有限公司 Text sentiment polarity classification system and method based on sentence sequence
US20120253792A1 (en) * 2011-03-30 2012-10-04 Nec Laboratories America, Inc. Sentiment Classification Based on Supervised Latent N-Gram Analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414300A (en) * 2008-11-28 2009-04-22 电子科技大学 Method for sorting and processing internet public feelings information
CN102200969A (en) * 2010-03-25 2011-09-28 日电(中国)有限公司 Text sentiment polarity classification system and method based on sentence sequence
US20120253792A1 (en) * 2011-03-30 2012-10-04 Nec Laboratories America, Inc. Sentiment Classification Based on Supervised Latent N-Gram Analysis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
何坤等: "《基于语义特征的文本情感倾向识别研究》", 《计算机应用研究》 *
厉小军等: "《文本倾向性分析综述》", 《浙江大学学报》 *
夏火松等: "《中文情感分类挖掘预处理关键技术比较研究》", 《情报杂志》 *
连凯: "《基于SVM的汉语评论情感分类方法研究》", 《现代计算机》 *
连凯: "《基于SVM的汉语评论情感分类方法研究》", 《现代计算机》, no. 12, 25 April 2012 (2012-04-25) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224603A (en) * 2015-09-01 2016-01-06 北京京东尚科信息技术有限公司 Corpus acquisition methods and device
CN105224603B (en) * 2015-09-01 2018-04-10 北京京东尚科信息技术有限公司 Training corpus acquisition methods and device
CN105930359A (en) * 2016-04-11 2016-09-07 百度在线网络技术(北京)有限公司 Tendency monitoring method and device
CN107220355A (en) * 2017-06-02 2017-09-29 北京百度网讯科技有限公司 News Quality estimation method, equipment and storage medium based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN109271512B (en) Emotion analysis method, device and storage medium for public opinion comment information
WO2019227710A1 (en) Network public opinion analysis method and apparatus, and computer-readable storage medium
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN108885623A (en) The lexical analysis system and method for knowledge based map
CN104077377A (en) Method and device for finding network public opinion hotspots based on network article attributes
CN110968684A (en) Information processing method, device, equipment and storage medium
CN103455411B (en) The foundation of daily record disaggregated model, user behaviors log sorting technique and device
CN110334268B (en) Block chain project hot word generation method and device
CN109634994A (en) A kind of the matching method for pushing and computer equipment and storage medium of resume and position
CN105095415A (en) Method and apparatus for confirming network emotion
CN109471932A (en) Rumour detection method, system and storage medium based on learning model
CN111198935A (en) Model processing method and device, storage medium and electronic equipment
CN102542063A (en) Content filtering method, device and system
CN109739985A (en) Automatic document classification method, equipment and storage medium
CN110597978A (en) Article abstract generation method and system, electronic equipment and readable storage medium
CN104750791A (en) Image retrieval method and device
CN113516340A (en) Intelligent work order pushing method and device
CN111079029A (en) Sensitive account detection method, storage medium and computer equipment
CN104615689A (en) Searching method and device
CN103336764A (en) Orientation analysis-based classification model building and content identification method and device
Reuver et al. Is Stance Detection Topic-Independent and Cross-topic Generalizable?--A Reproduction Study
CN103399737B (en) Multi-media processing method based on speech data and device
CN109657043B (en) Method, device and equipment for automatically generating article and storage medium
CN112579781B (en) Text classification method, device, electronic equipment and medium
Bharathi et al. Machine Learning Based Approach for Sentiment Analysis on Multilingual Code Mixing Text.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20131002

RJ01 Rejection of invention patent application after publication