CN109189880A - A kind of user interest classification method based on short text - Google Patents
A kind of user interest classification method based on short text Download PDFInfo
- Publication number
- CN109189880A CN109189880A CN201711452259.XA CN201711452259A CN109189880A CN 109189880 A CN109189880 A CN 109189880A CN 201711452259 A CN201711452259 A CN 201711452259A CN 109189880 A CN109189880 A CN 109189880A
- Authority
- CN
- China
- Prior art keywords
- text
- user
- short text
- interest
- short
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
For user interest classification model construction problem, a kind of method for establishing the emerging interesting model of classifying of user on short text data collection on HerPink platform is proposed.To alleviate data sparsity problem caused by short text, on the basis of analyzing short text structure and content, short text reconstruct concept is provided, root carries out the extension of content of text, to expand original characteristic information.The feature word set of text after reconstruct is mapped to concept set using participle tool.It is clustered based on the text vector for being abstracted into conceptual level, divides the interest set of user, and provide the expression mechanism of user interest disaggregated model.The result shows that short text reconstruct and concept mapping improve Clustering Effect, show that the user interest disaggregated model of building has preferable performance.
Description
Technical field
This application involves technical field of information processing more particularly to a kind of short text user interest classification methods.
Background technique
Text is obtained by background data base, for Text Pretreatment part, is carried using Python
Jieba segments text, the spy for then removing stop words, calculating TFIDF feature weight, extraction feature item formation text
Vector space is levied, user interest classification is carried out finally by SVM, carries out evaluation of classification.
Summary of the invention
Vector space model is the model about text representation.It by the basic unit of text representation be defined as by
The characteristic item that word, word or phrase are constituted, all characteristic item constitutive characteristic item collections.Each document is equal to characteristic item by a dimension
The vector for collecting number is constituted, and each component of the vector is the number that characteristic item occurs in a document.It is defined as follows: setting
Document sets are A={ ai }, and the number of element is S in set A;The number of element is M in feature item collection T={ ti }, set T;It is fixed
Adopted characteristic item ti weight Wij in a document are as follows:
1≤i of Wij=tfij/af j≤S, 1≤j≤M
Wherein tfij is characterized the frequency that a ti occurs in document ai, referred to as Xiang Pin;Afj is that occur in document sets D
The number of documents of characteristic item ti, referred to as document frequency.The vector space model of document is constructed on this basis, with t1, t2 ...,
TM is reference axis, document ai is expressed as M dimensional vector (Wi1, Wi2 ..., WiM), then similarity sim (ai, aj) between ai, aj
Are as follows:
Wherein: 1≤i≤S, 1≤j≤M
The similarity of search file X and ownership goal document Y is sim (X, Y) at this time, and selection meets predetermined threshold and wants
The document asked can just obtain the search file for meeting user demand by the descending arrangement of similarity.
Detailed description of the invention
Fig. 1 is that a kind of framework for short text user interest classification method that one exemplary embodiment of the application provides is intended to.
Specific embodiment
1, following index is mainly passed through to the evaluation of text classification quality:
(1) classification accuracy rate (classification accuracy)
Accuracy (M)=Σ xp (x) Accuracy (M, x)=p (C (x)=C (x)
The Accuracy (M, x)=1 as C (x)=C (x), otherwise Accuracy (M, x)=0.Wherein C (x) is sample x
Concrete class, C (x) be model prediction classification, p (x) be sample x probability.
(2) precision ratio (precision)
Refer to the number of documents and all number of files for meeting inquiry for matching that correct search engine retrieving is arrived with retrieved set
Purpose ratio.The estimation formulas of precision ratio are as follows:
Precision (M, C)=P (C/C)
(3) recall ratio (recall) refers to that the satisfaction of correct text retrieval target and physical presence is looked into search result
Ask the ratio of desired text data, the estimation formulas of recall ratio are as follows:
Recall (M, C)=P (C C))
Wherein C represents actual value as target class value, and C represents predicted value as target class value.
Claims (1)
1. a kind of user interest classification method based on short text, which comprises the steps of:
(1) number of words of HerPink platform user short text is restricted, so text belongs to short text scope.Due to single number of words
Less, contained characteristic information is less, it is difficult to bear the important task for portraying user interest classification, it is therefore necessary to take certain strategy
Content abundant.For objective corpus, the own structural characteristics having are the correlative connection characteristics between text.Here text
This information content delivered, forward and commented on comprising user.The possibility that user delivers or forwards has corresponding comment, that
Just there is the property that is mutually related between the comment text collection of this text for being published or forwarding corresponding thereto;
(2) since the sent out short text content information of user is less, short text is not obvious enough containing feature, therefore wants with a kind of solution party
Method enables to the characteristic information of every short text to increase;Just because of having correlation between short text, in one institute
Have in associated short text, the keyword of former short text, which can be repeated, to be referred to and other word quantity relevant to theme also can
Increase;For this feature, the short text that user can be delivered or be forwarded by its associated comment assigned short text set into
Row extension;Likewise, the comment short text that user is delivered is also by affiliated short text and other corresponding comment texts
Expanded.
User interest identification problem is converted traditional classification problem by the present invention, i.e., according to the interest characteristics vector Uv of user U
={ x1, x2, x3 ..., xn } and power function f judges the category of interest Y={ y1, y2, y3 ..., yi } of user, is denoted as f
(UX) -> Y, wherein yi represents the category of interest of user.
The present invention proposes a kind of new user interest profile expression way: giving some user U, it is assumed that it is in special time period
It is combined into I={ i1, i2, i3 ..., in } in the pictures of middle publication, n indicates the quantity of picture, for each picture i, comprising more
The different concept and objectives (characterization that can be used as image, semantic) of kind, can identify this with existing image, semantic identification technology
The characteristic set F={ f1, f2, f3 ..., fj ..., fm } of a little concepts and object, m are characterized number, and fj indicates that the image includes language
The probability of adopted concept j.If likewise, in a certain period of time the user publication text collection be D=d1, d2, d3 ...,
Dp }, p indicates the quantity of text;Assuming that the length of text D is s, i.e., all of this user publication (utilize filtering comprising s word
Algorithm retains valuable feature text after being filtered to text) so D={ W1, W2, W3 ..., Ws }, in text
Each word can word vector indicate, can preferably utilize syntax and semantic feature, last text sentence vector is expressed as V
(D)=V (W1)+V (W2)+...+V (WS), micro- this classification of text for word-based vector characteristics in next step.For number of tags
According to T={ t1, t2, t3 ..., tq }, q indicate the quantity (each user's more than one label) of label, pass through each different mark
Label resolve into the method for vector space model to construct the feature representation of label.Finally, different category of interest users share picture
Concept characteristic be distributed different, the text of publication is different, and user tag is different, therefore can predict its different interest accordingly.
(3) assume S={ s1,s2,s3….,snRepresent all publications of user, forwarding and the set commented on.Wherein, sj:<tj,Rj>,
The short text that tj is i-th;RiFor its associated short text set;rj∈Ri.If L={ l1,l2,l3…,lnIt is reconstruct
Set afterwards, lj:<Dj,Ej>.Wherein, DiIndicate tiWith RiThe text formed after reconstruct;EiIndicate tiThe theme of middle extraction and spy
The different characteristic item of user representative and the set of corresponding weight value, ej∈Ej,ej:<TW>, WjCalculation formula are as follows:
Wherein, ρ is weighting coefficient;freq(tij) it is characterized a TjIn set EiThe frequency occurred in middle each element attribute T.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711452259.XA CN109189880A (en) | 2017-12-26 | 2017-12-26 | A kind of user interest classification method based on short text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711452259.XA CN109189880A (en) | 2017-12-26 | 2017-12-26 | A kind of user interest classification method based on short text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109189880A true CN109189880A (en) | 2019-01-11 |
Family
ID=64948437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711452259.XA Pending CN109189880A (en) | 2017-12-26 | 2017-12-26 | A kind of user interest classification method based on short text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109189880A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955856B (en) * | 2012-11-09 | 2015-07-08 | 北京航空航天大学 | Chinese short text classification method based on characteristic extension |
EP3021264A1 (en) * | 2013-07-11 | 2016-05-18 | Huawei Technologies Co., Ltd. | Information recommendation method and apparatus in social media |
CN105740366A (en) * | 2016-01-26 | 2016-07-06 | 哈尔滨工业大学深圳研究生院 | Inference method and device of MicroBlog user interests |
CN106095966A (en) * | 2016-06-15 | 2016-11-09 | 成都品果科技有限公司 | A kind of user's extendible label for labelling method and system |
CN107239203A (en) * | 2016-03-29 | 2017-10-10 | 北京三星通信技术研究有限公司 | A kind of image management method and device |
-
2017
- 2017-12-26 CN CN201711452259.XA patent/CN109189880A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955856B (en) * | 2012-11-09 | 2015-07-08 | 北京航空航天大学 | Chinese short text classification method based on characteristic extension |
EP3021264A1 (en) * | 2013-07-11 | 2016-05-18 | Huawei Technologies Co., Ltd. | Information recommendation method and apparatus in social media |
CN105740366A (en) * | 2016-01-26 | 2016-07-06 | 哈尔滨工业大学深圳研究生院 | Inference method and device of MicroBlog user interests |
CN107239203A (en) * | 2016-03-29 | 2017-10-10 | 北京三星通信技术研究有限公司 | A kind of image management method and device |
CN106095966A (en) * | 2016-06-15 | 2016-11-09 | 成都品果科技有限公司 | A kind of user's extendible label for labelling method and system |
Non-Patent Citations (2)
Title |
---|
曾金: ""基于图像语义的用户兴趣建模"", 《数据分析与知识发现》 * |
邱云飞: ""基于微博短文本的用户兴趣建模方法"", 《计算机工程》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109947909B (en) | Intelligent customer service response method, equipment, storage medium and device | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN106951422B (en) | Webpage training method and device, and search intention identification method and device | |
US9928296B2 (en) | Search lexicon expansion | |
KR101754473B1 (en) | Method and system for automatically summarizing documents to images and providing the image-based contents | |
US20150142708A1 (en) | Retrieval of similar images to a query image | |
CN112256939B (en) | Text entity relation extraction method for chemical field | |
CN103699625A (en) | Method and device for retrieving based on keyword | |
CN109388743B (en) | Language model determining method and device | |
CN110750640A (en) | Text data classification method and device based on neural network model and storage medium | |
CN112069408A (en) | Recommendation system and method for fusion relation extraction | |
CN102402593A (en) | Multi-modal approach to search query input | |
CN114238573A (en) | Information pushing method and device based on text countermeasure sample | |
CN108984555B (en) | User state mining and information recommendation method, device and equipment | |
CN104199965A (en) | Semantic information retrieval method | |
CN110858217A (en) | Method and device for detecting microblog sensitive topics and readable storage medium | |
US11886515B2 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
KR20220134695A (en) | System for author identification using artificial intelligence learning model and a method thereof | |
CN115659008A (en) | Information pushing system and method for big data information feedback, electronic device and medium | |
JP2006227823A (en) | Information processor and its control method | |
CN104765752B (en) | Recommendation apparatus based on user model evolution and method | |
CN117725194A (en) | Personalized pushing method, system, equipment and storage medium for futures data | |
CN112925912A (en) | Text processing method, and synonymous text recall method and device | |
CN116738068A (en) | Trending topic mining method, device, storage medium and equipment | |
CN109189880A (en) | A kind of user interest classification method based on short text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190111 |
|
WD01 | Invention patent application deemed withdrawn after publication |