CN109189880A - A kind of user interest classification method based on short text - Google Patents

A kind of user interest classification method based on short text Download PDF

Info

Publication number
CN109189880A
CN109189880A CN201711452259.XA CN201711452259A CN109189880A CN 109189880 A CN109189880 A CN 109189880A CN 201711452259 A CN201711452259 A CN 201711452259A CN 109189880 A CN109189880 A CN 109189880A
Authority
CN
China
Prior art keywords
text
user
short text
interest
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711452259.XA
Other languages
Chinese (zh)
Inventor
万迅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ai Pink Technology (wuhan) Ltd By Share Ltd
Original Assignee
Ai Pink Technology (wuhan) Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ai Pink Technology (wuhan) Ltd By Share Ltd filed Critical Ai Pink Technology (wuhan) Ltd By Share Ltd
Priority to CN201711452259.XA priority Critical patent/CN109189880A/en
Publication of CN109189880A publication Critical patent/CN109189880A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

For user interest classification model construction problem, a kind of method for establishing the emerging interesting model of classifying of user on short text data collection on HerPink platform is proposed.To alleviate data sparsity problem caused by short text, on the basis of analyzing short text structure and content, short text reconstruct concept is provided, root carries out the extension of content of text, to expand original characteristic information.The feature word set of text after reconstruct is mapped to concept set using participle tool.It is clustered based on the text vector for being abstracted into conceptual level, divides the interest set of user, and provide the expression mechanism of user interest disaggregated model.The result shows that short text reconstruct and concept mapping improve Clustering Effect, show that the user interest disaggregated model of building has preferable performance.

Description

A kind of user interest classification method based on short text
Technical field
This application involves technical field of information processing more particularly to a kind of short text user interest classification methods.
Background technique
Text is obtained by background data base, for Text Pretreatment part, is carried using Python Jieba segments text, the spy for then removing stop words, calculating TFIDF feature weight, extraction feature item formation text Vector space is levied, user interest classification is carried out finally by SVM, carries out evaluation of classification.
Summary of the invention
Vector space model is the model about text representation.It by the basic unit of text representation be defined as by The characteristic item that word, word or phrase are constituted, all characteristic item constitutive characteristic item collections.Each document is equal to characteristic item by a dimension The vector for collecting number is constituted, and each component of the vector is the number that characteristic item occurs in a document.It is defined as follows: setting Document sets are A={ ai }, and the number of element is S in set A;The number of element is M in feature item collection T={ ti }, set T;It is fixed Adopted characteristic item ti weight Wij in a document are as follows:
1≤i of Wij=tfij/af j≤S, 1≤j≤M
Wherein tfij is characterized the frequency that a ti occurs in document ai, referred to as Xiang Pin;Afj is that occur in document sets D The number of documents of characteristic item ti, referred to as document frequency.The vector space model of document is constructed on this basis, with t1, t2 ..., TM is reference axis, document ai is expressed as M dimensional vector (Wi1, Wi2 ..., WiM), then similarity sim (ai, aj) between ai, aj Are as follows:
Wherein: 1≤i≤S, 1≤j≤M
The similarity of search file X and ownership goal document Y is sim (X, Y) at this time, and selection meets predetermined threshold and wants The document asked can just obtain the search file for meeting user demand by the descending arrangement of similarity.
Detailed description of the invention
Fig. 1 is that a kind of framework for short text user interest classification method that one exemplary embodiment of the application provides is intended to.
Specific embodiment
1, following index is mainly passed through to the evaluation of text classification quality:
(1) classification accuracy rate (classification accuracy)
Accuracy (M)=Σ xp (x) Accuracy (M, x)=p (C (x)=C (x)
The Accuracy (M, x)=1 as C (x)=C (x), otherwise Accuracy (M, x)=0.Wherein C (x) is sample x Concrete class, C (x) be model prediction classification, p (x) be sample x probability.
(2) precision ratio (precision)
Refer to the number of documents and all number of files for meeting inquiry for matching that correct search engine retrieving is arrived with retrieved set Purpose ratio.The estimation formulas of precision ratio are as follows:
Precision (M, C)=P (C/C)
(3) recall ratio (recall) refers to that the satisfaction of correct text retrieval target and physical presence is looked into search result Ask the ratio of desired text data, the estimation formulas of recall ratio are as follows:
Recall (M, C)=P (C C))
Wherein C represents actual value as target class value, and C represents predicted value as target class value.

Claims (1)

1. a kind of user interest classification method based on short text, which comprises the steps of:
(1) number of words of HerPink platform user short text is restricted, so text belongs to short text scope.Due to single number of words Less, contained characteristic information is less, it is difficult to bear the important task for portraying user interest classification, it is therefore necessary to take certain strategy Content abundant.For objective corpus, the own structural characteristics having are the correlative connection characteristics between text.Here text This information content delivered, forward and commented on comprising user.The possibility that user delivers or forwards has corresponding comment, that Just there is the property that is mutually related between the comment text collection of this text for being published or forwarding corresponding thereto;
(2) since the sent out short text content information of user is less, short text is not obvious enough containing feature, therefore wants with a kind of solution party Method enables to the characteristic information of every short text to increase;Just because of having correlation between short text, in one institute Have in associated short text, the keyword of former short text, which can be repeated, to be referred to and other word quantity relevant to theme also can Increase;For this feature, the short text that user can be delivered or be forwarded by its associated comment assigned short text set into Row extension;Likewise, the comment short text that user is delivered is also by affiliated short text and other corresponding comment texts Expanded.
User interest identification problem is converted traditional classification problem by the present invention, i.e., according to the interest characteristics vector Uv of user U ={ x1, x2, x3 ..., xn } and power function f judges the category of interest Y={ y1, y2, y3 ..., yi } of user, is denoted as f (UX) -> Y, wherein yi represents the category of interest of user.
The present invention proposes a kind of new user interest profile expression way: giving some user U, it is assumed that it is in special time period It is combined into I={ i1, i2, i3 ..., in } in the pictures of middle publication, n indicates the quantity of picture, for each picture i, comprising more The different concept and objectives (characterization that can be used as image, semantic) of kind, can identify this with existing image, semantic identification technology The characteristic set F={ f1, f2, f3 ..., fj ..., fm } of a little concepts and object, m are characterized number, and fj indicates that the image includes language The probability of adopted concept j.If likewise, in a certain period of time the user publication text collection be D=d1, d2, d3 ..., Dp }, p indicates the quantity of text;Assuming that the length of text D is s, i.e., all of this user publication (utilize filtering comprising s word Algorithm retains valuable feature text after being filtered to text) so D={ W1, W2, W3 ..., Ws }, in text Each word can word vector indicate, can preferably utilize syntax and semantic feature, last text sentence vector is expressed as V (D)=V (W1)+V (W2)+...+V (WS), micro- this classification of text for word-based vector characteristics in next step.For number of tags According to T={ t1, t2, t3 ..., tq }, q indicate the quantity (each user's more than one label) of label, pass through each different mark Label resolve into the method for vector space model to construct the feature representation of label.Finally, different category of interest users share picture Concept characteristic be distributed different, the text of publication is different, and user tag is different, therefore can predict its different interest accordingly.
(3) assume S={ s1,s2,s3….,snRepresent all publications of user, forwarding and the set commented on.Wherein, sj:<tj,Rj>, The short text that tj is i-th;RiFor its associated short text set;rj∈Ri.If L={ l1,l2,l3…,lnIt is reconstruct Set afterwards, lj:<Dj,Ej>.Wherein, DiIndicate tiWith RiThe text formed after reconstruct;EiIndicate tiThe theme of middle extraction and spy The different characteristic item of user representative and the set of corresponding weight value, ej∈Ej,ej:<TW>, WjCalculation formula are as follows:
Wherein, ρ is weighting coefficient;freq(tij) it is characterized a TjIn set EiThe frequency occurred in middle each element attribute T.
CN201711452259.XA 2017-12-26 2017-12-26 A kind of user interest classification method based on short text Pending CN109189880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711452259.XA CN109189880A (en) 2017-12-26 2017-12-26 A kind of user interest classification method based on short text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711452259.XA CN109189880A (en) 2017-12-26 2017-12-26 A kind of user interest classification method based on short text

Publications (1)

Publication Number Publication Date
CN109189880A true CN109189880A (en) 2019-01-11

Family

ID=64948437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711452259.XA Pending CN109189880A (en) 2017-12-26 2017-12-26 A kind of user interest classification method based on short text

Country Status (1)

Country Link
CN (1) CN109189880A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955856B (en) * 2012-11-09 2015-07-08 北京航空航天大学 Chinese short text classification method based on characteristic extension
EP3021264A1 (en) * 2013-07-11 2016-05-18 Huawei Technologies Co., Ltd. Information recommendation method and apparatus in social media
CN105740366A (en) * 2016-01-26 2016-07-06 哈尔滨工业大学深圳研究生院 Inference method and device of MicroBlog user interests
CN106095966A (en) * 2016-06-15 2016-11-09 成都品果科技有限公司 A kind of user's extendible label for labelling method and system
CN107239203A (en) * 2016-03-29 2017-10-10 北京三星通信技术研究有限公司 A kind of image management method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955856B (en) * 2012-11-09 2015-07-08 北京航空航天大学 Chinese short text classification method based on characteristic extension
EP3021264A1 (en) * 2013-07-11 2016-05-18 Huawei Technologies Co., Ltd. Information recommendation method and apparatus in social media
CN105740366A (en) * 2016-01-26 2016-07-06 哈尔滨工业大学深圳研究生院 Inference method and device of MicroBlog user interests
CN107239203A (en) * 2016-03-29 2017-10-10 北京三星通信技术研究有限公司 A kind of image management method and device
CN106095966A (en) * 2016-06-15 2016-11-09 成都品果科技有限公司 A kind of user's extendible label for labelling method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曾金: ""基于图像语义的用户兴趣建模"", 《数据分析与知识发现》 *
邱云飞: ""基于微博短文本的用户兴趣建模方法"", 《计算机工程》 *

Similar Documents

Publication Publication Date Title
CN109947909B (en) Intelligent customer service response method, equipment, storage medium and device
CN106649818B (en) Application search intention identification method and device, application search method and server
CN106951422B (en) Webpage training method and device, and search intention identification method and device
US9928296B2 (en) Search lexicon expansion
KR101754473B1 (en) Method and system for automatically summarizing documents to images and providing the image-based contents
US20150142708A1 (en) Retrieval of similar images to a query image
CN112256939B (en) Text entity relation extraction method for chemical field
CN103699625A (en) Method and device for retrieving based on keyword
CN109388743B (en) Language model determining method and device
CN110750640A (en) Text data classification method and device based on neural network model and storage medium
CN112069408A (en) Recommendation system and method for fusion relation extraction
CN102402593A (en) Multi-modal approach to search query input
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN108984555B (en) User state mining and information recommendation method, device and equipment
CN104199965A (en) Semantic information retrieval method
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
KR20220134695A (en) System for author identification using artificial intelligence learning model and a method thereof
CN115659008A (en) Information pushing system and method for big data information feedback, electronic device and medium
JP2006227823A (en) Information processor and its control method
CN104765752B (en) Recommendation apparatus based on user model evolution and method
CN117725194A (en) Personalized pushing method, system, equipment and storage medium for futures data
CN112925912A (en) Text processing method, and synonymous text recall method and device
CN116738068A (en) Trending topic mining method, device, storage medium and equipment
CN109189880A (en) A kind of user interest classification method based on short text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190111

WD01 Invention patent application deemed withdrawn after publication