CN109189880A

CN109189880A - A kind of user interest classification method based on short text

Info

Publication number: CN109189880A
Application number: CN201711452259.XA
Authority: CN
Inventors: 万迅
Original assignee: Ai Pink Technology (wuhan) Ltd By Share Ltd
Current assignee: Ai Pink Technology (wuhan) Ltd By Share Ltd
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2019-01-11

Abstract

For user interest classification model construction problem, a kind of method for establishing the emerging interesting model of classifying of user on short text data collection on HerPink platform is proposed.To alleviate data sparsity problem caused by short text, on the basis of analyzing short text structure and content, short text reconstruct concept is provided, root carries out the extension of content of text, to expand original characteristic information.The feature word set of text after reconstruct is mapped to concept set using participle tool.It is clustered based on the text vector for being abstracted into conceptual level, divides the interest set of user, and provide the expression mechanism of user interest disaggregated model.The result shows that short text reconstruct and concept mapping improve Clustering Effect, show that the user interest disaggregated model of building has preferable performance.

Description

A kind of user interest classification method based on short text

Technical field

This application involves technical field of information processing more particularly to a kind of short text user interest classification methods.

Background technique

Text is obtained by background data base, for Text Pretreatment part, is carried using Python Jieba segments text, the spy for then removing stop words, calculating TFIDF feature weight, extraction feature item formation text Vector space is levied, user interest classification is carried out finally by SVM, carries out evaluation of classification.

Summary of the invention

Vector space model is the model about text representation.It by the basic unit of text representation be defined as by The characteristic item that word, word or phrase are constituted, all characteristic item constitutive characteristic item collections.Each document is equal to characteristic item by a dimension The vector for collecting number is constituted, and each component of the vector is the number that characteristic item occurs in a document.It is defined as follows: setting Document sets are A={ ai }, and the number of element is S in set A；The number of element is M in feature item collection T={ ti }, set T；It is fixed Adopted characteristic item ti weight Wij in a document are as follows:

1≤i of Wij=tfij/af j≤S, 1≤j≤M

Wherein tfij is characterized the frequency that a ti occurs in document ai, referred to as Xiang Pin；Afj is that occur in document sets D The number of documents of characteristic item ti, referred to as document frequency.The vector space model of document is constructed on this basis, with t1, t2 ..., TM is reference axis, document ai is expressed as M dimensional vector (Wi1, Wi2 ..., WiM), then similarity sim (ai, aj) between ai, aj Are as follows:

Wherein: 1≤i≤S, 1≤j≤M

The similarity of search file X and ownership goal document Y is sim (X, Y) at this time, and selection meets predetermined threshold and wants The document asked can just obtain the search file for meeting user demand by the descending arrangement of similarity.

Detailed description of the invention

Fig. 1 is that a kind of framework for short text user interest classification method that one exemplary embodiment of the application provides is intended to.

Specific embodiment

1, following index is mainly passed through to the evaluation of text classification quality:

(1) classification accuracy rate (classification accuracy)

Accuracy (M)=Σ xp (x) Accuracy (M, x)=p (C (x)=C (x)

The Accuracy (M, x)=1 as C (x)=C (x), otherwise Accuracy (M, x)=0.Wherein C (x) is sample x Concrete class, C (x) be model prediction classification, p (x) be sample x probability.

(2) precision ratio (precision)

Refer to the number of documents and all number of files for meeting inquiry for matching that correct search engine retrieving is arrived with retrieved set Purpose ratio.The estimation formulas of precision ratio are as follows:

Precision (M, C)=P (C/C)

(3) recall ratio (recall) refers to that the satisfaction of correct text retrieval target and physical presence is looked into search result Ask the ratio of desired text data, the estimation formulas of recall ratio are as follows:

Recall (M, C)=P (C C))

Wherein C represents actual value as target class value, and C represents predicted value as target class value.

Claims

1. a kind of user interest classification method based on short text, which comprises the steps of:

(1) number of words of HerPink platform user short text is restricted, so text belongs to short text scope.Due to single number of words Less, contained characteristic information is less, it is difficult to bear the important task for portraying user interest classification, it is therefore necessary to take certain strategy Content abundant.For objective corpus, the own structural characteristics having are the correlative connection characteristics between text.Here text This information content delivered, forward and commented on comprising user.The possibility that user delivers or forwards has corresponding comment, that Just there is the property that is mutually related between the comment text collection of this text for being published or forwarding corresponding thereto；

(2) since the sent out short text content information of user is less, short text is not obvious enough containing feature, therefore wants with a kind of solution party Method enables to the characteristic information of every short text to increase；Just because of having correlation between short text, in one institute Have in associated short text, the keyword of former short text, which can be repeated, to be referred to and other word quantity relevant to theme also can Increase；For this feature, the short text that user can be delivered or be forwarded by its associated comment assigned short text set into Row extension；Likewise, the comment short text that user is delivered is also by affiliated short text and other corresponding comment texts Expanded.

User interest identification problem is converted traditional classification problem by the present invention, i.e., according to the interest characteristics vector Uv of user U ={ x1, x2, x3 ..., xn } and power function f judges the category of interest Y={ y1, y2, y3 ..., yi } of user, is denoted as f (UX) -> Y, wherein yi represents the category of interest of user.

The present invention proposes a kind of new user interest profile expression way: giving some user U, it is assumed that it is in special time period It is combined into I={ i1, i2, i3 ..., in } in the pictures of middle publication, n indicates the quantity of picture, for each picture i, comprising more The different concept and objectives (characterization that can be used as image, semantic) of kind, can identify this with existing image, semantic identification technology The characteristic set F={ f1, f2, f3 ..., fj ..., fm } of a little concepts and object, m are characterized number, and fj indicates that the image includes language The probability of adopted concept j.If likewise, in a certain period of time the user publication text collection be D=d1, d2, d3 ..., Dp }, p indicates the quantity of text；Assuming that the length of text D is s, i.e., all of this user publication (utilize filtering comprising s word Algorithm retains valuable feature text after being filtered to text) so D={ W1, W2, W3 ..., Ws }, in text Each word can word vector indicate, can preferably utilize syntax and semantic feature, last text sentence vector is expressed as V (D)=V (W1)+V (W2)+...+V (WS), micro- this classification of text for word-based vector characteristics in next step.For number of tags According to T={ t1, t2, t3 ..., tq }, q indicate the quantity (each user's more than one label) of label, pass through each different mark Label resolve into the method for vector space model to construct the feature representation of label.Finally, different category of interest users share picture Concept characteristic be distributed different, the text of publication is different, and user tag is different, therefore can predict its different interest accordingly.

(3) assume S={ s₁,s₂,s₃….,s_nRepresent all publications of user, forwarding and the set commented on.Wherein, s_j:<t_j,R_j>, The short text that tj is i-th；R_iFor its associated short text set；r_j∈R_i.If L={ l₁,l₂,l₃…,l_nIt is reconstruct Set afterwards, l_j:<D_j,E_j>.Wherein, D_iIndicate t_iWith R_iThe text formed after reconstruct；E_iIndicate t_iThe theme of middle extraction and spy The different characteristic item of user representative and the set of corresponding weight value, e_j∈E_j,e_j:<TW>, W_jCalculation formula are as follows:

Wherein, ρ is weighting coefficient；freq(t_ij) it is characterized a T_jIn set E_iThe frequency occurred in middle each element attribute T.