CN110569351A

CN110569351A - Network media news classification method based on restrictive user preference

Info

Publication number: CN110569351A
Application number: CN201910821597.9A
Authority: CN
Inventors: 靳继磊; 王森奥; 刘玲; 朱迪; 祁菲菲
Original assignee: Beijing Liyun Wanluo Technology Co Ltd
Current assignee: Beijing Liyun Wanluo Technology Co Ltd
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2019-12-13

Abstract

The invention discloses a network media news classification method of restrictive user preference, which comprises the steps of acquiring a user set and a news set; preprocessing news data in a news set; generating a user news preference matrix according to the preference behavior of the user to news; extracting characteristics of news data according to the news preference matrix of the user; calculating a feature vector of news to be classified; the news is classified according to the feature vector. In the method, in the process of proposing the data characteristics, the preference behavior of the user is fully considered, and the relevant data is screened with emphasis, so that the obtained data result is more suitable for the specific user; in the feature vector proposition, the features of multiple dimensions are fully considered, and a comprehensive similarity calculation is proposed in the similarity calculation process. Compared with the prior art, the method has stronger pertinence, smaller time complexity and better universality.

Description

network media news classification method based on restrictive user preference

Technical Field

The invention belongs to the field of intelligent text classification processing, and particularly relates to a network media news classification method based on restrictive user preference.

background

Text classification refers to determining a category for each document in a document set according to predefined topic categories, and is a supervised learning process. With the development of the internet, unstructured data mainly based on texts is rapidly increased, text classification has become an important research topic, and is widely researched and applied in the fields of machine learning, information retrieval and the like.

the research results of the problem of text classification in academic and industrial circles are many, for example, the traditional text representation model has a Vector Space Model (VSM), but the traditional text representation model does not consider the semantics of words, and the traditional text representation model has the defects of high dimensionality and high sparsity when used for document representation. Latent semantic analysis model (LSA), which maps high dimensional feature space to low dimension using matrix singular value decomposition, has the disadvantage of high matrix decomposition complexity. Probabilistic Latent Semantic Analysis (PLSA) models, which use probabilistic models to link underlying subject and co-occurring data into probabilistic statistics, but the model parameters increase linearly with the number of documents. The latent dirichlet allocation model (LDA), which describes a document-subject-word three-layer structure relationship, is an unsupervised model. However, semantic information such as word vectors is not combined in the training process of the model. Research on such a type is continuing and progressing.

However, the technical requirements in specific applications are different from the research in academia, especially for network media operators, when performing intelligent classification on news in their own platform, two aspects of information need to be focused on, firstly, what is the object of classification, namely what is the text questioning feature of the news existing on the platform and will come in the future, and secondly, what is the purpose of classification, and in fact, the main purpose is to meet the preference of users, namely, news needs to be classified under the constraint of the preference of different users, so as to realize accurate marketing.

disclosure of Invention

In view of the above, an object of the present invention is to provide a method for classifying network media news based on constrained user preferences, which is used for a network media operator to classify news on a platform more accurately and better according to the user preferences.

Based on the above purpose, the invention provides a network media news classification method based on restrictive user preference, which comprises the following steps:

step 1, acquiring a user set U and a news set I;

Step 2, preprocessing the news data in the news set;

Step 3, generating a user news preference matrix R according to the preference behavior of the user to news;

step 4, extracting the characteristics of news data according to the news preference matrix of the user;

Step 5, calculating a feature vector of the news to be classified;

And 6, classifying the news according to the feature vectors.

The preprocessing comprises Chinese word segmentation according to a preset rule; deleting the connecting words according to the part of speech; and deleting the irrelevant words according to the word frequency.

The preference behavior of the user to the news in the step 3 refers to the behavior that the user browses or clicks the corresponding news, the user news preference matrix R is M multiplied by N, wherein M is the number of the users, N is the news space, and one element R in R_ijThe value of (a) is a positive integer, which represents the preference behavior times of the ith user to the jth news.

Step 4 comprises the following steps:

step 401, according to a preset threshold, performing boolean value taking on the user news preference matrix, if r is_ijGreater than or equal to said threshold value, then r_ij1, otherwise r_ij＝0；

Step 402, for r_ijcarrying out feature extraction on the news data corresponding to the 1, and acquiring word frequency vector features;

step 403, establishing a word vector generation model, training the model, and obtaining word vector characteristics of the news data;

And step 404, establishing a theme generation model, training the model, and acquiring theme vector characteristics of the news data.

And 5, fusing the word vector characteristics, the word frequency vector characteristics and the theme vector characteristics of the news to be classified to obtain the general characteristic vector of each news.

The word feature vector is a word2vec word vector, the word frequency vector features are word frequency and inverse document frequency feature vectors, and the topic vector is an improved latent Dirichlet allocation model topic vector;

the fusion of the word vector and the word frequency vector is weighted fusion, the word vector is taken as a reference, and the word frequency vector is taken as a weight; the fusion of the topic vector and the word frequency vector is weighted fusion, the topic vector is a reference, and the word frequency vector is a weight;

And splicing and fusing the word vector after weighted fusion and the theme vector after weighted fusion to obtain a comprehensive characteristic vector of the news to be classified.

The classification process described in step 6 comprises the following steps:

601, randomly determining the centers of T categories according to a preset category number T;

Step 602, calculating the similarity of each news to T centers, and dividing all the news into each category according to the similarity;

step 603, comparing the similarity of every two news in the category, and reselecting the center of the category;

Step 604, according to the newly selected center, performing category division on all news again;

and 605, repeating the steps 603 and 604 until the number of the category changes of all the news is smaller than a preset change threshold or the similarity in the categories is smaller than a preset similarity threshold, stopping iteration, and finishing the classification process. The method of claim 7, wherein the similarity is calculated by the following formula:

Max (d (X, Y)) represents the maximum manhattan distance between two vectors, α is a preset adjustable parameter for adjusting the weight value between the distance metric and the angle metric, and the manhattan distance is represented by d (X, Y) ═ X₁-y₁|+|x₂-y₂|+…+|x_p-y_pL, x and y are two vectors, x₁,x₂,…x_pFor each attribute value, y, in the vector x₁,y₂,…,y_pFor each attribute value in the vector y, there are p attributes in each vector,Where | x | is the vector x ═ x (x)₁,x₂,…x_p) Is the vector y ═ y | | | (y ═ y | |)₁,y₂,…y_p) Euclidean norm of.

the invention relates to a network media news classification method of restrictive user preference, which comprises the steps of fully considering the preference behavior of a user in the process of proposing data characteristics, and screening and processing relevant data with emphasis, so that the obtained data result is more suitable for a specific user, and the data processing time is greatly shortened; secondly, in the process of proposing the feature vector, the features of multiple dimensionalities are fully considered, and a comprehensive similarity calculation is proposed in the process of similarity calculation, so that the method is more universal. Compared with the prior art, the method has stronger pertinence, smaller time complexity and better universality.

drawings

FIG. 1 is a flowchart illustrating a method for classifying news of network media according to constrained user preferences according to an embodiment of the present invention;

Detailed Description

the invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

Referring to fig. 1, an embodiment of the present invention is a method for classifying network media news according to user preferences, including the following steps:

Step 1, acquiring a user set U and a news set I;

step 2, preprocessing the news data in the news set;

step 5, calculating a feature vector of the news to be classified;

and 6, classifying the news according to the feature vectors.

Generally, there are various methods for chinese word segmentation, and there are a word segmentation method based on a character string dictionary, a word segmentation method based on a text probability, and a word segmentation method based on a semantic analysis. The conjunctions are deleted according to the part of speech, and the words needing to be deleted generally comprise conjunctions, exclamation words, adverbs and the like. The deletion of the irrelevant words according to the word frequency can be considered to delete the words with high word frequency, such as ' me ', ' and the like.

step 4 comprises the following steps:

Step 402, for r_ijNews data corresponding to 1Performing feature extraction to obtain word frequency vector features; word frequency and inverse document frequency are statistical methods used to evaluate the importance of a word to a document or a category in a corpus or corpus of documents. The main idea is as follows: if a word or phrase appears more frequently in one category and rarely in other categories, the word or phrase is considered to have a good category discrimination ability and is suitable for classification. The calculation method is actually the product of the word frequency (TF) and the Inverse Document Frequency (IDF). The word frequency is the frequency of occurrence of a word t in a document d, while the inverse document frequency represents the category discrimination capability of the word t, the fewer documents containing the word t the greater the inverse document frequency. The calculation formulas for TF and IDF are as follows, respectively.

Where f (t, d) represents the number of times the term t appears in the document d, idf_telder indicates the number of documents including the entry t in the corpus, and N indicates the total number of documents in the corpus. TFIDF weights for entry t are: fidf_t＝tf(t,d)×idf_t. It can be seen that the weight of an entry t increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

The training sum of word2vec word vectors yields a simpler answer and is well suited for use in the present invention due to the better similarity measure between word vectors. Because news data are screened according to user preference in the early stage, the word2vec word vector in the step occupies a small space, and subsequent calculation can be faster.

in the embodiment of the invention, the generation model of the subject vector adopts an LF-LDA model, and the model replaces the subject word Dirichlet polynomial distribution in the original LDA model into the mixture of two distributions. One of which is the original Dirichlet polynomial distribution and the other of which generates word distributions for the underlying features.

The classification process described in step 6 comprises the following steps:

And 605, repeating the steps 603 and 604 until the number of the category changes of all the news is smaller than a preset change threshold or the similarity in the categories is smaller than a preset similarity threshold, stopping iteration, and finishing the classification process.

the method of claim 7, wherein the similarity is calculated by the following formula:

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

the embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. a network media news classification method based on restrictive user preference is characterized by comprising the following steps:

Step 1, acquiring a user set U and a news set I;

Step 2, preprocessing the news data in the news set;

Step 5, calculating a feature vector of the news to be classified;

And 6, classifying the news according to the feature vectors.

2. the method of claim 1, wherein the preprocessing comprises Chinese word segmentation according to a predetermined rule; deleting the connecting words according to the part of speech; and deleting the irrelevant words according to the word frequency.

3. The method of claim 2, wherein the news-based preference behavior of the user in step 3 is that the user browses or clicks the corresponding news behavior, and the news preference matrix R of the user has M x N dimensions, where M is the number of users, N is the news space, and one element R in R is the number of users_ijThe value of (a) is a positive integer, which represents the preference behavior times of the ith user to the jth news.

4. the method for classifying news of web media according to claim 3, wherein the step 4 comprises the steps of:

5. the method for classifying news as claimed in claim 4, wherein in step 5, word vector features, word frequency vector features and topic vector features of the news to be classified are fused to obtain a summarized feature vector of each news item.

6. the method of claim 5, wherein the word feature vector is a word2vec word vector, the word frequency vector features are a word frequency and inverse document frequency feature vector, and the topic vector is an improved latent Dirichlet allocation model topic vector;

7. The method for news classification of network media according to claim 6, wherein the classification process in step 6 comprises the following steps:

8. the method of claim 7, wherein the similarity is calculated by the following formula:

Where Max (d (X, Y)) represents the maximum manhattan distance of two feature vectors, which is expressed as d (X, Y) ═ X₁-y₁|+|x₂-y₂|+…+|x_p-y_pI, x and y are two eigenvectors, x₁,x₂,…x_pfor each attribute value, y, in the vector x₁,y₂,…,y_pFor each attribute value in the vector y, there are p attributes in each vector, alpha is an adjustable parameter,Where | x | is the vector x ═ x (x)₁,x₂,…x_p) Is the vector y ═ y | | | (y ═ y | |)₁,y₂,…y_p) Euclidean norm of.