CN110569351A - Network media news classification method based on restrictive user preference - Google Patents

Network media news classification method based on restrictive user preference Download PDF

Info

Publication number
CN110569351A
CN110569351A CN201910821597.9A CN201910821597A CN110569351A CN 110569351 A CN110569351 A CN 110569351A CN 201910821597 A CN201910821597 A CN 201910821597A CN 110569351 A CN110569351 A CN 110569351A
Authority
CN
China
Prior art keywords
news
vector
user
word
preference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910821597.9A
Other languages
Chinese (zh)
Inventor
靳继磊
王森奥
刘玲
朱迪
祁菲菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Liyun Wanluo Technology Co Ltd
Original Assignee
Beijing Liyun Wanluo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Liyun Wanluo Technology Co Ltd filed Critical Beijing Liyun Wanluo Technology Co Ltd
Priority to CN201910821597.9A priority Critical patent/CN110569351A/en
Publication of CN110569351A publication Critical patent/CN110569351A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a network media news classification method of restrictive user preference, which comprises the steps of acquiring a user set and a news set; preprocessing news data in a news set; generating a user news preference matrix according to the preference behavior of the user to news; extracting characteristics of news data according to the news preference matrix of the user; calculating a feature vector of news to be classified; the news is classified according to the feature vector. In the method, in the process of proposing the data characteristics, the preference behavior of the user is fully considered, and the relevant data is screened with emphasis, so that the obtained data result is more suitable for the specific user; in the feature vector proposition, the features of multiple dimensions are fully considered, and a comprehensive similarity calculation is proposed in the similarity calculation process. Compared with the prior art, the method has stronger pertinence, smaller time complexity and better universality.

Description

network media news classification method based on restrictive user preference
Technical Field
The invention belongs to the field of intelligent text classification processing, and particularly relates to a network media news classification method based on restrictive user preference.
background
Text classification refers to determining a category for each document in a document set according to predefined topic categories, and is a supervised learning process. With the development of the internet, unstructured data mainly based on texts is rapidly increased, text classification has become an important research topic, and is widely researched and applied in the fields of machine learning, information retrieval and the like.
the research results of the problem of text classification in academic and industrial circles are many, for example, the traditional text representation model has a Vector Space Model (VSM), but the traditional text representation model does not consider the semantics of words, and the traditional text representation model has the defects of high dimensionality and high sparsity when used for document representation. Latent semantic analysis model (LSA), which maps high dimensional feature space to low dimension using matrix singular value decomposition, has the disadvantage of high matrix decomposition complexity. Probabilistic Latent Semantic Analysis (PLSA) models, which use probabilistic models to link underlying subject and co-occurring data into probabilistic statistics, but the model parameters increase linearly with the number of documents. The latent dirichlet allocation model (LDA), which describes a document-subject-word three-layer structure relationship, is an unsupervised model. However, semantic information such as word vectors is not combined in the training process of the model. Research on such a type is continuing and progressing.
However, the technical requirements in specific applications are different from the research in academia, especially for network media operators, when performing intelligent classification on news in their own platform, two aspects of information need to be focused on, firstly, what is the object of classification, namely what is the text questioning feature of the news existing on the platform and will come in the future, and secondly, what is the purpose of classification, and in fact, the main purpose is to meet the preference of users, namely, news needs to be classified under the constraint of the preference of different users, so as to realize accurate marketing.
disclosure of Invention
In view of the above, an object of the present invention is to provide a method for classifying network media news based on constrained user preferences, which is used for a network media operator to classify news on a platform more accurately and better according to the user preferences.
Based on the above purpose, the invention provides a network media news classification method based on restrictive user preference, which comprises the following steps:
step 1, acquiring a user set U and a news set I;
Step 2, preprocessing the news data in the news set;
Step 3, generating a user news preference matrix R according to the preference behavior of the user to news;
step 4, extracting the characteristics of news data according to the news preference matrix of the user;
Step 5, calculating a feature vector of the news to be classified;
And 6, classifying the news according to the feature vectors.
The preprocessing comprises Chinese word segmentation according to a preset rule; deleting the connecting words according to the part of speech; and deleting the irrelevant words according to the word frequency.
The preference behavior of the user to the news in the step 3 refers to the behavior that the user browses or clicks the corresponding news, the user news preference matrix R is M multiplied by N, wherein M is the number of the users, N is the news space, and one element R in RijThe value of (a) is a positive integer, which represents the preference behavior times of the ith user to the jth news.
Step 4 comprises the following steps:
step 401, according to a preset threshold, performing boolean value taking on the user news preference matrix, if r isijGreater than or equal to said threshold value, then rij1, otherwise rij=0;
Step 402, for rijcarrying out feature extraction on the news data corresponding to the 1, and acquiring word frequency vector features;
step 403, establishing a word vector generation model, training the model, and obtaining word vector characteristics of the news data;
And step 404, establishing a theme generation model, training the model, and acquiring theme vector characteristics of the news data.
And 5, fusing the word vector characteristics, the word frequency vector characteristics and the theme vector characteristics of the news to be classified to obtain the general characteristic vector of each news.
The word feature vector is a word2vec word vector, the word frequency vector features are word frequency and inverse document frequency feature vectors, and the topic vector is an improved latent Dirichlet allocation model topic vector;
the fusion of the word vector and the word frequency vector is weighted fusion, the word vector is taken as a reference, and the word frequency vector is taken as a weight; the fusion of the topic vector and the word frequency vector is weighted fusion, the topic vector is a reference, and the word frequency vector is a weight;
And splicing and fusing the word vector after weighted fusion and the theme vector after weighted fusion to obtain a comprehensive characteristic vector of the news to be classified.
The classification process described in step 6 comprises the following steps:
601, randomly determining the centers of T categories according to a preset category number T;
Step 602, calculating the similarity of each news to T centers, and dividing all the news into each category according to the similarity;
step 603, comparing the similarity of every two news in the category, and reselecting the center of the category;
Step 604, according to the newly selected center, performing category division on all news again;
and 605, repeating the steps 603 and 604 until the number of the category changes of all the news is smaller than a preset change threshold or the similarity in the categories is smaller than a preset similarity threshold, stopping iteration, and finishing the classification process. The method of claim 7, wherein the similarity is calculated by the following formula:
Max (d (X, Y)) represents the maximum manhattan distance between two vectors, α is a preset adjustable parameter for adjusting the weight value between the distance metric and the angle metric, and the manhattan distance is represented by d (X, Y) ═ X1-y1|+|x2-y2|+…+|xp-ypL, x and y are two vectors, x1,x2,…xpFor each attribute value, y, in the vector x1,y2,…,ypFor each attribute value in the vector y, there are p attributes in each vector,Where | x | is the vector x ═ x (x)1,x2,…xp) Is the vector y ═ y | | | (y ═ y | |)1,y2,…yp) Euclidean norm of.
the invention relates to a network media news classification method of restrictive user preference, which comprises the steps of fully considering the preference behavior of a user in the process of proposing data characteristics, and screening and processing relevant data with emphasis, so that the obtained data result is more suitable for a specific user, and the data processing time is greatly shortened; secondly, in the process of proposing the feature vector, the features of multiple dimensionalities are fully considered, and a comprehensive similarity calculation is proposed in the process of similarity calculation, so that the method is more universal. Compared with the prior art, the method has stronger pertinence, smaller time complexity and better universality.
drawings
FIG. 1 is a flowchart illustrating a method for classifying news of network media according to constrained user preferences according to an embodiment of the present invention;
Detailed Description
the invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
Referring to fig. 1, an embodiment of the present invention is a method for classifying network media news according to user preferences, including the following steps:
Step 1, acquiring a user set U and a news set I;
step 2, preprocessing the news data in the news set;
Step 3, generating a user news preference matrix R according to the preference behavior of the user to news;
step 4, extracting the characteristics of news data according to the news preference matrix of the user;
step 5, calculating a feature vector of the news to be classified;
and 6, classifying the news according to the feature vectors.
the preprocessing comprises Chinese word segmentation according to a preset rule; deleting the connecting words according to the part of speech; and deleting the irrelevant words according to the word frequency.
Generally, there are various methods for chinese word segmentation, and there are a word segmentation method based on a character string dictionary, a word segmentation method based on a text probability, and a word segmentation method based on a semantic analysis. The conjunctions are deleted according to the part of speech, and the words needing to be deleted generally comprise conjunctions, exclamation words, adverbs and the like. The deletion of the irrelevant words according to the word frequency can be considered to delete the words with high word frequency, such as ' me ', ' and the like.
The preference behavior of the user to the news in the step 3 refers to the behavior that the user browses or clicks the corresponding news, the user news preference matrix R is M multiplied by N, wherein M is the number of the users, N is the news space, and one element R in RijThe value of (a) is a positive integer, which represents the preference behavior times of the ith user to the jth news.
step 4 comprises the following steps:
Step 401, according to a preset threshold, performing boolean value taking on the user news preference matrix, if r isijGreater than or equal to said threshold value, then rij1, otherwise rij=0;
Step 402, for rijNews data corresponding to 1Performing feature extraction to obtain word frequency vector features; word frequency and inverse document frequency are statistical methods used to evaluate the importance of a word to a document or a category in a corpus or corpus of documents. The main idea is as follows: if a word or phrase appears more frequently in one category and rarely in other categories, the word or phrase is considered to have a good category discrimination ability and is suitable for classification. The calculation method is actually the product of the word frequency (TF) and the Inverse Document Frequency (IDF). The word frequency is the frequency of occurrence of a word t in a document d, while the inverse document frequency represents the category discrimination capability of the word t, the fewer documents containing the word t the greater the inverse document frequency. The calculation formulas for TF and IDF are as follows, respectively.
Where f (t, d) represents the number of times the term t appears in the document d, idftelder indicates the number of documents including the entry t in the corpus, and N indicates the total number of documents in the corpus. TFIDF weights for entry t are: fidft=tf(t,d)×idft. It can be seen that the weight of an entry t increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
step 403, establishing a word vector generation model, training the model, and obtaining word vector characteristics of the news data;
And step 404, establishing a theme generation model, training the model, and acquiring theme vector characteristics of the news data.
and 5, fusing the word vector characteristics, the word frequency vector characteristics and the theme vector characteristics of the news to be classified to obtain the general characteristic vector of each news.
The word feature vector is a word2vec word vector, the word frequency vector features are word frequency and inverse document frequency feature vectors, and the topic vector is an improved latent Dirichlet allocation model topic vector;
The training sum of word2vec word vectors yields a simpler answer and is well suited for use in the present invention due to the better similarity measure between word vectors. Because news data are screened according to user preference in the early stage, the word2vec word vector in the step occupies a small space, and subsequent calculation can be faster.
the fusion of the word vector and the word frequency vector is weighted fusion, the word vector is taken as a reference, and the word frequency vector is taken as a weight; the fusion of the topic vector and the word frequency vector is weighted fusion, the topic vector is a reference, and the word frequency vector is a weight;
in the embodiment of the invention, the generation model of the subject vector adopts an LF-LDA model, and the model replaces the subject word Dirichlet polynomial distribution in the original LDA model into the mixture of two distributions. One of which is the original Dirichlet polynomial distribution and the other of which generates word distributions for the underlying features.
And splicing and fusing the word vector after weighted fusion and the theme vector after weighted fusion to obtain a comprehensive characteristic vector of the news to be classified.
The classification process described in step 6 comprises the following steps:
601, randomly determining the centers of T categories according to a preset category number T;
step 602, calculating the similarity of each news to T centers, and dividing all the news into each category according to the similarity;
Step 603, comparing the similarity of every two news in the category, and reselecting the center of the category;
Step 604, according to the newly selected center, performing category division on all news again;
And 605, repeating the steps 603 and 604 until the number of the category changes of all the news is smaller than a preset change threshold or the similarity in the categories is smaller than a preset similarity threshold, stopping iteration, and finishing the classification process.
the method of claim 7, wherein the similarity is calculated by the following formula:
max (d (X, Y)) represents the maximum manhattan distance between two vectors, α is a preset adjustable parameter for adjusting the weight value between the distance metric and the angle metric, and the manhattan distance is represented by d (X, Y) ═ X1-y1|+|x2-y2|+…+|xp-ypL, x and y are two vectors, x1,x2,…xpfor each attribute value, y, in the vector x1,y2,…,ypFor each attribute value in the vector y, there are p attributes in each vector,where | x | is the vector x ═ x (x)1,x2,…xp) Is the vector y ═ y | | | (y ═ y | |)1,y2,…yp) Euclidean norm of.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
the embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (8)

1. a network media news classification method based on restrictive user preference is characterized by comprising the following steps:
Step 1, acquiring a user set U and a news set I;
Step 2, preprocessing the news data in the news set;
Step 3, generating a user news preference matrix R according to the preference behavior of the user to news;
Step 4, extracting the characteristics of news data according to the news preference matrix of the user;
Step 5, calculating a feature vector of the news to be classified;
And 6, classifying the news according to the feature vectors.
2. the method of claim 1, wherein the preprocessing comprises Chinese word segmentation according to a predetermined rule; deleting the connecting words according to the part of speech; and deleting the irrelevant words according to the word frequency.
3. The method of claim 2, wherein the news-based preference behavior of the user in step 3 is that the user browses or clicks the corresponding news behavior, and the news preference matrix R of the user has M x N dimensions, where M is the number of users, N is the news space, and one element R in R is the number of usersijThe value of (a) is a positive integer, which represents the preference behavior times of the ith user to the jth news.
4. the method for classifying news of web media according to claim 3, wherein the step 4 comprises the steps of:
Step 401, according to a preset threshold, performing boolean value taking on the user news preference matrix, if r isijgreater than or equal to said threshold value, then rij1, otherwise rij=0;
step 402, for rijCarrying out feature extraction on the news data corresponding to the 1, and acquiring word frequency vector features;
step 403, establishing a word vector generation model, training the model, and obtaining word vector characteristics of the news data;
And step 404, establishing a theme generation model, training the model, and acquiring theme vector characteristics of the news data.
5. the method for classifying news as claimed in claim 4, wherein in step 5, word vector features, word frequency vector features and topic vector features of the news to be classified are fused to obtain a summarized feature vector of each news item.
6. the method of claim 5, wherein the word feature vector is a word2vec word vector, the word frequency vector features are a word frequency and inverse document frequency feature vector, and the topic vector is an improved latent Dirichlet allocation model topic vector;
The fusion of the word vector and the word frequency vector is weighted fusion, the word vector is taken as a reference, and the word frequency vector is taken as a weight; the fusion of the topic vector and the word frequency vector is weighted fusion, the topic vector is a reference, and the word frequency vector is a weight;
And splicing and fusing the word vector after weighted fusion and the theme vector after weighted fusion to obtain a comprehensive characteristic vector of the news to be classified.
7. The method for news classification of network media according to claim 6, wherein the classification process in step 6 comprises the following steps:
601, randomly determining the centers of T categories according to a preset category number T;
step 602, calculating the similarity of each news to T centers, and dividing all the news into each category according to the similarity;
Step 603, comparing the similarity of every two news in the category, and reselecting the center of the category;
step 604, according to the newly selected center, performing category division on all news again;
And 605, repeating the steps 603 and 604 until the number of the category changes of all the news is smaller than a preset change threshold or the similarity in the categories is smaller than a preset similarity threshold, stopping iteration, and finishing the classification process.
8. the method of claim 7, wherein the similarity is calculated by the following formula:
Where Max (d (X, Y)) represents the maximum manhattan distance of two feature vectors, which is expressed as d (X, Y) ═ X1-y1|+|x2-y2|+…+|xp-ypI, x and y are two eigenvectors, x1,x2,…xpfor each attribute value, y, in the vector x1,y2,…,ypFor each attribute value in the vector y, there are p attributes in each vector, alpha is an adjustable parameter,Where | x | is the vector x ═ x (x)1,x2,…xp) Is the vector y ═ y | | | (y ═ y | |)1,y2,…yp) Euclidean norm of.
CN201910821597.9A 2019-09-02 2019-09-02 Network media news classification method based on restrictive user preference Pending CN110569351A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910821597.9A CN110569351A (en) 2019-09-02 2019-09-02 Network media news classification method based on restrictive user preference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910821597.9A CN110569351A (en) 2019-09-02 2019-09-02 Network media news classification method based on restrictive user preference

Publications (1)

Publication Number Publication Date
CN110569351A true CN110569351A (en) 2019-12-13

Family

ID=68777273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910821597.9A Pending CN110569351A (en) 2019-09-02 2019-09-02 Network media news classification method based on restrictive user preference

Country Status (1)

Country Link
CN (1) CN110569351A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611379A (en) * 2020-05-18 2020-09-01 深圳证券信息有限公司 Text information classification method, device, equipment and readable storage medium
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112836010A (en) * 2020-10-22 2021-05-25 长城计算机软件与系统有限公司 Patent retrieval method, storage medium and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770520A (en) * 2010-03-05 2010-07-07 南京邮电大学 User interest modeling method based on user browsing behavior
CN101853250A (en) * 2009-04-03 2010-10-06 华为技术有限公司 Method and device for classifying documents
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN107908669A (en) * 2017-10-17 2018-04-13 广东广业开元科技有限公司 A kind of big data news based on parallel LDA recommends method, system and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853250A (en) * 2009-04-03 2010-10-06 华为技术有限公司 Method and device for classifying documents
CN101770520A (en) * 2010-03-05 2010-07-07 南京邮电大学 User interest modeling method based on user browsing behavior
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN107908669A (en) * 2017-10-17 2018-04-13 广东广业开元科技有限公司 A kind of big data news based on parallel LDA recommends method, system and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611379A (en) * 2020-05-18 2020-09-01 深圳证券信息有限公司 Text information classification method, device, equipment and readable storage medium
CN112836010A (en) * 2020-10-22 2021-05-25 长城计算机软件与系统有限公司 Patent retrieval method, storage medium and device
CN112836010B (en) * 2020-10-22 2024-04-05 新长城科技有限公司 Retrieval method, storage medium and device for patent
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112417153B (en) * 2020-11-20 2023-07-04 虎博网络技术(上海)有限公司 Text classification method, apparatus, terminal device and readable storage medium

Similar Documents

Publication Publication Date Title
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN108763362B (en) Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection
CN106156204B (en) Text label extraction method and device
CN110232149B (en) Hot event detection method and system
WO2023029420A1 (en) Power user appeal screening method and system, electronic device, and storage medium
JP2012027845A (en) Information processor, relevant sentence providing method, and program
Zhou et al. Joint image and text representation for aesthetics analysis
CN112749341B (en) Important public opinion recommendation method, readable storage medium and data processing device
CN110569351A (en) Network media news classification method based on restrictive user preference
Lavanya et al. Twitter sentiment analysis using multi-class SVM
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN109062895B (en) Intelligent semantic processing method
CN112949713B (en) Text emotion classification method based on complex network integrated learning
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
CN110866102A (en) Search processing method
CN114742071B (en) Cross-language ideas object recognition analysis method based on graph neural network
CN116501875A (en) Document processing method and system based on natural language and knowledge graph
Osanyin et al. A review on web page classification
Villegas et al. Vector-based word representations for sentiment analysis: a comparative study
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
CN116881451A (en) Text classification method based on machine learning
Tian et al. Chinese short text multi-classification based on word and part-of-speech tagging embedding
CN113780832B (en) Public opinion text scoring method, public opinion text scoring device, computer equipment and storage medium
Gong A personalized recommendation method for short drama videos based on external index features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191213