CN111695020A

CN111695020A - Hadoop platform-based information recommendation method and system

Info

Publication number: CN111695020A
Application number: CN202010542277.2A
Authority: CN
Inventors: 张梓光; 肖明; 张小芳; 许宋硕; 周敏; 鲁虎
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-09-22

Abstract

The application relates to an information recommendation method and system based on a Hadoop platform, which mainly comprises the following steps: (1) acquiring text information and publisher information, performing denoising processing on the text information and storing the text information in an HDFS system; (2) generating a key-value pair list for the text information and the publisher information stored in the HDFS system by using MapReduce; (3) performing theme modeling on the list according to the key value by the LDA theme model; (4) clustering the text information, and finishing information recommendation according to a clustering result; preliminarily filtering the text information to be recommended by utilizing the characteristics of Hadoop distributed storage information, and establishing a mapping relation between an information publisher and the text information; the text information is secondarily filtered by combining the Hadoop platform and the LDA topic model, so that the topic of the text information can be finely extracted, the access rate of a recommendation system to the text information and the accuracy of text information query before recommendation are improved, and the effectiveness and the accuracy of information recommendation are further ensured.

Description

Hadoop platform-based information recommendation method and system

Technical Field

The invention belongs to the field of data mining, and particularly relates to an information recommendation method and system based on a Hadoop platform.

Background

As internet technology has developed, more and more users browse news online or using mobile devices, and news applications have become one of the hottest internet applications, just a little lower than internet music. However, the huge amount of network news causes the problem of information overload, so that it is an important research topic to help users to filter or recommend useful news information. The mass users relate to tens of millions of attention relations and the amount of published articles, the interaction behavior and reading behavior among the users can reach the billion level, and the following defects occur in the conventional recommendation model and processing method along with the sharp increase of the data such as the number of users, the amount of published articles and the like: the accuracy of processing the text data is reduced; the performance of topic mining and information recommendation is insufficient; the problem of sparse user data is not well solved, and the defects enable the existing recommendation model and processing method to not meet the recommendation requirements of users, so that the popularization of a news application platform is hindered, and the satisfaction degree of the users is further influenced.

Disclosure of Invention

Based on the information recommendation method and system based on the Hadoop platform, the information is preliminarily filtered before classified recommendation by utilizing the characteristic of distributed data processing of the Hadoop platform, so that the recommendation accuracy is improved, and the defects of the prior art are overcome.

The invention relates to an information recommendation method based on a Hadoop platform, which comprises the following steps:

acquiring text information and corresponding publisher information, performing denoising processing on the text information, and storing the denoised text information and the publisher information in an HDFS (Hadoop distributed file system) of a Hadoop platform;

dividing and sequencing text information and publisher information stored in the HDFS system by using a MapReduce computing frame to generate a plurality of text information and key value pairs corresponding to the publisher information, and combining the key value pairs of the same publisher to generate a plurality of key value pair lists;

performing theme modeling on the key value pair list by using an LDA theme model to obtain the theme characteristics of each piece of text information, and clustering the text information according to the modeling result of the LDA theme model;

and recommending information to the user according to the clustering result of the text information.

Preferably, the denoising processing of the text information includes:

and converting the text information into a uniform language.

Preferably, the denoising processing of the text information further includes:

and converting special symbols carried in the text information into characters so as to reserve the emotional characteristics of the text information.

Preferably, the denoising processing of the text information further includes:

and performing word segmentation on the text information by using an ICTCCLAS word segmentation system.

Preferably, the denoising processing of the text information further includes:

stop words in the text information are removed to reduce the storage space of the text information in the HDFS system.

Preferably, clustering the text information comprises:

and calculating the similarity of the text information by utilizing the cosine similarity, and clustering the text information according to the calculation result of the similarity.

Preferably, calculating the text information similarity includes:

the text information is simplified into a space vector by using a vector space model VSM, and the cosine similarity of the text information is calculated as the following formula

A_iAnd B_iRespectively indicate participation similarDegree-calculated vector-space-model-VSM-based spatial vectors of two text messages.

Preferably, the recommending information to the user according to the clustering result of the text information includes:

and calculating the similarity of the candidate text information and the reading history of the user and/or the score of the text information according to the clustering result of the text information, generating a list to be recommended, and indexing the list to be recommended to complete information recommendation of the user.

Preferably, the acquiring the text information and the corresponding publisher information comprises:

and simulating user login, downloading any page URL, performing page analysis to obtain publisher information, and obtaining the published text information according to the publisher information.

In another aspect, the present invention provides an information recommendation system based on a Hadoop platform, including:

the information acquisition module is used for acquiring the text information and the corresponding publisher information;

the information storage module runs an HDFS (Hadoop distributed file system) system with a Hadoop computing framework to store the text information and the publisher information which are subjected to denoising processing;

the key value pair generating module runs a MapReduce computing framework to divide and sequence the text information and the publisher information stored in the HDFS system, generates a plurality of text information and key value pairs corresponding to the publisher information, and combines the key value pairs of the same publisher to generate a plurality of key value pair lists;

the text information topic modeling module is used for carrying out topic modeling on the key value pair list obtained in the key value pair generating module by utilizing an LDA topic model to obtain the topic characteristics of each piece of text information;

the text information clustering module is used for clustering the text information according to the modeling result of the LDA topic model and recommending information to the user;

and the recommending module is used for recommending information to the user according to the clustering result of the text information.

According to the technical scheme, the invention has the following beneficial effects:

according to the information recommendation method and system based on the Hadoop platform, the text information to be recommended is preliminarily filtered by utilizing the characteristics of Hadoop distributed storage information, and the mapping relation between an information publisher and the text information is established, so that compared with the prior art that text mining is directly carried out on the text information by utilizing an LDA topic model, the method and system have higher accuracy; the secondary filtering of the text information by combining the Hadoop platform and the LDA topic model can realize refined extraction of the topic of the text information, improve the access rate of the recommendation system to the text information and the accuracy of text information query before recommendation, and further guarantee the effectiveness and accuracy of information recommendation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of an information recommendation system based on a Hadoop platform according to an embodiment of the present invention

FIG. 2 is a flow chart of an embodiment of the invention based on a Hadoop platform information recommendation method

FIG. 3 is a flowchart illustrating an implementation of a microblog news recommending method based on a Hadoop platform according to another embodiment of the invention

FIG. 4 is a schematic diagram of a MapReduce engine according to another embodiment of the present invention

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 and fig. 2, the present embodiment provides an information recommendation system based on a Hadoop platform, including:

When the recommendation system of the embodiment recommends information to a user, text information and corresponding publisher information are obtained, denoising processing is performed on the text information, and the text information and the publisher information subjected to denoising processing are stored in an HDFS system of a Hadoop platform; dividing and sequencing text information and publisher information stored in the HDFS system by using a MapReduce computing frame to generate a plurality of text information and key value pairs corresponding to the publisher information, and combining the key value pairs of the same publisher to generate a plurality of key value pair lists; performing theme modeling on the key value pair list by using an LDA theme model to obtain the theme characteristics of each piece of text information, and clustering the text information according to the modeling result of the LDA theme model; and recommending information to the user according to the clustering result of the text information.

In a further embodiment, the recommendation system may further include a text information denoising module (not shown in the figure) for performing denoising processing on the text information; the denoising processing module can also be integrated in the information acquisition module, and the information acquisition module can process and store the text information in the information storage module of the HDFS system operating with the Hadoop calculation framework.

In a further embodiment, the information storage module and the key-value pair generation module may be integrated in the same processing unit, and a complete Hadoop platform has been constructed in the processing unit, which includes an HDFS system and a MapReduce engine, and is configured to perform distributed storage and mapping relationship establishment on text information, and finally obtain a multiple key-value pair list merged according to publisher information.

The modules may be implemented by software codes, and in this case, the modules may be stored in a memory provided at a control end such as a control computer. The above modules may also be implemented by hardware, such as an integrated circuit chip.

As shown in fig. 3, another embodiment of the present invention is introduced below, and this embodiment is a personalized microblog news recommendation method based on a Hadoop platform, and more personalized and precise user news recommendation is realized by performing more precise topic mining on huge amounts of microblog text data by using the Hadoop platform.

The acquisition of the microblog text information is different from that of a common webpage, and because a crawler scheme based on Python is adopted in the embodiment of the microblog anti-crawler mechanism, the authority of acquiring information is obtained by simulating the login of a user to obtain the authorization of a microblog platform, which comprises the following steps:

the user name is encrypted through base64 to carry out pre-login to obtain parameters of server time, nonce, pubkey and rsakv, wherein the server time is server time, the nonce is a server random character string, the pubkey is a public key encrypted by RSA used by a customer service end, and the rsakv is a value in headers used for login;

carrying out RSA encryption on the password to construct a form data imitation login request;

and acquiring a login jump link and acquiring cookie information.

After logging in successfully, a URL needs to be selected from page URLs which are not acquired, the webpage is downloaded after entering the page, the acquired URL list library is updated after downloading is completed, and page resolution is performed on the downloaded webpage after storage is completed, wherein the page resolution is to extract microblog information in the downloaded webpage. Firstly, a microblog page is used as a seed URL, a comment relation and news comment content (including a user ID) of a news microblog can be obtained from the seed URL, and then personal information, microblog text information and the like of a user personal homepage are crawled according to the user ID.

The method needs to perform data cleansing on the text information of the microblog to retain effective text content, and comprises the following steps:

different from common text information, microblog text information is provided with a plurality of special text elements, in order to improve the access rate of data and the accuracy of text mining, the special text elements which are irrelevant to the meaning of the text information need to be removed, such as '@' characters, topic labels '###', URL links displayed in the text information and the like, and the workload of subsequent computer operation is reduced;

in order to facilitate text mining, the text information language is uniformly converted into simplified Chinese in the embodiment;

in addition, the microblog text information also carries more emoticons, and the emoticons reflect the emotional characteristics of the text information to a certain extent, so that the emoticons are converted into characters and reserved as a part of the text information to restore the text information content more accurately;

the embodiment also carries out word segmentation and word stop removal on the text information, and the word segmentation system adopts ICTCCLAS to realize Chinese word segmentation, part of speech tagging and unknown word identification on the text information; the stop words refer to words which have high occurrence frequency but do not have actual meanings, such as language words, in the text, and the removal of the stop words can effectively release the storage space, improve the mining capability and the clustering efficiency of a subsequent LDA topic model, and further improve the recommendation accuracy.

Storing the text information cleaned by the series of data and the publisher information thereof in an HDFS system, wherein the HDFS system is provided with a NameNode and a plurality of DataNodes, the NameNode is responsible for positioning the storage position of the text information, naming the entered text information, distributing the entered text information for each DataNode node, and finally storing the text information in the DataNodes; the main responsibility of the DataNode is to respond to the data access command of the NameNode in real time, store or extract the text information in real time, and the NameNode and the DataNode keep the real-time information interaction through a heartbeat mechanism. The HDFS also performs multi-path backup on the information while inputting the text information, stores microblog information in blocks, and defaults the size of each block to be 64M, so that the safety, accuracy and access efficiency of data are improved, and the subsequent MapReduce data processing is easy.

As shown in fig. 4, MapReduce mainly accomplishes the following work:

receiving a text information processing request, sending a processing instruction to a node JobClient, packaging application configuration parameters into jar files by the processing instruction, storing the jar files into an HDFS (Hadoop distributed File System), and submitting a text information storage path to a JobTracker node; creating each Task, namely MapTask and ReduceTask by a JobTracker node, distributing the Taskask and ReduceTask to each TaskTracker service for execution, monitoring each Task by the JobTracker, and re-running if a failed Task is found; the TaskTracker subdivides the text information preprocessing task, and invokes a plurality of Map tasks, at the moment, the disordered text information of the HDFS system is divided and sequenced, and a plurality of key value pairs of < user u, information v > are generated, wherein the key value pairs represent the mapping relation between the microblog users and the text information issued by the microblog users one by one; when the Map component finishes data segmentation and serialization, merging segmented key-value pairs < user u, information v > through the Shuffle component, wherein the merging basis is the user name of a microblog publisher, and merging the key-value pairs of microblog information of the same user into a large key-value pair list. At this time, the output result of the Map process is the input of Reduce, Reduce further performs aggregation optimization processing on the microblog information key value pair lists, and finally outputs the text information key value pairs processed by the system.

The processing process of the microblog text information by the MapReduce in the whole stage is realized on the basis of dynamic real-time interaction of the NameNode and the DataNode in the HDFS system.

Extracting key-value pair lists using LDA topic modelCalculating the topic distribution of the text information contained in the text information to obtain the topic characteristics of the text, wherein the posterior distribution of the topic distribution and the word distribution in the LDA topic model is estimated by utilizing a Gibbs sampling algorithm so as to estimate the topic distribution theta and the word distribution

Two parameters.

After traversing all text information by the LDA topic model, calculating the similarity of the text by using cosine similarity to cluster the text information, simplifying the semantic similarity of the text information into space vector operation by using a vector space model VSM, comparing each keyword in the text information with a bag of words and giving a positive real value on the basis of the bag of words to enable each text information to form a multidimensional space vector, and calculating the cosine similarity of the text information as follows

A_iAnd B_iVector space model VSM-based spatial vectors representing two text messages participating in a similarity calculation, respectively.

And calculating the final similarity between the candidate blog articles and the user preference and/or the scores of the text information according to the text clustering result, generating a Top-K blog article recommendation list according to the similarity in a descending order, and finally recommending microblog information to the microblog user in a personalized manner to realize accurate microblog news recommendation.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An information recommendation method based on a Hadoop platform is characterized by comprising the following steps:

2. The Hadoop platform-based information recommendation method according to claim 1, wherein the denoising processing of the text information comprises:

and converting the text information into a uniform language.

3. The Hadoop platform based information recommendation method as claimed in claim 2, wherein the denoising processing of the text information further comprises:

4. The Hadoop platform based information recommendation method as claimed in claim 3, wherein the denoising processing of the text information further comprises:

5. The Hadoop platform based information recommendation method according to claim 4, wherein the denoising processing of the text information further comprises:

6. The Hadoop platform based information recommendation method according to claim 1, wherein the clustering the text information comprises:

7. The Hadoop platform-based information recommendation method according to claim 7, wherein the calculating the similarity of the text information by using the cosine similarity comprises:

text information is simplified into space vectors by using a vector space model VSM, and the cosine similarity of the text information is calculated as the following formula

8. The Hadoop platform-based information recommendation method according to claim 1, wherein the recommending information to the user according to the clustering result of the text information comprises:

9. The Hadoop platform-based information recommendation method according to claim 1, wherein the acquiring text information and corresponding publisher information comprises:

10. An information recommendation system based on a Hadoop platform is characterized by comprising:

the information storage module is used for operating an HDFS (Hadoop distributed file system) system with a Hadoop computing framework so as to store the text information and the publisher information which are subjected to denoising processing;