CN106202574A - The appraisal procedure recommended towards microblog topic and device - Google Patents

The appraisal procedure recommended towards microblog topic and device Download PDF

Info

Publication number
CN106202574A
CN106202574A CN201610698208.4A CN201610698208A CN106202574A CN 106202574 A CN106202574 A CN 106202574A CN 201610698208 A CN201610698208 A CN 201610698208A CN 106202574 A CN106202574 A CN 106202574A
Authority
CN
China
Prior art keywords
topic
microblog
microblogging
text
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610698208.4A
Other languages
Chinese (zh)
Inventor
徐华
李佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201610698208.4A priority Critical patent/CN106202574A/en
Publication of CN106202574A publication Critical patent/CN106202574A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of appraisal procedure recommended towards microblog topic and device, wherein, method comprises the following steps: obtain a plurality of microblogging text in microblog from the Internet;A plurality of microblogging text is carried out participle;Obtain the word frequency of a plurality of microblogging text after participle, to obtain content of microblog;Obtain without supervision topic recommendation information according to content of microblog;Assessment result is obtained with the Euclidean distance of the term vector of topic in the test set preset according to the term vector without supervision topic recommendation information.The recommendation topic that unsupervised approaches can be obtained by the appraisal procedure of the embodiment of the present invention is assessed automatically, realize the nothing supervision topic recommendation method to microblog assesses detection the most automatically, so that it is determined that the effectiveness of unsupervised approaches, it is possible not only to save manpower, and evaluation result has more cogency.

Description

The appraisal procedure recommended towards microblog topic and device
Technical field
The present invention relates to Computer Applied Technology and social networks technical field, talk about towards microblog particularly to one The appraisal procedure of topic recommendation and device.
Background technology
Personalized recommendation system refers to, on the basis of for setting up binary crelation between user and information products, utilize user Conventional selection course or the similarity relationships with other users, excavate the potential objects of this user, and then carry out The system of personalized recommendation.One complete commending system is made up of three parts: logging modle (behavior of phone user information), Analyze module (analyzing the model of user) and proposed algorithm modeling.Wherein, it is recommended that algoritic module is the part of core the most.According to The achievement in research of foreign scholar, personalized recommendation algorithm can be largely classified into collaborative filtering (collaborative Filtering) algorithm and proposed algorithm based on content (content-based) and two kinds of algorithms etc. combined.
Collaborative filtering system is the commending system that the first generation is suggested and is used widely.Traditional collaborative filtering system Core concept can be divided into two parts: first, be that the historical information utilizing user calculates the similarity between user;Then, Utilize the neighbours higher with targeted customer's similarity that the evaluation of other products is predicted targeted customer's hobby to specific products Degree.Targeted customer is recommended by system according to this fancy grade.This method is not rely on retouching of product itself State information, just for the similarity between user and user in the evaluation to product.Therefore, collaborative filtering can be very The efficient potential hobby finding targeted customer, it is possible to recommend fresh information and new product for it, and can recommend to be difficult to Carry out the product of content analysis.Generalization bounds than former: a. carries out the recommendation of word co-occurrence according to existing label;B. basis Text feature (such as title, description) is recommended;C. label relativity measurement is utilized to recommend.
User and product can be set up configuration file by content-based recommendation system respectively, bought by analyzing The content that (or browsing) crosses, sets up or updates the configuration file of user.System can compare the phase of user and product configuration file Like degree, and directly recommend the product most like with its configuration file to user.Such as, in film is recommended, based on content it is The general character (performer, director, style etc.) of the film that the system marking that first analysis user has seen is higher, then recommend and these User feels high other films of movie contents similarity of interest.Generally speaking, content-based recommendation system can not be given a mark The constraint of openness problem, can recommend emerging product, finds information of hiding, and by listing the feature of content recommendation, explains Why recommend these products, make user have more preferable Consumer's Experience in use.But, existing artificial assessment method is deposited At certain defect, have much room for improvement.
Summary of the invention
It is contemplated that one of technical problem solved the most to a certain extent in correlation technique.
To this end, it is an object of the present invention to propose a kind of appraisal procedure recommended towards microblog topic, the party Method is possible not only to save manpower, and evaluation result has more cogency.
Further object is that and propose a kind of apparatus for evaluating recommended towards microblog topic.
For reaching above-mentioned purpose, one aspect of the present invention embodiment proposes a kind of assessment recommended towards microblog topic Method, comprises the following steps: obtain a plurality of microblogging text in microblog from the Internet;Described a plurality of microblogging text is carried out Participle;Obtain the word frequency of the described a plurality of microblogging text after participle, to obtain content of microblog;Nothing is obtained according to described content of microblog Supervision topic recommendation information;According to described without supervision topic recommendation information term vector with preset test set in topic word to The Euclidean distance of amount obtains assessment result.
The appraisal procedure recommended towards microblog topic of the embodiment of the present invention, can recommend methods and results to enter topic Row assessment automatically, introduces term vector and substitutes existing artificial evaluation mode, subjectivity when saving manpower and eliminate artificial evaluation, Automatically assessed by the recommendation topic that unsupervised approaches is obtained, it is achieved method is recommended in the nothing supervision topic of microblog The most automatically assess detection, so that it is determined that the effectiveness of unsupervised approaches, be possible not only to save manpower, and evaluation result be more There is cogency.
It addition, the appraisal procedure recommended towards microblog topic according to the above embodiment of the present invention can also have following Additional technical characteristic:
Further, in one embodiment of the invention, before described a plurality of microblogging text carries out participle, also wrap Include: described a plurality of microblogging text is carried out pretreatment, to remove garbage, wherein, described garbage include html label, URL and picture.
Further, in one embodiment of the invention, described according to described text data obtain without supervision topic push away The information of recommending farther includes: be modeled described microblogging content of text by default topic model, with according to utilizing probability K key word of TOP of maximum theme is recommended as topic;Obtain described without supervision topic recommendation information.
Alternatively, in one embodiment of the invention, described topic model is LDA (Latent Dirichlet Allocation, implicit expression Di Li Cray distribute) topic model, with by described LDA topic model to described microblogging content of text Subject information be modeled.
Further, in one embodiment of the invention, also include: utilize RNN (Recurrent neural Network, Recognition with Recurrent Neural Network) model training obtains the term vector of topic in described test set.
For reaching above-mentioned purpose, another aspect of the present invention embodiment proposes a kind of towards commenting that microblog topic is recommended Estimate device, including: the first acquisition module, for obtaining a plurality of microblogging text in microblog from the Internet;Word-dividing mode, uses In described a plurality of microblogging text is carried out participle;Second acquisition module, the described a plurality of microblogging text after obtaining participle Word frequency, to obtain content of microblog;Recommending module, for obtaining without supervision topic recommendation information according to described content of microblog;Assessment Module, for according to the described term vector without supervision topic recommendation information and the term vector of topic European in default test set Distance obtains assessment result.
The apparatus for evaluating recommended towards microblog topic of the embodiment of the present invention, can recommend methods and results to enter topic Row assessment automatically, introduces term vector and substitutes existing artificial evaluation mode, subjectivity when saving manpower and eliminate artificial evaluation, Automatically assessed by the recommendation topic that unsupervised approaches is obtained, it is achieved method is recommended in the nothing supervision topic of microblog The most automatically assess detection, so that it is determined that the effectiveness of unsupervised approaches, be possible not only to save manpower, and evaluation result be more There is cogency.
It addition, according to the above embodiment of the present invention towards microblog topic recommend apparatus for evaluating can also have with Lower additional technical characteristic:
Further, in one embodiment of the invention, also include: pretreatment module, for described a plurality of microblogging Text carries out pretreatment, and to remove garbage, wherein, described garbage includes html label, URL and picture.
Further, in one embodiment of the invention, described recommending module includes: recommendation unit, for by advance If topic model described microblogging content of text is modeled, crucial with TOP K according to the theme utilizing maximum probability Word is recommended as topic;Acquiring unit, is used for obtaining described nothing supervision topic recommendation information.
Alternatively, in one embodiment of the invention, described topic model is LDA model, with by described LDA model The subject information of described microblogging content of text is modeled.
Further, in one embodiment of the invention, also include: training module, be used for utilizing RNN model training to obtain The term vector of topic in described test set.
Aspect and advantage that the present invention adds will part be given in the following description, and part will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or that add aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially with easy to understand, wherein:
Fig. 1 is the neutral net schematic diagram in correlation technique;
Fig. 2 is the flow chart of the appraisal procedure recommended towards microblog topic according to the embodiment of the present invention;
Fig. 3 is the flow chart of the appraisal procedure recommended towards microblog topic according to one embodiment of the invention;
Fig. 4 is the flow process of the appraisal procedure recommended towards microblog topic according to one specific embodiment of the present invention Figure;
Fig. 5 is the structural representation of the apparatus for evaluating recommended towards microblog topic according to the embodiment of the present invention;
Fig. 6 is the structural representation of the recommending module according to one embodiment of the invention;
Fig. 7 is the knot of the appraisal procedure device recommended towards microblog topic according to one specific embodiment of the present invention Structure schematic diagram.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of described embodiment is shown in the drawings, the most from start to finish Same or similar label represents same or similar element or has the element of same or like function.Below with reference to attached The embodiment that figure describes is exemplary, it is intended to is used for explaining the present invention, and is not considered as limiting the invention.
Below the appraisal procedure recommended towards microblog topic and device proposed according to embodiments of the present invention is being described Before, the importance assessed is briefly described first.
Microblogging is the New Media just occurred recent years.2006, the company Obvious of blog founder's WILLIAMS-DARLING Ton Creating Twitter, this is first microblogging website in the world.Meanwhile, the domestic net occurring in that some similar Twitter Standing, as done, what, meal are no, sound of a bird chirping is askew, but above major part website was closed because of technical problem etc. in July, 2009.At present, domestic master Sina's microblogging, Tengxun's microblogging and Sohu's microblogging is had for microblogging website in Yaoing.
Although microblogging time of occurrence is shorter, but its development is the swiftest and the most violent.Microblogging be one can issue conveniently and efficiently, point Enjoying, propagate and obtain the platform of information, user can pass through the issues such as webpage, WAP web page, note, real-time messages software and connect By information, exchange with friend, the people of concern and vermicelli at any time.The information that can issue include brief text, picture, Video and audio fragment and hyperlink etc..Generally text number of words is less than 140 words.
Microblogging has number of words to limit, thus the shortest and the smallest.Topic type microblogging is again because there being a clear and definite topic, and people are to enclose Express an opinion around this topic, write microblogging, as long as have expressed mood, attitude, the most also can, short note is good, so Compared with general style, in topic type microblogging, sentence is relatively brief, and simple sentence is in the majority.
Microblog topic has acted on the tradition of microblogging quotation body narration style, and briefly and not consumption of essence refining, touches briefly on the essentials, in a language , labeling is the core content of topic.Although the content of a lot of topics is consistent, but in statement, has nothing in common with each other.With Time, microblog topic has the feature of colloquial style, popularization, interest and appeal, and it meets mode of thinking and the literary composition of shallow arches epoch people Word consumption pattern, and the captivation of topic can be strengthened.By browsing the topic of labeling, quick obtaining topic content, allow More the microblog users that this topic is interested paid close attention to this topic and adds discussion.
Topic has netizen's participation of height, the most all has sympathetic response, because the standard that arranges of topic is to draw Play concern and the discussion of microblog users.
One of feature of topic is that emotional appeals are obvious, and the original number of topic authentic and valid in unit hour becomes evaluation The standard that topic is propagated.Due to the relative secret relatively freely commented on anonymity of communication space, thus online friend sees in expression Some more extreme and fierce forms can be used, again because of the restriction of microblogging number of words, it is impossible to the discussion side that logic is distinct during point Formula carries out idealism express, so in topic type microblogging, the expression of viewpoint sentence often emotion is strong, and rationality is evaluated light Changing, the indecency vocabulary that the representability such as dirty word, coarse language is strong occurs in a large number, this also become topic type microblogging viewpoint sentence showing emotion and A kind of wide expression way during attitude.
In topic type microblogging, in addition to some, very word " to power " clearly expresses viewpoint, people also can use one Plant obscure, non-immediate mode, express viewpoint with the meaning between the lines.As:
(1) # edible oil rise in price # I it may be said that dirty word?
This is an interrogative sentence, literal on see and do not express anyways, but under " edible oil rise in price " this linguistic context, Can be understood as speaker's expression is a kind of impulsion want and give vent to talking Billingsgate, thus the most indirectly have expressed topic Discontented emotion and the attitude demoted.
(2) rip off the customer the Spring Festival # in # Sanya
What the purpose that local government department does so is?Deceive oneself?Smear more black and more black?Still visitor is allowed to go never Sanya?Should go to survey IQ!" should go survey survey IQ " have expressed speaker and be discontented with " local offices ".
Containing the spoken and written languages of a large amount of non-standards, it also it is a feature of topic type microblogging language.This non-standard May have
The reason of several aspects: on the one hand, be unartificial reason cause input error or general knowledge on Chinese-character writing Mistake;On the other hand, microblogging freedom of expression, in addition sensitive subjects restriction legally, some noises, non-can be added artificially Modular word, non-standard symbol and unnormalized language form.
Wherein, distribution represents is Hinton at paper " the Learning distributed of 1986 the earliest Representations of concepts " middle proposition.Although this article does not say that word to do distribution to be represented, but extremely The thought of few this advanced person is the most just burying kindling material in the minds of people, starts gradually by people's weight after 2000 Depending on.Distribution represents for representing word, is commonly called " Word Representation " or " Word Embedding ", Chinese It is commonly called as " term vector ".All " term vector " hereinafter mentioned is all referring to the term vector represented with distribution.If with traditional Rarefaction representation method represents word, and solving some task when, (such as building language model) can cause dimension disaster.Use low The term vector of dimension does not just have such problem.Simultaneously from practice, if the study of the feature of the higher-dimension degree of depth to be applied mechanically, its complexity Almost being difficult to accept, therefore the term vector of low-dimensional is the most also endured to the fullest extent and is pursued.
Mikolov does language model with Recognition with Recurrent Neural Network the earliest and delivers on INTERSPEECH 2010 " Recurrent neural network based language model " is inner.Recurrent neural network is Recognition with Recurrent Neural Network, is called for short RNN.In several years later, Mikolov does various improvement always on RNNLM, has in speed , also have in accuracy rate.
Recognition with Recurrent Neural Network structurally has bigger difference with the feedforward network used in above each method, but former Manage equally.
As it is shown in figure 1, the left side is the abstract structure of network in figure, owing to Recognition with Recurrent Neural Network is used on time series more, Therefore input layer, hidden layer and the output layer of the inside has all brought (t).W (t) is the One-hot of t word in sentence The vector of representation, say, that w is a very long vector, and the inside only one of which element is 1.And below S (t-1) vector is exactly a upper hidden layer.Last hidden layer computing formula is:
S (t)=sigmoid (Uw (t)+Ws (t-1)) s (t)=sigmoid (Uw (t)+Ws (t-1)).
Further, from figure, the right can be seen that how Recognition with Recurrent Neural Network launches.Often carry out neologisms, just and A upper hidden layer combined calculation goes out next hidden layer, and hidden layer is recycled, and maintains up-to-date state always.Each hiding Layer obtains output valve by one layer of traditional feedforward network.
It is understood that the sharpest edges of Recognition with Recurrent Neural Network are, can utilize the most fully and all believe above Breath predicts next word, and unlike other work above, can only open the window of a n word, only comes with front n word The next word of prediction.
But, the existing unsupervised approaches recommending topic is estimated the most manually evaluating, not only waste of manpower, And evaluation result does not have cogency.
The present invention is based on the problems referred to above, and proposes a kind of appraisal procedure recommended towards microblog topic and dress Put.
The appraisal procedure recommended towards microblog topic proposed according to embodiments of the present invention is described with reference to the accompanying drawings And device, the assessment side recommended towards microblog topic proposed according to embodiments of the present invention is described the most with reference to the accompanying drawings Method.
Fig. 2 is the flow chart of the appraisal procedure recommended towards microblog topic of the embodiment of the present invention.
As in figure 2 it is shown, should comprise the following steps towards the appraisal procedure that microblog topic is recommended:
In step s 201, from the Internet, a plurality of microblogging text in microblog is obtained.
Wherein, in one embodiment of the invention, before described a plurality of microblogging text carries out participle, also include: right Described a plurality of microblogging text carries out pretreatment, to remove garbage, wherein, described garbage include html label, URL and Picture.
In step S202, described a plurality of microblogging text is carried out participle.
Specifically, as it is shown on figure 3, it is possible, firstly, to utilize crawler technology to obtain the content of microblog in microblog.Example As, utilize Python to write crawlers, the news of portal website is crawled and is stored into backstage MongoDB data In storehouse.Wherein, content of microblog can be Sina's microblogging.
Secondly, the microblogging content of text obtained is carried out data prediction.Specifically, content of text is extracted, due to reptile Obtain is initial data, wherein contains gibberish unrelated to text snippet in a large number, such as html label, URL, picture Deng, these irrelevant contents are removed;Then the content of text after cleaning is done Chinese word segmentation to process.
In step S203, obtain the word frequency of the described a plurality of microblogging text after participle, to obtain content of microblog.
It is to say, further the word after participle is calculated word frequency, to be removed by substantial amounts of low word frequency word.
In step S204, obtain without supervision topic recommendation information according to described content of microblog.
Further, in one embodiment of the invention, described according to described text data obtain without supervision topic push away The information of recommending farther includes: be modeled described microblogging content of text by default topic model, with according to utilizing probability K key word of TOP of maximum theme is recommended as topic;Obtain described without supervision topic recommendation information.
Alternatively, in one embodiment of the invention, as shown in Figure 4, described topic model is LDA topic model, with By described LDA topic model, the subject information of described microblogging content of text is modeled.
Specifically, use based on topic model without supervision topic recommendation method, it should be noted that recommendation method not office It is limited to which kind of unsupervised approaches.LDA topic model is a kind of probability topic model to discrete data set modeling, is a kind of to literary composition The method that the subject information of notebook data is modeled, by document carries out a brief description, retains the statistics letter of essence Breath, contributes to processing large-scale document sets efficiently.It haves three layers production bayesian network structure, before such a Put forward hypothesis: document is to be made up of several implicit themes, and these themes are to be made up of several specific vocabulary in text, ignore Syntactic structure in document and the sequencing of word appearance.
Wherein, topic model can carry out automatic modeling to content of microblog, utilizes TOP K the pass of the theme of maximum probability Keyword is recommended as topic.
In step S205, according to the described term vector without supervision topic recommendation information and topic in the test set preset The Euclidean distance of term vector obtains assessment result.
It addition, in one embodiment of the invention, also include: utilize RNN model training to obtain in described test set and talk about The term vector of topic.
It is understood that utilize RNN model training to obtain term vector, term vector is tried to achieve in topic in test set, then Obtaining the term vector of the topic recommended, the Euclidean distance of the two is as evaluation criterion, and Euclidean distance is the least, and the topic recommended is described More accurate.
The appraisal procedure recommended towards microblog topic according to embodiments of the present invention, can recommend method knot to topic Fruit is assessed automatically, introduces term vector and substitutes existing artificial evaluation mode, master when saving manpower and eliminate artificial evaluation The property seen, is assessed automatically by the recommendation topic obtaining unsupervised approaches, it is achieved pushing away without supervision topic microblog That recommends method assesses detection the most automatically, so that it is determined that the effectiveness of unsupervised approaches, is possible not only to save manpower, and evaluation and test Result has more cogency.
The apparatus for evaluating recommended towards microblog topic proposed according to embodiments of the present invention is described referring next to accompanying drawing.
Fig. 5 is the structural representation of the apparatus for evaluating recommended towards microblog topic of the embodiment of the present invention.
As it is shown in figure 5, should include towards the apparatus for evaluating 10 that microblog topic is recommended: the first acquisition module 100, participle Module the 200, second acquisition module 300, recommending module 400 and evaluation module 500.
Wherein, the first acquisition module 100 is for obtaining a plurality of microblogging text in microblog from the Internet.Word-dividing mode 200 for carrying out participle to a plurality of microblogging text.The word of second acquisition module 300 a plurality of microblogging text after obtaining participle Frequently, to obtain content of microblog.Recommending module 400 is for obtaining without supervision topic recommendation information according to content of microblog.Evaluation module 500 for obtaining with the Euclidean distance of the term vector of topic in the test set preset according to the term vector without supervision topic recommendation information To assessment result.The apparatus for evaluating 10 of the embodiment of the present invention can be to the various recommendation topics recommending method to obtain without supervision topic It is evaluated, thus various unsupervised approaches are ranked up, it is achieved the purpose of assessment automatically.
Further, in one embodiment of the invention, the apparatus for evaluating 10 of the embodiment of the present invention also includes: pretreatment Module (is not specifically identified in figure).Wherein, pretreatment module is for carrying out pretreatment to a plurality of microblogging text, to remove useless letter Breath, wherein, garbage includes html label, URL and picture.
Further, in one embodiment of the invention, as shown in Figure 6, it is recommended that module 400 includes: recommendation unit 401 With acquiring unit 402.
Wherein, it is recommended that unit 401 is for being modeled microblogging content of text by default topic model, with according to profit Recommend as topic with K key word of TOP of the theme of maximum probability.Acquiring unit 402 is used for obtaining and recommends without supervision topic Information.
Alternatively, in one embodiment of the invention, topic model is LDA model, with civilian to microblogging by LDA model The subject information of this content is modeled.
Further, in one embodiment of the invention, the apparatus for evaluating 10 of the embodiment of the present invention also includes: training mould Block.Wherein, training module is for utilizing RNN model training to obtain the term vector of topic in test set.
Specifically, recommend method automatically to assess to without supervision topic, comprise two stages: be first system from Dynamic pretreatment stage, carries out pretreatment and includes removing the unrelated letters such as html label, URL, picture the microblogging content of text obtained Breath, then carries out Chinese word segmentation process, and calculates word frequency, rejects being in the word disabling vocabulary and low word frequency, uses nothing Measure of supervision carries out topic recommendation to test set.
The apparatus for evaluating 10 of the embodiment of the present invention can call the term vector model that training in advance is good, obtains in test set The term vector of the topic of topic and recommendation represents, calculate recommend Euclidean between topic word vector sum true topic word vector away from From, the method effect that Euclidean distance is the least is the best.
It is understood that as it is shown in fig. 7, the top-down three big primary layers that are segmented into of whole apparatus for evaluating 10, top Layer is evaluation result and the display module recommending method without supervision topic;Centre is method assessment models;Bottom is data acquisition Module.Wherein, method evaluation result display module is recommended mainly to provide the user with a patterned close friend without supervision topic User interface, browse automatic assessment result to facilitate.Method evaluation module calculates true words mainly by term vector Euclidean distance between topic and recommendation topic.Bottom functional module is mainly data acquisition and storage.
Wherein, it is clear that the enforcement of apparatus for evaluating 10 recommended towards microblog topic have employed crawler technology, text data Wash technology, Recognition with Recurrent Neural Network modeling technique, topic model build the core technologies such as topic recommended technology.These algorithms and figure are used The functional modules such as interface, family can realize with Python, C++ and Java language exploitation respectively, and supports based on linux kernel Allocating operating system.It addition, based on above-mentioned development platform, the deployment of whole apparatus for evaluating 10 runs needs following several levels The support of running environment.First at operating system layer, it was predicted that system can be run on platform based on linux kernel;With Time also need to program run time infrastructure, namely Python 2.7, GCC 4.7 and above and JRE 1.6 run time infrastructure, Data base uses MongoDB.Only having possessed above-mentioned back-up environment, assessment system could normally be run.
The apparatus for evaluating recommended towards microblog topic according to embodiments of the present invention, can recommend method knot to topic Fruit is assessed automatically, introduces term vector and substitutes existing artificial evaluation mode, master when saving manpower and eliminate artificial evaluation The property seen, is assessed automatically by the recommendation topic obtaining unsupervised approaches, it is achieved pushing away without supervision topic microblog That recommends method assesses detection the most automatically, so that it is determined that the effectiveness of unsupervised approaches, is possible not only to save manpower, and evaluation and test Result has more cogency.
In describing the invention, it is to be understood that term " " center ", " longitudinally ", " laterally ", " length ", " width ", " thickness ", " on ", D score, "front", "rear", "left", "right", " vertically ", " level ", " top ", " end " " interior ", " outward ", " up time Pin ", " counterclockwise ", " axially ", " radially ", the orientation of the instruction such as " circumferential " or position relationship be based on orientation shown in the drawings or Position relationship, is for only for ease of the description present invention and simplifies description rather than instruction or imply that the device of indication or element must Must have specific orientation, with specific azimuth configuration and operation, be therefore not considered as limiting the invention.
Additionally, term " first ", " second " are only used for describing purpose, and it is not intended that instruction or hint relative importance Or the implicit quantity indicating indicated technical characteristic.Thus, define " first ", the feature of " second " can express or Implicitly include at least one this feature.In describing the invention, " multiple " are meant that at least two, such as two, three Individual etc., unless otherwise expressly limited specifically.
In the present invention, unless otherwise clearly defined and limited, term " install ", " being connected ", " connection ", " fixing " etc. Term should be interpreted broadly, and connects for example, it may be fixing, it is also possible to be to removably connect, or integral;Can be that machinery connects Connect, it is also possible to be electrical connection;Can be to be joined directly together, it is also possible to be indirectly connected to by intermediary, in can being two elements The connection in portion or the interaction relationship of two elements, unless otherwise clear and definite restriction.For those of ordinary skill in the art For, above-mentioned term concrete meaning in the present invention can be understood as the case may be.
In the present invention, unless otherwise clearly defined and limited, fisrt feature second feature " on " or D score permissible It is that the first and second features directly contact, or the first and second features are by intermediary mediate contact.And, fisrt feature exists Second feature " on ", " top " and " above " but fisrt feature directly over second feature or oblique upper, or be merely representative of Fisrt feature level height is higher than second feature.Fisrt feature second feature " under ", " lower section " and " below " can be One feature is immediately below second feature or obliquely downward, or is merely representative of fisrt feature level height less than second feature.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show Example " or the description of " some examples " etc. means to combine this embodiment or example describes specific features, structure, material or spy Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.And, the specific features of description, structure, material or feature can be in office One or more embodiments or example combine in an appropriate manner.Additionally, in the case of the most conflicting, the skill of this area The feature of the different embodiments described in this specification or example and different embodiment or example can be tied by art personnel Close and combination.
Although above it has been shown and described that embodiments of the invention, it is to be understood that above-described embodiment is example Property, it is impossible to being interpreted as limitation of the present invention, those of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, revises, replaces and modification.

Claims (10)

1. the appraisal procedure recommended towards microblog topic, it is characterised in that comprise the following steps:
A plurality of microblogging text in microblog is obtained from the Internet;
Described a plurality of microblogging text is carried out participle;
Obtain the word frequency of the described a plurality of microblogging text after participle, to obtain content of microblog;
Obtain without supervision topic recommendation information according to described content of microblog;And
According to the described term vector without supervision topic recommendation information and the Euclidean distance of the term vector of topic in the test set preset Obtain assessment result.
The appraisal procedure recommended towards microblog topic the most according to claim 1, it is characterised in that described a plurality of Before microblogging text carries out participle, also include:
Described a plurality of microblogging text is carried out pretreatment, and to remove garbage, wherein, described garbage includes that html marks Label, URL and picture.
The appraisal procedure recommended towards microblog topic the most according to claim 1, it is characterised in that described according to institute State text data to obtain farther including without supervision topic recommendation information:
By default topic model, described microblogging content of text is modeled, with according to the theme utilizing maximum probability K key word of TOP is recommended as topic;
Obtain described without supervision topic recommendation information.
The appraisal procedure recommended towards microblog topic the most according to claim 3, it is characterised in that described theme mould Type is that implicit expression Di Li Cray distributes LDA topic model, with by the described LDA topic model theme to described microblogging content of text Information is modeled.
The appraisal procedure recommended towards microblog topic the most according to claim 1, it is characterised in that also include:
Recognition with Recurrent Neural Network RNN model training is utilized to obtain the term vector of topic in described test set.
6. the apparatus for evaluating recommended towards microblog topic, it is characterised in that including:
First acquisition module, for obtaining a plurality of microblogging text in microblog from the Internet;
Word-dividing mode, for carrying out participle to described a plurality of microblogging text;
Second acquisition module, the word frequency of the described a plurality of microblogging text after obtaining participle, to obtain content of microblog;
Recommending module, for obtaining without supervision topic recommendation information according to described content of microblog;And
Evaluation module, for according to described without supervision topic recommendation information term vector with preset test set in topic word to The Euclidean distance of amount obtains assessment result.
The apparatus for evaluating recommended towards microblog topic the most according to claim 6, it is characterised in that also include:
Pretreatment module is for described a plurality of microblogging text is carried out pretreatment, to remove garbage, wherein, described useless Information includes html label, URL and picture.
The apparatus for evaluating recommended towards microblog topic the most according to claim 6, it is characterised in that described recommendation mould Block includes:
Recommendation unit, for being modeled described microblogging content of text by default topic model, with according to utilizing probability K key word of TOP of maximum theme is recommended as topic;
Acquiring unit, is used for obtaining described nothing supervision topic recommendation information.
The apparatus for evaluating recommended towards microblog topic the most according to claim 8, it is characterised in that described theme mould Type is LDA model, to be modeled the subject information of described microblogging content of text by described LDA model.
The apparatus for evaluating recommended towards microblog topic the most according to claim 6, it is characterised in that also include:
Training module, for utilizing RNN model training to obtain the term vector of topic in described test set.
CN201610698208.4A 2016-08-19 2016-08-19 The appraisal procedure recommended towards microblog topic and device Pending CN106202574A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610698208.4A CN106202574A (en) 2016-08-19 2016-08-19 The appraisal procedure recommended towards microblog topic and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610698208.4A CN106202574A (en) 2016-08-19 2016-08-19 The appraisal procedure recommended towards microblog topic and device

Publications (1)

Publication Number Publication Date
CN106202574A true CN106202574A (en) 2016-12-07

Family

ID=57523303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610698208.4A Pending CN106202574A (en) 2016-08-19 2016-08-19 The appraisal procedure recommended towards microblog topic and device

Country Status (1)

Country Link
CN (1) CN106202574A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506377A (en) * 2017-07-20 2017-12-22 南开大学 This generation system is painted in interaction based on commending system
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system
CN108536868A (en) * 2018-04-24 2018-09-14 北京慧闻科技发展有限公司 The data processing method of short text data and application on social networks
CN109271491A (en) * 2018-11-02 2019-01-25 合肥工业大学 Cloud service recommendation method based on non-structured text information
CN110287491A (en) * 2019-06-25 2019-09-27 北京百度网讯科技有限公司 Event name generation method and device
CN111008274A (en) * 2019-12-10 2020-04-14 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN112784032A (en) * 2021-01-28 2021-05-11 上海明略人工智能(集团)有限公司 Conversation corpus recommendation evaluation method and device, storage medium and electronic equipment
CN113076425A (en) * 2021-04-25 2021-07-06 昆明理工大学 Event related viewpoint sentence classification method for microblog comments
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104615767A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Searching-ranking model training method and device and search processing method
CN104899188A (en) * 2015-03-11 2015-09-09 浙江大学 Problem similarity calculation method based on subjects and focuses of problems
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN105868184A (en) * 2016-05-10 2016-08-17 大连理工大学 Chinese name recognition method based on recurrent neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104615767A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Searching-ranking model training method and device and search processing method
CN104899188A (en) * 2015-03-11 2015-09-09 浙江大学 Problem similarity calculation method based on subjects and focuses of problems
CN105551485A (en) * 2015-11-30 2016-05-04 讯飞智元信息科技有限公司 Audio file retrieval method and system
CN105868184A (en) * 2016-05-10 2016-08-17 大连理工大学 Chinese name recognition method based on recurrent neural network

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506377A (en) * 2017-07-20 2017-12-22 南开大学 This generation system is painted in interaction based on commending system
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN107704512B (en) * 2017-08-31 2021-08-24 平安科技(深圳)有限公司 Financial product recommendation method based on social data, electronic device and medium
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system
CN108536868B (en) * 2018-04-24 2022-04-15 北京慧闻科技(集团)有限公司 Data processing method and device for short text data on social network
CN108536868A (en) * 2018-04-24 2018-09-14 北京慧闻科技发展有限公司 The data processing method of short text data and application on social networks
CN109271491A (en) * 2018-11-02 2019-01-25 合肥工业大学 Cloud service recommendation method based on non-structured text information
CN109271491B (en) * 2018-11-02 2021-09-28 合肥工业大学 Cloud service recommendation method based on unstructured text information
CN110287491A (en) * 2019-06-25 2019-09-27 北京百度网讯科技有限公司 Event name generation method and device
CN110287491B (en) * 2019-06-25 2024-01-12 北京百度网讯科技有限公司 Event name generation method and device
CN111008274A (en) * 2019-12-10 2020-04-14 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN111008274B (en) * 2019-12-10 2021-04-06 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN112784032A (en) * 2021-01-28 2021-05-11 上海明略人工智能(集团)有限公司 Conversation corpus recommendation evaluation method and device, storage medium and electronic equipment
CN113076425B (en) * 2021-04-25 2022-12-20 昆明理工大学 Event related viewpoint sentence classification method for microblog comments
CN113076425A (en) * 2021-04-25 2021-07-06 昆明理工大学 Event related viewpoint sentence classification method for microblog comments
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product

Similar Documents

Publication Publication Date Title
CN106202574A (en) The appraisal procedure recommended towards microblog topic and device
Hannigan et al. Topic modeling in management research: Rendering new theory from textual data
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN110717017A (en) Method for processing corpus
CN103853824B (en) In-text advertisement releasing method and system based on deep semantic mining
US9646078B2 (en) Sentiment extraction from consumer reviews for providing product recommendations
CN105095433B (en) Entity recommended method and device
Chen et al. Tracking and recognizing emotions in short text messages from online chatting services
Yin et al. The construction of sentiment lexicon based on context-dependent part-of-speech chunks for semantic disambiguation
CN111797898B (en) Online comment automatic reply method based on deep semantic matching
CN104836720A (en) Method for performing information recommendation in interactive communication, and device
CN107862561A (en) A kind of method and apparatus that user-interest library is established based on picture attribute extraction
WO2014039897A1 (en) System and method for mapping semiotic relationships
CN106663117A (en) Constructing a graph that facilitates provision of exploratory suggestions
CN112836487B (en) Automatic comment method and device, computer equipment and storage medium
CN103927299A (en) Method for providing candidate sentences in input method and method and device for recommending input content
CN102609427A (en) Public opinion vertical search analysis system and method
CN110427478A (en) A kind of the question and answer searching method and system of knowledge based map
CN108932227A (en) A kind of short text emotion value calculating method based on sentence structure and context
CN112966526A (en) Automobile online comment emotion analysis method based on emotion word vector
Zhu et al. Joint visual-textual sentiment analysis based on cross-modality attention mechanism
Radlinski et al. On natural language user profiles for transparent and scrutable recommendation
CN114997288A (en) Design resource association method
CN109472022A (en) New word identification method and terminal device based on machine learning
CN107239509A (en) Towards single Topics Crawling method and system of short text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161207