CN106202294B - Related news computing method and device based on keyword and topic model fusion - Google Patents

Related news computing method and device based on keyword and topic model fusion Download PDF

Info

Publication number
CN106202294B
CN106202294B CN201610509723.3A CN201610509723A CN106202294B CN 106202294 B CN106202294 B CN 106202294B CN 201610509723 A CN201610509723 A CN 201610509723A CN 106202294 B CN106202294 B CN 106202294B
Authority
CN
China
Prior art keywords
news
keywords
keyword
candidate set
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610509723.3A
Other languages
Chinese (zh)
Other versions
CN106202294A (en
Inventor
陈鑫
彭仁刚
肖峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qizhi Business Consulting Co ltd
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201610509723.3A priority Critical patent/CN106202294B/en
Publication of CN106202294A publication Critical patent/CN106202294A/en
Application granted granted Critical
Publication of CN106202294B publication Critical patent/CN106202294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a method and a device for calculating related news based on the fusion of keywords and a topic model, wherein the method comprises the following steps: selecting a URL of first news for segmentation; extracting keywords and calculating a topic model based on the segmented URL, and establishing an inverted index according to the obtained related information of the keywords, the topic number and the known news category; selecting and sorting a candidate set based on the inverted index, and selecting one or more second news related to the first news; the device comprises: the system comprises a segmentation unit, a calculation unit, an index establishing unit, a candidate set selection unit and a sorting unit; the invention fuses the key words of news and the topic model to calculate the candidate set of the related news, and adopts different strategies to perform rank and ctr feedback and weight adjustment aiming at each news category, thereby finally forming a whole set of technical scheme for calculating the related news.

Description

Related news computing method and device based on keyword and topic model fusion
Technical Field
The invention relates to the technical field of information, in particular to a method and a device for calculating related news based on keyword and topic model fusion.
Background
Currently, a mainstream news App or a news portal website usually recommends several news related to the news at the bottom of a news detail page for a user to read again. The recommended related news has relevance to the news, novelty and effectiveness so as to improve the stay time and click rate of the user on the product.
Currently, there are two main methods for calculating related news: the method comprises the steps of firstly, calculating related news based on keywords, and secondly, calculating related news based on a topic model of article content, wherein the related news is obtained by calculating a plurality of keywords and corresponding weights of the keywords for each piece of news, establishing an inverted index by using the keywords, inquiring the inverted index of the keywords for the news to be calculated, generating a related news candidate set according to the number and the weight of the keywords, and sequencing the candidate set according to the number and the weight of the keywords; the topic distribution weight vector of each news is obtained by calculating the topic distribution and the weight of each news based on the topic model, the topic similarity of two news can be obtained by calculating the cosine similarity of the vectors pairwise, and a candidate set is generated according to the similarity and is combined with a relevant strategy for sequencing.
The two methods for calculating the related news have different degrees, for example, the method for calculating the related news by only using the keywords has the problems of insufficient accuracy, insufficient recall rate and the like; the use of only the topic model to calculate the related news results in the related news staying only in the topic hierarchy, and some categories of news are often expanded by some keywords.
Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for calculating related news based on the fusion of keywords and topic models, so as to overcome the defects in the prior art that related news is calculated only by using keywords or only by using topic models.
The invention provides a related news computing method based on keyword and topic model fusion, which comprises the following steps:
selecting a URL of first news for segmentation;
extracting keywords and calculating a topic model based on the segmented URL, and establishing an inverted index according to the obtained related information of the keywords, the topic number and the known news category;
selecting and sorting a candidate set based on the inverted index, and selecting one or more second news related to the first news.
Preferably, the keyword extraction process includes:
and respectively extracting text and title keywords from the URL of the first news, and calculating the weight of the keywords according to the text and the title keywords. And carrying out related expansion on the single keyword to obtain an expanded keyword.
Preferably, the process of topic model calculation comprises:
and calculating the theme and the weight of the first news according to the URL of the first news.
Preferably, the keyword related information obtained by keyword extraction includes: and if the original keywords or the expanded keywords are used, the process of establishing the inverted index comprises the following steps:
and generating a plurality of triple keys by using the original keywords or the expanded keywords, the subject numbers and the news categories, and establishing an inverted index according to the triple keys.
Preferably, a plurality of the primary keywords which are obtained by calculation are ranked at the top according to the weight, and all the expanded keywords are selected; and taking a plurality of the topic numbers which are ranked at the top according to the weight.
Preferably, the candidate set is selected using a method of hitting the inverted index score.
Preferably, before sorting the candidate set, the method further comprises:
for news of all categories, the titles are segmented and then filtered to stop words, and the Jaccard similarity of the titles term is calculated between every two news; two news with Jaccard similarity larger than the threshold are regarded as repeated news and are subjected to deduplication filtering.
Preferably, the news in the candidate set is ranked using one or more of the following strategies:
time adjustment right, place name adjustment right, picture number adjustment right, click feedback adjustment right and Ctr estimation adjustment right.
The invention also provides a related news computing device based on the fusion of the keywords and the topic model, which comprises:
the segmentation unit is used for selecting the URL of the first news for segmentation;
the computing unit is used for extracting keywords and computing a topic model based on the divided URLs;
the index establishing unit is used for establishing an inverted index according to the related information of the key words, the topic numbers and the known news categories obtained by the calculating unit;
a candidate set selecting unit, configured to select a candidate set based on the inverted index;
and the sorting unit is used for sorting the candidate set and selecting one or more second news related to the first news.
Preferably, the computing unit specifically includes:
the first calculation module is used for respectively extracting text and title keywords from the URL of the first news and calculating the weight of the keywords according to the text and the title keywords;
and the second calculation module is used for calculating the theme and the weight of the first news according to the URL of the first news.
Preferably, the first calculation module is further configured to perform related expansion on the single keyword to obtain an expanded keyword.
Preferably, the keyword related information obtained by keyword extraction includes: and the index establishing unit is specifically used for generating a plurality of triple keys by the original keywords or the expanded keywords, the subject numbers and the news categories, and establishing the inverted index according to the triple keys.
Preferably, the index establishing unit is further configured to take a plurality of top-ranked native keywords according to weights and take all the expanded keywords; and taking a plurality of the topic numbers which are ranked at the top according to the weight.
Preferably, the candidate set selecting unit selects the candidate set by using a method of hitting the inverted index score.
Preferably, the apparatus further comprises:
and the duplicate removal unit is used for filtering the titles of the news in the candidate set to stop words after segmenting the titles, calculating the Jaccard similarity of the titles term between every two news, and performing duplicate removal filtering when the two news with the Jaccard similarity larger than a preset threshold are considered as repeated news.
Preferably, the sorting unit takes
The news in the candidate set is ranked by one or more of the following strategies:
time adjustment right, place name adjustment right, picture number adjustment right, click feedback adjustment right and Ctr estimation adjustment right.
The invention has the beneficial effects that:
the invention is a technical scheme for calculating relevant news from multiple dimensional considerations, firstly, a news candidate set is obtained based on the fusion of keywords and a topic model, and relevance and novelty can be considered at the same time; secondly, various strategies such as time weighting, weight elimination, ctr estimation, on-line click feedback and the like are added in the rank flow according to different news categories to adapt to the real on-line environment.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
FIGS. 2 and 3 are schematic diagrams of an application example of the method according to the embodiment of the invention;
fig. 4 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The invention is further described with reference to the following figures and detailed description of embodiments.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The method according to the embodiment of the present invention will be described in detail with reference to fig. 1 to 3.
In the embodiment of the present invention, for clarity, the current news to be calculated is referred to as first news, and the related news obtained by calculation later is referred to as second news.
As shown in fig. 1, fig. 1 is a schematic flow chart of the method according to the embodiment of the present invention, which may specifically include the following steps:
step 101: segmenting a received URL of the current news to be calculated;
specifically, news flowing in from Kafka is first subjected to news preprocessing, news categories unsuitable for calculating related news are filtered, the obtained first news is divided into a text part and other information (such as title, capture time, category and the like), the text part and the other information are respectively stored on redis, the text part and the url @ content are indexed through two keys, and different expiration times (for example, 3 days in text and 30 days in other information) are set to ensure storage space. Kafka is a distributed messaging system, which classifies messages as they are stored according to topic. The redis is an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent, and provides API of multiple languages.
Step 102: extracting keywords based on the segmented URL to obtain related information of the keywords;
specifically, based on the divided URLs, extraction and weight calculation of the text and the title keywords are performed, and the output format is "word # part of speech # weight".
As a preferred scheme of the invention, word2vec expansion can be carried out on a single keyword, wherein word2vec is a tool for converting the single keyword into a vector form, the processing on text content can be simplified into vector operation in a vector space, the vector operation is essentially a matrix decomposition model, and the matrix describes the relevant condition of each keyword and the word set of the context. And decomposing the matrix, only taking the vector of each keyword corresponding to the implicit space, thereby completing the mapping from word to vector, and calculating words related to the keyword after expansion according to the similarity between every two vectors, namely the expanded keyword.
Step 103: calculating a theme model based on the segmented URL to obtain a theme number;
specifically, based on the divided URLs, some topic to which the URL belongs and the weight thereof are calculated, and the topic content is replaced by a number, i.e., a topic number. In the embodiment of the invention, the calculation of the topic model uses an LDA model calculation package on Spark, the number of topic is determined to be 200, and an LDA model is respectively trained for news of each category and all news in the whole.
Step 104: establishing an inverted index according to the obtained related information of the keywords, the topic numbers and the known news categories;
specifically, the Top few of the native keywords calculated in step 102 are selected according to the weights (for example, Top3), and the word2vec expanded keywords are all selected as the first part; taking the Top ones (such as Top3) of topic calculated in step 103 as a second part according to the weight; and taking the known news category as a third part, performing Cartesian product on the third part to generate n triple keys, wherein va l is the URL. Therefore, all the ur's in the reverse index set to be built can be built with indexes, and the frequency is about 10 minutes.
Step 105: selecting a candidate set by adopting a method of hitting the inverted index score;
specifically, since the inverted index triplet key is composed of three parts: native keywords or keywords extended by word2vec + topic number + news category, so the set of hit triples key can be made X ═ { X1, X2, …, xn }.
It should be noted that the above-mentioned way of creating the inverted index by using the triplet key is applicable to some categories of news, such as automobile, science and technology, sports, history, international, military, etc. For news categories which are not suitable for establishing the inverted index by the triple key, a single type topic model or a global topic model can be adopted, for example, for categories such as domestic, social, economic, health and entertainment, the single type topic model is adopted, and a candidate set is selected according to the distribution similarity of topics of all news in the category. The similarity here is cosine similarity, and the topic probability distribution of each news is regarded as an n-dimensional vector. And regarding news with the similarity larger than a certain threshold, regarding the news as related news, and entering a candidate set. For categories such as info, news and the like, a global topic model is adopted, a candidate set is selected according to the topic distribution similarity of all news in the global, the calculation method is similar to that of a single type of topic model, the difference is that the topic probability distribution uses the topic model of the global news, the set to be calculated is also news of all categories, and vector dimensionality and calculation time are too high. Observing the global topic model, finding that the probability distribution of topic is usually biased to some topic, the long tail phenomenon is serious, and here, the long tail is staged, and only the topic of each news Top 10 is taken to perform cosine similarity calculation.
Step 106: carrying out duplicate elimination processing on the candidate set;
specifically, in the embodiment of the present invention, the deduplication strategies include two types: one is that in the news category of the candidate set generated using the topoic distribution similarity score, if the topoic distribution similarity score is greater than a certain threshold (depending on the news category), it is considered to be duplicate news, and deduplication filtering is performed. Secondly, for news of all categories, the titles are segmented and then filtered to stop words, and the Jaccard similarity of the titles term is calculated between every two news:
Figure BDA0001038079660000061
and considering two news with the Jaccard similarity larger than a certain threshold value as repeated news, and performing deduplication filtering.
Step 107: sorting the news in the candidate set to select one or more second news which are most related to the first news;
specifically, after the candidate set of related news is obtained, the related news needs to be ranked according to rank policy alignment, so that the top related news is scored highest in each dimension, and one or more news with top scores are taken as the second news.
The rank policy may adopt one or more of the following policies:
time adjustment and weighting: the related news only indexes news within 7 days at present, so according to the time t of news generation, the news can be weighted by taking 0-7 days as a range and taking the second level as precision, and the smaller the t is, the higher the weight is. The latter is chosen here in view of both linear smoothing and acceleration weighting approaches.
And (3) adjusting the right of the place name: the method aims to solve the problems that keywords are used as reverse indexes: some news is 'car accidents happen in a certain Beijing place', the key words are extracted as 'car accidents', and regional news such as 'car accidents happen in xx city xx county' can be obtained, and except a first-line city and some known famous cities, the regional news is not concerned by users, so special right-reducing processing is needed for the place names.
Picture number adjustment right: whether a piece of news has a picture during presentation is also an important factor for attracting whether a user clicks, and the right is adjusted according to three categories of no picture, a single picture and a three-picture of a list presentation style.
Click feedback transfer right: the relevant news with high ctr is weighted appropriately according to the showing and clicking conditions on the relevant news line.
The ctr pre-estimates the transfer weight: after certain user behaviors are accumulated, an estimation model of ctr can be trained according to various characteristics of an article and whether a user clicks at last, and the estimation model is trained by adopting the cosine similarity between topic of a topic probability distribution top3 and a positive sample and is used on line.
It should be noted that, because news is time-consuming when computing the topic model, an asynchronous computing process is adopted to separate the computing of the keywords and the topic model, and the establishment of the inverted index from the main process of the related news computing. Meanwhile, in order to update related news fast enough, different calculation frequencies are adopted for news occurring in different time points.
As shown in fig. 2 and fig. 3, fig. 2 and fig. 3 are schematic diagrams of application examples of the method according to the embodiment of the present invention, and the related news calculation method and flow are currently embedded in news channels of various product lines such as 360 mobile phone browser, 360 search app, 360 search web, and mobile phone guard. As shown in fig. 2, after the user clicks on the news detail page labeled "all over the world, the recording process of a runner is actually such a child … …", the back-end system calculates according to the method described in the embodiment of the present invention to obtain a plurality of news related to the news as a candidate set, then sorts the related information in the candidate set, and selects TOP3 to recommend to the user, as shown in fig. 3, 3 pieces of related news are recommended to the user, and the titles are "various punting shots whose games for a runner are referred to as" fake and secret "bar brother", "bar brother for a runner" and "foretell: the brother of the running bar 4 invites guests in the second period and the grouped book of the book giardia are fiercely attacked.
The apparatus according to the embodiment of the present invention will be described in detail with reference to fig. 4.
As shown in fig. 4, fig. 4 is a schematic structural diagram of a device according to an embodiment of the present invention, which may specifically include: a dividing unit, a calculating unit, an index establishing unit, a candidate set selecting unit, a duplicate eliminating unit and a sorting unit, wherein,
the segmentation unit is used for selecting the URL of the first news for segmentation; specifically, news flowing in from Kafka is first subjected to news preprocessing, news categories unsuitable for calculating related news are filtered, then the obtained first news is divided into a text part and other information (such as title, capture time, category and the like) by the dividing unit, the text part and the other information are respectively stored on redis, the text part and the other information are respectively indexed by two keys of url @ uinfo and url @ content, and different expiration times (for example, 3 days of text and 30 days of other information) are set to ensure storage space. Kafka is a distributed messaging system, which classifies messages as they are stored according to Topic. The redis is an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent, and provides API of multiple languages.
The computing unit is used for extracting keywords and computing a topic model based on the divided URLs;
the calculating unit specifically includes: a first computing module and a second computing module, wherein,
the first calculation module is used for respectively extracting text and title keywords from the URL of the first news and calculating the weight of the keywords according to the text and the title keywords; specifically, the first calculation module extracts text and title keywords and calculates weights based on the divided URLs, and outputs the text and title keywords in a format of "word # part of speech # weight".
As a preferred scheme of the invention, the first computing module can also perform word2vec expansion on a single keyword, wherein word2vec is a tool for converting the single keyword into a vector form, the processing on text content can be simplified into vector operation in a vector space, the vector operation is essentially a matrix decomposition model, and the matrix describes the relevant condition of each keyword and the word set of the context. The matrix is decomposed, only the vector of each keyword corresponding to the implicit space is taken, so that the mapping from word to vector is completed, and words related to the keyword after expansion, namely the expanded keyword, are calculated according to the similarity between vectors.
The second calculation module is used for calculating the theme and the weight of the first news according to the URL of the first news; specifically, based on the divided URLs, the second calculation module calculates certain topics to which it belongs and their weights, the topic content is replaced with a number, and the topic number. In the embodiment of the invention, the calculation of the topic model uses an LDA model calculation package on Spark, the number of topic is determined to be 200, and an LDA model is respectively trained for news of each category and all news in the whole.
The index establishing unit is used for establishing an inverted index according to the related information of the key words, the topic numbers and the known news categories obtained by the calculating unit; specifically, the first part is the first part of the word2vec expanded keywords, wherein the first part is the first part of the native keywords calculated by the first calculation module according to the weight values (for example, Top 3); taking the Top ones (such as Top3) of topic calculated by the second calculation module according to the weight value as a second part; and taking the known news category as a third part, and taking the three parts as Cartesian products by the index building unit to generate n triples key, wherein val is the URL.
A candidate set selecting unit, configured to select a candidate set based on the inverted index; specifically, since the inverted index triplet key is composed of three parts: native keywords or keywords extended by word2vec + topic number + news category, so the set of hit triples key can be made X ═ { X1, X2, …, xn }. .
Before the candidate set is ranked, the deduplication unit is configured to perform deduplication filtering on news in the candidate set, filter the titles of the news after segmenting the titles into words, calculate a Jaccard similarity of a title term between every two news, and consider two news with the Jaccard similarity larger than a predetermined threshold as duplicate news.
A sorting unit for sorting the candidate set to select one or more second news related to the first news; specifically, after obtaining the candidate set of related news, the ranking unit needs to rank according to rank policy alignment, so that the top related news is scored highest in each dimension, and one or more news with top scores are taken as the second news.
The rank policy may adopt one or more of the following policies:
time adjustment and weighting: the related news only indexes news within 7 days at present, so according to the time t of news generation, the news can be weighted by taking 0-7 days as a range and taking the second level as precision, and the smaller the t is, the higher the weight is. The latter is chosen here in view of both linear smoothing and acceleration weighting approaches.
And (3) adjusting the right of the place name: the method aims to solve the problems that keywords are used as reverse indexes: some news is 'car accidents happen in a certain Beijing place', the key words are extracted as 'car accidents', and regional news such as 'car accidents happen in xx city xx county' can be obtained, and except a first-line city and some known famous cities, the regional news is not concerned by users, so special right-reducing processing is needed for the place names.
Picture number adjustment right: whether a piece of news has a picture during presentation is also an important factor for attracting whether a user clicks, and the right is adjusted according to three categories of no picture, a single picture and a three-picture of a list presentation style.
Click feedback transfer right: the relevant news with high ctr is weighted appropriately according to the showing and clicking conditions on the relevant news line.
The ctr pre-estimates the transfer weight: after certain user behaviors are accumulated, an estimation model of ctr can be trained according to various characteristics of an article and whether a user clicks at last, and the estimation model is trained by adopting the cosine similarity between topic of a topic probability distribution top3 and a positive sample and is used on line.
In summary, the embodiment of the present invention is a technical solution for calculating related news from multiple dimensional considerations, and first, a news candidate set obtained based on fusion of a keyword and a topic model overcomes the defect of calculating related news by using the keyword alone or the topic model alone, and takes relevance and novelty into account; secondly, various strategies such as time weighting, weight elimination, ctr estimation, on-line click feedback and the like are added in the rank flow according to different news categories to adapt to the real on-line environment.
It should be noted that the embodiments and features of the embodiments of the present invention may be arbitrarily combined with each other without conflict.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of a network search system according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Moreover, it is noted that instances of the word "in one embodiment" are not necessarily all referring to the same embodiment.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Moreover, it should also be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (17)

1. A related news computing method based on keyword and topic model fusion is characterized by comprising the following steps:
selecting a URL of the first news, and segmenting the first news;
extracting keywords and calculating a topic model based on the content of the segmented first news, and establishing an inverted index according to the obtained related information of the keywords, the topic number and the known news category; wherein the keyword related information comprises: native keywords or expanded keywords;
selecting and sorting a candidate set based on the inverted index, and selecting one or more second news related to the first news.
2. The method of claim 1, wherein the keyword extraction process comprises:
and respectively extracting text and title keywords from the segmented content of the first news, and calculating the weight of the keywords according to the text and the title keywords.
3. The method of claim 1, further comprising:
and performing related expansion on the single keyword to obtain an expanded keyword.
4. The method of claim 1, wherein the process of topic model computation comprises:
and calculating the theme and the weight of the first news after segmentation according to the content of the first news after segmentation.
5. The method according to any one of claims 1 to 4, wherein the process of establishing the inverted index comprises:
and generating a plurality of triple keys by using the original keywords or the expanded keywords, the subject numbers and the news categories, and establishing an inverted index according to the triple keys.
6. The method according to any one of claims 1 to 4, characterized in that a plurality of top-ranked native keywords are taken according to weight, and the expanded keywords are taken in their entirety; and taking a plurality of the topic numbers which are ranked at the top according to the weight.
7. The method of any one of claims 1 to 4, wherein the candidate set is selected by hitting an inverted index score.
8. The method of any of claims 1 to 4, wherein sorting the candidate sets further comprises:
for news of all categories, the titles are segmented and then filtered to stop words, and the Jaccard similarity of the titles term is calculated between every two news; two news with Jaccard similarity larger than the threshold are regarded as repeated news and are subjected to deduplication filtering.
9. The method of any one of claims 1 to 4, wherein one or more of the following strategies are used to rank the news in the candidate set:
time adjustment right, place name adjustment right, picture number adjustment right, click feedback adjustment right and Ctr estimation adjustment right.
10. A related news computing device based on keyword and topic model fusion, comprising:
the segmentation unit is used for selecting the URL of the first news and segmenting the first news;
the computing unit is used for extracting keywords and computing a topic model based on the content of the segmented first news;
the index establishing unit is used for establishing an inverted index according to the related information of the key words, the topic numbers and the known news categories obtained by the calculating unit; wherein the keyword related information comprises: native keywords or expanded keywords;
a candidate set selecting unit, configured to select a candidate set based on the inverted index;
and the sorting unit is used for sorting the candidate set and selecting one or more second news related to the first news.
11. The apparatus according to claim 10, wherein the computing unit specifically includes:
the first calculation module is used for respectively extracting text and title keywords from the segmented content of the first news and calculating the weight of the keywords according to the text and the title keywords;
and the second calculation module is used for calculating the theme and the weight of the first news after segmentation according to the content of the first news after segmentation.
12. The apparatus of claim 11, wherein the first computing module is further configured to perform a related expansion on a single keyword to obtain an expanded keyword.
13. The apparatus according to any one of claims 10 to 12, wherein the index creating unit is specifically configured to generate a plurality of triplet keys from the original keyword or the expanded keyword, the topic number, and the news category, and create the inverted index according to the triplet keys.
14. The apparatus according to any one of claims 10 to 12, wherein the index creating unit is further configured to take a plurality of top-ranked native keywords according to weights and take all the expanded keywords; and taking a plurality of the topic numbers which are ranked at the top according to the weight.
15. The apparatus according to any one of claims 10 to 12, wherein the candidate set selecting unit selects the candidate set by using a method of hitting an inverted index score.
16. The apparatus of any one of claims 10 to 12, further comprising:
and the duplicate removal unit is used for filtering the titles of the news in the candidate set to stop words after segmenting the titles, calculating the Jaccard similarity of the titles term between every two news, and performing duplicate removal filtering when the two news with the Jaccard similarity larger than a preset threshold are considered as repeated news.
17. The apparatus according to any one of claims 10 to 12, wherein the ranking unit ranks the news in the candidate set using one or more of the following policies:
time adjustment right, place name adjustment right, picture number adjustment right, click feedback adjustment right and Ctr estimation adjustment right.
CN201610509723.3A 2016-07-01 2016-07-01 Related news computing method and device based on keyword and topic model fusion Active CN106202294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610509723.3A CN106202294B (en) 2016-07-01 2016-07-01 Related news computing method and device based on keyword and topic model fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610509723.3A CN106202294B (en) 2016-07-01 2016-07-01 Related news computing method and device based on keyword and topic model fusion

Publications (2)

Publication Number Publication Date
CN106202294A CN106202294A (en) 2016-12-07
CN106202294B true CN106202294B (en) 2020-09-11

Family

ID=57464512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610509723.3A Active CN106202294B (en) 2016-07-01 2016-07-01 Related news computing method and device based on keyword and topic model fusion

Country Status (1)

Country Link
CN (1) CN106202294B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919682A (en) * 2017-03-01 2017-07-04 北京再塑宝科技有限公司 A kind of search associational word implementation method based on redis technologies
CN107423430B (en) * 2017-08-03 2020-03-03 北京京东尚科信息技术有限公司 Data processing method, device and computer readable storage medium
CN108052520A (en) * 2017-11-01 2018-05-18 平安科技(深圳)有限公司 Conjunctive word analysis method, electronic device and storage medium based on topic model
CN108256096B (en) * 2018-01-30 2021-01-22 北京搜狐新媒体信息技术有限公司 Data processing method and device
CN108509630A (en) * 2018-04-09 2018-09-07 北京搜狐新媒体信息技术有限公司 A kind of news recommendation method and device
CN110737820B (en) * 2018-07-03 2022-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for generating event information
CN109408706B (en) * 2018-09-20 2022-05-03 上海掌门科技有限公司 Image filtering method
CN109508394A (en) * 2018-10-18 2019-03-22 青岛聚看云科技有限公司 A kind of training method and device of multi-medium file search order models
CN112100500A (en) * 2020-09-23 2020-12-18 高小翎 Example learning-driven content-associated website discovery method
CN112202889B (en) * 2020-09-30 2023-05-23 深圳前海微众银行股份有限公司 Information pushing method, device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages
CN103389975A (en) * 2012-05-07 2013-11-13 腾讯科技(深圳)有限公司 News recommending method and system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6845374B1 (en) * 2000-11-27 2005-01-18 Mailfrontier, Inc System and method for adaptive text recommendation
US7689559B2 (en) * 2006-02-08 2010-03-30 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US8095540B2 (en) * 2008-04-16 2012-01-10 Yahoo! Inc. Identifying superphrases of text strings
CN105095162A (en) * 2014-05-19 2015-11-25 腾讯科技(深圳)有限公司 Text similarity determining method and device, electronic equipment and system
CN104965889B (en) * 2015-06-17 2017-06-13 腾讯科技(深圳)有限公司 Content recommendation method and device
CN105183833B (en) * 2015-08-31 2020-05-19 天津大学 Microblog text recommendation method and device based on user model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103389975A (en) * 2012-05-07 2013-11-13 腾讯科技(深圳)有限公司 News recommending method and system
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages

Also Published As

Publication number Publication date
CN106202294A (en) 2016-12-07

Similar Documents

Publication Publication Date Title
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN109815308B (en) Method and device for determining intention recognition model and method and device for searching intention recognition
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN105045875B (en) Personalized search and device
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN103455487B (en) The extracting method and device of a kind of search term
CN109388743B (en) Language model determining method and device
CN104765769A (en) Short text query expansion and indexing method based on word vector
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
Patil et al. Automatic text categorization: Marathi documents
CN104598607A (en) Method and system for recommending search phrase
CN107291895B (en) Quick hierarchical document query method
CN105740448B (en) More microblogging timing abstract methods towards topic
CN111291177A (en) Information processing method and device and computer storage medium
CN104915399A (en) Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN106649605B (en) Method and device for triggering promotion keywords
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
CN106294358A (en) The search method of a kind of information and system
Zhu et al. Real-time personalized twitter search based on semantic expansion and quality model
CN113722478A (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
Kaczmarek Interactive query expansion with the use of clustering-by-directions algorithm
CN110717038B (en) Object classification method and device
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN106372123B (en) Tag-based related content recommendation method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee after: Beijing Qizhi Business Consulting Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20240119

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Beijing Qizhi Business Consulting Co.,Ltd.

TR01 Transfer of patent right