Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for calculating related news based on the fusion of keywords and topic models, so as to overcome the defects in the prior art that related news is calculated only by using keywords or only by using topic models.
The invention provides a related news computing method based on keyword and topic model fusion, which comprises the following steps:
selecting a URL of first news for segmentation;
extracting keywords and calculating a topic model based on the segmented URL, and establishing an inverted index according to the obtained related information of the keywords, the topic number and the known news category;
selecting and sorting a candidate set based on the inverted index, and selecting one or more second news related to the first news.
Preferably, the keyword extraction process includes:
and respectively extracting text and title keywords from the URL of the first news, and calculating the weight of the keywords according to the text and the title keywords. And carrying out related expansion on the single keyword to obtain an expanded keyword.
Preferably, the process of topic model calculation comprises:
and calculating the theme and the weight of the first news according to the URL of the first news.
Preferably, the keyword related information obtained by keyword extraction includes: and if the original keywords or the expanded keywords are used, the process of establishing the inverted index comprises the following steps:
and generating a plurality of triple keys by using the original keywords or the expanded keywords, the subject numbers and the news categories, and establishing an inverted index according to the triple keys.
Preferably, a plurality of the primary keywords which are obtained by calculation are ranked at the top according to the weight, and all the expanded keywords are selected; and taking a plurality of the topic numbers which are ranked at the top according to the weight.
Preferably, the candidate set is selected using a method of hitting the inverted index score.
Preferably, before sorting the candidate set, the method further comprises:
for news of all categories, the titles are segmented and then filtered to stop words, and the Jaccard similarity of the titles term is calculated between every two news; two news with Jaccard similarity larger than the threshold are regarded as repeated news and are subjected to deduplication filtering.
Preferably, the news in the candidate set is ranked using one or more of the following strategies:
time adjustment right, place name adjustment right, picture number adjustment right, click feedback adjustment right and Ctr estimation adjustment right.
The invention also provides a related news computing device based on the fusion of the keywords and the topic model, which comprises:
the segmentation unit is used for selecting the URL of the first news for segmentation;
the computing unit is used for extracting keywords and computing a topic model based on the divided URLs;
the index establishing unit is used for establishing an inverted index according to the related information of the key words, the topic numbers and the known news categories obtained by the calculating unit;
a candidate set selecting unit, configured to select a candidate set based on the inverted index;
and the sorting unit is used for sorting the candidate set and selecting one or more second news related to the first news.
Preferably, the computing unit specifically includes:
the first calculation module is used for respectively extracting text and title keywords from the URL of the first news and calculating the weight of the keywords according to the text and the title keywords;
and the second calculation module is used for calculating the theme and the weight of the first news according to the URL of the first news.
Preferably, the first calculation module is further configured to perform related expansion on the single keyword to obtain an expanded keyword.
Preferably, the keyword related information obtained by keyword extraction includes: and the index establishing unit is specifically used for generating a plurality of triple keys by the original keywords or the expanded keywords, the subject numbers and the news categories, and establishing the inverted index according to the triple keys.
Preferably, the index establishing unit is further configured to take a plurality of top-ranked native keywords according to weights and take all the expanded keywords; and taking a plurality of the topic numbers which are ranked at the top according to the weight.
Preferably, the candidate set selecting unit selects the candidate set by using a method of hitting the inverted index score.
Preferably, the apparatus further comprises:
and the duplicate removal unit is used for filtering the titles of the news in the candidate set to stop words after segmenting the titles, calculating the Jaccard similarity of the titles term between every two news, and performing duplicate removal filtering when the two news with the Jaccard similarity larger than a preset threshold are considered as repeated news.
Preferably, the sorting unit takes
The news in the candidate set is ranked by one or more of the following strategies:
time adjustment right, place name adjustment right, picture number adjustment right, click feedback adjustment right and Ctr estimation adjustment right.
The invention has the beneficial effects that:
the invention is a technical scheme for calculating relevant news from multiple dimensional considerations, firstly, a news candidate set is obtained based on the fusion of keywords and a topic model, and relevance and novelty can be considered at the same time; secondly, various strategies such as time weighting, weight elimination, ctr estimation, on-line click feedback and the like are added in the rank flow according to different news categories to adapt to the real on-line environment.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The invention is further described with reference to the following figures and detailed description of embodiments.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The method according to the embodiment of the present invention will be described in detail with reference to fig. 1 to 3.
In the embodiment of the present invention, for clarity, the current news to be calculated is referred to as first news, and the related news obtained by calculation later is referred to as second news.
As shown in fig. 1, fig. 1 is a schematic flow chart of the method according to the embodiment of the present invention, which may specifically include the following steps:
step 101: segmenting a received URL of the current news to be calculated;
specifically, news flowing in from Kafka is first subjected to news preprocessing, news categories unsuitable for calculating related news are filtered, the obtained first news is divided into a text part and other information (such as title, capture time, category and the like), the text part and the other information are respectively stored on redis, the text part and the url @ content are indexed through two keys, and different expiration times (for example, 3 days in text and 30 days in other information) are set to ensure storage space. Kafka is a distributed messaging system, which classifies messages as they are stored according to topic. The redis is an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent, and provides API of multiple languages.
Step 102: extracting keywords based on the segmented URL to obtain related information of the keywords;
specifically, based on the divided URLs, extraction and weight calculation of the text and the title keywords are performed, and the output format is "word # part of speech # weight".
As a preferred scheme of the invention, word2vec expansion can be carried out on a single keyword, wherein word2vec is a tool for converting the single keyword into a vector form, the processing on text content can be simplified into vector operation in a vector space, the vector operation is essentially a matrix decomposition model, and the matrix describes the relevant condition of each keyword and the word set of the context. And decomposing the matrix, only taking the vector of each keyword corresponding to the implicit space, thereby completing the mapping from word to vector, and calculating words related to the keyword after expansion according to the similarity between every two vectors, namely the expanded keyword.
Step 103: calculating a theme model based on the segmented URL to obtain a theme number;
specifically, based on the divided URLs, some topic to which the URL belongs and the weight thereof are calculated, and the topic content is replaced by a number, i.e., a topic number. In the embodiment of the invention, the calculation of the topic model uses an LDA model calculation package on Spark, the number of topic is determined to be 200, and an LDA model is respectively trained for news of each category and all news in the whole.
Step 104: establishing an inverted index according to the obtained related information of the keywords, the topic numbers and the known news categories;
specifically, the Top few of the native keywords calculated in step 102 are selected according to the weights (for example, Top3), and the word2vec expanded keywords are all selected as the first part; taking the Top ones (such as Top3) of topic calculated in step 103 as a second part according to the weight; and taking the known news category as a third part, performing Cartesian product on the third part to generate n triple keys, wherein va l is the URL. Therefore, all the ur's in the reverse index set to be built can be built with indexes, and the frequency is about 10 minutes.
Step 105: selecting a candidate set by adopting a method of hitting the inverted index score;
specifically, since the inverted index triplet key is composed of three parts: native keywords or keywords extended by word2vec + topic number + news category, so the set of hit triples key can be made X ═ { X1, X2, …, xn }.
It should be noted that the above-mentioned way of creating the inverted index by using the triplet key is applicable to some categories of news, such as automobile, science and technology, sports, history, international, military, etc. For news categories which are not suitable for establishing the inverted index by the triple key, a single type topic model or a global topic model can be adopted, for example, for categories such as domestic, social, economic, health and entertainment, the single type topic model is adopted, and a candidate set is selected according to the distribution similarity of topics of all news in the category. The similarity here is cosine similarity, and the topic probability distribution of each news is regarded as an n-dimensional vector. And regarding news with the similarity larger than a certain threshold, regarding the news as related news, and entering a candidate set. For categories such as info, news and the like, a global topic model is adopted, a candidate set is selected according to the topic distribution similarity of all news in the global, the calculation method is similar to that of a single type of topic model, the difference is that the topic probability distribution uses the topic model of the global news, the set to be calculated is also news of all categories, and vector dimensionality and calculation time are too high. Observing the global topic model, finding that the probability distribution of topic is usually biased to some topic, the long tail phenomenon is serious, and here, the long tail is staged, and only the topic of each news Top 10 is taken to perform cosine similarity calculation.
Step 106: carrying out duplicate elimination processing on the candidate set;
specifically, in the embodiment of the present invention, the deduplication strategies include two types: one is that in the news category of the candidate set generated using the topoic distribution similarity score, if the topoic distribution similarity score is greater than a certain threshold (depending on the news category), it is considered to be duplicate news, and deduplication filtering is performed. Secondly, for news of all categories, the titles are segmented and then filtered to stop words, and the Jaccard similarity of the titles term is calculated between every two news:
and considering two news with the Jaccard similarity larger than a certain threshold value as repeated news, and performing deduplication filtering.
Step 107: sorting the news in the candidate set to select one or more second news which are most related to the first news;
specifically, after the candidate set of related news is obtained, the related news needs to be ranked according to rank policy alignment, so that the top related news is scored highest in each dimension, and one or more news with top scores are taken as the second news.
The rank policy may adopt one or more of the following policies:
time adjustment and weighting: the related news only indexes news within 7 days at present, so according to the time t of news generation, the news can be weighted by taking 0-7 days as a range and taking the second level as precision, and the smaller the t is, the higher the weight is. The latter is chosen here in view of both linear smoothing and acceleration weighting approaches.
And (3) adjusting the right of the place name: the method aims to solve the problems that keywords are used as reverse indexes: some news is 'car accidents happen in a certain Beijing place', the key words are extracted as 'car accidents', and regional news such as 'car accidents happen in xx city xx county' can be obtained, and except a first-line city and some known famous cities, the regional news is not concerned by users, so special right-reducing processing is needed for the place names.
Picture number adjustment right: whether a piece of news has a picture during presentation is also an important factor for attracting whether a user clicks, and the right is adjusted according to three categories of no picture, a single picture and a three-picture of a list presentation style.
Click feedback transfer right: the relevant news with high ctr is weighted appropriately according to the showing and clicking conditions on the relevant news line.
The ctr pre-estimates the transfer weight: after certain user behaviors are accumulated, an estimation model of ctr can be trained according to various characteristics of an article and whether a user clicks at last, and the estimation model is trained by adopting the cosine similarity between topic of a topic probability distribution top3 and a positive sample and is used on line.
It should be noted that, because news is time-consuming when computing the topic model, an asynchronous computing process is adopted to separate the computing of the keywords and the topic model, and the establishment of the inverted index from the main process of the related news computing. Meanwhile, in order to update related news fast enough, different calculation frequencies are adopted for news occurring in different time points.
As shown in fig. 2 and fig. 3, fig. 2 and fig. 3 are schematic diagrams of application examples of the method according to the embodiment of the present invention, and the related news calculation method and flow are currently embedded in news channels of various product lines such as 360 mobile phone browser, 360 search app, 360 search web, and mobile phone guard. As shown in fig. 2, after the user clicks on the news detail page labeled "all over the world, the recording process of a runner is actually such a child … …", the back-end system calculates according to the method described in the embodiment of the present invention to obtain a plurality of news related to the news as a candidate set, then sorts the related information in the candidate set, and selects TOP3 to recommend to the user, as shown in fig. 3, 3 pieces of related news are recommended to the user, and the titles are "various punting shots whose games for a runner are referred to as" fake and secret "bar brother", "bar brother for a runner" and "foretell: the brother of the running bar 4 invites guests in the second period and the grouped book of the book giardia are fiercely attacked.
The apparatus according to the embodiment of the present invention will be described in detail with reference to fig. 4.
As shown in fig. 4, fig. 4 is a schematic structural diagram of a device according to an embodiment of the present invention, which may specifically include: a dividing unit, a calculating unit, an index establishing unit, a candidate set selecting unit, a duplicate eliminating unit and a sorting unit, wherein,
the segmentation unit is used for selecting the URL of the first news for segmentation; specifically, news flowing in from Kafka is first subjected to news preprocessing, news categories unsuitable for calculating related news are filtered, then the obtained first news is divided into a text part and other information (such as title, capture time, category and the like) by the dividing unit, the text part and the other information are respectively stored on redis, the text part and the other information are respectively indexed by two keys of url @ uinfo and url @ content, and different expiration times (for example, 3 days of text and 30 days of other information) are set to ensure storage space. Kafka is a distributed messaging system, which classifies messages as they are stored according to Topic. The redis is an open source log-type and Key-Value database which is written by using ANSI C language, supports network, can be based on memory and can also be persistent, and provides API of multiple languages.
The computing unit is used for extracting keywords and computing a topic model based on the divided URLs;
the calculating unit specifically includes: a first computing module and a second computing module, wherein,
the first calculation module is used for respectively extracting text and title keywords from the URL of the first news and calculating the weight of the keywords according to the text and the title keywords; specifically, the first calculation module extracts text and title keywords and calculates weights based on the divided URLs, and outputs the text and title keywords in a format of "word # part of speech # weight".
As a preferred scheme of the invention, the first computing module can also perform word2vec expansion on a single keyword, wherein word2vec is a tool for converting the single keyword into a vector form, the processing on text content can be simplified into vector operation in a vector space, the vector operation is essentially a matrix decomposition model, and the matrix describes the relevant condition of each keyword and the word set of the context. The matrix is decomposed, only the vector of each keyword corresponding to the implicit space is taken, so that the mapping from word to vector is completed, and words related to the keyword after expansion, namely the expanded keyword, are calculated according to the similarity between vectors.
The second calculation module is used for calculating the theme and the weight of the first news according to the URL of the first news; specifically, based on the divided URLs, the second calculation module calculates certain topics to which it belongs and their weights, the topic content is replaced with a number, and the topic number. In the embodiment of the invention, the calculation of the topic model uses an LDA model calculation package on Spark, the number of topic is determined to be 200, and an LDA model is respectively trained for news of each category and all news in the whole.
The index establishing unit is used for establishing an inverted index according to the related information of the key words, the topic numbers and the known news categories obtained by the calculating unit; specifically, the first part is the first part of the word2vec expanded keywords, wherein the first part is the first part of the native keywords calculated by the first calculation module according to the weight values (for example, Top 3); taking the Top ones (such as Top3) of topic calculated by the second calculation module according to the weight value as a second part; and taking the known news category as a third part, and taking the three parts as Cartesian products by the index building unit to generate n triples key, wherein val is the URL.
A candidate set selecting unit, configured to select a candidate set based on the inverted index; specifically, since the inverted index triplet key is composed of three parts: native keywords or keywords extended by word2vec + topic number + news category, so the set of hit triples key can be made X ═ { X1, X2, …, xn }. .
Before the candidate set is ranked, the deduplication unit is configured to perform deduplication filtering on news in the candidate set, filter the titles of the news after segmenting the titles into words, calculate a Jaccard similarity of a title term between every two news, and consider two news with the Jaccard similarity larger than a predetermined threshold as duplicate news.
A sorting unit for sorting the candidate set to select one or more second news related to the first news; specifically, after obtaining the candidate set of related news, the ranking unit needs to rank according to rank policy alignment, so that the top related news is scored highest in each dimension, and one or more news with top scores are taken as the second news.
The rank policy may adopt one or more of the following policies:
time adjustment and weighting: the related news only indexes news within 7 days at present, so according to the time t of news generation, the news can be weighted by taking 0-7 days as a range and taking the second level as precision, and the smaller the t is, the higher the weight is. The latter is chosen here in view of both linear smoothing and acceleration weighting approaches.
And (3) adjusting the right of the place name: the method aims to solve the problems that keywords are used as reverse indexes: some news is 'car accidents happen in a certain Beijing place', the key words are extracted as 'car accidents', and regional news such as 'car accidents happen in xx city xx county' can be obtained, and except a first-line city and some known famous cities, the regional news is not concerned by users, so special right-reducing processing is needed for the place names.
Picture number adjustment right: whether a piece of news has a picture during presentation is also an important factor for attracting whether a user clicks, and the right is adjusted according to three categories of no picture, a single picture and a three-picture of a list presentation style.
Click feedback transfer right: the relevant news with high ctr is weighted appropriately according to the showing and clicking conditions on the relevant news line.
The ctr pre-estimates the transfer weight: after certain user behaviors are accumulated, an estimation model of ctr can be trained according to various characteristics of an article and whether a user clicks at last, and the estimation model is trained by adopting the cosine similarity between topic of a topic probability distribution top3 and a positive sample and is used on line.
In summary, the embodiment of the present invention is a technical solution for calculating related news from multiple dimensional considerations, and first, a news candidate set obtained based on fusion of a keyword and a topic model overcomes the defect of calculating related news by using the keyword alone or the topic model alone, and takes relevance and novelty into account; secondly, various strategies such as time weighting, weight elimination, ctr estimation, on-line click feedback and the like are added in the rank flow according to different news categories to adapt to the real on-line environment.
It should be noted that the embodiments and features of the embodiments of the present invention may be arbitrarily combined with each other without conflict.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of a network search system according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Moreover, it is noted that instances of the word "in one embodiment" are not necessarily all referring to the same embodiment.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Moreover, it should also be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.