WO2017020451A1 - 信息推送方法和装置 - Google Patents

信息推送方法和装置 Download PDF

Info

Publication number
WO2017020451A1
WO2017020451A1 PCT/CN2015/095754 CN2015095754W WO2017020451A1 WO 2017020451 A1 WO2017020451 A1 WO 2017020451A1 CN 2015095754 W CN2015095754 W CN 2015095754W WO 2017020451 A1 WO2017020451 A1 WO 2017020451A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
keyword
accessed
keyword set
information
Prior art date
Application number
PCT/CN2015/095754
Other languages
English (en)
French (fr)
Inventor
裘皓萍
陈炜于
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Publication of WO2017020451A1 publication Critical patent/WO2017020451A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present application relates to the field of computer technologies, and in particular, to the field of Internet technologies, and in particular, to an information push method and apparatus.
  • Information Push also known as “webcasting” is a technology that reduces information overload by pushing the information the user needs on the Internet through certain technical standards or protocols. Information push technology can reduce the time it takes for users to search on the network by actively pushing information to users.
  • the information pushed to the user is often one or more independent information, lacking the correlation between the information. If the pushed information is a segment of the progress of an event, it is difficult for the user to know the event background or development process of the pushed information through the pushed content. Therefore, this kind of information push technology has the problem that the network information related data is insufficiently utilized and the push information content is not rich enough.
  • the purpose of the present application is to propose an improved information push method and apparatus to solve the technical problems mentioned in the background section above.
  • the present application provides an information pushing method, the method comprising: acquiring page access information of at least one site, wherein the page access information includes a web address of the accessed page and a page visit amount; and corresponding to each web address
  • the page performs content analysis to generate a keyword set of each accessed page; based on the comparison of the keyword sets, the phase is Generating a keyword set that is greater than the first preset threshold, and generating at least one associated page keyword set, wherein the accessed pages corresponding to the keyword set used to generate the associated page keyword set are mutually associated pages; Sorting results of a sum of page visits of the accessed pages corresponding to each of the at least one associated page keyword set, generating first push information by using one or more of the at least one associated page keyword set; And generating, by the at least one accessed page corresponding to the set of associated page keywords of the first push information, second push information associated with the first push information and pushing the information to the user.
  • the generating, according to the at least one accessed page corresponding to the set of associated page keywords used to generate the first push information, generating second push information associated with the first push information and pushing Providing to the user comprising: clustering the publishing time of the accessed page corresponding to the set of associated page keywords used to generate the first push information according to a preset time interval, and dividing into at least one time period, wherein, when When the at least one time period includes more than two time periods, a time difference between publication times respectively taken from any two time periods is greater than the time interval; for one or more times in the at least one time period Segments respectively extract a page from the accessed page corresponding to each time segment; based on the extracted page, generate second push information and push it to the user.
  • the publishing time of the accessed page corresponding to the set of associated page keywords used to generate the first push information is clustered according to a preset time interval, and is divided into at least one time period.
  • the method further includes: for the accessed page corresponding to the set of associated page keywords, screening the accessed page corresponding to the keyword set whose similarity is greater than the second preset threshold to a page, and filtering the remaining pages after the page is removed
  • the access page is the accessed page corresponding to the set of associated page keywords, wherein the second preset threshold is greater than the first preset threshold.
  • the content parsing is performed on the pages corresponding to the respective web addresses, and generating the keyword set of each accessed page comprises: performing statistical analysis and/or semantic analysis on the content of the accessed page, and extracting at least one key. a word; generating a keyword set based on the at least one keyword.
  • the generating the keyword set based on the at least one keyword comprises: expanding, for each single keyword in each of the at least one keyword Generating an extended keyword, wherein the extended keyword includes at least one of: a synonym of the single keyword, a synonym of the single keyword, a related word of the single keyword; based on the at least one keyword and The extended keyword generates a keyword set.
  • the keyword set that satisfies one of the following conditions is used as a keyword set whose similarity is greater than the first preset threshold: the number of the same keywords is greater than the threshold; the number of the same keywords is compared The ratio of the total number of keywords in the keyword set is greater than the ratio threshold.
  • each keyword in the keyword set further has an importance coefficient
  • the keyword comparison based on the keyword set merges the keyword set with the similarity greater than the first preset threshold to generate
  • the at least one associated page keyword set includes: performing similarity calculation on different keyword sets based on the importance coefficient; and combining the keyword sets whose similarities are greater than the similarity threshold to generate an associated page keyword set.
  • the application provides an information pushing device, where the device includes: an information acquiring module, configured to acquire page access information of at least one site, where the page access information includes a URL and a page of the accessed page
  • the keyword collection generation module is configured to perform content analysis on the pages corresponding to the respective URLs to generate a keyword set of each accessed page
  • the keyword collection merge module is configured to compare each other based on the keyword set, The keyword set with the similarity greater than the first preset threshold is merged to generate at least one associated page keyword set, wherein the accessed pages corresponding to the keyword set used to generate the associated page keyword set are associated pages;
  • An information generating module configured to use one of the at least one associated page keyword set or based on a ranking result of a sum of page visits of the accessed pages corresponding to each of the at least one associated page keyword set Multiple sets generate first push information;
  • second push information generates a pushing module, configured to generate second push information associated with the first push information and push the user to the user based on the at least one accessed
  • the second push information generating and pushing module includes: a clustering unit configured to release a time of the accessed page corresponding to the set of associated page keywords used to generate the first push information Performing clustering according to a preset time interval, dividing into at least one time period, wherein when the at least one time period includes more than two In the inter-segment, the time difference between the release times respectively taken from any two time periods is greater than the time interval; the extracting unit is configured to use one or more time periods in the at least one time period, respectively A page is extracted from the accessed page corresponding to the time period; and the generating unit is configured to generate second push information based on the extracted page and push the information to the user.
  • a clustering unit configured to release a time of the accessed page corresponding to the set of associated page keywords used to generate the first push information Performing clustering according to a preset time interval, dividing into at least one time period, wherein when the at least one time period includes more than two In the inter-segment, the time difference between
  • the second push information generating and pushing module further includes: a screening unit configured to use a key that is greater than a second preset threshold for the accessed page corresponding to the associated page keyword set The accessed page corresponding to the word set is filtered out to a page, and the remaining accessed page is used as the accessed page corresponding to the associated page keyword set, wherein the second preset threshold is greater than the first pre- Set the threshold.
  • the keyword set generating module includes: a keyword extracting unit configured to perform statistical analysis and/or semantic analysis on content of the accessed page, extract at least one keyword; generate keyword set And a unit configured to generate a keyword set based on the at least one keyword.
  • the keyword set generation unit includes: an extension subunit configured to expand for each of the at least one keyword to generate an extended keyword, wherein the extension The keyword includes at least one of the following: a synonym of the single keyword, a synonym of the single keyword, a related word of the single keyword, a keyword set generating subunit, configured to be based on the at least one keyword and The extended keyword generates a keyword set.
  • the keyword set merge module is further configured to: use a keyword set that satisfies one of the following conditions as a keyword set whose similarity is greater than a first preset threshold: the number of the same keywords is greater than one The number threshold; the ratio of the number of identical keywords to the total number of keywords in the compared keyword set is greater than the ratio threshold.
  • each keyword in the keyword set further has an importance coefficient
  • the keyword set combining module includes: a calculating unit configured to use different keywords according to the importance coefficient The set performs similarity calculation; the merging and generating unit is configured to combine the keyword sets whose similarities are greater than the similarity threshold to generate an associated page keyword set.
  • the information pushing method and device provided by the application obtains a page of at least one site Face-to-face access information, and then performing content analysis on the pages corresponding to the respective URLs, generating a keyword set of each accessed page, and then combining the keyword sets whose similarities are greater than the first preset threshold based on mutual comparison of the keyword sets to generate And at least one associated page keyword set, and then generating, by using one or more of the at least one associated page keyword set, based on the ranking result of the sum of the page visit amounts of the accessed pages of the at least one associated page keyword set
  • the information is pushed, and the second push information associated with the first push information is generated and pushed to the user based on the at least one accessed page corresponding to the set of associated page keywords used to generate the first push information.
  • the information push method and apparatus may further push the second push information associated with the first push information to the user, thereby enriching the content of the push information.
  • FIG. 1 is a flow chart of one embodiment of an information push method according to the present application.
  • FIG. 2 is a schematic diagram of an application example of an information push method according to the present application.
  • FIG. 3 is a flow chart of still another embodiment of an information push method according to the present application.
  • FIG. 4 is an effect diagram of an application scenario of an embodiment of the information pushing method shown in FIG. 3;
  • FIG. 5 is a schematic structural diagram of an embodiment of an information pushing apparatus according to the present application.
  • FIG. 6 is a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present application.
  • FIG. 1 illustrates a flow 100 of one embodiment of a method of information push.
  • This embodiment is mainly illustrated by using the method in an electronic device with certain computing capabilities, which may include, but is not limited to, a smart phone, a tablet computer, an e-book reader, and an MP3 player (Moving Picture Experts Group Audio Layer).
  • the motion picture expert compresses the standard audio layer 3), the MP4 (Moving Picture Experts Group Audio Layer IV) player, the laptop portable computer, the desktop computer, and the like.
  • the information pushing method includes the following steps:
  • Step 101 Acquire page access information of at least one site, where the page access information includes a URL of the accessed page and a page visit amount.
  • the electronic device may obtain the page access of the at least one site locally or remotely. information.
  • the electronic device when the electronic device is a web server that provides support for at least one site, the webpage access information may be directly obtained from the local device; and when the electronic device is not a web server that supports the site, the wired device may be connected through a wired connection.
  • the wireless connection method obtains the above page access information from the website server.
  • the above wireless connection methods include, but are not limited to, 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods now known or developed in the future.
  • the page access information may include the URL of the page being accessed and the page visit amount.
  • the page being accessed may be a page that has been visited by the user.
  • each page accessed by the user corresponds to a web address, which can be represented by a Uniform Resoure Locator (URL).
  • the electronic device can obtain the URL of the page accessed by the user from one or more sites (eg, a forum website).
  • the electronic device may also obtain the page content of the accessed page.
  • the electronic device can also obtain the page access amount while acquiring the URL of the page.
  • the page visit amount may be the total number of times the page is accessed, or the number of times the page is accessed within a certain period of time (for example, 24 hours).
  • the accessed page obtained by the electronic device may be all pages accessed by the user, or It is a page whose access amount is greater than a certain threshold (for example, 50 times), and may also be a preset number of pages (for example, 100,000) whose access amount is ranked from high to low.
  • Step 102 Perform content analysis on the pages corresponding to the respective URLs, and generate a keyword set of each accessed page.
  • the electronic device may parse the content of the page corresponding to each of the foregoing URLs by using various methods, extract one or more keywords, and generate a keyword set.
  • the method for analyzing the content of the foregoing page by the electronic device may be a statistical analysis method.
  • an electronic device may extract keywords of the above page by using a Latent Dirichlet Allocation (LDA) model.
  • LDA Latent Dirichlet Allocation
  • the electronic device can treat each page as a word frequency vector (for example, a vector including each word and its frequency of occurrence), thereby converting the text information into digital information that is easy to model, and according to words, topics, and documents (may
  • a three-layer Bayesian probability model is established by using the three-layer structure of the page content of each page as a document.
  • the document to the subject obeys the polynomial distribution
  • the subject to the word obey the polynomial distribution.
  • each page represents a probability distribution of a number of topics
  • each topic represents a probability distribution of many words.
  • the electronic device may use a word whose distribution probability is greater than a certain threshold (for example, greater than 1%) as a keyword of the page according to a probability distribution of words, or may select a certain number from each page according to a distribution probability from high to low (for example, 20). The word as a keyword for the page.
  • the method for analyzing the content of the foregoing page by the electronic device may also be a semantic analysis method.
  • the electronic device may perform a full segmentation method on the content of the accessed page to divide the content into words; and then perform an importance calculation on the obtained word (for example, using a word frequency-inverse file frequency method) (Term Frequency-Inverse Document) Frequency, TF-IDF)), based on the results of the importance calculation, filters out some commonly used function words (for Chinese, such as "to", "”) and other words that do not produce actual semantics, and then get keywords.
  • a word frequency-inverse file frequency method Term Frequency-Inverse Document Frequency, TF-IDF
  • the electronic device may first use the full segmentation method to segment all possible words that match the language lexicon, and then use the statistical language model to determine the optimal segmentation result.
  • the N-Gram model described here is a commonly used language model. For Chinese, it can be called the Chinese Language Model (CLM).
  • CLM Chinese Language Model
  • the N-Gram model is based on the assumption that the occurrence of the Nth word is only related to the previous N-1 words, and is not related to any other words.
  • the probability of the entire sentence is the product of the probability of occurrence of each word, and these Probability can be obtained by counting the number of simultaneous occurrences of N words from the corpus.
  • the electronic device can calculate the importance of these words by using the term frequency-inverse document frequency (TF-IDF) method.
  • TF-IDF frequency-inverse document frequency
  • the main idea of the word frequency-reverse file frequency method is that if a word or phrase appears more in a document or page and rarely appears in other articles, the word or phrase is considered to have good class distinguishing ability. Suitable for classification.
  • the frequency (Term Frequency, TF) can measure the importance of a word or phrase to a document or page.
  • the TF is larger, otherwise, TF
  • IDF inverse document frequency
  • the electronic device can measure the importance of a word or phrase in a certain page according to the product of the TF and the IDF, thereby extracting one or more keywords of the page.
  • the electronic device may further expand a single keyword of the one or more keywords to generate an extended keyword, and generate the key together with the extracted keyword and the extracted keyword.
  • Word collection each word can have synonyms. For example, “Dad” can have the synonym “Father”, and each word can also have synonyms. For example, “Attendance” can have a synonym “Participation”, and each word can also be related. A conjunction, such as a "drawing”, can have the associated word "draw", and so on.
  • the electronic device may use a synonym, a synonym, and a related vocabulary of a single keyword in the one or more keywords as an extended keyword of a single keyword, and add the extended keyword to the keyword set.
  • the related words of a single keyword may be acquired by a machine learning pre-trained related word model according to a large amount of pre-fetched documents or page data.
  • the related word model may be a model that divides the content into words according to a large number of documents or page contents that are pre-fetched, undergoes a full segmentation method, and then counts the probability that at least two words appear at the same time.
  • words with a probability that is greater than a certain threshold may be related words.
  • each keyword in the keyword set may also have an importance coefficient.
  • the importance coefficient is a value that measures the importance of a keyword relative to the page it is on.
  • the importance coefficient of the keyword extracted from the page may be set to 1
  • the importance coefficient of the synonym of the keyword is set to 0.8
  • the importance coefficient of the synonym or related word of the keyword is set to 0.5, etc. Wait. It is worth noting that the importance coefficient is to distinguish the importance of the keyword.
  • the above specific numerical value is an exemplary description of the importance coefficient, and does not constitute a limitation on the importance coefficient.
  • the importance coefficient of the keyword extracted from the page may also be related to the number of times the keyword appears in the page, and the more the number of occurrences, the greater the importance coefficient; the importance coefficient of the extended keyword may also be extended.
  • the keyword is related to the degree of association between the keywords extracted from the page, for example, the synonym of the keyword extracted from the page may have the same importance coefficient as the keyword.
  • the preset related word model may also include the degree of relevance of the related words, and the degree of relevance may be proportional to the probability that the words appear at the same time, and the importance coefficient of the related words of the keywords extracted from the page may be the importance of the keyword.
  • Step 103 Combine the keyword sets whose similarities are greater than the first preset threshold according to mutual comparison of the keyword sets to generate at least one associated page keyword set.
  • the electronic device may further compare different keyword sets, calculate similarities between the keyword sets, and merge the keyword sets whose similarities are greater than the first preset threshold to generate an associated page.
  • Keyword set The accessed pages corresponding to the keyword set used to generate the associated page keyword set may be associated pages.
  • the similarity between the sets of keywords can characterize the degree of similarity between different sets of keywords.
  • the electronic device can use the number of identical keywords between the two sets to characterize the degree of similarity between the sets of keywords.
  • the words in the set of keywords may also have importance coefficients.
  • the electronic device can calculate the similarity between the keyword set A and the keyword set B by using the following method: the importance of the word shared between the keyword set A and the keyword set B The sum of the products of the coefficients/the sum of the squares of the importance coefficients of the words in the keyword set A and the squared sum of the importance coefficients of the words in the keyword set B are respectively squared.
  • the keyword set A includes (Japan 1, island 0.8, reclamation 0.5), wherein 1, 0.8, and 0.5 are the keywords "Japan", "island", and "reclamation” in the keyword set A, respectively.
  • the keyword set B includes (Japan 0.7, Daishima 1, Sovereign 0.6), wherein 0.7, 1 and 0.6 are the keywords "Japan”, “Island” and “Key” in the keyword set B, respectively.
  • the importance coefficient of sovereignty, the similarity between the keyword set A and the keyword set B can be:
  • the first preset threshold may be a threshold (for example, 0.5) set according to experience, or may be trained according to a pre-acquired page sample to obtain a classification model, and verify the classification model by verifying the sample.
  • the threshold when the classification model has a certain classification accuracy rate (such as 99%).
  • the electronic device can only add the words in the different keyword sets to a set and merge, and the electronic device can also de-duplicate the words in the different keyword sets into one set, and The importance factors of the same keyword are added together to merge.
  • the electronic device can divide the accessed page acquired in step 101 into a plurality of categories. Where each category consists of at least one visited page, these interviewed The pages of the question page are similar or related, and are related to each other. At the same time, the keyword sets corresponding to the associated pages are merged to generate a set of associated page keywords.
  • the electronic device may also acquire the associated page by a method of text clustering (such as K-means), and generate an associated page keyword set.
  • K-means clustering method the electronic device can first select the K pages with the highest page access as the centroid of the cluster, then measure the distance from other pages to each centroid, and classify it into the nearest centroid class. And then recalculating the centroids of the various classes that have been obtained, looping through the steps "measuring the distance of other pages to each centroid and assigning it to the nearest centroid class" until the new centroid and the original centroid are equal to or less than the specified threshold, At this point, the page is divided into K categories. Among the K categories, the accessed pages corresponding to each category may be associated pages.
  • the keyword set of the accessed page of the mutually associated page is merged according to the above method, and the associated page keyword set can be obtained.
  • Step 104 Generate first push information by using one or more sets of at least one associated page keyword set based on a sort result of a sum of page visit amounts of the accessed pages corresponding to each set in the at least one associated page keyword set. .
  • the electronic device may first obtain the sum of the page visits of the accessed pages corresponding to the at least one associated page keyword set, and sort the sum of the page visits (for example, the sort order is the page visit amount).
  • the sum of the sums is high to low, and then based on the sorting result, the first push information is generated using one or more of the at least one set of associated page keywords.
  • the electronic device may acquire a preset number of related page keywords (for example, 10) arranged in front, and then according to the associated page keywords.
  • the first pushed information is generated by the set or the accessed page corresponding to the set of associated page keywords.
  • the electronic device may select a page with the latest release time in the accessed page corresponding to the associated page keyword set, and use the theme or keyword of the page as the first push information.
  • the electronic device may also sort the words in the associated page keyword set according to the number of pages of the corresponding accessed page or the page access amount from large to small, and select the first predetermined number of keywords as the first. Push information.
  • the electronic device may also use the theme of the page with the highest page access amount in the associated page corresponding to the associated page keyword set as the first push information.
  • Electronic equipment can also In other manners, for example, the keyword of the page with the highest page access amount in the accessed page corresponding to the associated page keyword set is used as the first push information. This application does not limit this.
  • the first push information may further include a sum of page visit amounts of the associated pages corresponding to the associated page keyword set, or a page visit amount of the accessed page for generating the first push information.
  • the electronic device can push the first push information to the user.
  • the electronic device may also directly present the first push information to the user, and may also push the first push information to the user in a hyperlink form, and the hyperlink may be text including a keyword or a topic name for linking to the first
  • the visited page corresponding to the push information or the accessed page corresponding to the associated page keyword set corresponding to the first push information is the one with the highest page visit amount.
  • the electronic device can obtain the top N (N is a positive integer) classification with the highest number of visits in the category corresponding to the above page, and generate N pieces of first push information by the N categories.
  • Step 105 Generate second push information associated with the first push information and push it to the user based on the at least one accessed page corresponding to the set of associated page keywords used to generate the first push information.
  • the electronic device may acquire the accessed page corresponding to the set of associated page keywords for generating the first push information, and select at least one accessed page from the A visited page generates second push information associated with the aforementioned first push information.
  • the second push information may be generated based on a page associated with the first push information. For example, if the first push information is a keyword that is selected in the top page keyword set according to the number of pages of the corresponding accessed page or the page visit amount is sorted from the largest to the smallest, the top preset number of keywords is selected.
  • the second push information may be a subject including M (M is a positive integer) pages having the largest number of words in the preset number of keywords; if the first push information is an associated page corresponding to the associated page keyword set The theme of the accessed page with the highest page visit amount, the second push information may be the top M (M is a positive integer) page with the highest page access amount in the associated page corresponding to the associated page keyword set (can be included for generating the first A page for pushing information may or may not include a first push letter for generating The page of the interest page).
  • the electronic device may present the second push information together with the first push information to the user, or may detect the predetermined operation of the user after presenting the first push information to the user, and send the second push information in response to detecting the predetermined operation. Show it to the user.
  • the second push information may be presented when the user clicks on the first push information, or when the user clicks on the button corresponding to the first push information, may also be presented in response to a mouse hover, and the like.
  • the second push information may be pushed to the user in the form of a hyperlink, and the hyperlink may be associated with the page corresponding to the second push information.
  • the electronic device first obtains the URL of the accessed page and the page visit amount from at least one site, and then performs content analysis on each accessed page to generate a keyword set of each accessed page, and then based on the keyword set.
  • the first push information 201 may include a theme 2011, a sum of page visits of the associated pages corresponding to the set of associated page keywords, and a button 2013.
  • the electronic device displays the second push information.
  • 202 contains the subject 2021.
  • the theme 2011 and the theme 2021 may both be texts in the form of hyperlinks for linking to the accessed pages corresponding to the theme 2011 and the theme 2021.
  • the application scenario of the example may be that the electronic device pushes the news event of the website to the editor of the website, and the background information of the news events, so that the editor can edit the news event and update the website content.
  • the above embodiment of the present application can present the richer content of the push information to the user by pushing the second push information associated with the first associated information to the user.
  • the information pushing method 300 includes the following steps:
  • Step 301 Obtain page access information of at least one site, where the page access information includes a URL of the accessed page and a page visit amount.
  • the electronic device may obtain the page access of the at least one site locally or remotely.
  • the page access information may include a URL (eg, a URL) of the page being accessed and a page visit amount.
  • Step 302 Perform content analysis on the pages corresponding to the respective URLs, and generate a keyword set of each accessed page.
  • the electronic device may parse the content of the page corresponding to each of the foregoing URLs by using various methods (such as a statistical analysis method or a semantic analysis method), extract one or more keywords, and generate a keyword set. .
  • the electronic device may further expand a single keyword of the one or more keywords to generate an extended keyword, and generate the keyword set together with the expanded keyword.
  • the extended keyword may include synonyms, synonyms, and related words of the extracted single keywords.
  • each keyword in the keyword set may also have an importance coefficient.
  • Step 303 Combine the keyword sets whose similarities are greater than the first preset threshold according to mutual comparison of the keyword sets to generate at least one associated page keyword set.
  • the electronic device may further compare different keyword sets, calculate similarities between the keyword sets, and merge the keyword sets whose similarities are greater than the first preset threshold to generate an associated page.
  • Keyword set The accessed pages corresponding to the keyword set used to generate the associated page keyword set may be associated pages.
  • the similarity between the sets of keywords can characterize the degree of similarity between different sets of keywords.
  • the electronic device can use the number of identical keywords between the two sets to characterize the degree of similarity between the sets of keywords.
  • the electronic device can perform the similarity calculation using a well-known text similarity calculation method such as a cosine similarity algorithm or a Jaccard coefficient.
  • the words in the set of keywords may also have importance coefficients. At this point, the electronic device can be calculated based on the importance coefficient Similarity between keyword sets.
  • Step 304 Generate first push information by using one or more sets of at least one associated page keyword set based on a sort result of a sum of page visit amounts of the accessed pages corresponding to each set in the at least one associated page keyword set. .
  • the electronic device may first obtain the sum of the page visits of the accessed pages corresponding to the at least one associated page keyword set, and sort the sum of the page visits (for example, the sort order is the page visit amount).
  • the sum of the sums is high to low, and then based on the sorting result, the first push information is generated using one or more of the at least one set of associated page keywords.
  • Step 305 The publishing time of the accessed page corresponding to the set of associated page keywords used to generate the first push information is clustered according to a preset time interval, and is divided into at least one time period.
  • the electronic device may perform clustering on the publishing time of the accessed page corresponding to the associated page keyword set for generating the first push information according to a preset time interval, and divide into at least one time period.
  • the result of the clustering may be that the time difference between the publishing times respectively taken from any two time periods is greater than the preset time interval.
  • Clustering is the process of dividing a collection of physical or abstract objects into multiple classes of similar objects.
  • the purpose of the electronic device to cluster the publishing time of the accessed page according to the preset time interval is to divide the publishing time of the accessed page into at least one time period, thereby dividing the accessed page into multiple similar publishing times. class.
  • various well-known clustering algorithms can be used for clustering according to the release time.
  • the electronic device may be based on a hierarchical clustering algorithm, each time combining two release times with the smallest interval, until the time difference between two release times with the smallest interval is greater than or equal to a preset time interval, thereby, the associated page is
  • the accessed page corresponding to the keyword set is divided into pages published in different time periods according to the publishing time. Any two visited pages published in different time periods, their publishing time is greater than the preset time interval.
  • the electronic device may further determine a preset time interval of the cluster according to different time periods of the day. For example, an electronic device can acquire multiple days in advance. The number of page postings, divided by the distribution of page postings. For example, if the number of web pages published from 0:00 to 6:00 is relatively small every day, the preset time interval from 0:00 to 6:00 can be set to a longer period of time, such as 2 hours; Assuming that there are more pages published between 9:00 and 11:00 every day, you can set the preset time interval from 9:00 to 11:00 to a shorter time period, such as 20 minutes.
  • the electronic device may divide the accessed pages corresponding to a set of associated page keywords by time, and the accessed pages of different time periods may record event content of different development stages.
  • Step 306 Extract one page from the accessed page corresponding to each time segment for one or more of the at least one time period.
  • the electronic device may extract one page from the accessed page corresponding to each time segment for one or more of the at least one time period.
  • the page extracted by the electronic device may be any page published in the corresponding time period, or may be a page acquired according to a certain rule.
  • the electronic device obtains the page according to a certain rule, the page with the highest page access amount in the corresponding time period may be obtained, and the page with the earliest publishing time in the corresponding time period may also be obtained, and the preset publishing page may also be obtained.
  • the priority level of the site is obtained, and the like, which is not limited in this application.
  • Step 307 Generate second push information based on the extracted page and push it to the user.
  • the electronic device may generate second push information according to a certain rule based on the page extracted in step 306, and may push the second push information to the user.
  • the electronic device may use the extracted topic or keyword of the page as the second push information, and the electronic device may also publish the extracted page from the extracted page.
  • the time is selected from the preset number of pages in the near and far order, the theme or keyword of these pages is used as the second push information, and the like. This application does not limit this.
  • a deduplication step of the page may also be included.
  • the electronic device may perform the following processing on the accessed page corresponding to the associated page keyword set: the interview corresponding to the associated page keyword set
  • the page is displayed, and the accessed page corresponding to the keyword set with the similarity greater than the second preset threshold is screened to a page, and the remaining accessed page after the screen is screened is used as the accessed page corresponding to the associated page keyword set.
  • the second preset threshold may be greater than the first preset threshold.
  • the electronic device may consider that the accessed page corresponding to the two keyword sets is the same content page, that is, Duplicate page.
  • the electronic device can reserve any page from the repeated pages, or select a page from a repeated page according to a certain rule for reservation, such as selecting the page with the earliest release time for reservation, etc., and screening out other ones in the duplicate page.
  • the page will be the accessed page corresponding to the set of related page keywords as the page to be accessed after the page is screened out.
  • each set of repeated pages includes 2 pages
  • the electronic device screens out each group of the 30 groups.
  • One page, one page is reserved, and the remaining 970 pages are the accessed pages corresponding to the set of associated page keywords.
  • the electronic device may delete the page information of the page.
  • the electronic device may accumulate page visits of pages that are not retained on the page views of the reserved pages.
  • the step 301, the step 302, the step 303, and the step 304 in the foregoing implementation process are substantially the same as the steps 101, 102, 103, and 104 in the foregoing embodiment, and details are not described herein again.
  • the flow 300 of the information push method in the present embodiment replaces step 105 with steps 305, 306, and 307.
  • the present embodiment may extract the accessed page corresponding to the associated page keyword set corresponding to the first push information according to the time period, thereby generating second push information associated with the first push information.
  • the page content of the page in each time period can give a development status of the event, and extracting one page from each time period to generate second push information can enable the user to pass the first The second push information to understand the development process of the entire event.
  • FIG. 4 is an effect diagram of an application scenario of the information pushing method of the embodiment.
  • the application scenario shown in FIG. 4 is a push scenario of hot news information, wherein 401 indicates first push information, and 402 indicates second push information.
  • This embodiment facilitates pushing the development information in the respective time periods of the first push information to the user.
  • the page may be de-duplicated to avoid obtaining pages with the same content in different time periods, thereby reducing the effectiveness of information pushing.
  • the present application provides an embodiment of an apparatus for information push, the apparatus embodiment corresponding to the method embodiment shown in FIG. Can be applied to electronic devices.
  • the apparatus 500 for information push includes: an information acquisition module 501, a keyword set generation module 502, a keyword set merge module 503, a first push information generation module 504, and a second push information.
  • the information obtaining module 501 is configured to obtain the page access information of the at least one site, where the page access information includes the website address of the accessed page and the page access amount
  • the keyword set generating module 502 is configured to perform the page corresponding to each website address.
  • the keyword set merge module 503 is configured to merge the keyword sets with the similarity greater than the first preset threshold to generate at least one associated page based on mutual comparison of the keyword sets a keyword set, wherein the accessed pages corresponding to the keyword set for generating the associated page keyword set are mutually associated pages;
  • the first push information generating module 504 is configured to correspond to each set based on the at least one associated page keyword set Sorting the sum of the page visits of the accessed pages, generating the first push information by using one or more of the at least one associated page keyword set;
  • the second push information generating and pushing module 505 is configured to use The key to the associated page that generates the first push information At least one set of the corresponding page is accessed, the second push information associated with the first push information and pushed to the user.
  • the keyword set generation module 502 can then proceed to each of the above
  • the content of the page corresponding to the URL is parsed by various methods (such as statistical analysis methods or semantic analysis methods), and one or more keywords are extracted therefrom to generate a keyword set.
  • the keyword set generation module 502 can also expand a single keyword of the one or more keywords to generate an extended keyword, and generate the keyword set together with the expanded keyword.
  • the extended keyword may include synonyms, synonyms, and related words of the extracted single keywords.
  • each keyword in the keyword set may also have an importance coefficient.
  • the keyword set merge module 503 may then compare the keyword sets generated by the keyword set generation module 502 with each other, and merge the keyword sets whose similarities are greater than the first preset threshold to generate at least one associated page. Keyword set.
  • the accessed pages corresponding to the keyword set used to generate the associated page keyword set are associated pages.
  • the similarity between the sets of keywords can be calculated by a variety of methods.
  • the first push information generating module 504 may then obtain the sum of the page visits of the accessed pages corresponding to the at least one associated page keyword set, and sort the sum of the page visits (eg, sort). The order is the sum of the page visits from high to low, and then based on the sorting result, the first push information is generated using one or more of the at least one associated page keyword set.
  • the second push information generating and pushing module 505 may obtain, for each piece of the first push information, the accessed page corresponding to the set of associated page keywords for generating the first push information, and select at least a visited page, and then generating second push information associated with the first push information according to the at least one accessed page and pushing the second push information to the second user.
  • the second push information generating and pushing module 505 may include: a clustering unit (not shown) configured to generate an associated page keyword set for generating the first push information.
  • the publishing time of the corresponding accessed page is clustered according to a preset time interval, and is divided into at least one time period; an extracting unit (not shown) is configured to use one or more time periods in at least one time period And extracting a page from the accessed page corresponding to each time segment; a generating unit (not shown) configured to generate second push information based on the extracted page and push the information to the user.
  • the result of the clustering may be: The time difference between the publication times taken from any two time periods is greater than the preset time interval.
  • the second push information generating and pushing module 505 may further include: a screening unit (not shown) configured to access the accessed page corresponding to the associated page keyword set.
  • the accessed page corresponding to the keyword set whose similarity is greater than the second preset threshold is screened to one page, and the accessed page remaining after the screen is screened is used as the accessed page corresponding to the associated page keyword set.
  • the second preset threshold is greater than the first preset threshold.
  • the function of the screening unit is to de-emphasize the accessed page corresponding to the associated page keyword set.
  • modules or units described in the information push device 500 correspond to the respective steps in the method described with reference to FIG.
  • the operations and features described above for the method are equally applicable to the information push device 500 and the modules or units included therein, and are not described herein again.
  • information push device 500 also includes other well-known structures, such as processors, memories, etc., which are not shown in FIG. 5 in order to unnecessarily obscure the embodiments of the present disclosure.
  • FIG. 6 a block diagram of a computer system 600 suitable for use in implementing the electronic device of the embodiments of the present application is shown.
  • computer system 600 includes a central processing unit (CPU) 601 that can be loaded into a program in random access memory (RAM) 603 according to a program stored in read only memory (ROM) 602 or from storage portion 608. And perform various appropriate actions and processes.
  • RAM random access memory
  • ROM read only memory
  • RAM random access memory
  • various programs and data required for the operation of the system 600 are also stored.
  • the CPU 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also coupled to bus 604.
  • the following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, etc.; an output portion 607 including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 608 including a hard disk or the like. And a communication portion 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the Internet.
  • Driver 610 is also coupled to I/O interface 605 as needed.
  • a removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, It is mounted on the drive 610 as needed so that the computer program read therefrom is installed into the storage portion 608 as needed.
  • an embodiment of the present application includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart.
  • the computer program can be downloaded and installed from the network via communication portion 609, and/or installed from removable media 611.
  • the units involved in the embodiments of the present application may be implemented by software or by hardware.
  • the described modules may also be provided in the processor, for example, as a processor.
  • the information acquisition module, the keyword collection generation module, the keyword collection merge module, the first push information generation module, and the second push information generation and push module, the names of the modules do not constitute the module itself under certain circumstances.
  • the information acquisition module may also be described as "a module configured to acquire page access information of at least one site.”
  • the present application further provides a computer readable storage medium, which may be a computer readable storage medium included in the apparatus described in the foregoing embodiment, or may exist separately, not A computer readable storage medium that is assembled into a terminal.
  • the computer readable storage medium stores one or more programs that are used by one or more processors to perform the method of information push described in the present application.

Abstract

一种信息推送方法和装置。所述方法的一具体实施方式包括:获取至少一个站点的被访问页面的网址及页面访问量(101);对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合(102);基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合(103);基于至少一个关联页面关键词集合中各个集合对应的被访问页面的页面访问量之和的排序结果,利用所述至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息(104);基于用于生成所述第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与所述第一推送信息相关联的第二推送信息并推送给用户(105)。该实施方式可以丰富推送信息的内容。

Description

信息推送方法和装置
相关申请的交叉引用
本申请要求于2015年08月03日提交的中国专利申请号为“201510483126.3”的优先权,其全部内容作为整体并入本申请中。
技术领域
本申请涉及计算机技术领域,具体涉及互联网技术领域,尤其涉及一种信息推送方法和装置。
背景技术
信息推送,又称为“网络广播”,是通过一定的技术标准或协议,在互联网上通过推送用户需要的信息来减少信息过载的一项技术。信息推送技术通过主动推送信息给用户,可以减少用户在网络上搜索所花的时间。
然而,在现有的信息推送技术中,推送给用户的信息往往是一条或多条相互独立的信息,缺乏信息之间的关联性。如果所推送信息是某一事件进展的片段,难以通过所推送的内容使用户了解所推送信息的事件背景或发展过程。因此,这种信息推送技术存在着网络信息相关数据利用不足,推送信息内容不够丰富的问题。
发明内容
本申请的目的在于提出一种改进的信息推送方法和装置,来解决以上背景技术部分提到的技术问题。
一方面,本申请提供了一种信息推送方法,所述方法包括:获取至少一个站点的页面访问信息,其中,所述页面访问信息包括被访问页面的网址及页面访问量;对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合;基于关键词集合的相互比较,将相 似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合,其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面互为关联页面;基于所述至少一个关联页面关键词集合中的各个集合对应的被访问页面的页面访问量之和的排序结果,利用所述至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息;基于用于生成所述第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与所述第一推送信息相关联的第二推送信息并推送给用户。
在一些实施例中,所述基于用于生成所述第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与所述第一推送信息相关联的第二推送信息并推送给用户,包括:对用于生成所述第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段,其中,当所述至少一个时间段包括两个以上的时间段时,分别取自任意两个时间段的发布时间之间的时间差大于所述时间间隔;对于所述至少一个时间段中的一个或多个时间段,分别从每个时间段所对应的被访问页面中提取一个页面;基于所提取的页面,生成第二推送信息并推送给用户。
在一些实施例中,所述对用于生成所述第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段之前,还包括:对于关联页面关键词集合所对应的被访问页面,将相似度大于第二预设阈值的关键词集合所对应的被访问页面筛除至一个页面,将筛除页面后剩余的被访问页面作为关联页面关键词集合所对应的被访问页面,其中,所述第二预设阈值大于第一预设阈值。
在一些实施例中,所述对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合包括:对所述被访问页面的内容进行统计分析和/或语义分析,提取至少一个关键词;基于所述至少一个关键词,生成关键词集合。
在一些实施例中,所述基于所述至少一个关键词,生成关键词集合包括:对于每个所述至少一个关键词中的单个关键词,进行扩展以 生成扩展关键词,其中,所述扩展关键词包括以下至少一项:所述单个关键词的同义词、所述单个关键词的近义词、所述单个关键词的关联词;基于所述至少一个关键词和所述扩展关键词,生成关键词集合。
在一些实施例中,将满足以下条件之一的关键词集合作为相似度大于第一预设阈值的关键词集合:相同关键词的个数大于个数阈值;相同关键词的个数与进行比较的关键词集合中关键词的总个数的比值大于比值阈值。
在一些实施例中,所述关键词集合中的各关键词还具有重要度系数,以及,所述基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合包括:基于所述重要度系数对不同的关键词集合进行相似度计算;将相似度大于相似度阈值的关键词集合合并,生成关联页面关键词集合。
第二方面,本申请提供了一种信息推送装置,所述装置包括:信息获取模块,配置用于获取至少一个站点的页面访问信息,其中,所述页面访问信息包括被访问页面的网址及页面访问量;关键词集合生成模块,配置用于对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合;关键词集合合并模块,配置用于基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合,其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面互为关联页面;第一推送信息生成模块,配置用于基于所述至少一个关联页面关键词集合中的各个集合对应的被访问页面的页面访问量之和的排序结果,利用所述至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息;第二推送信息生成及推送模块,配置用于基于用于生成所述第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与所述第一推送信息相关联的第二推送信息并推送给用户。
在一些实施例中,所述第二推送信息生成及推送模块包括:聚类单元,配置用于对用于生成所述第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段,其中,当所述至少一个时间段包括两个以上的时 间段时,分别取自任意两个时间段的发布时间之间的时间差大于所述时间间隔;提取单元,配置用于对于所述至少一个时间段中的一个或多个时间段,分别从每个时间段所对应的被访问页面中提取一个页面;生成单元,配置用于基于所提取的页面,生成第二推送信息并推送给用户。
在一些实施例中,所述第二推送信息生成及推送模块还包括:筛除单元,配置用于对于关联页面关键词集合所对应的被访问页面,将相似度大于第二预设阈值的关键词集合所对应的被访问页面筛除至一个页面,将筛除页面后剩余的被访问页面作为关联页面关键词集合所对应的被访问页面,其中,所述第二预设阈值大于第一预设阈值。
在一些实施例中,所述关键词集合生成模块包括:关键词提取单元,配置用于对所述被访问页面的内容进行统计分析和/或语义分析,提取至少一个关键词;关键词集合生成单元,配置用于基于所述至少一个关键词,生成关键词集合。
在一些实施例中,所述关键词集合生成单元包括:扩展子单元,配置用于对于每个所述至少一个关键词中的单个关键词,进行扩展以生成扩展关键词,其中,所述扩展关键词包括以下至少一项:所述单个关键词的同义词、所述单个关键词的近义词、所述单个关键词的关联词;关键词集合生成子单元,配置用于基于所述至少一个关键词和所述扩展关键词,生成关键词集合。
在一些实施例中,所述关键词集合合并模块进一步配置用于:将满足以下条件之一的关键词集合作为相似度大于第一预设阈值的关键词集合:相同关键词的个数大于个数阈值;相同关键词的个数与进行比较的关键词集合中关键词的总个数的比值大于比值阈值。
在一些实施例中,所述关键词集合中的各关键词还具有重要度系数,以及,所述关键词集合合并模块包括:计算单元,配置用于基于所述重要度系数对不同的关键词集合进行相似度计算;合并及生成单元,配置用于将相似度大于相似度阈值的关键词集合合并,生成关联页面关键词集合。
本申请提供的信息推送方法和装置,通过获取至少一个站点的页 面访问信息,接着对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合,然后基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合,接着基于至少一个关联页面关键词集合各自对应的被访问页面的页面访问量之和的排序结果,利用至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息,并且,基于用于生成第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与第一推送信息相关联的第二推送信息并推送给用户。这种信息推送方法和装置在向用户推送第一推送信息之后,还可以进一步向用户推送与第一推送信息相关联的第二推送信息,从而丰富了推送信息的内容。
附图说明
通过阅读参照以下附图所作的对非限制性实施例的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是根据本申请的信息推送方法的一个实施例的流程图;
图2是根据本申请的信息推送方法的一个应用示例的示意图;
图3是根据本申请的信息推送方法的又一个实施例的流程图;
图4是图3所示的信息推送方法的实施例的一个应用场景的效果图;
图5是根据本申请的信息推送装置的一个实施例的结构示意图;
图6是适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本 申请。
请参考图1,其示出了信息推送的方法的一个实施例的流程100。本实施例主要以该方法应用于有一定运算能力的电子设备中来举例说明,该电子设备可以包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。该信息推送方法,包括以下步骤:
步骤101,获取至少一个站点的页面访问信息,其中,页面访问信息包括被访问页面的网址及页面访问量。
在本实施例中,电子设备(例如可以是包含信息推送的应用运行于其上的电子终端或为包含信息推送的应用提供支持的后台服务器)可以从本地或远程地获取至少一个站点的页面访问信息。其中,当上述电子设备是为至少一个站点提供支持的网站服务器时,其可以直接从本地获取上述页面访问信息;而当上述电子设备不是为站点提供支持的网站服务器时,其可以通过有线连接方式或者无线连接方式从网站服务器获取上述页面访问信息。上述无线连接方式包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。
在这里,页面访问信息可以包括被访问页面的网址及页面访问量。被访问页面可以是被用户访问过的页面。通常,用户访问的每个页面都对应一个网址,该网址可以用统一资源定位器(Uniform Resoure Locator,URL)来表示。电子设备可以从一个或多个站点(例如论坛网站)中获取被用户访问过的页面的URL。可选地,电子设备也可以获取被访问页面的页面内容。
对于电子设备获取的每个页面,电子设备获取页面的URL的同时,还可以获取页面访问量。其中页面访问量可以是页面的总被访问次数,也可以是页面在一定时间段(例如24个小时)内的被访问次数。电子设备获取的被访问页面可以是被用户访问过的所有页面,也可以 是访问量大于一定阈值(如50次)的页面,还可以是访问量由高到低排列靠前的预设个数(如10万个)的页面。
步骤102,对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合。
在本实施例中,电子设备可以对上述的各个网址对应的页面的内容通过各种方法进行解析,从中提取出一个或多个关键词,生成关键词集合。
在本实施例的可选实现方式中,电子设备对上述页面的内容的分析方法可以是统计分析方法。例如电子设备可以通过隐含狄利克雷分布(Latent Dirichlet Allocation,LDA)模型提取上述页面的关键词。具体的,电子设备可以将每个页面视为一个词频向量(例如包括各个词语及其出现频率的向量),从而将文本信息转化成易于建模的数字信息,并根据词、主题和文档(可以将每个页面的页面内容作为一个文档)三层结构建立三层贝叶斯概率模型。其中,文档到主题服从多项式分布,主题到词服从多项式分布。这样,每一个页面代表了一些主题所构成的一个概率分布,而每一个主题又代表了很多词所构成的一个概率分布。电子设备可以根据词的概率分布,将分布概率大于一定阈值(例如大于1%)的词作为页面的关键词,也可以从每个页面中按照分布概率由高到低选择一定数量(例如20个)的词作为页面的关键词。
在本实施例的可选实现方式中,电子设备对上述页面的内容的分析方法也可以是语义分析方法。例如,电子设备可以对被访问页面的内容进行全切分方法等处理,把内容分割成词;再对所得到的词进行重要性计算(例如采用词频-逆向文件频率方法(Term Frequency-Inverse Document Frequency,TF-IDF)),基于重要性计算的结果过滤掉一些常用的虚词(对于中文而言,如“了”、“的”)等不产生实际语义的词汇,进而得到关键词。
具体地,电子设备可以首先利用全切分方法切分出与语言词库匹配的所有可能的词,再运用统计语言模型确定最优的切分结果。以页面内容的主题为“本季度居民收入”为例,可以首先进行语言词库匹 配,找到匹配的所有词——本,季度,居民,收入,本季,本季度,度,居民收入,民;这些词以词网格(word lattices)形式表示,接着基于词网格做路径搜索,再基于统计语言模型(例如N-Gram模型,)找到最优路径。如果结果显示“本季度居民收入”的语言模型得分最高,则“本季度居民收入”即为“本季度居民收入”的最优切分。在这里所述的N-Gram模型是常用的一种语言模型,对中文而言,可以称之为汉语语言模型(Chinese Language Model,CLM)。该N-Gram模型基于这样一种假设,第N个词的出现只与前面N-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现概率的乘积,而这些概率可以通过直接从语料中统计N个词同时出现的次数得到。
利用全切分方法将内容分割成词之后,电子设备可以采用词频-逆向文件频率(term frequency-inverse document frequency,TF-IDF)方法对这些词进行重要性计算。词频-逆向文件频率方法的主要思想是,如果某个词或短语在一个文档或页面中出现较多,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。其中,频率(Term Frequency,TF)可以衡量某个词或短语对于一个文档或页面的重要性,如果某个词或短语在一个文档或页面中出现的次数多,则TF越大,反之,TF越小;逆向文档频率(inverse document frequency,IDF)可以衡量一个词或短语的普遍重要性,词语在文档集或语料库出现的频率越高,该词语的普遍重要性越高,IDF越小,反之IDF越大。电子设备可以根据TF与IDF的乘积来衡量某个词或短语在某个页面里面的重要性,从而提取出页面的一个或多个关键词。
需要说明的是,上述语义分析方式的各种方法是目前广泛研究和应用的公知技术,在此不再赘述。
在本实施例的一些可选实现方式中,电子设备还可以对上述一个或多个关键词中的单个关键词进行扩展生成扩展关键词,并将扩展关键词和所提取的关键词一起生成关键词集合。实践中,每个词语可以有同义词,例如“爸爸”可以具有同义词“父亲”,每个词语也可以有近义词,例如“出席”可以具有近义词“参加”,每个词语还可以有关 联词,例如“工程图”可以具有关联词“绘制”,等等。电子设备可以将上述一个或多个关键词中单个关键词的同义词、近义词、关联词汇总,作为单个关键词的扩展关键词,并将这些扩展关键词加入上述关键词集合。其中,经常一起出现的词或短语可以作为关联词。可选地,单个关键词的关联词可以根据预先抓取的大量文档或页面数据通过机器学习预先训练的关联词模型获取。例如,该关联词模型可以是根据预先抓取的大量文档或页面内容,经过全切分方法等处理,把内容分割成词,再统计至少两个词同时出现的概率的模型。其中,同时出现概率大于一定阈值的词可以互为关联词。
在本实施例的一些可选实现方式中,关键词集合中的每个关键词还可以具有重要度系数。其中,重要度系数是衡量一个关键词相对于其所在的页面的重要度的数值。例如,可以将从页面中提取的关键词的重要度系数设为1,将该关键词的同义词的重要度系数设为0.8,将该关键词的近义词或关联词的重要度系数设为0.5,等等。值得说明的是,重要度系数是为了区分关键词的重要程度,以上具体数值是对重要度系数的示例性说明,并不构成对重要度系数的限定。可选地,从页面提取的关键词的重要度系数还可以与关键词在页面中出现的次数相关联,出现次数越多,重要度系数越大;扩展关键词的重要度系数还可以和扩展关键词与从页面提取的关键词之间的关联度有关,例如,从页面提取的关键词的同义词可以具有与该关键词相同的重要度系数。实践中,预设的关联词模型中也可以包括关联词的关联度,该关联度可以与词语同时出现的概率成正比,从页面提取的关键词的关联词的重要度系数可以为该关键词的重要度系数与关联度的乘积。
步骤103,基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合。
在本实施例中,电子设备可以进一步对不同的关键词集合的相互比较,计算各个关键词集合之间的相似度,并将相似度大于第一预设阈值的关键词集合合并,生成关联页面关键词集合。其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面可以互为关联页面。
在这里,关键词集合之间的相似度可以表征不同的关键词集合之间的相似程度。在本实施例中,电子设备可以用两个集合之间的相同关键词的个数来表征关键词集合之间的相似程度。电子设备也可以采用余弦相似度(cosine similarity)算法、Jaccard系数之类的公知的文本相似度计算方法来进行相似度计算。以Jaccard系数方法为例,电子设备可以采用如下的公式计算两个关键词集合A和B之间的相似度:关键词集合A与关键词集合B之间的相似度=关键词集合A与关键词集合B之间共有的词的数目/关键词集合A与关键词集合B一起包括的词的数目。
在一些实现中,关键词集合中的词还可以具有重要度系数。此时,以余弦相似度算法为例,电子设备可以采用如下方法计算关键词集合A与关键词集合B之间的相似度:关键词集合A与关键词集合B之间共有的词的重要度系数的乘积之和/关键词集合A中各词的重要度系数的平方和与关键词集合B中各词的重要度系数的平方和分别开平方后的乘积。例如,关键词集合A包括(日本1,造岛0.8,填海0.5),其中,1、0.8和0.5分别是关键词集合A中的关键词“日本”、“造岛”和“填海”具有的重要度系数,关键词集合B包括(日本0.7,造岛1,主权0.6),其中,0.7、1和0.6分别是关键词集合B中的关键词“日本”、“造岛”和“主权”具有的重要度系数,则关键词集合A与关键词集合B之间的相似度可以为:
Figure PCTCN2015095754-appb-000001
值得说明的是,第一预设阈值可以是根据经验人为设定的阈值(例如0.5),也可以是根据预先获取的页面样本进行训练获得分类模型,并通过验证样本对该分类模型进行验证,在该分类模型具有一定的分类准确率(如99%)时的阈值。
其中,电子设备可以仅将不同的关键词集合中的各词去重后放入一个集合进行合并,电子设备也可以将不同的关键词集合中的各词去重后放入一个集合,同时将相同关键词的重要度系数相加以进行合并。
通过该步骤,电子设备可以将步骤101中获取的被访问页面划分为多个分类。其中,每个分类由至少一个被访问页面组成,这些被访 问页面的页面内容相似或相关联,互为关联页面。同时,这些关联页面对应的关键词集合被合并生成关联页面关键词集合。
在一些实现中,该步骤中电子设备也可以通过文本聚类(如K-means)的方法获取关联页面,并生成关联页面关键词集合。以K-means聚类方法为例,电子设备可以首先选取页面访问量最高的K个页面作为聚类的质心,然后测量其他页面到每个质心的距离,并把它归到最近的质心的类,接着重新计算已经得到的各个类的质心,循环执行步骤“测量其他页面到每个质心的距离,并把它归到最近的质心的类”直至新的质心与原质心等于或小于指定阈值,此时,页面被划分为K个分类。这K个分类中,每个分类对应的被访问页面可以互为关联页面。将互为关联页面的被访问页面的关键词集合按照上述的方法合并,可以得到关联页面关键词集合。
步骤104,基于至少一个关联页面关键词集合中的各个集合对应的被访问页面的页面访问量之和的排序结果,利用至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息。
在本实施例中,电子设备可以首先获取上述至少一个关联页面关键词集合所对应的被访问页面的页面访问量的总和,并将这些页面访问量的总和进行排序(例如排序顺序为页面访问量的总和从高到低),然后基于排序结果,利用至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息。
例如,当按照上述页面访问量的总和从高到低的顺序排序时,电子设备可以获取排列靠前的预设个数(例如10个)的关联页面关键词集合,然后根据这些关联页面关键词集合或者这些关联页面关键词集合所对应的被访问页面,生成第一推送信息。在这里,电子设备可以选取关联页面关键词集合对应的被访问页面中发布时间最近的页面,将该页面的主题或关键词作为第一推送信息。电子设备也可以将关联页面关键词集合中的各词按照所对应的被访问页面的页面数量或页面访问量由大到小进行排序,选取排在最前的预设个数的关键词作为第一推送信息。电子设备还可以将关联页面关键词集合对应的关联页面中页面访问量最高的页面的主题作为第一推送信息。电子设备还可以 以其他方式,如将关联页面关键词集合对应的被访问页面中页面访问量最高的页面的关键词作为第一推送信息。本申请对此不做限定。可选地,第一推送信息还可以包括关联页面关键词集合对应的关联页面的页面访问量的总和,或者用于生成第一推送信息的被访问页面的页面访问量。
在一些实现中,电子设备可以将该第一推送信息推送给用户。电子设备还可以将第一推送信息直接呈现给用户,还可以将第一推送信息以超链接形式推送给用户,该超链接可以是包括关键词或主题名称的文本,用于链接到该第一推送信息对应的被访问页面或生成该第一推送信息的关联页面关键词集合所对应的关联页面中页面访问量最高的被访问页面。
通过本步骤,电子设备可以获取上述页面对应的分类中访问量最高的前N(N为正整数)个分类,并将这N个分类生成N条第一推送信息。
步骤105,基于用于生成第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与第一推送信息相关联的第二推送信息并推送给用户。
在本实施例中,对于每条第一推送信息,电子设备可以获取用于生成第一推送信息的关联页面关键词集合所对应的被访问页面,并从中选取至少一个被访问页面,根据该至少一个被访问页面生成与前述第一推送信息相关联的第二推送信息。
在这里,第二推送信息可以根据与第一推送信息相关联的页面生成。例如,如果第一推送信息是关联页面关键词集合中各词按照所对应的被访问页面的页面数量或页面访问量由大到小进行排序而选取的排在最前的预设个数的关键词,第二推送信息可以是包含这预设个数的关键词中词的个数最多的M(M为正整数)个页面的主题;如果第一推送信息是关联页面关键词集合对应的关联页面中页面访问量最高的被访问页面的主题,第二推送信息可以是关联页面关键词集合对应的关联页面中页面访问量最高的前M(M为正整数)个页面(可以包括用于生成第一推送信息的页面,也可以不包括用于生成第一推送信 息的页面)的主题。
其中,电子设备可以将第二推送信息和第一推送信息一起呈现给用户,也可以在向用户呈现第一推送信息后,检测用户的预定操作,响应于检测到预定操作,将第二推送信息展示给用户。例如,第二推送信息可以在用户点击第一推送信息时呈现,也可以在用户点击第一推送信息对应的按钮时呈现,还可以响应于鼠标悬停而呈现,等等。可选地,第二推送信息可以以超链接的形式推送给用户,该超链接可以关联到第二推送信息对应的页面。
如图2所示,给出了本实施例在具体应用时的一个示例。在图2的示例中,电子设备首先从至少一个站点获取被访问页面的网址及页面访问量,接着对各个被访问页面进行内容解析,生成各个被访问页面的关键词集合,然后基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成关联页面关键词集合,接着从关联页面关键词集合中选取对应的关联页面的页面访问量之和最高的3个关联页面关键词集合,并将这3个关联页面关键词集合分别对应的页面访问量最高的被访问页面的主题生成第一推送信息201(如网络中的热点新闻);然后,从用于生成第一推送信息201的关联页面关键词集合所对应的被访问页面中获取至少一个(如3个)被访问页面,生成与第一推送信息相关联的第二推送信息202(如热点新闻的背景新闻),并推送给用户。
在图2中,第一推送信息201可以包括主题2011、关联页面关键词集合对应的关联页面的页面访问量之和2012和按钮2013,当按钮2013被用户点击时,电子设备显示第二推送信息202包含的主题2021。其中主题2011和主题2021都可以是超链接形式的文本,分别用以链接到主题2011和主题2021对应的被访问页面。该示例的应用场景例如可以是电子设备向网站的编辑人员推送网络上较受关注的新闻事件,以及这些新闻事件的背景资料,以便编辑人员对新闻事件进行编辑并更新网站内容。
本申请的上述实施例通过向用户推送与第一关联信息相关联的第二推送信息,从而可以向用户展示更丰富的推送信息的内容。
进一步参考图3,其示出了本申请的信息推送的方法的又一个实施例的流程300。该信息推送方法300,包括以下步骤:
步骤301,获取至少一个站点的页面访问信息,其中,页面访问信息包括被访问页面的网址及页面访问量。
在本实施例中,电子设备(例如可以是包含信息推送的应用运行于其上的电子终端或为包含信息推送的应用提供支持的后台服务器)可以从本地或远程地获取至少一个站点的页面访问信息。在这里,页面访问信息可以包括被访问页面的网址(例如URL)及页面访问量。
步骤302,对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合。
在本实施例中,电子设备可以对上述的各个网址对应的页面的内容通过各种方法(例如统计分析方法或语义分析方法)进行解析,从中提取出一个或多个关键词,生成关键词集合。在一些实现中,电子设备还可以对上述一个或多个关键词中的单个关键词进行扩展生成扩展关键词,并将所提取的关键词和扩展关键词一起生成关键词集合。其中,扩展关键词可以包括所提取的单个关键词的同义词、近义词和关联词。可选地,关键词集合中的每个关键词还可以具有重要度系数。
步骤303,基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合。
在本实施例中,电子设备可以进一步对不同的关键词集合的相互比较,计算各个关键词集合之间的相似度,并将相似度大于第一预设阈值的关键词集合合并,生成关联页面关键词集合。其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面可以互为关联页面。
在这里,关键词集合之间的相似度可以表征不同的关键词集合之间的相似程度。在本实施例中,电子设备可以用两个集合之间的相同关键词的个数来表征关键词集合之间的相似程度。电子设备可以采用余弦相似度(cosine similarity)算法、Jaccard系数之类的公知的文本相似度计算方法来进行相似度计算。在一些实现中,关键词集合中的词还可以具有重要度系数。此时,电子设备可以基于重要度系数计算 关键词集合之间的相似度。
步骤304,基于至少一个关联页面关键词集合中的各个集合对应的被访问页面的页面访问量之和的排序结果,利用至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息。
在本实施例中,电子设备可以首先获取上述至少一个关联页面关键词集合所对应的被访问页面的页面访问量的总和,并将这些页面访问量的总和进行排序(例如排序顺序为页面访问量的总和从高到低),然后基于排序结果,利用至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息。
步骤305,对用于生成第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段。
在本实施例中,电子设备可以对用于生成第一推送信息的关联页面关键词集合对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段。这里,当上述至少一个时间段包括两个以上的时间段时,聚类的结果可以是:分别取自任意两个时间段的发布时间之间的时间差大于上述预设的时间间隔。
聚类是将物理或抽象对象的集合分成由类似的对象组成的多个类的过程。在这里,电子设备将被访问页面的发布时间按照预设的时间间隔聚类的目的是:将被访问页面的发布时间划分成至少一个时间段,从而将被访问页面分成发布时间相近的多个类。
在本实施例中,按照发布时间的聚类可以使用各种公知的聚类算法。例如,电子设备可以基于层次聚类算法,每次合并时间间隔最小的两个发布时间,直到时间间隔最小的两个发布时间之间的时间差大于或等于预设的时间间隔,从而,将关联页面关键词集合对应的被访问页面按照发布时间划分成在不同时间段内发布的页面。不同时间段内发布的任意两个被访问页面,他们的发布时间都大于预设的时间间隔。
在本实施例的可选实现方式中,电子设备还可以按照一天的不同时间段确定聚类的预设时间间隔。例如,电子设备可以预先获取多天 的页面发布量,根据页面发布量的分布划分时间间隔。例如,假设每天0:00到6:00的网页发布量比较少,则可以将发布时间为0:00到6:00的预设时间间隔设置为一个较长时间段,如2小时;同样,假设在每天9:00到11:00之间的网页发布量比较多,则可以将发布时间为9:00到11:00的预设时间间隔设置为一个较短的时间段,如20分钟。
通过本步骤,电子设备可以将一个关联页面关键词集合对应的被访问页面按时间划分开来,不同时间段的被访问页面可能记录了不同发展阶段的事件内容。
步骤306,对于上述至少一个时间段中的一个或多个时间段,分别从每个时间段所对应的被访问页面中提取一个页面。
在本实施例中,电子设备可以针对上述至少一个时间段中的一个或多个时间段,分别从每个时间段所对应的被访问页面中提取出一个页面。
在这里,电子设备所提取的页面,可以是所对应的时间段内发布的任意页面,也可以是按一定规则获取的页面。当电子设备按一定规则获取页面时,可以获取所对应的时间段内页面访问量最高的页面,也可以获取所对应的时间段内发布时间最早的页面,还可以按照预先设定的发布页面的站点的优先级别获取页面,等等,本申请对此不做限定。
步骤307,基于所提取的页面,生成第二推送信息并推送给用户。
在本实施例中,电子设备可以基于步骤306中所提取的页面,根据一定的规则,生成第二推送信息,并可以将第二推送信息推送给用户。电子设备基于所提取的页面生成第二推送信息的方式有很多,例如,电子设备可以将所提取的页面的主题或关键字作为第二推送信息,电子设备也可以从所提取的页面中按照发布时间由近及远的顺序选取前预设个数的页面,将这些页面的主题或关键字作为第二推送信息,等等。本申请对此不做限定。
在本实施例的可选实现方式中,在步骤304和步骤305之间,还可以包括页面的去重步骤。电子设备可以将关联页面关键词集合所对应的被访问页面做以下处理:对于关联页面关键词集合所对应的被访 问页面,将相似度大于第二预设阈值的关键词集合所对应的被访问页面筛除至一个页面,将筛除页面后剩余的被访问页面作为关联页面关键词集合所对应的被访问页面。
这里,相似度的算法与前述实施例的步骤103中的计算方法相同,在此不再赘述。其中,第二预设阈值可以大于第一预设阈值。电子设备通过该步骤对关联页面关键词集合所对应的被访问页面去重的原理是:
例如,第二预设阈值取值为98%,则当两个关键词集合的相似度大于98%时,电子设备可以认为这两个关键词集合对应的被访问页面为相同内容的页面,即重复的页面。电子设备可以从重复的页面中保留任意一个页面,也可以从重复的页面中按一定的规则选取一个页面进行保留,如选取发布时间最早的页面进行保留等,同时筛除重复的页面中的其他页面,将筛除页面后剩余的被访问页面作为关联页面关键词集合所对应的被访问页面。假设关联页面关键词集合所对应的被访问页面有1000个,其中有30组重复的页面,每组重复的页面都包括2个页面,则电子设备从这30组的每一组页面中筛除1个页面,保留1个页面,则剩余970个页面作为关联页面关键词集合所对应的被访问页面。对于重复的页面中不被保留的页面,电子设备可以将该页面的页面信息删除。可选地,对于重复的页面,电子设备可以将不被保留的页面的页面访问量累加在保留的页面的页面访问量上。
在本实施例中,上述实现流程中的步骤301、步骤302、步骤303和步骤304分别与前述实施例中的步骤101、步骤102、步骤103和步骤104基本相同,在此不再赘述。
从图3中可以看出,与图1对应的实施例不同的是,本实施例中的信息推送方法的流程300用步骤305、306、307代替了步骤105。通过步骤305、306、307,本实施例可以将第一推送信息相对应的关联页面关键词集合对应的被访问页面按照时间段进行抽取,从而生成与第一推送信息相关联的第二推送信息。当这些页面属于同一个事件时,每个时间段内的页面的页面内容可以给出事件的一个发展状态,从每个时间段内提取一个页面生成第二推送信息,可以使用户通过第 二推送信息对整个事件的发展过程进行了解。如图4所示,为本实施例的信息推送方法的一应用场景的效果图。图4示出的应用场景为热点新闻信息的推送场景,其中,401指示第一推送信息,402指示第二推送信息。本实施例有助于向用户推送第一推送信息的各个时间段内的发展信息。可选地,在将被访问页面的发布时间进行聚类前,可以先对页面去重,以避免在不同时间段内获取具有相同内容的页面从而降低信息推送的有效性。
进一步参考图5,作为对上述各图所示方法的实现,本申请提供了一种信息推送的装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于电子设备中。
如图5所示,本实施例所述的信息推送的装置500包括:信息获取模块501、关键词集合生成模块502、关键词集合合并模块503、第一推送信息生成模块504及第二推送信息生成及推送模块505。其中,信息获取模块501配置用于获取至少一个站点的页面访问信息,其中,页面访问信息包括被访问页面的网址及页面访问量;关键词集合生成模块502配置用于对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合;关键词集合合并模块503配置用于基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合,其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面互为关联页面;第一推送信息生成模块504配置用于基于至少一个关联页面关键词集中的各个集合对应的被访问页面的页面访问量之和的排序结果,利用上述至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息;第二推送信息生成及推送模块505配置用于基于用于生成第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与第一推送信息相关联的第二推送信息并推送给用户。
在本实施例中,信息推送装置500可以首先通过信息获取模块501从本地或远程地获取至少一个站点的页面访问信息。在这里,页面访问信息可以包括被访问页面的网址(例如URL)及页面访问量。
在本实施例中,关键词集合生成模块502可以接着对上述的各个 网址对应的页面的内容通过各种方法(例如统计分析方法或语义分析方法)进行解析,从中提取出一个或多个关键词,生成关键词集合。在一些实现中,关键词集合生成模块502还可以对上述一个或多个关键词中的单个关键词进行扩展生成扩展关键词,并将所提取的关键词和扩展关键词一起生成关键词集合。其中,扩展关键词可以包括所提取的单个关键词的同义词、近义词和关联词。可选地,关键词集合中的每个关键词还可以具有重要度系数。
在本实施例中,关键词集合合并模块503接着可以对关键词集合生成模块502生成的关键词集合相互比较,并将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合。其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面互为关联页面。这里,关键词集合之间的相似度可以通过多种方法计算。
在本实施例中,第一推送信息生成模块504接着可以获取上述至少一个关联页面关键词集合所对应的被访问页面的页面访问量的总和,并将这些页面访问量的总和进行排序(例如排序顺序为页面访问量的总和从高到低),然后基于排序结果,利用至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息。
在本实施例中,第二推送信息生成及推送模块505接着可以对于每条第一推送信息,获取用于生成第一推送信息的关联页面关键词集合所对应的被访问页面,并从中选取至少一个被访问页面,然后根据该至少一个被访问页面生成与前述第一推送信息相关联的第二推送信息并推送给第二用户。
在本实施例的一些可选实现方式中,第二推送信息生成及推送模块505可以包括:聚类单元(未示出),配置用于对用于生成第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段;提取单元(未示出),配置用于对于至少一个时间段中的一个或多个时间段,分别从每个时间段所对应的被访问页面中提取一个页面;生成单元(未示出),配置用于基于所提取的页面,生成第二推送信息并推送给用户。这里,当上述至少一个时间段包括两个以上的时间段时,聚类的结果可以是: 分别取自任意两个时间段的发布时间之间的时间差大于上述预设的时间间隔。
在本实施例的一些可选实现方式中,第二推送信息生成及推送模块505还可以包括:筛除单元(未示出),配置用于对于关联页面关键词集合所对应的被访问页面,将相似度大于第二预设阈值的关键词集合所对应的被访问页面筛除至一个页面,将筛除页面后剩余的被访问页面作为关联页面关键词集合所对应的被访问页面。其中,第二预设阈值大于第一预设阈值。筛除单元的作用是对关联页面关键词集合所对应的被访问页面去重。
值得说明的是,信息推送装置500中记载的诸模块或单元与参考图1描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作和特征同样适用于信息推送装置500及其中包含的模块或单元,在此不再赘述。
本领域技术人员可以理解,上述信息推送装置500还包括一些其他公知结构,例如处理器、存储器等,为了不必要地模糊本公开的实施例,这些公知的结构在图5中未示出。
下面参考图6,其示出了适于用来实现本申请实施例的电子设备的计算机系统600的结构示意图。
如图6所示,计算机系统600包括中央处理单元(CPU)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储部分608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有系统600操作所需的各种程序和数据。CPU 601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
以下部件连接至I/O接口605:包括键盘、鼠标等的输入部分606;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分607;包括硬盘等的存储部分608;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611,诸如磁盘、光盘、磁光盘、半导体存储器等等, 根据需要安装在驱动器610上,以便于从其上读出的计算机程序根据需要被安装入存储部分608。
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,所述计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分609从网络上被下载和安装,和/或从可拆卸介质611被安装。
本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的模块也可以设置在处理器中,例如,可以描述为:一种处理器包括。其中信息获取模块、关键词集合生成模块、关键词集合合并模块、第一推送信息生成模块及第二推送信息生成及推送模块,这些模块的名称在某种情况下并不构成对该模块本身的限定,例如,信息获取模块还可以被描述为“配置用于获取至少一个站点的页面访问信息的模块”。
作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中所述装置中所包含的计算机可读存储介质;也可以是单独存在,未装配入终端中的计算机可读存储介质。所述计算机可读存储介质存储有一个或者一个以上程序,所述程序被一个或者一个以上的处理器用来执行描述于本申请的信息推送的方法。
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离所述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (16)

  1. 一种信息推送方法,其特征在于,所述方法包括:
    获取至少一个站点的页面访问信息,其中,所述页面访问信息包括被访问页面的网址及页面访问量;
    对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合;
    基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合,其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面互为关联页面;
    基于所述至少一个关联页面关键词集合中的各个集合对应的被访问页面的页面访问量之和的排序结果,利用所述至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息;
    基于用于生成所述第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与所述第一推送信息相关联的第二推送信息并推送给用户。
  2. 根据权利要求1所述的方法,其特征在于,所述基于用于生成所述第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与所述第一推送信息相关联的第二推送信息并推送给用户,包括:
    对用于生成所述第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段,其中,当所述至少一个时间段包括两个以上的时间段时,分别取自任意两个时间段的发布时间之间的时间差大于所述时间间隔;
    对于所述至少一个时间段中的一个或多个时间段,分别从每个时间段所对应的被访问页面中提取一个页面;
    基于所提取的页面,生成第二推送信息并推送给用户。
  3. 根据权利要求2所述的方法,其特征在于,所述对用于生成所 述第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段之前,还包括:
    对于关联页面关键词集合所对应的被访问页面,将相似度大于第二预设阈值的关键词集合所对应的被访问页面筛除至一个页面,将筛除页面后剩余的被访问页面作为关联页面关键词集合所对应的被访问页面,其中,所述第二预设阈值大于第一预设阈值。
  4. 根据权利要求1所述的方法,其特征在于,所述对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合包括:
    对所述被访问页面的内容进行统计分析和/或语义分析,提取至少一个关键词;
    基于所述至少一个关键词,生成关键词集合。
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述至少一个关键词,生成关键词集合包括:
    对于每个所述至少一个关键词中的单个关键词,进行扩展以生成扩展关键词,其中,所述扩展关键词包括以下至少一项:所述单个关键词的同义词、所述单个关键词的近义词、所述单个关键词的关联词;
    基于所述至少一个关键词和所述扩展关键词,生成关键词集合。
  6. 根据权利要求1-5中任一所述的方法,其特征在于,将满足以下条件之一的关键词集合作为相似度大于第一预设阈值的关键词集合:
    相同关键词的个数大于个数阈值;
    相同关键词的个数与进行比较的关键词集合中关键词的总个数的比值大于比值阈值。
  7. 根据权利要求1-5中任一所述的方法,其特征在于,所述关键词集合中的各关键词还具有重要度系数,以及
    所述基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合包括:
    基于所述重要度系数对不同的关键词集合进行相似度计算;
    将相似度大于相似度阈值的关键词集合合并,生成关联页面关键词集合。
  8. 一种信息推送装置,其特征在于,所述装置包括:
    信息获取模块,配置用于获取至少一个站点的页面访问信息,其中,所述页面访问信息包括被访问页面的网址及页面访问量;
    关键词集合生成模块,配置用于对各个网址对应的页面进行内容解析,生成各个被访问页面的关键词集合;
    关键词集合合并模块,配置用于基于关键词集合的相互比较,将相似度大于第一预设阈值的关键词集合合并,生成至少一个关联页面关键词集合,其中,用于生成关联页面关键词集合的关键词集合对应的被访问页面互为关联页面;
    第一推送信息生成模块,配置用于基于所述至少一个关联页面关键词集合中的各个集合对应的被访问页面的页面访问量之和的排序结果,利用所述至少一个关联页面关键词集合中的一个或多个集合生成第一推送信息;
    第二推送信息生成及推送模块,配置用于基于用于生成所述第一推送信息的关联页面关键词集合所对应的至少一个被访问页面,生成与所述第一推送信息相关联的第二推送信息并推送给用户。
  9. 根据权利要求8所述的装置,其特征在于,所述第二推送信息生成及推送模块包括:
    聚类单元,配置用于对用于生成所述第一推送信息的关联页面关键词集合所对应的被访问页面的发布时间按照预设的时间间隔进行聚类,划分成至少一个时间段,其中,当所述至少一个时间段包括两个以上的时间段时,分别取自任意两个时间段的发布时间之间的时间差大于所述时间间隔;
    提取单元,配置用于对于所述至少一个时间段中的一个或多个时间段,分别从每个时间段所对应的被访问页面中提取一个页面;
    生成单元,配置用于基于所提取的页面,生成第二推送信息并推送给用户。
  10. 根据权利要求9所述的装置,其特征在于,所述第二推送信息生成及推送模块还包括:
    筛除单元,配置用于对于关联页面关键词集合所对应的被访问页面,将相似度大于第二预设阈值的关键词集合所对应的被访问页面筛除至一个页面,将筛除页面后剩余的被访问页面作为关联页面关键词集合所对应的被访问页面,其中,所述第二预设阈值大于第一预设阈值。
  11. 根据权利要求8所述的装置,其特征在于,所述关键词集合生成模块包括:
    关键词提取单元,配置用于对所述被访问页面的内容进行统计分析和/或语义分析,提取至少一个关键词;
    关键词集合生成单元,配置用于基于所述至少一个关键词,生成关键词集合。
  12. 根据权利要求11所述的装置,其特征在于,所述关键词集合生成单元包括:
    扩展子单元,配置用于对于每个所述至少一个关键词中的单个关键词,进行扩展以生成扩展关键词,其中,所述扩展关键词包括以下至少一项:所述单个关键词的同义词、所述单个关键词的近义词、所述单个关键词的关联词;
    关键词集合生成子单元,配置用于基于所述至少一个关键词和所述扩展关键词,生成关键词集合。
  13. 根据权利要求8-12中任一项所述的装置,其特征在于,所述 关键词集合合并模块进一步配置用于:
    将满足以下条件之一的关键词集合作为相似度大于第一预设阈值的关键词集合:
    相同关键词的个数大于个数阈值;
    相同关键词的个数与进行比较的关键词集合中关键词的总个数的比值大于比值阈值。
  14. 根据权利要求8-12中任一项所述的装置,其特征在于,所述关键词集合中的各关键词还具有重要度系数,以及
    所述关键词集合合并模块包括:
    计算单元,配置用于基于所述重要度系数对不同的关键词集合进行相似度计算;
    合并及生成单元,配置用于将相似度大于相似度阈值的关键词集合合并,生成关联页面关键词集合。
  15. 一种设备,包括:
    处理器;和
    存储器,
    所述存储器中存储有能够被所述处理器执行的计算机可读指令,在所述计算机可读指令被执行时,所述处理器执行权利要求1至7中任一项所述的方法。
  16. 一种非易失性计算机存储介质,所述计算机存储介质存储有能够被处理器执行的计算机可读指令,当所述计算机可读指令被处理器执行时,所述处理器执行权利要求1至7中任一项所述的方法。
PCT/CN2015/095754 2015-08-03 2015-11-27 信息推送方法和装置 WO2017020451A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510483126.3 2015-08-03
CN201510483126.3A CN105069102B (zh) 2015-08-03 2015-08-03 信息推送方法和装置

Publications (1)

Publication Number Publication Date
WO2017020451A1 true WO2017020451A1 (zh) 2017-02-09

Family

ID=54498472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/095754 WO2017020451A1 (zh) 2015-08-03 2015-11-27 信息推送方法和装置

Country Status (2)

Country Link
CN (1) CN105069102B (zh)
WO (1) WO2017020451A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921918A (zh) * 2018-07-24 2018-11-30 Oppo广东移动通信有限公司 视频创建方法及相关装置
CN109785919A (zh) * 2018-11-30 2019-05-21 平安科技(深圳)有限公司 名词匹配方法、装置、设备及计算机可读存储介质
CN110163701A (zh) * 2018-02-11 2019-08-23 北京京东尚科信息技术有限公司 推送信息的方法和装置
CN111460289A (zh) * 2020-03-27 2020-07-28 北京百度网讯科技有限公司 新闻资讯的推送方法和装置
CN112733006A (zh) * 2019-10-14 2021-04-30 中国移动通信集团上海有限公司 用户画像的生成方法、装置、设备及存储介质
CN113420550A (zh) * 2021-06-30 2021-09-21 中国农业银行股份有限公司 提取关键词的方法及装置
CN113781113A (zh) * 2021-09-09 2021-12-10 杭州爆米花鹰眼科技有限责任公司 一种连锁式信息推送系统及方法
CN114357278A (zh) * 2020-09-28 2022-04-15 腾讯科技(深圳)有限公司 一种话题推荐方法、装置及设备
CN114817730A (zh) * 2022-05-06 2022-07-29 李春良 一种大数据情境下的资讯活动信息推荐系统及方法

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069102B (zh) * 2015-08-03 2017-05-24 百度在线网络技术(北京)有限公司 信息推送方法和装置
CN105491056A (zh) * 2015-12-25 2016-04-13 深圳市金立通信设备有限公司 一种信息推送方法及终端
CN106933912B (zh) * 2015-12-31 2020-07-03 北京国双科技有限公司 关键词的获取方法和装置
CN105808641A (zh) 2016-02-24 2016-07-27 百度在线网络技术(北京)有限公司 线下资源的挖掘方法和装置
CN107451161A (zh) * 2016-06-01 2017-12-08 阿里巴巴集团控股有限公司 展示对象的推送方法、装置及平台
CN106294815B (zh) * 2016-08-16 2019-08-16 晶赞广告(上海)有限公司 一种url的聚类方法及装置
CN106372204A (zh) * 2016-08-31 2017-02-01 北京小米移动软件有限公司 推送消息处理方法及装置
CN108241699B (zh) * 2016-12-26 2022-03-11 百度在线网络技术(北京)有限公司 用于推送信息的方法和装置
CN106777283B (zh) * 2016-12-29 2021-02-26 北京奇虎科技有限公司 一种同义词的挖掘方法及装置
CN108363707B (zh) * 2017-01-26 2020-01-24 百度在线网络技术(北京)有限公司 用于生成网页的方法和装置
CN106777403B (zh) * 2017-03-28 2020-07-28 百度在线网络技术(北京)有限公司 信息推送方法和装置
CN107196999B (zh) * 2017-05-03 2020-01-24 网易传媒科技(北京)有限公司 用于下发信息流推送数据的方法及设备
CN107172151B (zh) * 2017-05-18 2020-08-07 百度在线网络技术(北京)有限公司 用于推送信息的方法和装置
CN107463552A (zh) * 2017-07-20 2017-12-12 北京奇艺世纪科技有限公司 一种生成视频主题名称的方法和装置
CN108304377B (zh) * 2017-12-28 2021-08-06 东软集团股份有限公司 一种长尾词的提取方法及相关装置
CN108416019A (zh) * 2018-03-06 2018-08-17 王海泉 关联词调整方法及调整系统
CN108846028A (zh) * 2018-05-24 2018-11-20 网易传媒科技(北京)有限公司 文章投放方法、介质、装置和计算设备
CN109189908B (zh) * 2018-08-22 2019-08-20 乔杨 海量数据提取推送工作方法
CN109345307A (zh) * 2018-09-28 2019-02-15 西安Tcl软件开发有限公司 广告推送方法、系统、终端及计算机可读存储介质
CN109582863B (zh) * 2018-11-19 2020-08-04 珠海格力电器股份有限公司 一种推荐方法及服务器
CN110309395A (zh) * 2019-07-05 2019-10-08 云南电网有限责任公司电力科学研究院 一种基于数据获取技术的专业字典构建方法
CN110888986B (zh) * 2019-12-06 2023-05-30 北京明略软件系统有限公司 信息推送方法、装置、电子设备和计算机可读存储介质
CN111008340B (zh) * 2019-12-19 2022-11-29 中国联合网络通信集团有限公司 课程推荐方法、设备和存储介质
CN111523027B (zh) * 2020-04-16 2023-08-01 武汉有牛科技有限公司 基于区块链技术的数据新闻自动撰写机器人
CN116340639B (zh) * 2023-03-31 2023-12-12 北京百度网讯科技有限公司 新闻召回方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260597A1 (en) * 2006-05-02 2007-11-08 Mark Cramer Dynamic search engine results employing user behavior
CN101984423A (zh) * 2010-10-21 2011-03-09 百度在线网络技术(北京)有限公司 一种热搜词生成方法及系统
CN103164521A (zh) * 2013-03-11 2013-06-19 亿赞普(北京)科技有限公司 一种基于用户浏览和搜索行为的关键词计算方法及装置
CN105069102A (zh) * 2015-08-03 2015-11-18 百度在线网络技术(北京)有限公司 信息推送方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102723B (zh) * 2014-07-21 2017-07-25 百度在线网络技术(北京)有限公司 搜索内容提供方法和搜索引擎

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260597A1 (en) * 2006-05-02 2007-11-08 Mark Cramer Dynamic search engine results employing user behavior
CN101984423A (zh) * 2010-10-21 2011-03-09 百度在线网络技术(北京)有限公司 一种热搜词生成方法及系统
CN103164521A (zh) * 2013-03-11 2013-06-19 亿赞普(北京)科技有限公司 一种基于用户浏览和搜索行为的关键词计算方法及装置
CN105069102A (zh) * 2015-08-03 2015-11-18 百度在线网络技术(北京)有限公司 信息推送方法和装置

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163701A (zh) * 2018-02-11 2019-08-23 北京京东尚科信息技术有限公司 推送信息的方法和装置
CN110163701B (zh) * 2018-02-11 2023-11-03 北京京东尚科信息技术有限公司 推送信息的方法和装置
CN108921918B (zh) * 2018-07-24 2023-05-30 Oppo广东移动通信有限公司 视频创建方法及相关装置
CN108921918A (zh) * 2018-07-24 2018-11-30 Oppo广东移动通信有限公司 视频创建方法及相关装置
CN109785919A (zh) * 2018-11-30 2019-05-21 平安科技(深圳)有限公司 名词匹配方法、装置、设备及计算机可读存储介质
CN109785919B (zh) * 2018-11-30 2023-06-23 平安科技(深圳)有限公司 名词匹配方法、装置、设备及计算机可读存储介质
CN112733006A (zh) * 2019-10-14 2021-04-30 中国移动通信集团上海有限公司 用户画像的生成方法、装置、设备及存储介质
CN111460289A (zh) * 2020-03-27 2020-07-28 北京百度网讯科技有限公司 新闻资讯的推送方法和装置
CN111460289B (zh) * 2020-03-27 2024-03-29 北京百度网讯科技有限公司 新闻资讯的推送方法和装置
CN114357278A (zh) * 2020-09-28 2022-04-15 腾讯科技(深圳)有限公司 一种话题推荐方法、装置及设备
CN114357278B (zh) * 2020-09-28 2024-03-19 腾讯科技(深圳)有限公司 一种话题推荐方法、装置及设备
CN113420550A (zh) * 2021-06-30 2021-09-21 中国农业银行股份有限公司 提取关键词的方法及装置
CN113420550B (zh) * 2021-06-30 2024-03-01 中国农业银行股份有限公司 提取关键词的方法及装置
CN113781113B (zh) * 2021-09-09 2022-06-21 杭州爆米花鹰眼科技有限责任公司 一种连锁式信息推送系统及方法
CN113781113A (zh) * 2021-09-09 2021-12-10 杭州爆米花鹰眼科技有限责任公司 一种连锁式信息推送系统及方法
CN114817730A (zh) * 2022-05-06 2022-07-29 李春良 一种大数据情境下的资讯活动信息推荐系统及方法
CN114817730B (zh) * 2022-05-06 2023-06-20 成都坐联智城科技有限公司 一种大数据情境下的资讯活动信息推荐系统及方法

Also Published As

Publication number Publication date
CN105069102B (zh) 2017-05-24
CN105069102A (zh) 2015-11-18

Similar Documents

Publication Publication Date Title
WO2017020451A1 (zh) 信息推送方法和装置
US10140384B2 (en) Dynamically modifying elements of user interface based on knowledge graph
Wang et al. Product aspect extraction supervised with online domain knowledge
CN104899322B (zh) 搜索引擎及其实现方法
CN104573054B (zh) 一种信息推送方法和设备
WO2017118427A1 (zh) 网页训练的方法和装置、搜索意图识别的方法和装置
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
WO2017000402A1 (zh) 网页生成方法和装置
WO2018040343A1 (zh) 用于识别文本类型的方法、装置和设备
WO2016135905A1 (ja) 情報処理システム及び情報処理方法
Ho et al. Mining future spatiotemporal events and their sentiment from online news articles for location-aware recommendation system
WO2015188719A1 (zh) 结构化数据与图片的关联方法与关联装置
Lee et al. Leveraging microblogging big data with a modified density-based clustering approach for event awareness and topic ranking
CN108090178B (zh) 一种文本数据分析方法、装置、服务器和存储介质
US11640420B2 (en) System and method for automatic summarization of content with event based analysis
CN113688310A (zh) 一种内容推荐方法、装置、设备及存储介质
CN109815401A (zh) 一种应用于Web人物搜索的人名消歧方法
Xu et al. Extracting keywords from texts based on word frequency and association features
JP5952756B2 (ja) 予測対象コンテンツにおける将来的なコメント数を予測する予測サーバ、プログラム及び方法
KR20160002199A (ko) 연관 키워드를 이용한 이슈 데이터 추출방법 및 시스템
WO2019231635A1 (en) Method and apparatus for generating digest for broadcasting
WO2016027364A1 (ja) 話題クラスタ選択装置、及び検索方法
Abinaya et al. Event identification in social media through latent dirichlet allocation and named entity recognition
JP6373767B2 (ja) 話題語ランキング装置、話題語ランキング方法、およびプログラム
CN110795943B (zh) 一种针对事件的话题表示生成方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15900221

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15900221

Country of ref document: EP

Kind code of ref document: A1