CN113536106A

CN113536106A - Method for determining information content to be recommended

Info

Publication number: CN113536106A
Application number: CN202011321743.0A
Authority: CN
Inventors: 薛天竹; 沈春旭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-10-22

Abstract

The application relates to a method and a device for determining information content to be recommended, computer equipment and a storage medium. The method comprises the following steps: obtaining a first predetermined number of first information content objects having a closest time; for each first information content object, acquiring the click rate score of each first information content object in the latest first preset time; carrying out missing value smoothing treatment on each click rate score to ensure that information content objects without click rate scores also obtain corresponding click rate scores and obtain click rate scores after the missing value smoothing treatment; and determining the information content object to be recommended based on the click rate score of each first information content object after the missing value smoothing processing. The method can improve the quality of information content recommendation.

Description

Method for determining information content to be recommended

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for determining information content to be recommended, a computer device, and a storage medium.

Background

With the increasing development of information technology, recommendations of objects related to information Content, such as User Generated Content (UGC) and the like, have become an important Content for network technology applications. In information recommendation, information recommendation related to operation information such as clicking, watching or reading by a user is performed in a common mode, but the current recommendation mode of information content has the problem of poor recommendation quality.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for determining information content to be recommended, which can improve the quality of advancement.

A method for determining information content to be recommended, the method comprising:

obtaining a first predetermined number of first information content objects having a closest time;

for each first information content object, acquiring the click rate score of each first information content object in the latest first preset time;

carrying out missing value smoothing treatment on each click rate score to ensure that information content objects without click rate scores also obtain corresponding click rate scores and obtain click rate scores after the missing value smoothing treatment;

and determining the information content object to be recommended based on the click rate score of each first information content object after the missing value smoothing processing.

Based on the scheme of the embodiment, the click rate scores within the latest first preset time are obtained for the first preset number of first information content objects with the latest time, and the missing value processing is performed on the basis, so that the click rate conditions of a certain number of newly released information content objects can be combined, the first information content objects without click rate scores can also have a certain click rate score and also have a probability of being recommended, the problem of cold start of newly released information content due to the fact that no historical data exists is solved, effective recommendation can be performed on the new content, the new content can enter a recommendation process, and the quality of information content recommendation is improved.

acquiring a second information content object within a second latest preset time;

carrying out frequency statistics on the second information content object to obtain first frequency of each first word in the second information content object;

acquiring a second frequency of each first word in an information content object within a third nearest preset time, wherein the third preset time is greater than the second preset time;

determining content hotspot scores of the second information content objects based on the first frequency and the second frequency of each first word segmentation contained in the second information content objects;

and determining an information content object to be recommended in the second information content objects based on the content hotspot scores of the second information content objects.

According to the scheme of the embodiment, by counting the first frequency of the first participle contained in each second information content object within the second predetermined time closest to the time and the second frequency of the first participle contained in each second information content object within the second predetermined time, the content hotspot score of each second information content object is obtained, so that the information content object with the hotspot can be found from the second information content objects published within the second predetermined time closest to the time, the hotspot content in the newly published information content can be found, and the quality of information content recommendation is improved.

acquiring a first preset number of first information content objects with the latest time and second information content objects within a second latest preset time;

for a first preset number of first information content objects, acquiring the click rate score of each first information content object in the latest first preset time; carrying out missing value smoothing treatment on each click rate score to obtain the click rate score of each first information content object after missing value smoothing treatment;

performing frequency statistics on the second information content object to obtain a first frequency of each first word in the second information content object within the second preset time and a second frequency of each first word in the second information content object within a third latest preset time; determining a content hotspot score of each second information content object based on a first frequency and a second frequency of each first word segmentation contained in the second information content object, wherein the third preset time is greater than the second preset time;

and determining the information content objects to be recommended based on the click rate scores of the first information content objects and the content hotspot scores of the second information content objects after the missing value smoothing processing.

Based on the scheme of the embodiment, the click rate scores in the latest first preset time are obtained for the first preset number of first information content objects which are latest in time, and the missing value processing is performed on the basis, so that the click rate conditions of a certain number of newly released information content objects can be combined, the first information content objects which are not clicked can also have a certain click rate score and also have a probability of being recommended, the cold start problem of newly released information content due to no historical data is solved, and meanwhile, for each second information content object which is latest in second preset time, the content hotspot scores of each second information content object are calculated by counting the first frequency of the first participle contained in the second preset time and the second frequency of the first participle in the third preset time, so that hotspots can be found from the second information content objects which are latest in second preset time The information content object can mine hot content in newly released information content, so that the cold start problem is solved, the new content can be effectively recommended to enter a recommendation process, real-time hot spots can be mined, and the overall recommendation quality is improved.

In one embodiment, after obtaining a first predetermined number of first information content objects whose time is the latest, obtaining the click rate score of each of the first information content objects before the latest first predetermined time, further includes: and removing the first information content objects containing the sensitive words in the first predetermined number of first information content objects.

In one embodiment, after obtaining a first predetermined number of first information content objects whose time is the latest, obtaining the click rate score of each of the first information content objects before the latest first predetermined time, further includes: and removing the first information content objects with the content length smaller than a preset length threshold value from the first preset number of first information content objects.

In one embodiment, for each of the first information content objects, before obtaining the click rate score of each of the first information content objects within the latest first predetermined time, the method further includes:

writing each of the first information content objects into a time reversed sequence.

In one embodiment, obtaining the click-through rate score of each of the first information content objects within the latest first predetermined time comprises:

acquiring the number of clicks and the number of exposures of each first information content object in the latest first preset time;

and calculating and obtaining the click rate score of each first information content object in the latest first preset time based on the click number and the exposure number.

In one embodiment, performing confidence smoothing on the click rate score to obtain a click rate score after the confidence smoothing includes: and carrying out confidence coefficient smoothing processing on the click rate scores according to the exposure, wherein the larger the exposure, the smaller the punishment of the corresponding click rate scores.

In one embodiment, performing frequency statistics on the second information content object to obtain the first frequency of each first word segmentation in the second information content object includes:

and performing frequency statistics on each second information content object by adopting an unigram language model and a bigram language model to obtain the first frequency of each first word segmentation in the second information content object.

In one embodiment, the method further comprises the steps of: acquiring a third information content object within the latest third preset time;

and carrying out frequency statistics on the third information content object to obtain a second frequency of each second participle in the third information content object, wherein each second participle comprises the first participle.

In one embodiment, obtaining a second frequency of each of the first terms in the information content object within a third predetermined time comprises:

and performing frequency statistics on the content of each third information object in the latest third preset time by using the unigram language model and the bigram language model to obtain the second frequency of each second participle in the third information content object, wherein each second participle comprises the first participle.

In one embodiment, determining the content hotspot score of each of the second informational content objects based on the word segmentation hotspot scores of each of the first words contained in the second informational content object includes:

and summing the word segmentation hotspot scores of the first word segmentations contained in the second information content object to obtain the content hotspot score of the information content object.

In one embodiment, determining the first information content object to be recommended based on the recommendation probability value of each first information content object after the normalization processing includes:

determining a recommendation probability interval in which each first information content object is located based on the recommendation probability value of each first information content object;

and acquiring a randomly selected first probability value, and determining a first information content object corresponding to a recommendation probability interval corresponding to the first probability value as a first information content object to be recommended.

In one embodiment, determining the second information content object to be recommended based on the recommendation probability values of the second information content objects after the normalization processing includes:

determining a recommendation probability interval in which each second information content object is located based on the recommendation probability value of each second information content object;

and acquiring a randomly selected second probability value, and determining a second information content object corresponding to the recommendation probability interval corresponding to the second probability value as a second information content object to be recommended.

In one embodiment, determining the information content object to be recommended based on the recommendation probability values of the second information content objects after the normalization processing includes:

and acquiring a randomly selected probability value, and determining a second information content object corresponding to the recommendation probability interval corresponding to the probability value as the information content to be recommended.

An information content to be recommended determination device, comprising:

a first object acquisition module for acquiring a first predetermined number of first information content objects closest in time;

the click rate score determining module is used for acquiring the click rate score of each first information content object in the latest first preset time for each first information content object;

the missing value smoothing processing module is used for carrying out missing value smoothing processing on each click rate score so that information content objects without click rate scores also obtain corresponding click rate scores and click rate scores after the missing value smoothing processing are obtained;

and the first object to be recommended determining module is used for determining the information content objects to be recommended based on the click rate scores of the first information content objects after the missing value smoothing processing.

An information content to be recommended determination device, comprising:

the second object acquisition module is used for acquiring a second information content object in the latest second preset time;

the frequency determining module is used for carrying out frequency statistics on the second information content object to obtain the first frequency of each first word in the second information content object; acquiring a second frequency of each first word in an information content object within a third nearest preset time, wherein the third preset time is greater than the second preset time;

the content hotspot score determining module is used for determining the content hotspot score of each second information content object based on the first frequency and the second frequency of each first word segmentation contained in the second information content object;

and the second object to be recommended determining module is used for determining the information content objects to be recommended in the second information content objects based on the content hotspot scores of the second information content objects.

An information content to be recommended determination device, comprising:

the object acquisition module is used for acquiring a first preset number of first information content objects with the latest time and second information content objects within a second latest preset time;

the click score determining module is used for acquiring the click rate score of each first information content object within the latest first preset time for a first preset number of first information content objects; carrying out missing value smoothing treatment on each click rate score to obtain the click rate score of each first information content object after missing value smoothing treatment;

the hotspot score determining module is used for carrying out frequency statistics on the second information content object to obtain a first frequency of each first word in the second information content object in the second preset time and a second frequency of each first word in the third preset time; determining a content hotspot score of each second information content object based on a first frequency and a second frequency of each first word segmentation contained in the second information content object, wherein the third preset time is greater than the second preset time;

and the recommendation object determining module is used for determining the information content objects to be recommended based on the click rate scores of the first information content objects and the content hotspot scores of the second information content objects after the missing value smoothing processing.

A computer device comprising a memory storing a computer program and a processor implementing the steps comprised in any of the methods as described above when the computer program is executed.

A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps comprised in any of the methods as described above.

Drawings

FIG. 1 is a diagram of an application environment of a method for determining content of information to be recommended in an embodiment;

FIG. 2 is a flowchart illustrating a method for determining content of information to be recommended according to an embodiment;

FIG. 3 is a flowchart illustrating a method for determining content of information to be recommended according to another embodiment;

FIG. 4 is a flowchart illustrating a method for determining content of information to be recommended according to another embodiment;

FIG. 5 is a schematic diagram illustrating a comparison of a recommendation made using the method of the embodiment of the present application and a conventional method in a specific example;

FIG. 6 is a schematic diagram illustrating a comparison of a recommendation made by an embodiment of the method of the present application and a conventional method in a specific example;

FIG. 7 is a block diagram illustrating an exemplary embodiment of an apparatus for determining content of information to be recommended;

fig. 8 is a block diagram showing the structure of an information content determining apparatus to be recommended according to another embodiment;

fig. 9 is a block diagram showing the structure of an information content determining apparatus to be recommended according to another embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method for determining the information content to be recommended, provided by the application, can be applied to the application environment shown in fig. 1. In some embodiments, the application environment may only involve the terminal 102 and the server 104. The terminal 102 communicates with the server 104 through a network, and a terminal user can access or obtain related information content published on the server 104 through the terminal 102, such as posts, articles, and other user-generated content. The server 104 recommends the relevant information content for the terminal 102 when the terminal 102 accesses the terminal, or actively recommends the relevant information content to the terminal 102, for example, the recommended relevant information content may be displayed on a home page provided for the terminal 102 when the terminal accesses the terminal, or displayed on another relevant page. The server 104 may determine the information content to be recommended for the terminal 102 in combination with various policies.

In some embodiments, the application environment may involve both the terminal 102, the server 104, and the server 106. Wherein, the terminal 102 communicates with the server 104 through the network, and the terminal user can access or obtain the related information content published on the server 104 through the terminal 102. The server 104 communicates with the server 106 through a network, and the server 106 may determine information content to be recommended that needs to be recommended based on relevant information of the server 104, such as log data, and provide the determined information content to be recommended to the server 104. When the terminal 102 accesses, the server 104 determines the information content recommended to the terminal 102 by combining the information content to be recommended provided by the server 106. The information content recommended to the terminal 102, which is determined by the server 104, may be information content to be recommended, which is provided by the server 106 directly, or may be information content to be recommended, which is provided by the server 106 in combination, and related recommendation configuration is performed, for example, the information content to be recommended is recommended on a plurality of pages, such as a home page and other pages, or is combined with results of other recommendation strategies. Server 104 and server 106 may be separate servers or may refer to the same server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a method for determining information content to be recommended is provided, which is described by taking the method as an example applied to the server 104 or the server 106 in fig. 1, and includes the following steps S201 to S204.

Step S201: a first predetermined number of first information content objects having the closest time of acquisition.

Here, the time may refer to a distribution time of an information content object, for example, a distribution time of an information content object of user-generated content such as a post or an article. When the first predetermined number of first information content objects with the latest time are obtained, the first predetermined number of information content objects which are released most recently may be obtained specifically in a time-reversed manner.

Wherein the first predetermined number may be set in combination with the actual technical need. In some embodiments, the number of information content objects binned per period of statistics may be determined in conjunction with, and may be determined concurrently with, the current traffic situation. For example, assuming that the number of information content objects put in storage per hour is N in a period of 1 hour, a value between 4N and 6N may be used as the first predetermined number. If the number of N is large, it may be reduced appropriately, and for example, a value between 3N and 4N may be taken as the first predetermined number. If the number of N is small, it may be increased as appropriate, and for example, a value between 5N and 7N may be taken as the first predetermined number. It will be appreciated that in other embodiments, other ways of determining the first predetermined number may be used.

In some embodiments, after obtaining the first predetermined number of first information content objects with the most recent time, the first information content objects containing the sensitive words in the first predetermined number of first information content objects may be further removed. The sensitive words generally refer to terms related to violations of relevant laws and regulations and influencing network environment health, and in some scenarios, terms unsuitable to appear on relevant pages of the current server, and the like are also related. By removing the first information content object containing the sensitive words, the propagation of the recommendation of the first information content object containing the sensitive words can be avoided, which is beneficial for optimizing the network environment.

In some embodiments, after obtaining the first predetermined number of first information content objects with the most recent time, the first information content objects with the content length smaller than the predetermined length threshold value in the first predetermined number of first information content objects may be further removed. For the first information content objects with the content length smaller than the preset length threshold, the fact that the first information content objects do not actually contain enough information is indicated, therefore, the first information content objects are removed, the number of the first information content objects involved in subsequent processing can be reduced, and the processing efficiency is improved.

It is understood that, in the actual processing, after the first predetermined number of first information content objects with the most recent time are obtained, the first information content objects containing the sensitive words and the first information content objects with the content length smaller than the predetermined length threshold value in the first predetermined number of first information content objects may be removed at the same time.

After the above processing, the time-reversed sequence may be written in a time-reversed list generated from the finally obtained first information content object, so as to obtain each time-reversed first information content object.

Step S202: and for each first information content object, acquiring the click rate score of each first information content object in the latest first preset time.

It should be understood that, when there are sensitive words or information content objects with content lengths smaller than a predetermined length threshold in the first preset number of first information content objects, after the removing operation, the number of the first information content objects arranged upside down in the time involved here is smaller than the first preset number.

In an embodiment, the step of obtaining the click rate score of each of the first information content objects within the latest first predetermined time may specifically include the step S2021 and the step S2022.

Step S2021: and acquiring the number of clicks and the number of exposures of each first information content object in the latest first preset time.

The duration of the first predetermined time may be set according to actual technical requirements, and in general, the first predetermined time should be set to be shorter to implement recommendation of a newly published information content object, but the first predetermined time also has a certain time length to balance reading-related situations such as clicking, praise, comment and the like of the newly published information content object within the time. For example, in one embodiment, the first predetermined time may be set to 5 minutes. It is understood that in other embodiments, the first predetermined time may be set to other time periods.

The number of hits and the number of exposures of each first information content object in the latest first predetermined time may be obtained from related log data, for example, an mta log of the server, where the mta log of the server records all log data of the information content object related to user behavior, for example, the exposure action and the exposure time of the related information of the information content object displayed on the related display page of the user, the click action and the click time of the user clicking on the information content object, the like action and the like of the user clicking on the information content object, and the comment action, the comment content, and the comment time of the user commenting on the information content object. Thus, the number of hits and the number of exposures of each information content object within the first predetermined time can be filtered and statistically obtained from the identification of the information content object and the time of the user action (e.g., the click time) by accessing the mta log.

Step S2022: and calculating and obtaining the click rate score of each first information content object in the latest first preset time based on the click number and the exposure number.

In an embodiment of the application, the click rate of the first information content object in the first predetermined time may be obtained by calculating a ratio of the number of clicks and the number of exposures, and then the click rate score in the latest first predetermined time may be determined by combining the click rate. In some embodiments, the click rate score may be obtained directly as the click rate score. In other embodiments, a click-through rate score may be obtained by performing further calculations based on the click-through rate. For example, if the number of hits for a first information content object D within the latest first predetermined time T1 is D1 and the number of exposures is b1, the corresponding click rate is D1/b 1. The click rate score of D in T1 may be D1/b1, or D1/b1 may be multiplied by a fixed value, for example, 100, and 100D1/b1 may be used as the click rate score of D in T1.

In an embodiment, after obtaining the click rate score, the obtained click rate score may be further subjected to confidence smoothing to obtain a click rate score after the confidence smoothing. Therefore, the confidence coefficient smoothing processing is carried out on the obtained click rate scores to carry out smoothing punishment processing, so that the finally obtained click rate scores can better accord with the actual click condition of the information content object. When the confidence smoothing is specifically performed, the confidence smoothing may be performed on the click rate score according to the exposure amount, and specifically, the penalty is smaller as the exposure amount is larger and the corresponding click rate score is more confident. In some specific examples, the confidence smoothing process may be based on a wilson confidence interval algorithm.

Step S203: and carrying out missing value smoothing treatment on each click rate score, so that the information content object without the click rate score also obtains the corresponding click rate score, and the click rate score after the missing value smoothing treatment is obtained.

Through missing value smoothing processing, information content objects without click rate scores in the first information content objects can also obtain corresponding click rate scores, so that the recommended probability can be obtained, the information content objects can enter a recommendation library for recommendation, newly released information content objects can enter a recommendation process without actual click rate, and the problem of cold start of newly distributed content due to the fact that historical data does not exist is solved. And through carrying out missing value smoothing processing after the confidence smoothing processing, each first information content object can have a corresponding click rate score, and the click rate score is credible, so that newly issued information content objects can have recommendation probability, and a recommendation process can be entered.

Step S204: and determining the information content object to be recommended based on the click rate score of each first information content object after the missing value smoothing processing.

When determining the information content object to be recommended based on the click-through rate score after the missing value smoothing processing of each first information content object, the information content object to be recommended (also referred to as the first information content object to be recommended in the embodiment of the present application) may be determined directly based on the click-through rate score.

When the first information content object to be recommended is determined based on the click rate score, the determined click rate score interval may be obtained, and the first information content object with the click rate score in the click rate score interval is determined as the first information content object to be recommended.

In an embodiment, when determining the information content object to be recommended based on the click-through rate score after the missing value smoothing processing of each first information content object, the following manner may be adopted, and specifically, step S2041 and step S2042 may be included.

Step S2041: and normalizing the click rate scores of the first information content objects after the missing value smoothing processing to obtain the recommendation probability values of the first information content objects after the normalization processing.

Each click rate score may be mapped to a fraction of the (0, 1) interval by a normalization process operation. The specific normalization processing mode can be performed by adopting the existing normalization processing mode.

Step S2042: and determining the first information content object to be recommended based on the recommendation probability value of each first information content object after the normalization processing.

Based on the normalized recommendation probability values of the first information content objects, the first information content objects to be recommended may be determined in various possible ways. For example, by obtaining the determined recommendation probability interval, the first information content object with the recommendation probability value in the recommendation probability interval is determined as the first information content object to be recommended.

In some embodiments, determining the first information content object to be recommended based on the recommendation probability value of each first information content object after the normalization processing may be performed in the following manner.

Firstly, a recommendation probability interval in which each first information content object is located is determined based on the recommendation probability value of each first information content object. Thus, each first information content object may be mapped to a corresponding recommendation probability interval based on the recommendation probability value of each first information content object. For example, assuming that there are 5 first information content objects D1-D5, and their recommendation probability values are all 0.2, the recommendation probability intervals mapped for them may be (0,0.2), [0.2,0.4), [0.4,0.6), [0.6,0.8), [0.9,1), respectively.

Then, a first probability value selected randomly is obtained, and a first information content object corresponding to a recommendation probability interval corresponding to the first probability value is determined as a first information content object to be recommended.

The randomly selected first probability value may be only 1 value, or a plurality of first probability values may be obtained by selecting for multiple times, and then the first information content object corresponding to the recommendation probability interval corresponding to the first probability value is used as the first information content object to be recommended. For example, assuming that the randomly selected first probability values are 0.3 and 0.7, the intervals [0.2,0.4 ] and [0.6,0.8) corresponding to the randomly selected first probability values D1 and D4 are used as the first information content object to be recommended.

Based on the embodiment of the application, the click rate scores within the latest first preset time are obtained for the first preset number of first information content objects with the latest time, and the missing value processing is performed on the basis, so that the click rate conditions of a certain number of newly released information content objects can be combined, the first information content objects without click rate scores can also have a certain click rate score and also have a recommended probability, the problem of cold start of newly released information contents due to the fact that no historical data exists is solved, effective recommendation can be performed on the new contents, the new contents can enter a recommendation process, and the quality of information content recommendation is improved.

Referring to fig. 5, by comparing the information content to be recommended determining method in the above embodiment with the guaranteed-base recommendation method provided by the service side in the conventional recommendation method and the recommendation method provided by the service side in the time-reversed manner, and by counting data within 10 days after the three different methods are adopted, the method provided in the embodiment of the present application is superior to the two conventional methods in multiple indexes such as PV click rate, UV click rate, user approval rate, user comment rate, total number of exposed content, and the like, and the total number of exposed content can basically cover all new content every day on the basis of solving the problem of new content exposure due to the two methods. The PV click rate is the ratio of the number of clicks to the number of exposures, the UV click rate is the ratio of the number of clicked users to the number of exposures, the user approval rate is the ratio of the number of approved users to the total number of exposed users, and the user comment rate is the ratio of the number of commented users to the total number of exposed users.

In one embodiment, as shown in fig. 3, a method for determining information content to be recommended is provided, which is described by taking the method as an example applied to the server 104 or the server 106 in fig. 1, and includes the following steps S301 to S305.

Step S301: a second information content object within a second, most recent, predetermined time is obtained.

Here, the time may refer to a distribution time of the information content object, for example, a distribution time of the information content object of the user-generated content such as a post or an article, that is, a distribution time of each second information content object is within the second predetermined time.

The duration of the second predetermined time may be set in accordance with actual technical requirements, and in general, the second predetermined time should be set to be shorter to achieve that hot content is mined from a newly published information content object and a suddenly occurring hot event can be mined, but the second predetermined time may also have a certain time length to balance the situations of the newly published information content object in the time, which are related to reading, such as clicking, praise, comment and the like, with the user behavior. For example, in one embodiment, the second predetermined time may be set to 2 hours. It is understood that in other embodiments, the second predetermined time may be set to other time periods.

Step S302: and carrying out frequency statistics on the second information content object to obtain the first frequency of each first word segmentation in the second information content object.

When the second information content object is subjected to frequency statistics, the frequency statistics of word segmentation can be combined with a word segmentation model. In some embodiments, a unigram language model and a bigram language model may be adopted to perform frequency statistics on each second information content object, so as to obtain a first frequency of each first segmentation in the second information content object. The unigram language model is used as a model of unitary word segmentation, word segmentation is carried out in a mode of dividing the second information content object into one word, frequency statistics is carried out on the basis, the bigram language model is used as a model of binary word segmentation, word segmentation is carried out in a mode of dividing every two words in the second information content object into one word, and frequency statistics is carried out on the basis. By combining the unigram language model and the bigram language model to carry out frequency statistics, the problem of overall accuracy reduction caused by word segmentation errors of word segmentation can be avoided, effect attenuation caused by the absence of word segmentation is avoided, and therefore final frequency statistics can cover more words, the coverage area of the obtained word segmentation is wider, and the accuracy of the finally obtained result is improved.

Step S303: and acquiring a second frequency of each first word in the information content object within a third latest preset time, wherein the third preset time is greater than the second preset time.

The duration of the third predetermined time may be set in accordance with actual technical requirements, and generally, the duration of the third predetermined time is set to be much longer than the duration of the second time, but not too long, so as to reflect that a hotspot event occurring suddenly can be mined by mining hotspot content in a newly issued information content object. For example, in one embodiment, the third predetermined time may be set to 48 hours. It is understood that in other embodiments, the third predetermined time may be set to other time periods.

When the second frequency of each first word in the information content object within the latest third predetermined time is obtained, the second frequency of each second word in the third information content object may be obtained by performing frequency statistics on each third information content object within the latest third predetermined time of the release time. Since the third predetermined time is greater than the second predetermined time, it may be determined that the third information content object includes the second information content object, and the finally obtained second participle may include the first participle, so that the second frequency of the first participle may be obtained through a consistency relationship between the first participle and the second participle.

In a specific example, the unigram language model and the bigram language model may be adopted to perform frequency statistics on each third information object content within a third latest predetermined time, so as to obtain a second frequency of each second word segmentation in the third information content object.

Step S304: and determining the content hotspot score of each second information content object based on the first frequency and the second frequency of each first word segmentation contained in the second information content object.

When determining the content hotspot score of each second information content object based on the first frequency and the second frequency of each first participle included in the second information content object, various possible ways may be adopted as long as the hotspot change condition of the second information content object in the second predetermined time relative to the third predetermined time can be embodied.

In an embodiment, determining the content hotspot score of each second information content object based on the first frequency and the second frequency of each first participle included in the second information content object may include the following steps S3041 and S3042.

Step S3041: and determining the word segmentation hotspot score of each first word segmentation based on the first frequency and the second frequency of each first word segmentation.

In one embodiment, the word segmentation hotspot score may be a ratio of a sum of a first frequency and an inverse frequency, wherein the inverse frequency is an inverse of the sum of the first frequency and a second frequency, and a frequency difference value is a difference of the second frequency and the first frequency, to the sum of the inverse frequency.

Denote the first frequency of a first participle as F_rencentThe second frequency is denoted as F_oldWord segmentation hotspot score is score_gramThen, the determination of the word segmentation hotspot score can be formulated as:

score_gram＝(F_rencent+likelyscore)/(F_old-F_rencent+likelyscore)

wherein the reciprocal of Likelyscore is Likelyscore 1/(F)_rencent+F_old)。

It is understood that in other embodiments, the word segmentation hotspot score may be determined in other manners.

Step S3042: and determining the content hotspot score of each second information content object based on the word segmentation hotspot score of each first word segmentation contained in the second information content object.

In one embodiment, the word segmentation hot scores of the first word segmentations included in the second information content object may be directly summed, and the obtained sum value is used as the content hot score of the second information content object.

Step S305: and determining an information content object to be recommended in the second information content objects based on the content hotspot scores of the second information content objects.

When determining the information content object to be recommended based on the content hotspot scores of the second information content objects, the information content object to be recommended (also referred to as a second information content object to be recommended in this embodiment of the present application) may be determined directly based on the content hotspot scores.

When the second information content object to be recommended is determined based on the content hotspot score, the determined content hotspot score interval may be obtained, and the second information content object with the content hotspot score in the content hotspot score interval is determined as the second information content object to be recommended.

In an embodiment, when determining the information content object to be recommended based on the content hotspot score of each second information content object, the following method may be adopted, and specifically, step S3051 and step S3052 may be included.

Step S3051: and normalizing the content hotspot scores of the second information content objects to obtain the recommendation probability values of the second information content objects after normalization.

Step S3052: and determining the second information content object to be recommended based on the recommendation probability value of each second information content object after the normalization processing.

Based on the normalized recommendation probability values of the second information content objects, the second information content objects to be recommended may be determined in various possible ways. For example, by obtaining the determined recommendation probability interval, the second information content object with the recommendation probability value in the recommendation probability interval is determined as the second information content object to be recommended.

In some embodiments, determining the second information content object to be recommended based on the recommendation probability values of the second information content objects after the normalization processing may be performed in the following manner.

Firstly, a recommendation probability interval in which each second information content object is located is determined based on the recommendation probability value of each second information content object. Thus, each second information content object may be mapped to a corresponding recommendation probability interval based on the recommendation probability value of each second information content object. For example, if there are 10 second information content objects d1-d10 and their recommendation probability values are all 0.1, the recommendation probability intervals mapped for them may be (0,0.1), [0.1,0.2), [0.2,0.3), [0.3,0.4), [0.4,0.5), [0.5,0.6), [0.6,0.7), [0.7,0.8), [0.8,0.9), [0.9,1), respectively.

And then, acquiring a randomly selected second probability value, and determining a second information content object corresponding to the recommendation probability interval corresponding to the second probability value as a second information content object to be recommended.

The randomly selected second probability value may be only 1 value, or a plurality of second probability values may be obtained by selecting for multiple times, and then the second information content object corresponding to the recommendation probability interval corresponding to the second probability value is used as the second information content object to be recommended. For example, assuming that the randomly selected second probability values are 0.3, 0.6, and 0.7, the intervals [0.3,0.4), [0.6,0.7), [0.7,0.8) corresponding to d3, d6, and d7 are used as the second information content object to be recommended.

In an embodiment, as shown in fig. 4, a method for determining information content to be recommended is provided, which is described by taking the method as an example applied to the server 104 or the server 106 in fig. 1, and the method for determining information content to be recommended performs determination of information content to be recommended in a manner of combining the click-through rate score based on the information content object and the combined content hotspot score of the information content object, which are referred to in the above embodiments, and includes the following steps S401 to S405.

Step S401: a first predetermined number of first information content objects most recent in time is obtained, as well as second information content objects within a second most recent predetermined time.

The time here may refer to a distribution time of the information content object, for example, a distribution time of the information content object of the user-generated content such as a post, an article, etc., that is, a distribution time of each second information content object is within the second predetermined time. When the first predetermined number of first information content objects with the latest time are obtained, the first predetermined number of information content objects which are released most recently may be obtained specifically in a time-reversed manner.

Wherein the first predetermined number may be set in combination with the actual technical need. In some embodiments, the number of information content objects binned per period of statistics may be determined in conjunction with, and may be determined concurrently with, the current traffic situation. In some embodiments, after obtaining the first predetermined number of first information content objects with the most recent time, the first information content objects containing the sensitive words in the first predetermined number of first information content objects may be further removed. The sensitive words generally refer to terms related to violations of relevant laws and regulations and influencing network environment health, and in some scenarios, terms unsuitable to appear on relevant pages of the current server, and the like are also related. By removing the first information content object containing the sensitive words, the propagation of the recommendation of the first information content object containing the sensitive words can be avoided, which is beneficial for optimizing the network environment.

Step S402: for a first preset number of first information content objects, acquiring the click rate score of each first information content object in the latest first preset time; and carrying out missing value smoothing processing on each click rate score to obtain the click rate score of each first information content object after missing value smoothing processing.

In an embodiment, the click rate of the first information content object in the first predetermined time may be obtained by obtaining the number of clicks and the number of exposures of each first information content object in the latest first predetermined time, and calculating the ratio of the number of clicks and the number of exposures, and then determining the corresponding click rate score in the latest first predetermined time by combining the click rate.

The duration of the first predetermined time may be set according to actual technical requirements, for example, in one embodiment, the first predetermined time may be set to 5 minutes. It is understood that in other embodiments, the first predetermined time may be set to other time periods.

In some embodiments, the click rate score may be obtained directly as the click rate score. In other embodiments, a click-through rate score may be obtained by performing further calculations based on the click-through rate.

In an embodiment, after the click rate score of each of the first information content objects in the first predetermined number of first information content objects in the latest first predetermined time is obtained, the obtained click rate score may be further subjected to confidence smoothing to obtain a click rate score after the confidence smoothing, and then the click rate score after the confidence smoothing is subjected to missing value smoothing. Therefore, the confidence coefficient smoothing processing is carried out on the obtained click rate scores to carry out smoothing punishment processing, so that the finally obtained click rate scores can better accord with the actual click condition of the information content object. When the confidence smoothing is specifically performed, the confidence smoothing may be performed on the click rate score according to the exposure amount, and specifically, the penalty is smaller as the exposure amount is larger and the corresponding click rate score is more confident. In some specific examples, the confidence smoothing process may be based on a wilson confidence interval algorithm.

Step S404: performing frequency statistics on the second information content object to obtain a first frequency of each first word in the second information content object within the second preset time and a second frequency of each first word in the second information content object within a third latest preset time; and determining the content hotspot score of each second information content object based on the first frequency and the second frequency of each first word segmentation contained in the second information content object, wherein the third preset time is greater than the second preset time.

When the second information content object is subjected to frequency statistics, the frequency statistics of word segmentation can be combined with a word segmentation model. In some embodiments, a unigram language model and a bigram language model may be adopted to perform frequency statistics on each second information content object, so as to obtain a first frequency of each first segmentation in the second information content object. The unigram language model is used as a model of unitary word segmentation, word segmentation is carried out in a mode of dividing the second information content object into one word, frequency statistics is carried out on the basis, the bigram language model is used as a model of binary word segmentation, word segmentation is carried out in a mode of dividing every two words in the second information content object into one word, and frequency statistics is carried out on the basis. Frequency statistics is carried out by combining the unigram language model and the bigram language model, so that more words can be covered by the final frequency statistics, the coverage of the obtained participles is wider, and the accuracy of the finally obtained result is improved.

In one embodiment, determining the content hotspot score of each second information content object based on the first frequency and the second frequency of each first participle included in the second information content object may be performed in the following manner.

First, based on the first frequency and the second frequency of each first participle, determining a participle hotspot score of each first participle.

score_gram＝(F_rencent+likelyscore)/(F_old-F_rencent+likelyscore)

wherein the reciprocal of Likelyscore is Likelyscore 1/(F)_rencent+F_old)。

Secondly, determining the content hotspot score of each second information content object based on the word segmentation hotspot score of each first word segmentation contained in the second information content object.

Step S405: and determining the information content objects to be recommended based on the click rate scores of the first information content objects and the content hotspot scores of the second information content objects after the missing value smoothing processing.

In some embodiments, when determining the information content object to be recommended based on the click-through rate score of each of the first information content objects and the content hotspot score of each of the second information content objects after the missing value smoothing processing, the following manner may be adopted.

Normalizing the click rate scores of the first information content objects after missing value smoothing processing to obtain the recommendation probability values of the first information content objects after normalization processing; determining a first information content object to be recommended based on the recommendation probability value of each first information content object after normalization processing;

normalizing the content hotspot scores of the second information content objects to obtain the recommendation probability values of the second information content objects after normalization; and determining the second information content object to be recommended based on the recommendation probability value of each second information content object after the normalization processing.

The determined information content object to be recommended comprises the first information content object to be recommended and the second information content object to be recommended.

Through the normalization processing operation, each click rate score or content hotspot score can be mapped to a corresponding decimal of (0, 1) interval. The specific normalization processing mode can be performed by adopting the existing normalization processing mode.

In step S401, the acquired first information content objects are the first preset number of objects whose publication time is the latest, the second information content objects are the objects within the latest second predetermined time, and the first information content objects and the second information content objects are derived from the same information content object pool, so that there may be a situation that the first information content objects and the second information content objects are duplicated, that is, finally, a certain information content object may be included in both the first information content object to be recommended and the second information content object to be recommended.

At this time, in some embodiments, the first information content object to be recommended and the second information content object to be recommended may be subjected to deduplication processing, and then be used as the final information content object to be recommended.

In some embodiments, since the first information content object to be recommended is mainly used to recommend a new information content object and the second information content object to be recommended is mainly used to mine a hotspot in the new information content object, the first information content object to be recommended and the second information content object to be recommended may be recommended at the same time to realize recommendation of different types of information content objects, for example, recommendation may be performed in different areas or in different manners.

Referring to fig. 6, by comparing the method for determining information content to be recommended, which combines the click-through rate score and the content hotspot score in the above embodiment, with the method for determining information content to be recommended, which is based on the click-through rate score only, as described above, the two different methods are statistically used to find out data within 11 days, and both of the two methods can find out and recommend content that performs well from new information content objects. And the method combines the click rate score and the content hotspot score, adds the screening of hotspots on the pool of new information content objects, and greatly improves the user approval rate, the user comment rate, the uv click rate and the pv click rate by counting indexes of a period that the number of postings is doubled every day during the activity.

It should be understood that, although the steps in the flowcharts related to the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in these flowcharts may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the steps or stages in other steps.

In one embodiment, as shown in fig. 7, there is provided an information content to be recommended determining apparatus including:

a first object obtaining module 701, configured to obtain a first predetermined number of first information content objects closest in time;

a click rate score determining module 702, configured to obtain, for each first information content object, a click rate score of each first information content object within a latest first predetermined time;

a missing value smoothing module 703, configured to perform missing value smoothing on each click rate score, so that an information content object without a click rate score also obtains a corresponding click rate score, and obtains a click rate score after the missing value smoothing;

a first object-to-be-recommended object determining module 704, configured to determine an information content object to be recommended based on the click rate score obtained after the missing value smoothing processing of each first information content object.

In one embodiment, the apparatus further comprises a filtering module, configured to remove first information content objects containing sensitive words from the first predetermined number of first information content objects.

In an embodiment, the filtering module is further configured to remove first information content objects of the first predetermined number of first information content objects, the content length of which is smaller than a predetermined length threshold.

In one embodiment, the click rate score determining module 702 obtains the number of clicks and the number of exposures of each of the first information content objects in the latest first predetermined time, and calculates and obtains the click rate score of each of the first information content objects in the latest first predetermined time based on the number of clicks and the number of exposures.

In one embodiment, the method further comprises: and the confidence coefficient smoothing module is used for performing confidence coefficient smoothing on the click rate score obtained by the click rate score determining module 702 to obtain the click rate score after the confidence coefficient smoothing processing. At this time, the missing value smoothing module 703 performs missing value smoothing on the click rate score after confidence smoothing by the confidence smoothing module.

In one embodiment, the first module 704 for determining an object to be recommended includes:

and the normalization processing module is used for performing normalization processing on the click rate scores of the first information content objects after the missing value smoothing processing to obtain the recommendation probability values of the first information content objects after the normalization processing.

And the first determining module is used for determining the first information content object to be recommended based on the recommendation probability value of each first information content object after the normalization processing.

In an embodiment, the first determining module is configured to determine, based on the recommendation probability value of each first information content object, a recommendation probability interval in which each first information content object is located, acquire a randomly selected first probability value, and determine, as the first information content object to be recommended, the first information content object corresponding to the recommendation probability interval corresponding to the first probability value.

In one embodiment, as shown in fig. 8, there is provided an information content to be recommended determining apparatus including:

a second object obtaining module 801, configured to obtain a second information content object within a second latest predetermined time;

a frequency determining module 802, configured to perform frequency statistics on the second information content object to obtain a first frequency of each first word in the second information content object; acquiring a second frequency of each first word in an information content object within a third nearest preset time, wherein the third preset time is greater than the second preset time;

a content hotspot score determining module 803, configured to determine a content hotspot score of each second information content object based on a first frequency and a second frequency of each first participle included in the second information content object;

the second object to be recommended determining module 804 determines an information content object to be recommended in the second information content objects based on the content hotspot scores of the second information content objects.

In one embodiment, the frequency determining module 802 performs frequency statistics on each second information content object by using a unigram language model and a bigram language model to obtain a first frequency of each first word segmentation in the second information content object.

In one embodiment, the second object obtaining module 801 further obtains each third information content object within a third predetermined time that is the latest release time, and the frequency determining module 802 performs frequency statistics on each third information content object by using an unigram language model and a bigram language model to obtain a second frequency of each second participle in the third information content object, where the second participle includes the first participle.

In one embodiment, the content hotspot score determining module 803 determines the word segmentation hotspot score of each first word segmentation based on the first frequency and the second frequency of each first word segmentation, and determines the content hotspot score of each second information content object based on the word segmentation hotspot score of each first word segmentation included in the second information content object.

In one embodiment, the content hotspot score of the second information content object is a sum of the word segmentation hotspot scores of the first word segmentations included in the second information content object.

In one embodiment, the second to-be-recommended object determining module 804 includes:

and the normalization processing module is used for performing normalization processing on the content hotspot scores of the second information content objects to obtain the recommendation probability values of the second information content objects after the normalization processing.

And the second determining module is used for determining the second information content object to be recommended based on the recommendation probability value of each second information content object after the normalization processing.

In one embodiment, the second determining module determines, based on the recommendation probability value of each second information content object, a recommendation probability interval in which each second information content object is located, acquires a randomly selected second probability value, and determines, as the second information content object to be recommended, the second information content object corresponding to the recommendation probability interval corresponding to the second probability value.

In one embodiment, as shown in fig. 9, there is provided an information content to be recommended determining apparatus including:

the object obtaining module 901 may specifically include the first object obtaining module 701 and the second object obtaining module 801 described above;

the click score determining module 902 may specifically include the click rate score determining module 702 and the missing value smoothing module 703 described above;

the hotspot score determining module 903 may specifically include the frequency determining module 802 and the content hotspot score determining module 803 described above;

the recommendation object determining module 904 may specifically include the first to-be-recommended object determining module 704 and the second to-be-recommended object determining module 804 described above;

in one embodiment, the click score determining module 902 further comprises a filtering module for removing the first information content objects containing the sensitive words from the first predetermined number of first information content objects.

In one embodiment, the click score determining module 902 further includes a confidence smoothing module, configured to perform confidence smoothing on the click rate score obtained by the click rate score determining module 702 to obtain a click rate score after the confidence smoothing. At this time, the missing value smoothing module 703 performs missing value smoothing on the click rate score after confidence smoothing by the confidence smoothing module.

For specific limitations of the apparatus for determining content of information to be recommended, reference may be made to the above limitations of the method for determining content of information to be recommended, and details are not described here. All or part of the modules in the device for determining the content of the information to be recommended can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing log data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for determining content of information to be recommended.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor executes the computer program to implement the method for determining the content of the information to be recommended in any one of the above embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for determining information content to be recommended is characterized by comprising the following steps:

2. The method of claim 1, wherein after obtaining the click-through rate score of each of the first information content objects within a first predetermined time of recency, and before performing missing value smoothing on each of the click-through rate scores, further comprising the steps of:

and carrying out confidence smoothing on the click rate score to obtain the click rate score after the confidence smoothing.

3. The method according to claim 1, wherein determining the information content object to be recommended based on the click-through rate score after the missing value smoothing processing of each of the first information content objects comprises:

normalizing the click rate score after the missing value smoothing processing to obtain the recommendation probability value of each first information content object after the normalization processing;

and determining the information content object to be recommended based on the recommendation probability value of each first information content object after the normalization processing.

4. The method according to claim 3, wherein determining the information content object to be recommended based on the normalized recommendation probability value of each first information content object comprises:

determining a recommendation probability interval where each first information content object is based on the recommendation probability value of each first information content object;

and acquiring a randomly selected probability value, and determining a first information content object corresponding to a recommendation probability interval corresponding to the probability value as the information content to be recommended.

5. A method for determining information content to be recommended is characterized by comprising the following steps:

6. The method of claim 5, wherein determining the content hotspot score of each of the second informational content objects based on the first frequency and the second frequency of each of the first participles included in each of the second informational content objects comprises:

determining a word segmentation hotspot score of each first word segmentation based on the first frequency and the second frequency of each first word segmentation;

and determining the content hotspot score of each second information content object based on the word segmentation hotspot score of each first word segmentation contained in the second information content object.

7. The method of claim 6, wherein the participle hotspot score is a ratio of a sum of the first frequency and an inverse frequency, the inverse frequency being an inverse of the sum of the first frequency and the second frequency, to a sum of a frequency difference, the frequency difference being a difference of the second frequency and the first frequency, and the inverse frequency.

8. The method of claim 5, wherein determining the object to be recommended from the second information content objects based on the content hotspot score of each of the second information content objects comprises:

normalizing the content hotspot scores of the second information content objects to obtain the recommendation probability values of the second information content objects after normalization;

and determining the information content object to be recommended based on the recommendation probability value of each second information content object after the normalization processing.

9. A method for determining information content to be recommended is characterized by comprising the following steps:

10. The method of claim 9, wherein determining the information content objects to be recommended based on the click-through rate scores of the first information content objects and the content hotspot scores of the second information content objects after the missing value smoothing process comprises:

normalizing the content hotspot scores of the second information content objects to obtain the recommendation probability values of the second information content objects after normalization; determining a second information content object to be recommended based on the recommendation probability value of each second information content object after normalization processing;

the information content object to be recommended comprises the first information content object to be recommended and the second information content object to be recommended.