CN113763082A

CN113763082A - Information pushing method and device

Info

Publication number: CN113763082A
Application number: CN202010921654.3A
Authority: CN
Inventors: 王蒙; 李建星; 史晓斌
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2021-12-07

Abstract

The application provides an information pushing method and device. The method comprises the following steps: obtaining a source article pool; acquiring a target object pool matched with the configured scene limitation information and the similar object range limitation information; determining items similar to the items in the source item pool in the target item pool, and generating a similar item pool; pushing the articles in the pool of similar articles. The method can reduce the information pushing cost and avoid the waste of resources.

Description

Information pushing method and device

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to an information pushing method and apparatus.

Background

In the e-commerce industry, the types of articles are more and more, the excavation of a large number of articles is common scene requirements of the e-commerce industry, the excavation of similar articles is a common function in application of various application scenes, and common scenes such as recommendation recall, article pool expansion, activity article selection, article pool migration, new article label marking and the like are adopted.

Currently, a set of similar models is run for each scene to obtain similar objects of the object.

In the process of implementing the application, the inventor finds that each scene runs a set of implementation schemes of similar models respectively, and resources are wasted due to the development of repeated functional modules.

Disclosure of Invention

In view of this, the present application provides an information pushing method and apparatus, which can reduce the information pushing cost and avoid the waste of resources.

In order to solve the technical problem, the technical scheme of the application is realized as follows:

in one embodiment, an information pushing method is provided, the method comprising:

obtaining a source article pool;

acquiring a target object pool matched with the configured scene limitation information and the similar object range limitation information;

determining items similar to the items in the source item pool in the target item pool, and generating a similar item pool;

pushing the articles in the pool of similar articles.

In another embodiment, an information pushing apparatus is provided, the apparatus including: the system comprises a configuration unit, a source article pool acquisition unit, a target article pool acquisition unit, a similar article pool generation unit and a pushing unit;

the configuration unit is used for configuring scene limitation information and similar article range limitation information;

the source article pool acquiring unit is used for acquiring a source article pool;

the target pool acquiring unit is used for acquiring a target object pool matched with the configured scene limitation information and the similar object range limitation information;

the similar article pool generating unit is configured to determine, in the target article pool acquired by the target pool acquiring unit, articles similar to the articles in the source article pool acquired by the source article pool acquiring unit, and generate a similar article pool;

and the pushing unit is used for pushing the articles in the similar article pool generated by the similar article pool generating unit.

In another embodiment, an electronic device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the information pushing method when executing the program.

In another embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the information pushing method.

According to the technical scheme, the target object pool is determined by setting the scene limitation information and the similar object range limitation information of the application scene, and the similar object pools of all the objects in the source object pool are matched in the target object pool, so that pushing is performed. The technical scheme that the general model can be set to realize recommendation of acquaintance articles in a plurality of scenes through setting of scene parameters can be set, so that the information push cost can be reduced, and waste of resources is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a schematic diagram of an information pushing process in an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating the process of determining items similar to the items in the source item pool in the target item pool according to the second embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a process of determining similarity between attributes of two articles according to a second embodiment of the present application;

fig. 4 is a schematic flow chart of determining items similar to the items in the source item pool in the target item pool in the third embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating selection of a word vector model according to an embodiment of the present application;

fig. 6 is a schematic diagram of an information pushing process in a fourth embodiment of the present application;

FIG. 7 is a diagram illustrating editing distance and similarity;

FIG. 8 is a diagram illustrating comparison of similar numbers;

fig. 9 is a schematic flow chart of determining items similar to the items in the source item pool in the target item pool according to the fifth embodiment of the present application;

fig. 10 is a schematic view of an information pushing process in an eighth embodiment of the present application;

FIG. 11 is a schematic diagram of an apparatus for implementing the above technique in an embodiment of the present application;

fig. 12 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.

The embodiment of the application provides an information pushing method, and by setting scene information, more articles can be obtained for seed articles in various scenes through article similarity, so that the requirements of various scenes can be met.

The scenario is exemplified as follows: and selecting the activity items such as the juxta-sepa annual-day good section, recalling the recommended scene as one of the recalls, and diffusing the activity items to more commodities which can be newly purchased through the seed items. The method comprises the following steps that similar migration and diffusion scenes of an article pool are involved, wherein the seed article pool has the characteristics, and the shortage of the article pool obtained through service condition limitation, person-goods relationship matching, commodity attributes, user attributes, real resource conditions and the like needs to be diffused to a larger article pool; or only the object pools with specific time sequence and space are needed to be migrated to obtain the object pools with different spaces and the same scene; for example, the Jingdong convenience store migrates the good on-line shopping in spring festival of the last year to the off-line store in spring festival of this year.

Specific application scenarios the embodiment of the present application is not limited to the above exemplary scenarios, and as long as the user gives the scenario information, more articles can be recommended to the user in the scenario limited by the scenario information.

The information pushing method is applied to an information pushing device, and the device can be a PC, a server and the like.

The following describes in detail a process of implementing information pushing in the embodiment of the present application with reference to the accompanying drawings.

Example one

Referring to fig. 1, fig. 1 is a schematic view of an information pushing process in an embodiment of the present application. The method comprises the following specific steps:

step 101, a source object pool is obtained.

The source article pool is equivalent to a seed article and is used for finding more similar articles through the articles.

The obtaining of the source object pool in this step includes:

obtaining a source article pool which is directly uploaded;

or selecting the items matched with the configured parameters in the platform item pool and generating the source item pool.

That is to say, in the embodiment of the present application, a source article pool directly uploaded by a user may be used, and a configuration page may also be provided, so that the user performs parameter configuration, such as article attribute, time for loading articles, article name, and the like; and generating a target item pool according to the configured parameters.

And 102, acquiring a target object pool matched with the configured scene limitation information and the similar object range limitation information.

The target object pool is an object pool for searching for objects similar to the objects in the source object pool, generally contains fewer objects than the platform object pool, and is the platform object pool filtered by the scene limitation information and the similar object range limitation information.

In this step, obtaining a target object pool matched with the configured scene restriction information and the similar object range restriction information includes:

acquiring a directly uploaded target object pool matched with the configured scene limitation information and the similar object range limitation information;

or selecting an article matched with the configured scene limitation information and the similar article range limitation information from the platform article pool, and generating a target article pool.

In the embodiment of the present application, in the specific implementation, the target object pool may be a target object pool that is automatically collected by the user according to the scene restriction information of the application scene and the range restriction information of the similar object, and is directly uploaded, or the target object pool that is matched out in the platform object pool on the information push device is a target object pool that is configured by the user according to the scene restriction information of the application scene and the range restriction information of the similar object.

The scene limitation information is used for limiting the scenes of the pushed article application, and different scenes have different target article pool limitations.

Such as: the target item pool limit for the recommended recall scenario at a shopping store is: items that are on sale, available, and specifically shopping for a store;

the limitation of the target object pool in the item selection scene in the seventh festival is as follows: on-sale, available, specific class, near 30 day flow, etc.

The above listed scene restriction information of the two scenes is only an example, and in the specific implementation, the scene restriction information is not limited to the above two application scenes, nor is the given scene restriction information, and the specific scene may be set according to the actual need.

Wherein the similar article range restriction information includes one or any combination of the following:

third-level catalog, item brand, item name.

When the similar article range limiting information is a third-level catalog, the corresponding article is an article in the same third-level catalog;

when the similar article range limiting information is an article brand, the corresponding article is an article under the same brand;

when the similar article range limiting information is the article name, the corresponding article is the article under the same article name;

when the similar article range limiting information is a three-level catalog and an article brand, the corresponding article is the same three-level catalog and the article under the same article brand;

when the similar article range limiting information is a three-level catalog and an article name, the corresponding article is the same three-level catalog and the article under the same article name;

when the similar article range limiting information is an article brand and an article name, the corresponding article is the same article name and the article under the same article brand;

when the similar article range limiting information is the third-level catalog, the article brand and the article name, the corresponding article is the same third-level catalog, the same article brand and the article under the same article name.

The third-level catalog can be a catalog with the third grade ranking in the process of article classification, and the determination of the third-level catalog is set according to the grade of the platform articles of the user;

the brand of the article refers to the article under the same brand, such as a milk product corresponding to the ternary milk;

the name of the article refers to the name of the category to which the article belongs, such as milk and the like.

There are a plurality of items in the target item pool for finding items similar to the items in the source item pool.

Step 103, determining the items similar to the items in the source item pool in the target item pool, and generating a similar item pool.

Whether two articles are similar or not is determined in the embodiment of the application, the attribute similarity of the two articles can be calculated to determine whether the two articles are similar or not, the title similarity of the two articles can be calculated to determine whether the two articles are similar or not, and the fusion similarity of the attribute similarity and the title similarity of the two articles can be calculated to determine whether the two articles are similar or not.

And 104, pushing the articles in the similar article pool.

In this application, the pushing of the items in the similar item pool may be performed locally, or may be performed to send the items to the requesting device, which is not limited in this application.

When the article is pushed, the related information of the article can also be pushed. Pushing one or any combination of the following information:

address information of the articles, whether the articles directly reuse historical results or not, and sequencing the articles according to a preset rule.

Wherein the content of the first and second substances,

for example, address information of an article, namely address information of a similar article pool: URL, HDFS address, HIVE table, etc.;

and whether the articles in the similar article pool are directly reused can be given.

The method can also be used for pushing the articles in the similar article pool after sequencing the articles according to a certain rule or a plurality of rules:

such as in order of high to low similarity;

sales volume in near N days;

single dose for nearly N days, etc.

The above-mentioned is only the related information that can be pushed when pushing the article, and the user can specify the information of the article to be pushed according to the actual need, and the related information of the article to be pushed is not limited in the application embodiment.

In this embodiment, the source article pool and the target article pool are determined by setting the scene restriction information and the similar article range restriction information of the application scene, and the similar article pools of all articles in the source article pool are matched in the target article pool, so as to perform pushing. The technical scheme that the general model can be set to realize recommendation of acquaintance articles in a plurality of scenes through setting of scene parameters can be set, so that the information push cost can be reduced, and waste of resources is avoided.

Example two

In the first embodiment, the implementation of determining, in the target item pool, items similar to the items in the source item pool and generating a similar item pool specifically includes:

referring to fig. 2, fig. 2 is a schematic flow chart illustrating a process of determining items similar to the items in the source item pool in the target item pool according to the second embodiment of the present application. The method comprises the following specific steps:

in step 201, a first item is selected from a pool of source items.

The first item here is any item in the pool of source items.

In step 202, a second item is selected from the pool of target items.

The second item is any item in the target item pool.

Step 203, determining the attribute similarity of the first article and the second article.

The similarity is calculated from the Jaccard similarity, the total number of items in the pool of platform items and the number of platform items having attributes in the belonging gathers of the first item and the second item; the Jaccard similarity is calculated according to the intersection and union of the attributes of the first object and the second object;

the specific calculation process is as follows:

referring to fig. 3, fig. 3 is a schematic flowchart of determining similarity of attributes of two articles according to the second embodiment of the present application. The method comprises the following specific steps:

step 301, an item attribute set of a first item is obtained.

The set of item attributes ATTR1 of the first item1 in the embodiment of the present application is (ATTR11, ATTR12, …, ATTR1m), where m is the number of item attributes of the first item.

Step 302, an item attribute set of a second item is obtained.

The item attribute set ATTR2 of the second item2 in the embodiment of the present application is (ATTR21, ATTR22, …, ATTR2x), where x is the number of item attributes of the second item.

m and x may be the same or different.

Step 303, acquiring an intersection and a union of the item attribute sets of the first item and the second item.

ATTR1 ═ ATTR2, ATTR being the intersection of ATTR1 and ATTR 2; ATTRS is ATTR1 ^ ATTR2, and ATTRS is the union of ATTR1 and ATTR 2.

And step 304, calculating the similarity of the Jaccard according to the intersection and the union, and calculating the similarity of the attributes of the first item and the second item according to the similarity of the Jaccard, the total quantity of the platform items and the quantity of the platform items with the attributes in the intersection.

Calculating the attribute similarity of the first article and the second article by the following formula:

wherein the content of the first and second substances,

in the calculating process, the attribute similarity of the articles is reconstructed, all the attributes of each article are taken as a document, each attribute is equivalent to a word, and the Jaccard similarity Jaccard of the article pair (the first article and the second article) is calculated_{(item1，item2)}；

Wherein l is the attribute type of the article, such as the extension attribute, the specification attribute, the special attribute, the marketing attribute and the like; wherein each attribute type corresponds to a plurality of item attributes. k is the number of attributes in the intersection ATTR,

the number of items of the ith attribute belonging to the jth attribute class in the intersection ATTR; n is a radical of_jThe number of articles in the jth attribute type in the platform article pool; the platform article pool is a platform article pool generated by all articles of a platform where the push information is located, and if the platform is applied to the platform in the kyoto, the platform article pool refers to an article pool generated by all articles of the platform in the kyoto.

And 204, if the attribute similarity is determined to be greater than a first preset threshold, determining the second article as an article similar to the first article.

And if the attribute similarity is not larger than a first preset threshold value, determining the second article as a dissimilar article.

The setting of the first preset threshold is set according to the requirements of the actual application scene, and the specific value set in the embodiment of the application is not limited.

When the second article is determined to be a similar article, adding the second article into the similar article pool; otherwise, the similar item pool is not added.

According to the method and the device, the Jaccard similarity of the two articles is calculated firstly, and then the attribute similarity determination scheme of the IDF value of the intersection attribute of the two articles is calculated, so that the accuracy of calculating the similarity of the articles can be improved.

If the similarity of the objects is measured by using the Jaccard similarity only, the distribution of the overall attributes is ignored, and the similarity result is inclined to the popular attributes.

Such as time-to-market: summer in 2019, style: and others, shelf life: the occurrence frequency of attributes and attribute values of 120 days and the like is very high, so that intersection can be easily generated between two articles, and the attributes are general attributes and have wide coverage, so that the similarity of a plurality of articles after calculation of unimportant attributes Jaccard is very high, and misjudgment for judging the similarity of the two articles is caused.

EXAMPLE III

referring to fig. 4, fig. 4 is a schematic flow chart illustrating a process of determining items similar to the items in the source item pool in the target item pool according to the third embodiment of the present application. The method comprises the following specific steps:

step 401, selecting a third article in the source article pool, and obtaining a title vector of the third article.

The third article is any article in the source article pool.

Step 402, selecting a fourth item in the target item pool, and obtaining a title vector of the fourth item.

The fourth item is any item in the target item pool.

In the embodiment of the application, the title vector of the title is obtained through the preset word vector model.

When a plurality of word vector models to be selected exist, one word vector model is selected as a preset word vector model by evaluating the effect of the title vector in the embodiment of the application.

The word vector models to be selected may be currently established or pre-established, and in the embodiment of the present application, the following word vector model establishment process is given, but not limited to the following modeling process:

adopting 22 hundred million article titles of an active article table, utilizing a word segmentation tool of the word under the condition of Hadamard LTP, wherein a word stock can adopt brand words and attribute words of the Jingdong article, a word vector model is established through a skip-gram algorithm of word2vec, a title vector can be obtained, the number of all the participles is counted to be 980 ten thousand, wherein 320 ten thousand words with frequency of more than 50 exist, the occurrence frequency of the words covers 99.39% of all the title participles, and therefore MinCount is set to be 50; the sliding window is set to [3,10], 5 to 20 iterations, partition is set to 1000, vector dimension can be set to [32,200 ]; spark realizes that the skip-gram directly reproduces the original word2vec-C language version.

The process of specifically selecting the word vector model is as follows:

referring to fig. 5, fig. 5 is a schematic flow chart of selecting a word vector model in the embodiment of the present application. The method comprises the following specific steps:

step 501, based on a clustering algorithm, calculating an effect evaluation index value of the word vector model to be selected according to the category purity of each cluster.

In the embodiment of the application, a Kmeans + + algorithm can be adopted as the clustering algorithm, a COSINE distance is adopted as the clustering distance, and the optimal clustering value N is selected according to a clustering effect evaluation mode contour coefficient method and an elbow method.

N may be set empirically or may be statistically derived from sampled samples.

Taking N as 500 as an example, a cluster result has 500 clusters (cluster1, cluster2, cluster3, …, cluster500), wherein the belonging categories of the articles in each cluster need to be counted, and the effect evaluation index value of the word vector model can be calculated by the following formula:

wherein the content of the first and second substances,

n is the number of clusters, R is the class number, and can be set to be 3 (primary class, secondary class, tertiary class) if the number of clusters is N; h (X)_nr(ii) an impure grade for the nth cluster r class; alpha is alpha_nIs the ratio of the number of articles in the nth cluster to the number of articles in all clusters, p (x)_nr) The method is characterized in that the method is a ratio of the quantity of articles corresponding to the d-th category of the r-th category of the nth cluster to the quantity of articles corresponding to the d-th category of the r-th category in the N clusters, and each level of categories is provided with a plurality of categories, namely each level of categories is corresponding to a plurality of category names.

And 502, selecting a word vector model to be selected with the minimum effect evaluation index value as a preset word vector model.

And the evaluation effect of the vector model of the word vector is digitalized, so that the comparison of the quality of the word vector model is realized.

Step 403, determining the title similarity of the third article and the fourth article according to the title vector of the third article and the title vector of the fourth article.

After the heading vectors of the two articles are obtained, the method for calculating the heading vector similarity of the two articles is not limited.

Step 404, when it is determined that the title similarity is greater than a second preset threshold, determining the fourth article as an article similar to the third article.

And if the title similarity is not larger than a second preset threshold value, determining the fourth article as a dissimilar article.

The setting of the second preset threshold is set according to the requirements of the actual application scene, and the specific value set in the embodiment of the application is not limited.

When the fourth article is determined to be a similar article, adding the fourth article into the similar article pool; otherwise, the similar item pool is not added.

Whether the two articles are similar or not is determined through the title similarity, and the accuracy of title similarity calculation is greatly improved due to the fact that the optimal title vector extraction model is selected, namely the accuracy of matching of similar articles is improved.

Example four

Referring to fig. 6, fig. 6 is a schematic view of an information pushing flow in the fourth embodiment of the present application. The method comprises the following specific steps:

step 601, obtaining a source article pool.

The obtaining of the source object pool in this step includes:

obtaining a source article pool which is directly uploaded;

Step 602, a target object pool matched with the configured scene limitation information and the similar object range limitation information is obtained.

third-level catalog, item brand, item name.

Step 603, selecting a third article in the source article pool, and obtaining a title vector of the third article.

Step 604, selecting a fourth item in the target item pool, and obtaining a title vector of the fourth item.

The process of specifically selecting the word vector model is as follows:

the method comprises the following steps of firstly, calculating an effect evaluation index value of a word vector model to be selected according to the category purity of each cluster based on a clustering algorithm.

N may be set empirically or may be statistically derived from sampled samples.

wherein the content of the first and second substances,

n is the number of clusters, R is the class number, and can be set to be 3 (primary class, secondary class, tertiary class) if the number of clusters is N; h (X)_nr(ii) an impure grade for the nth cluster r class; alpha is alpha_nIs the ratio of the number of articles in the nth cluster to the number of articles in all clusters, p (x)_nr) For the ratio of the quantity of the articles corresponding to the d-th category of the r-th category of the nth cluster to the quantity of the articles corresponding to the d-th category of the r-th category in the N clusters, each level of categories is provided with a plurality of categories, namely each level of categoriesThe view corresponds to a plurality of category names.

And secondly, selecting the word vector model to be selected with the minimum effect evaluation index value as a preset word vector model.

Step 605, determining the title similarity of the third article and the fourth article according to the title vector of the third article and the title vector of the fourth article.

Step 606, when it is determined that the title similarity is greater than a second preset threshold, determining the fourth item as a similar item, and adding the similar item to a similar item pool.

If the title similarity is not larger than a second preset threshold value, determining the fourth article as a dissimilar article; and the similar article pool is not added.

Step 607, filtering similar items of the same item in the similar item pool using the set TOPN value paired with a second preset threshold.

In specific implementation, the method and the device not only use the similarity of the title vectors to determine whether the items are acquainted with each other, but also filter similar items recalled by the title vectors by setting a number of similar items found by each item (TOPN) value and a second preset threshold value in a pairing manner.

The optimal pairing result obtained by means of training, experience and the like is as follows: TOP3, 5, 10, 50, 100, the second predetermined threshold being 0.8474.

Examples are given below to illustrate the benefits of setting the above pairing results:

taking the TOP10000 similar articles of 1000 articles sampled in multiple rounds as an example, counting the distribution of similarity/dissimilarity in each TOP n range recommends selecting a proper TOP n and similarity value.

Selecting a similarity threshold: and counting and editing the mean distribution of the distance similarity and the COS similarity in each TOPN section.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating editing distance and similarity. In fig. 7, the lines corresponding to the filled circles indicate the similarity average values, and the lines corresponding to the open circles indicate the edit distance average values.

In fig. 7, as TOPN increases, the COS similarity value mean becomes smaller as the edit distance similarity mean becomes smaller. The description shows that the similarity of the title vector COS expresses the change rule of the similarity of certain editing distance. The similarity value range is more around 1/0.9/0.8, wherein the title similarity is greater than 92.23% of 0.8. But does not affect the comparison of the most similar results for the top n selection. The title similarity and the edit distance have certain relevance, manual inspection is carried out by referring to edit distance sampling data, 95.87% of similar results with the similarity being larger than 0.9 are similar, 20.68% of similar results with the similarity being smaller than 0.8 are similar, and the quantity of commodities is very small, so that the similar results are kept most probably by selecting a value from [0.8 and 0.9], and after manual screening, the results of similar commodities with the similarity being larger than 0.8474 of two titles are high in reliability, so 0.8474 is used as a similarity cut-off value. Selecting a TOPN value: after determining 0.8474, the optimal TOPN value needs to be selected, and the data distribution in the case that the similarity cut-off value of different TOPN sections is 0.8474 is counted as the following table 1 and FIG. 8.

Table 1 shows the data distribution for different TOPN segments with a similarity cutoff of 0.8474.

TABLE 1

And (2) performing data analysis on 1038 SKUs sampled from each segment of each TOPN, and extracting dissimilar SKUs in each segment by referring to the edit distance, wherein in the table 1, 17 TOP 1-10 dissimilar SKUs, 35 TOP 70-100 dissimilar SKUs and the like are included. Study analysis these dissimilar SKUs could be filtered by SKUs with similarity >0.8474, with 14 TOPs 1-10 and 27 TOPs 70-100 with similarities less than 0.8474, 13 and 24 respectively, and 4 and 11 respectively, dissimilar not being captured by the 0.8474 index, with similarities > 0.8474.

Referring to fig. 8, fig. 8 is a diagram illustrating the comparison of similar numbers. In fig. 8, the line segment corresponding to the triangular line represents "< 0.8474 in which there is dissimilarity", the line segment corresponding to the diamond represents "the number of dissimilarity", the line segment corresponding to the rectangular line represents "the number of dissimilarity", and the line segment corresponding to the cross represents "within range, there is also dissimilarity".

Most of dissimilar articles can be filtered through the line segments corresponding to the triangles and the line segments corresponding to the diamonds, but many similar articles can be misjudged to be dissimilar along with the growth of the TOPN segment, so that the value of the TOPN cannot be too large, and a proper small value needs to be selected.

3 TOPs 70-100 are misjudged, similarity results obtained after the similarity value is limited to be 0.8474 are that the similarity rates TOP 70-100 are reduced to 1.06% from the previous similarity rate of 3.37%, one of the 100 is dissimilar, and the TOPN of similar articles is generally selected to be less than 100 according to the previous requirements, so that the TOP proposal 100 is more reasonable. Table 2 is a data distribution limiting the dissimilarity in the recalled commodities for each of the TOPN sections before and after 0.8474 in similarity:

TABLE 2

After the TOP 70-100 limits the similarity, the dissimilarity rate is 1.06%, the coverage rate is 98.73%, and the comprehensive obtained TOP 1-100 similarity range is reasonable. According to the experience of TOP3, 5, 10, 50, 100, the similarity of 0.8474 is the final choice.

And 608, pushing the items in the filtered similar item pool.

When the article is pushed, the related information of the article can also be pushed.

such as in order of high to low similarity;

sales volume in near N days;

single dose for nearly N days, etc.

And combining whether the item is taken as the item to be recommended or not through the similarity of the title vectors in the process of determining the similarity of the items and the TOPN value.

EXAMPLE five

referring to fig. 9, fig. 9 is a schematic flowchart of a process of determining an item similar to an item in the source item pool in the target item pool in the fifth embodiment of the present application. The method comprises the following specific steps:

step 901, select a fifth item in the source item pool.

The fifth item is any item in the source item pool.

Step 902, select a sixth item in the pool of target items.

The sixth item is any item in the target item pool.

Step 903, determining the attribute similarity of the fifth article and the sixth article.

The process of determining the similarity of the attributes of the fifth item and the sixth item is as follows:

first, an item attribute set ATTR1 for a fifth item is obtained.

The article attribute set ATTR1 of the fifth article item5 in the embodiment of the present application is (ATTR11, ATTR12, …, ATTR1b), where b is the number of article attributes of the fifth article.

Second, an item attribute set ATTR2 for the sixth item is obtained.

The article attribute set ATTR2 of the second article item6 in the embodiment of the present application is (ATTR21, ATTR22, …, ATTR2p), where p is the number of article attributes of the sixth article.

b and p may be the same or different.

And thirdly, acquiring the intersection ATTR and the union ATTRS of the article attribute sets of the fifth article and the sixth article.

And fourthly, calculating the similarity of the Jaccard according to the intersection and the union, and calculating the similarity of the attributes of the fifth item and the sixth item according to the similarity of the Jaccard, the total number N of the platform items and the number of the platform items with the attributes in the intersection ATTR.

Calculating the attribute similarity of the sixth item and the fourth item by the following formula:

wherein the content of the first and second substances,

in the calculating process, the attribute similarity of the articles is reconstructed, all the attributes of each article are taken as a document, each attribute is equivalent to a word, and the Jaccard similarity Jaccard of the article pair (the fifth article and the sixth article) is calculated_{(item1，item2)}；

the number of items of the ith attribute belonging to the jth attribute class in the intersection ATTR; n is a radical of_jThe number of articles in the jth attribute type in the platform article pool; the platform article pool is a platform article pool generated by all articles of a platform where the push information is located, and if the platform is applied to the platform in the Jingdong, the platform article pool refers to an article pool generated by all articles of the platform in the Jingdong;

the letter used in the above calculation formula in the embodiment of the present application has the same meaning as in the second embodiment, but the value may be different or the same.

Step 904, determining a title similarity of the fifth item and the sixth item.

The process of specifically selecting the word vector model is as follows:

N may be set empirically or may be statistically derived from sampled samples.

wherein the content of the first and second substances,

The step 904 and the step 903 are not executed in sequence, and may be executed in parallel or in sequence.

Step 905, determining the overall similarity of the fifth article and the sixth article according to the attribute similarity and the title similarity.

In this step, determining the overall similarity between the fifth item and the sixth item according to the attribute similarity and the title similarity specifically includes:

when the value of the title similarity is larger than a second preset threshold, calculating the overall similarity by the following formula:

Simi_all＝Simtitle+βSimiattr_all；

wherein, Simi_allFor global similarity, Simtitle is title similarity, Simiattr_allBeta is weight reduction coefficient, beta is more than 0 and less than 1;

in a specific implementation, the values of β and the second preset threshold are not limited, for example, β may be set to 0.5, and the second preset threshold may be set to 0.8474.

When the value of the title similarity is not greater than a second preset threshold, calculating the overall similarity by the following formula:

Simi_all＝βSimtitle+Simiattr_all；

In the embodiment of the application, the title similarity is considered to be valid when the value of the title similarity is greater than a second preset threshold, if the value of the title similarity is greater than the second preset threshold, the title similarity is taken as the main value, and the attribute similarity is subjected to weight reduction processing; if the title similarity value is not greater than the second preset threshold value, the title similarity is judged to be invalid, and if the title similarity value is not greater than the second preset threshold value, the attribute similarity is taken as the main attribute similarity, and the weight reduction processing is carried out on the title similarity.

Step 906, when it is determined that the overall similarity is greater than a third preset threshold, determining that the fifth item and the sixth item are similar items.

When it is determined that the overall similarity is not greater than a third preset threshold, determining that the fifth article and the sixth article are dissimilar articles.

The setting of the third preset threshold is set according to the requirements of the actual application scene, and the specific value set in the embodiment of the application is not limited.

When the sixth article is determined to be a similar article, adding the sixth article into the similar article pool; otherwise, the similar item pool is not added.

In the embodiment, whether the two articles are similar or not is determined through the overall similarity obtained after the title similarity and the attribute similarity are fused, and the accuracy of the overall similarity calculation of the two articles is greatly improved due to the fusion of the two similarities, namely the accuracy of the matching of the similar articles is improved.

EXAMPLE six

In the embodiment of the application, when information pushing is realized, two realization modes can be provided, one is a developer mode, and the other is a non-development mode, which can also be called a common mode.

The development mode can be used by developers with research and development capabilities, and the non-development mode can be used by users with weak research and development capabilities or without research and development capabilities.

In particular, the developer mode or the non-developer mode can be selected for use.

Aiming at a developer mode, when the system is in a development mode, opening a similar determining method for configuration parameter adjustment, acquiring a source article pool and a target article pool, opening a cluster to operate a resource configuration parameter adjustment operation aging, and receiving input of the source article pool and the target article pool.

For the source article pool, the contents required to be set are:

address information of the source article pool: HDFS addresses, HIVE tables, CSVs, etc. of the source item pool;

the multi-level categories are generally three levels, and if the items are not input, the items in the three levels are defaulted to be similar;

similar article range restriction information: the third-level catalog, the brand name of the article and the name of the article can be respectively assigned with identifications, such as 0, 1, 2 and the like when the third-level catalog, the brand name of the article and the name of the article are specifically implemented.

Aiming at the target object pool, the contents required to be set are as follows:

and (3) sequencing rules: sales volume of 2-7 days, etc.;

the flow rate and single amount of the articles in 7 days, the single amount and flow rate in 30 days and the like.

Valid label of article: whether the selling is available;

the state of the article: whether the cabinet is put on;

address information of the target object pool: URL of target item pool, HDFS address, HIVE table, etc.

Whether the item reuses the historical result.

Parameters that the similarity determination process can set:

the method for determining the similarity comprises the following steps: title similarity, attribute similarity and overall similarity;

item recall quantity.

The resource allocation of the information pushing device is realized:

[ DeployMode ] Resource manager (YARN) mode of operation selection;

[ NumExecutor ] setting the running number of an actuator (executor) of a calculation engine (SPARK);

executing memory setting of an executor of [ execute memory ] SPARK;

a drive (driver) of [ driver memory ] SPARK runs the memory setting;

executor running kernel number setting of [ executorrCores ] SPARK;

driving (driver) of [ Shufflepartitations ] SPARK to run memory setting;

the drive (driver) of the [ defaultpulsepilellism ] SPARK runs the memory settings.

The relevant configuration can be carried out aiming at a developer mode, and the information push method in the embodiment of the application can be applied to a plurality of application scenes through the relevant configuration, namely a universal model is established, and a model does not need to be set for each scene to carry out information push;

the developer mode can also set resources, the information pushing efficiency can be greatly improved, and the waste of resources is avoided.

For the non-developer mode, when the non-development mode is selected, a scenario setting, a TOPN setting, and generation of a target item pool and input of a source item pool according to scenario constraints are opened.

The concrete implementation is as follows:

for the source article pool, the contents required to be set are:

and (3) sequencing rules: sales volume of 2-7 days, etc.;

Valid label of article: whether the selling is available;

the state of the article: whether the cabinet is put on;

parameters that the similarity determination process can set:

item recall quantity.

For a non-developer mode, namely a common mode, the information related to the similar articles can be set and determined, resource configuration options are not provided, so that common users who do not know research and development can also perform related setting as required, and pushing of the similar articles is further realized.

EXAMPLE seven

In the embodiment of the application, when similar articles are determined, the number and the number of the articles in the target article pool can be determined;

when the items in the similar item pool are determined to be on the order of millions, or less than millions, the items in the source item pool are assigned to a plurality of similar item determination nodes, and the target item pool is broadcast to the plurality of similar item determination nodes.

And after the similar articles are obtained by the plurality of similar article determining nodes, returning all the similar articles to generate a similar article pool.

When the order of magnitude of the items in the similar item pool is determined to be ten-million or more, filtering the target item pool by using a Local Sensitive Hashing (LSH) algorithm, and determining the items similar to the items in the source item pool in the target item pool.

In the embodiment, when the articles in the similar article pool are determined to be in the order of millions or less than millions, the precise similarity calculation is realized, and in order to increase the speed, the articles are distributed to a plurality of nodes to be executed respectively, namely executed in parallel.

At present, the LSH algorithm is suitable for searching similar results for 4000 million articles.

When the order of the items in the similar item pools is determined to be tens of millions or more, the target item pools are filtered through an approximation algorithm so as to reduce the calculation amount.

Example eight

Referring to fig. 10, fig. 10 is a schematic view of an information pushing flow in an eighth embodiment of the present application. The method comprises the following specific steps:

step 1001, a source article pool is obtained.

The obtaining of the source object pool in this step includes:

obtaining a source article pool which is directly uploaded;

Step 1002, a target object pool matched with the configured scene limitation information and the similar object range limitation information is obtained.

third-level catalog, item brand, item name.

Step 1003, determining the items similar to the items in the source item pool in the target item pool, and generating a similar item pool.

And 1004, filtering the similar objects in the similar object pool, which are similar to the same object in the source object pool, by using the set TOPN value.

This is accomplished by retaining only N similar items if item a in the source item pool stores more than N similar items in the target item pool, with the other items being deleted from the similar item pool.

When similar articles need to be filtered, a small top stack can be maintained for the similar articles corresponding to each article for quick sequencing.

When the articles are deleted, the articles can be randomly selected and deleted, or all similar articles of the same article can be sorted according to the sequence of similarity from large to small, the first N articles are reserved, and all the subsequent articles are deleted.

And 1005, pushing the articles in the similar article pool.

At this time, the filtered articles in the similar article pool are pushed.

The value for TOPN may be set to 10 by default or may be set according to the actual application.

If the business wants to set the TOPN to 30, the model automatically adjusts to recall 50, as in the option of TOPN set to 3, 5, 10, 50, 100 in the non-developer mode, and similarly if set to 60, the model automatically sets to 100, which has the benefits of:

ensuring more recalls as much as possible for filtering subsequent conditions;

the probability of historical data reuse is increased, the historical data can be reused for many times in the same scene, the number of recalls is fixed, the historical data can be reused when the same parameter is set next time, and repeated calculation is reduced.

such as in order of high to low similarity;

sales volume in near N days;

single dose for nearly N days, etc.

In this embodiment, the source object pool and the target object pool are determined by setting the scene restriction information and the similar object range restriction information of the application scene, the similar object pools of all the objects in the source object pool are matched in the target object pool, and the pushing is performed after the filtering is performed according to the set top dead number (TOPN) value. The technical scheme that the general model can be set to realize recommendation of acquaintance articles in a plurality of scenes through setting of scene parameters can be set, so that the information push cost can be reduced, and waste of resources is avoided.

Based on the same inventive concept, the embodiment of the application also provides an information pushing device. Referring to fig. 11, fig. 11 is a schematic structural diagram of an apparatus applied to the above technology in the embodiment of the present application. The device comprises: the system comprises a configuration unit, a source article pool acquisition unit, a target article pool acquisition unit, a similar article pool generation unit and a pushing unit;

Preferably, the first and second electrodes are formed of a metal,

the source article pool acquisition unit is specifically used for acquiring a directly uploaded source article pool; or selecting the items matched with the configured parameters in the platform item pool and generating the source item pool.

Preferably, the first and second electrodes are formed of a metal,

the target object pool acquiring unit is specifically configured to acquire a directly uploaded target object pool matched with the configured scene restriction information and the similar object range restriction information when acquiring the target object pool matched with the configured scene restriction information and the similar object range restriction information; or selecting an article matched with the configured scene limitation information and the similar article range limitation information from the platform article pool, and generating a target article pool.

Preferably, the first and second electrodes are formed of a metal,

the similar item pool generating unit is specifically configured to, when determining an item similar to an item in the source item pool in the target item pool, include: selecting a first item in a source item pool; selecting a second item in the target item pool; determining attribute similarity of the first article and the second article; wherein the similarity is calculated from the Jaccard similarity, and the total number of items in the pool of platform items and the number of platform items having attributes in the belonging gathers of the first item and the second item; the Jaccard similarity is calculated according to the intersection and union of the attributes of the first object and the second object; and if the attribute similarity is determined to be larger than a first preset threshold value, determining the second article as an article similar to the first article.

Preferably, the first and second electrodes are formed of a metal,

the similar item pool, specifically for use in determining items in the target item pool that are similar to items in the source item pool, includes: selecting a third article in a source article pool, and acquiring a title vector of the third article; selecting a fourth article in the target article pool, and acquiring a title vector of the fourth article; determining the title similarity of the third article and the fourth article according to the title vector of the third article and the title vector of the fourth article; and when the title similarity is determined to be larger than a second preset threshold value, determining the fourth article as an article similar to the third article.

Preferably, the target pool obtaining unit is further configured to obtain a title vector of the article through a preset word vector model; when a plurality of word vector models to be selected exist, calculating an effect evaluation index value of the word vector models to be selected according to the class purity of each cluster based on a clustering algorithm; and selecting the word vector model to be selected with the minimum effect evaluation index value as a preset word vector model.

Preferably, the first and second electrodes are formed of a metal,

the configuration unit is further used for configuring a TOPN value paired with a second preset threshold value;

the pushing unit is further configured to filter similar items of the same item in the similar item pool by using the TOPN value configured by the configuration unit before pushing the item in the similar item pool.

Preferably, the first and second electrodes are formed of a metal,

the similar item pool generating unit is specifically configured to, when determining an item similar to an item in the source item pool in the target item pool, include: selecting a fifth item in the source item pool; selecting a sixth item in the target item pool; determining attribute similarity of the fifth item and the sixth item; determining title similarity of the fifth item and the sixth item; determining the overall similarity of the fifth article and the sixth article according to the attribute similarity and the title similarity; and when the overall similarity is determined to be larger than a third preset threshold value, determining that the sixth article is an article similar to the fifth article.

Preferably, the first and second electrodes are formed of a metal,

the similar item pool generating unit is specifically configured to, when determining the overall similarity between the fifth item and the sixth item according to the attribute similarity and the title similarity, include: when the value of the title similarity is larger than a second preset threshold, calculating the overall similarity as follows: simtitle + beta Simiattr_all(ii) a When the title similarity value is not greater than a second preset threshold, calculating the overall similarity as follows: beta Simtitle + Simiattr_all(ii) a Wherein Simtitle is title similarity, Simiattr_allBeta is weight reduction coefficient, and beta is more than 0 and less than 1.

Preferably, the first and second electrodes are formed of a metal,

the recommending unit is further used for filtering objects in the similar object pool, which are similar to the same object in the source object pool, by using the set TOPN value before pushing the objects in the similar object pool.

Preferably, the first and second electrodes are formed of a metal,

the configuration unit is further used for configuring a development mode and a non-development mode;

wherein, when the non-development mode is selected, the scene setting is opened, the TOPN setting is carried out, and the generation of the target object pool and the input of the source object pool are limited according to the scene;

and when the system is in a development mode, opening a similar determining method for configuration parameter adjustment, acquiring a source article pool and a target article pool, opening a cluster operation resource configuration parameter to adjust operation timeliness, and receiving input of the source article pool and the target article pool.

Preferably, the first and second electrodes are formed of a metal,

the similar item pool generating unit is further used for distributing the items in the source item pool to a plurality of similar item determining nodes and broadcasting the target item pool to the plurality of similar item determining nodes when the items in the target item pool are determined to be in the order of millions or less than millions; after the similar articles are obtained by the multiple similar article determining nodes, returning all the similar articles to generate a similar article pool; when the order of the items in the similar item pool is determined to be tens of millions or more, filtering the target item pool by a Local Sensitive Hash (LSH) algorithm, and determining the items similar to the items in the source item pool in the target item pool.

Preferably, the first and second electrodes are formed of a metal,

the pushing unit is further configured to, when pushing the articles in the similar article pool, push one or any combination of the following information:

Preferably, the similar article range restriction information includes one or any combination of the following:

third-level catalog, item brand, item name.

The units of the above embodiments may be integrated into one body, or may be separately deployed; may be combined into one unit or further divided into a plurality of sub-units.

In another embodiment, an electronic device is further provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the information pushing method when executing the program.

In another embodiment, a computer readable storage medium is further provided, on which computer instructions are stored, and when executed by a processor, the instructions can implement the steps in the information pushing method.

Fig. 12 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 12, the electronic device may include: a Processor (Processor)1210, a communication Interface (Communications Interface)1220, a Memory (Memory)1230, and a communication bus 1240, wherein the Processor 1210, the communication Interface 1220, and the Memory 1230 communicate with each other via the communication bus 1240. Processor 1210 may call logic instructions in memory 1230 to perform the following method:

obtaining a source article pool;

pushing the articles in the pool of similar articles.

In addition, the logic instructions in the memory 1230 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An information pushing method, characterized in that the method comprises:

obtaining a source article pool;

pushing the articles in the pool of similar articles.

2. The method of claim 1, wherein the obtaining a source item pool comprises:

obtaining a source article pool which is directly uploaded;

3. The method according to claim 1, wherein the obtaining of the target item pool matching the configured scene restriction information and the similar item range restriction information comprises:

4. The method of claim 1, wherein said determining items in the target pool that are similar to items in the source pool comprises:

selecting a first item in a source item pool;

selecting a second item in the target item pool;

determining attribute similarity of the first article and the second article; wherein the similarity is calculated from the Jaccard similarity, and the total number of items in the pool of platform items and the number of platform items having attributes in the belonging gathers of the first item and the second item; the Jaccard similarity is calculated according to the intersection and union of the attributes of the first object and the second object;

and if the attribute similarity is determined to be larger than a first preset threshold value, determining the second article as an article similar to the first article.

5. The method of claim 1, wherein said determining items in the target pool that are similar to items in the source pool comprises:

selecting a third article in a source article pool, and acquiring a title vector of the third article;

selecting a fourth article in the target article pool, and acquiring a title vector of the fourth article;

determining the title similarity of the third article and the fourth article according to the title vector of the third article and the title vector of the fourth article;

and when the title similarity is determined to be larger than a second preset threshold value, determining the fourth article as an article similar to the third article.

6. The method of claim 5, further comprising:

acquiring a title vector of the article through a preset word vector model;

when a plurality of word vector models to be selected exist, calculating an effect evaluation index value of the word vector models to be selected according to the class purity of each cluster based on a clustering algorithm;

and selecting the word vector model to be selected with the minimum effect evaluation index value as a preset word vector model.

7. The method of claim 5, wherein after the generating the pool of similar items and before the pushing the items in the pool of similar items, the method further comprises:

and filtering similar items of the same item in the similar item pool by using the set TOPN value paired with a second preset threshold value.

8. The method of claim 1, wherein said determining items in the target pool that are similar to items in the source pool comprises:

selecting a fifth item in the source item pool;

selecting a sixth item in the target item pool;

determining attribute similarity of the fifth item and the sixth item;

determining title similarity of the fifth item and the sixth item;

determining the overall similarity of the fifth article and the sixth article according to the attribute similarity and the title similarity;

and when the overall similarity is determined to be larger than a third preset threshold value, determining that the sixth article is an article similar to the fifth article.

9. The method of claim 1, wherein after the generating the pool of similar items and before the pushing the items in the pool of similar items, the method further comprises:

filtering objects in the similar object pool that are similar to the same object in the source object pool using the set TOPN value.

10. The method of claim 1, further comprising: configuring a development mode and a non-development mode;

when the non-development mode is selected, opening scene setting, TOPN setting, generation of a target object pool and input of a source object pool according to scene limitation;

11. The method of claim 1, further comprising:

when the items in the similar item pool are determined to be in the order of millions or less than millions, distributing the items in the source item pool to a plurality of similar item determination nodes, and broadcasting the target item pool to the plurality of similar item determination nodes;

after the similar articles are obtained by the multiple similar article determining nodes, returning all the similar articles to generate a similar article pool;

when the order of the items in the similar item pool is determined to be tens of millions or more, filtering the target item pool by a Local Sensitive Hash (LSH) algorithm, and determining the items similar to the items in the source item pool in the target item pool.

12. The method of claim 1, wherein when pushing items in the pool of similar items, the method further comprises:

pushing one or any combination of the following information:

13. The method according to any one of claims 1 to 12, wherein the similar item range restriction information comprises one or any combination of the following:

third-level catalog, item brand, item name.

14. An information pushing apparatus, characterized in that the apparatus comprises: the system comprises a configuration unit, a source article pool acquisition unit, a target article pool acquisition unit, a similar article pool generation unit and a pushing unit;

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-13 when executing the program.

16. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 13.