CN110287977A - Content clustering method and device - Google Patents

Content clustering method and device Download PDF

Info

Publication number
CN110287977A
CN110287977A CN201810226492.4A CN201810226492A CN110287977A CN 110287977 A CN110287977 A CN 110287977A CN 201810226492 A CN201810226492 A CN 201810226492A CN 110287977 A CN110287977 A CN 110287977A
Authority
CN
China
Prior art keywords
node
content
label
category
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810226492.4A
Other languages
Chinese (zh)
Other versions
CN110287977B (en
Inventor
刘荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Youku Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Youku Network Technology Beijing Co Ltd filed Critical Youku Network Technology Beijing Co Ltd
Priority to CN201810226492.4A priority Critical patent/CN110287977B/en
Publication of CN110287977A publication Critical patent/CN110287977A/en
Application granted granted Critical
Publication of CN110287977B publication Critical patent/CN110287977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques

Abstract

This disclosure relates to content clustering method and device.This method comprises: obtaining multiple groups user behavior data;For each group of user behavior data, the corresponding content array of the user behavior data is determined respectively;According to the positional relationship in the content array between content, the correlation between content is determined;Classification belonging to each content is determined using label propagation algorithm according to the correlation between content.The disclosure can carry out content clustering automatically, carry out content clustering without artificial, save manpower, be easy to cluster a large amount of content, and can preferably excavate the correlation between content, improve the accuracy of content clustering.

Description

Content clustering method and device
Technical field
This disclosure relates to information technology field more particularly to a kind of content clustering method and device.
Background technique
In the related technology, by manually clustering to contents such as videos, the content of each classification is obtained.It is this manually into The mode of row content clustering needs to expend a large amount of manpowers, and the similarity in obtained each classification between content is difficult to obtain Guarantee.
Summary of the invention
In view of this, the present disclosure proposes a kind of content clustering method and devices.
According to the one side of the disclosure, a kind of content clustering method is provided, comprising:
Obtain multiple groups user behavior data;
For each group of user behavior data, the corresponding content array of the user behavior data is determined respectively;
According to the positional relationship in the content array between content, the correlation between content is determined;
Classification belonging to each content is determined using label propagation algorithm according to the correlation between content.
In one possible implementation, according to the positional relationship in the content array between content, content is determined Between correlation, comprising:
Correlation between content adjacent in the content array is determined as correlation.
In one possible implementation, it is determined each according to the correlation between content using label propagation algorithm Classification belonging to content, comprising:
Establish non-directed graph, and using each content as the non-directed graph in node;
If the correlation between two contents is correlation, the side between the corresponding node of two contents is established;
Respectively each node distributes label;
The label of the node is updated according to the label of the neighbor node of the node for any one node, wherein should The neighbor node of node indicates the node being connected with the node;
When the label of each node is stablized, according to the label of each node, determine belonging to the corresponding content of each node Classification.
In one possible implementation, for any one node, according to the label of the neighbor node of the node, more The label of the new node, comprising:
Any one node is updated according to the most label of frequency of occurrence in the label of the neighbor node of the node The label of the node.
In one possible implementation, before the label for updating the node, the method also includes:
It is relevant two contents for correlation, according to time of two contents adjacent appearance in each content array The number that several and two contents occur respectively in each content array, determines the similarity between two contents.
In one possible implementation, for any one node, according to the label of the neighbor node of the node, more The label of the new node, comprising:
According to similarity of the correlation between relevant two contents, determine between the corresponding node of two contents The weight on side;
For any one node, according between the label of the neighbor node of the node and the node and neighbor node Side weight, update the label of the node.
In one possible implementation, for any one node, according to the label of the neighbor node of the node, with And the weight on the side between the node and neighbor node, update the label of the node, comprising:
For any one node, the label of the neighbor node of the node is identified as candidate label;
Determine the corresponding neighbor node of each candidate's label in the neighbor node of the node;
According to the sum of the weight on side between each candidate corresponding neighbor node of label and the node, each candidate is determined The corresponding weight of label;
According to the label of the highest candidate tag update node of weight.
In one possible implementation, after determining classification belonging to each content, the method also includes:
If the similarity of the content in first category and second category meets first condition, merge the first category and The second category.
In one possible implementation, after determining classification belonging to each content, the method also includes:
Determine the first content number in the intersection of the first category and the second category;
Determine the first category and the second category and concentration the second content number;
If the ratio of the first content number and the second content number is greater than first threshold, it is determined that the first category Meet the first condition with the similarity of the content in the second category;
If the ratio of the first content number and the second content number is less than or equal to the first threshold, it is determined that institute The similarity for stating first category and the content in the second category is unsatisfactory for the first condition.
In one possible implementation, after determining classification belonging to each content, the method also includes:
Delete the content that second condition is unsatisfactory in each classification.
In one possible implementation, the second condition includes: that the clicking rate of content is less than second threshold.
According to another aspect of the present disclosure, a kind of content clustering device is provided, comprising:
Module is obtained, for obtaining multiple groups user behavior data;
First determining module, for determining that the user behavior data is corresponding respectively for each group of user behavior data Content array;
Second determining module, for determining between content according to the positional relationship in the content array between content Correlation;
Third determining module, for determining each content institute using label propagation algorithm according to the correlation between content The classification of category.
In one possible implementation, second determining module is used for:
Correlation between content adjacent in the content array is determined as correlation.
In one possible implementation, the third determining module is for including:
First setting up submodule, for establishing non-directed graph, and using each content as the non-directed graph in node;
It is corresponding to establish two contents if being correlation for the correlation between two contents for second setting up submodule Node between side;
Distribution sub module distributes label for respectively each node;
Submodule is updated, for according to the label of the neighbor node of the node, updating the node for any one node Label, wherein the neighbor node of the node indicates the node being connected with the node;
Submodule is determined, for according to the label of each node, determining each node when the label of each node is stablized Classification belonging to corresponding content.
In one possible implementation, the update submodule is used for:
Any one node is updated according to the most label of frequency of occurrence in the label of the neighbor node of the node The label of the node.
In one possible implementation, described device further include:
4th determining module, for being relevant two contents for correlation, according to two contents in each content The number that the number of adjacent appearance and two contents occur respectively in each content array in sequence, determines this two Similarity between content.
In one possible implementation, the update submodule includes:
Determination unit determines two contents pair for the similarity according to correlation between relevant two contents The weight on the side between node answered;
Updating unit, for for any one node, according to the label of the neighbor node of the node and the node with The weight on the side between neighbor node updates the label of the node.
In one possible implementation, the updating unit includes:
First determines subelement, for for any one node, the label of the neighbor node of the node to be determined respectively For candidate label;
Second determines subelement, the corresponding neighbor node of each candidate label in the neighbor node for determining the node;
Third determines subelement, for the power according to the side between each candidate corresponding neighbor node of label and the node The sum of weight determines the corresponding weight of each candidate label;
Subelement is updated, for the label according to the highest candidate tag update node of weight.
In one possible implementation, described device further include:
Merging module merges if the similarity for the content in first category and second category meets first condition The first category and the second category.
In one possible implementation, described device further include:
5th determining module, the first content number in intersection for determining the first category and the second category;
6th determining module, for determining the first category and the second category and concentration the second content number;
7th determining module, if being greater than first threshold for the ratio of the first content number and the second content number, Then determine that the similarity of the first category and the content in the second category meets the first condition;
8th determining module, if being less than or equal to for the ratio of the first content number and the second content number described First threshold, it is determined that the similarity of the first category and the content in the second category is unsatisfactory for the first condition.
In one possible implementation, described device further include:
Removing module, for deleting the content for being unsatisfactory for second condition in each classification.
In one possible implementation, the second condition includes: that the clicking rate of content is less than second threshold.
According to another aspect of the present disclosure, a kind of content clustering device is provided, comprising: processor;It is handled for storage The memory of device executable instruction;Wherein, the processor is configured to executing the above method.
According to another aspect of the present disclosure, a kind of non-volatile computer readable storage medium storing program for executing is provided, is stored thereon with Computer program instructions, wherein the computer program instructions realize the above method when being executed by processor.
The content clustering method and device of all aspects of this disclosure is by obtaining multiple groups user behavior data, for each group User behavior data determines the corresponding content array of user behavior data, according to the position in content array between content respectively Relationship is determined the correlation between content, and is determined in each according to the correlation between content using label propagation algorithm Classification belonging to holding carries out content clustering without artificial, saves manpower thus, it is possible to carry out content clustering automatically, is easy to pair A large amount of content is clustered, and can preferably excavate the correlation between content, improves the accuracy of content clustering.
According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become It is clear.
Detailed description of the invention
Comprising in the description and constituting the attached drawing of part of specification and specification together illustrates the disclosure Exemplary embodiment, feature and aspect, and for explaining the principles of this disclosure.
Fig. 1 shows the flow chart of the content clustering method according to one embodiment of the disclosure.
Fig. 2 shows the illustrative flow charts according to the content clustering method and step S14 of one embodiment of the disclosure.
Fig. 3 shows the schematic diagram of non-directed graph in the content clustering method according to one embodiment of the disclosure.
Fig. 4 shows an illustrative flow chart of the content clustering method and step S144 according to one embodiment of the disclosure.
Fig. 5 shows an illustrative flow chart of the content clustering method and step S1442 according to one embodiment of the disclosure.
Fig. 6 shows an illustrative flow chart of the content clustering method according to one embodiment of the disclosure.
Fig. 7 shows an illustrative flow chart of the content clustering method according to one embodiment of the disclosure.
Fig. 8 shows an illustrative flow chart of the content clustering method according to one embodiment of the disclosure.
Fig. 9 shows the block diagram of the content clustering device according to one embodiment of the disclosure.
Figure 10 shows an illustrative block diagram of the content clustering device according to one embodiment of the disclosure.
Figure 11 is a kind of block diagram of device 1900 for content clustering shown according to an exemplary embodiment.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.
In addition, giving numerous details in specific embodiment below to better illustrate the disclosure. It will be appreciated by those skilled in the art that without certain details, the disclosure equally be can be implemented.In some instances, for Method, means, element and circuit well known to those skilled in the art are not described in detail, in order to highlight the purport of the disclosure.
Fig. 1 shows the flow chart of the content clustering method according to one embodiment of the disclosure.This method can be applied to service In device.As shown in Figure 1, the method comprising the steps of S11 to step S14.
In step s 11, multiple groups user behavior data is obtained.
In the present embodiment, user behavior data indicates the data that user operates content and generates.For example, content For video, user behavior data may include that user watches the data of video, the data of user comment video, user deliver barrage Video data, user collect video data, the data of user's sharing video frequency and user thumb up data of video etc..
In step s 12, for each group of user behavior data, the corresponding content sequence of user behavior data is determined respectively Column.
It, then can be with for example, determine that user successively watches video V1, V2 ... ..., VN according to a certain group of user behavior data Determine that the corresponding content array of this group of user behavior data is { V1, V2 ..., VN }.
In step s 13, according to the positional relationship in content array between content, the correlation between content is determined.
In one possible implementation, it according to the positional relationship in content array between content, determines between content Correlation, comprising: the correlation between content adjacent in content array is determined as correlation.For example, content array is { V1, V2 ... ..., VN } can then determine that V1 is related to V2, and V2 is related to V3 ... ..., and V (N-1) is related to VN.
It should be noted that although the correlation between content adjacent in content array is determined as related introduction According to the positional relationship in content array between content, determine that the correlation between content is as above, but those skilled in the art's energy Enough to understand, the disclosure is answered without being limited thereto.Those skilled in the art can be according to practical application scene demand and/or personal preference spirit It is living that the concrete mode for determining the correlation between content is set, if correlation between content according to content in content array it Between positional relationship determine.For example, it is also possible to by the correlation between content of the distance in content array less than or equal to D It is determined as correlation, wherein D is natural number.The distance between two contents can refer to this in content array two in content array Content number between content.For example, it is 1 that the distance between V1 and V2, which are the distance between 0, V1 and V3,.If D is equal to 1, V1 not It is only related to V2, it is also related to V3.
In one possible implementation, it is relevant two contents according to correlation, content pair can be generated.Example Such as, can be generated content to { V1, V2 }, { V2, V3 } ..., { V (N-1), VN }.
In step S14, determined belonging to each content according to the correlation between content using label propagation algorithm Classification.
In the present embodiment, according to the correlation between content, and label propagation algorithm is used, it can be by unsupervised poly- The method of class gathers similar content for same class.
Wherein, label propagation algorithm can be LPA (Label Propagation Algorithm, label propagation algorithm), It may be improved label propagation algorithm, for example, COPRA (Community Overlap Propagation Algorithm, community are overlapped propagation algorithm) or SLPA (Speaker-listener Label Propagation Algorithm propagates the label propagation algorithm of node and label recipient node based on label) etc..
The present embodiment is by obtaining multiple groups user behavior data, for each group of user behavior data, determines user respectively The corresponding content array of behavioral data determines the correlation between content according to the positional relationship in content array between content, And classification belonging to each content is determined using label propagation algorithm according to the correlation between content, thus, it is possible to automatically into Row content clustering carries out content clustering without artificial, saves manpower, be easy to cluster a large amount of content, and can be more preferable Ground excavates the correlation between content, improves the accuracy of content clustering.
Fig. 2 shows the illustrative flow charts according to the content clustering method and step S14 of one embodiment of the disclosure.Such as figure Shown in 2, step S14 may include step S141 to step S145.
In step s 141, non-directed graph is established, and using each content as the node in non-directed graph.
Fig. 3 shows the schematic diagram of non-directed graph in the content clustering method according to one embodiment of the disclosure.As shown in figure 3, example Such as, content includes content V1, content V2, content V3, content V4 and content V5, then can establish respectively in non-directed graph in each Hold corresponding node, it can the corresponding node V1 of content V1, the corresponding node V2 of content V2, content V3 are established in non-directed graph The corresponding node V5 of corresponding node V3, content V4 corresponding node V4 and content V5.
In step S142, if the correlation between two contents is correlation, the corresponding node of two contents is established Between side.
As shown in figure 3, for example, content V2 is related to content V1, content V3, content V4 respectively, content V3 and content V5 phase Close, then can establish between node V2 and node V1 between, node V2 and node V3 while, between node V2 and node V4 While and node V3 and node V5 between while.
In step S143, respectively each node distributes label.
For example, label T1 can be distributed for node V1, label T2 is distributed for node V2, and so on.
In step S144, the node is updated according to the label of the neighbor node of the node for any one node Label, wherein the neighbor node of the node indicates the node being connected with the node.
In one possible implementation, for any one node, according to the label of the neighbor node of the node, more The label of the new node, comprising: most according to frequency of occurrence in the label of the neighbor node of the node for any one node Label, update the label of the node.
In one possible implementation, the mark most according to frequency of occurrence in the label of the neighbor node of the node Label, update the label of the node, comprising: if the quantity of the most label of frequency of occurrence is in the label of the neighbor node of the node 1, then the most label of frequency of occurrence in the label of the neighbor node of the node is determined as to the label of the node;If the node The quantity of the most label of frequency of occurrence is greater than 1 in the label of neighbor node, then goes out from the label of the neighbor node of the node The label that a label is determined as the node is randomly choosed in the most label of occurrence number.
In step S145, when the label of each node is stablized, according to the label of each node, each node pair is determined Classification belonging to the content answered.
In the present embodiment, can according to the label of the neighbor node of each node, the label of each node of continuous updating, Until the label of each node no longer changes.
In the present embodiment, when the label of each node is stablized, the node of same label can be classified as same class Not.
In one possible implementation, before the label for updating the node, this method further include: for correlation For relevant two contents, according to two contents in each content array the number of adjacent appearance and two contents The number occurred respectively in each content array determines the similarity between two contents.
For example, first content V1 is related to the second content V2, first content V1 and the second content V2 are in each content array In the number of adjacent appearance be C (V1, V2), the number that first content V1 occurs in each content array is C (V1), in second Holding the number that V2 occurs in each content array is C (V2), then can determine first content using formula 1, formula 2 or formula 3 etc. Similarity between V1 and the second content V2.
In this implementation, for two contents of appearance not adjacent in each content array, two contents Between similarity can be 0.
Fig. 4 shows an illustrative flow chart of the content clustering method and step S144 according to one embodiment of the disclosure.Such as Shown in Fig. 4, step S144 may include step S1441 and step S1442.
In step S1441, according to similarity of the correlation between relevant two contents, two contents pair are determined The weight on the side between node answered.
For example, can be using the similarity between two contents as the power on the side between the corresponding node of two contents Weight.
In step S1442, for any one node, according to the label of the neighbor node of the node and the node The weight on the side between neighbor node updates the label of the node.
In one possible implementation, for any one node, according to the label of the neighbor node of the node, with And the weight on the side between the node and neighbor node, update the label of the node, comprising: if the mark of the neighbor node of the node The quantity of the most label of frequency of occurrence is 1 in label, then the mark that frequency of occurrence in the label of the neighbor node of the node is most Label are determined as the label of the node;If the quantity of the most label of frequency of occurrence is greater than 1 in the label of the neighbor node of the node, Then select the maximum label of weight as the node from the most label of frequency of occurrence in the label of the neighbor node of the node Label.For example, the neighbor node of node V2 includes node V1, node V3, node V4 and node V6, node V1 and node V3's Label is that the label of T1, node V4 and node V6 are T2, then the frequency of occurrence of label T1 and label T2 are all 2, that is, node V2's The quantity of the most label of frequency of occurrence is greater than 1 in the label of neighbor node.If the weight on the side between node V1 and node V2 For W1, the weight on the side between node V3 and node V2 is W2, and the weight on the side between node V4 and node V2 is W3, node V6 The weight on the side between node V2 is W4, then can determine that the weight of label T1 is W1+W2, the weight of label T2 is W3+W4. If the weight of label T1 is greater than the weight of label T2, can be using label T1 as the label of node V2.
Fig. 5 shows an illustrative flow chart of the content clustering method and step S1442 according to one embodiment of the disclosure.Such as Shown in Fig. 5, step S1442 may include step S14421 to step S14424.
In step S14421, for any one node, the label of the neighbor node of the node is identified as waiting Select label.
For example, the neighbor node of node V2 includes node V1, node V3, node V4, node V6 and node V7, node V1 and The label of node V3 is that the label of T1, node V4 and node V6 are T2, and the label of node V7 is T3, then can be by label T1, mark Label T2 and label T3 is identified as candidate label.
In step S14422, the corresponding neighbor node of each candidate's label in the neighbor node of the node is determined.
For example, it may be determined that the corresponding neighbor node of candidate's label T1 includes that node V1 connects node V3, T2 pairs of candidate label The neighbor node answered includes node V4 and node V6, and the corresponding neighbor node of candidate label T3 includes node V7.
In step S14423, according to the weight on the side between each candidate corresponding neighbor node of label and the node it With determine the corresponding weight of each candidate label.
For example, the weight on the side between node V1 and node V2 is W1, the weight on the side between node V3 and node V2 is W2 can then determine that the corresponding weight of candidate label T1 is W1+W2;The weight on the side between node V4 and node V2 is W3, section The weight on the side between point V6 and node V2 is W4, then can determine that the corresponding weight of candidate label T2 is W3+W4;Node V7 with The weight on the side between node V2 is W5, then can determine that the corresponding weight of candidate label T3 is W5.
In step S14424, according to the label of the highest candidate tag update node of weight.
For example, if in candidate label T1, candidate label T2 and candidate label T3, the weight highest of candidate label T2, then Candidate label T2 can be updated to the label of node V2.
Fig. 6 shows an illustrative flow chart of the content clustering method according to one embodiment of the disclosure.As shown in fig. 6, This method may include step S11 to step S15.
In step s 11, multiple groups user behavior data is obtained.
In step s 12, for each group of user behavior data, the corresponding content sequence of user behavior data is determined respectively Column.
In step s 13, according to the positional relationship in content array between content, the correlation between content is determined.
In step S14, determined belonging to each content according to the correlation between content using label propagation algorithm Classification.
In step S15, if the similarity of the content in first category and second category meets first condition, merge One classification and second category.
In one possible implementation, content clustering can be carried out with assigned frequency.For example, can obtain daily new User behavior data, and content clustering is carried out according to new user behavior data.If newly determining classification (such as the first kind ) similarity with the content in a certain old classification (such as second category) does not meet first condition, then it is new true can to merge this Fixed classification and the old classification.
When meeting first condition by the similarity of the content in first category and second category in the present embodiment, merge First category and second category thus, it is possible to guarantee the stabilization of classification, and can be such that classification extends automatically.
Fig. 7 shows an illustrative flow chart of the content clustering method according to one embodiment of the disclosure.As shown in fig. 7, This method may include step S21 to step S29.
In the step s 21, multiple groups user behavior data is obtained.
Wherein, the description to step S11 is seen above to step S21.
In step S22, for each group of user behavior data, the corresponding content sequence of user behavior data is determined respectively Column.
Wherein, the description to step S12 is seen above to step S22.
In step S23, according to the positional relationship in content array between content, the correlation between content is determined.
Wherein, the description to step S13 is seen above to step S23.
In step s 24, it is determined belonging to each content according to the correlation between content using label propagation algorithm Classification.
Wherein, the description to step S14 is seen above to step S24.
In step s 25, the first content number in the intersection of first category and second category is determined.
Wherein, first content number indicates the content number in the intersection of first category and second category.
In step S26, first category and second category and concentration the second content number is determined.
Wherein, the second content number indicates first category and second category and concentration content number.
In step s 27, if the ratio of first content number and the second content number is greater than first threshold, it is determined that first category Meet first condition with the similarity of the content in second category.
In step S28, if the similarity of the content in first category and second category meets first condition, merge One classification and second category.
Wherein, the description to step S15 is seen above to step S28.
In step S29, if the ratio of first content number and the second content number is less than or equal to first threshold, it is determined that the The similarity of content in one classification and second category is unsatisfactory for first condition.
Fig. 8 shows an illustrative flow chart of the content clustering method according to one embodiment of the disclosure.As shown in figure 8, This method may include step S31 to step S35.
In step S31, multiple groups user behavior data is obtained.
Wherein, the description to step S11 is seen above to step S31.
In step s 32, for each group of user behavior data, the corresponding content sequence of user behavior data is determined respectively Column.
Wherein, the description to step S12 is seen above to step S32.
In step S33, according to the positional relationship in content array between content, the correlation between content is determined.
Wherein, the description to step S13 is seen above to step S33.
In step S34, determined belonging to each content according to the correlation between content using label propagation algorithm Classification.
Wherein, the description to step S14 is seen above to step S34.
In step s 35, the content that second condition is unsatisfactory in each classification is deleted.
The present embodiment is unsatisfactory for the content of second condition by deleting in each classification, can guarantee interior in each classification The quality of appearance.
In one possible implementation, second condition includes: that the clicking rate of content is less than second threshold.Wherein, interior The clicking rate of appearance is equal to the ratio of the number that content is clicked with the number being demonstrated.
In alternatively possible implementation, second condition may include: the click of at the appointed time content in range Amount is less than third threshold value.
It should be noted that although as above, the those skilled in the art that describes second condition with two above implementation Member it is understood that the disclosure answer it is without being limited thereto.Those skilled in the art can be according to practical application scene demand and/or personal happiness Good flexible setting second condition.
Fig. 9 shows the block diagram of the content clustering device according to one embodiment of the disclosure.As shown in figure 9, the device includes: to obtain Modulus block 901, for obtaining multiple groups user behavior data;First determining module 902, for for each group of user behavior number According to, respectively determine the corresponding content array of user behavior data;Second determining module 903, for according to content in content array Between positional relationship, determine the correlation between content;Third determining module 904, for according to the correlation between content, Using label propagation algorithm, classification belonging to each content is determined.
In one possible implementation, the second determining module 903 is used for: will be between content adjacent in content array Correlation be determined as correlation.
Figure 10 shows an illustrative block diagram of the content clustering device according to one embodiment of the disclosure.It is as shown in Figure 10:
In one possible implementation, third determining module 904 is used for including: the first setting up submodule 9041 In establishing non-directed graph, and using each content as the node in non-directed graph;Second setting up submodule 9042, if being used for two Correlation between content is correlation, then establishes the side between the corresponding node of two contents;Distribution sub module 9043, is used for Respectively each node distributes label;Submodule 9044 is updated, for being saved according to the neighbours of the node for any one node The label of point, updates the label of the node, wherein the neighbor node of the node indicates the node being connected with the node;Determine son Module 9045, for according to the label of each node, determining the corresponding content of each node when the label of each node is stablized Affiliated classification.
In one possible implementation, it updates submodule 9044 to be used for: for any one node, according to the node Neighbor node label in the most label of frequency of occurrence, update the label of the node.
In one possible implementation, the device further include: the 4th determining module 905, for being for correlation Relevant two contents, according to two contents, the number of adjacent appearance and two contents exist in each content array The number occurred respectively in each content array determines the similarity between two contents.
In one possible implementation, it updates submodule 9044 and comprises determining that unit, for being phase according to correlation The similarity between two contents closed, determines the weight on the side between the corresponding node of two contents;Updating unit is used for For any one node, according to the power on the side between the label of the neighbor node of the node and the node and neighbor node Weight, updates the label of the node.
In one possible implementation, updating unit includes: the first determining subelement, for for any one section The label of the neighbor node of the node is identified as candidate label by point;Second determines subelement, for determining the node The corresponding neighbor node of each candidate's label in neighbor node;Third determines subelement, for corresponding according to each candidate label Neighbor node and the node between the sum of the weight on side, determine the corresponding weight of each candidate label;Subelement is updated, is used In the label according to the highest candidate tag update node of weight.
In one possible implementation, the device further include: merging module 906, if being used for first category and second The similarity of content in classification meets first condition, then merges first category and second category.
In one possible implementation, the device further include: the 5th determining module 907, for determining first category With the first content number in the intersection of second category;6th determining module 908, for determine first category and second category and The the second content number concentrated;7th determining module 909, if being greater than the first threshold for the ratio of first content number and the second content number Value, it is determined that the similarity of the content in first category and second category meets first condition;8th determining module 910, is used for If the ratio of first content number and the second content number is less than or equal to first threshold, it is determined that in first category and second category The similarity of content is unsatisfactory for first condition.
In one possible implementation, the device further include: removing module 911, for deleting in each classification not Meet the content of second condition.
In one possible implementation, second condition includes: that the clicking rate of content is less than second threshold.
The present embodiment is by obtaining multiple groups user behavior data, for each group of user behavior data, determines user respectively The corresponding content array of behavioral data determines the correlation between content according to the positional relationship in content array between content, And classification belonging to each content is determined using label propagation algorithm according to the correlation between content, thus, it is possible to automatically into Row content clustering carries out content clustering without artificial, saves manpower, be easy to cluster a large amount of content, and can be more preferable Ground excavates the correlation between content, improves the accuracy of content clustering.
Figure 11 is a kind of block diagram of device 1900 for content clustering shown according to an exemplary embodiment.For example, Device 1900 may be provided as a server.Referring to Fig.1 1, it further comprises one that device 1900, which includes processing component 1922, A or multiple processors and memory resource represented by a memory 1932, can be by processing component 1922 for storing The instruction of execution, such as application program.The application program stored in memory 1932 may include one or more every One corresponds to the module of one group of instruction.In addition, processing component 1922 is configured as executing instruction, to execute the above method.
Device 1900 can also include that a power supply module 1926 be configured as the power management of executive device 1900, and one Wired or wireless network interface 1950 is configured as device 1900 being connected to network and input and output (I/O) interface 1958.Device 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, for example including calculating The memory 1932 of machine program instruction, above-mentioned computer program instructions can be executed by the processing component 1922 of device 1900 to complete The above method.
The disclosure can be system, method and/or computer program product.Computer program product may include computer Readable storage medium storing program for executing, containing for making processor realize the computer-readable program instructions of various aspects of the disclosure.
Computer readable storage medium, which can be, can keep and store the tangible of the instruction used by instruction execution equipment Equipment.Computer readable storage medium for example can be-- but it is not limited to-- storage device electric, magnetic storage apparatus, optical storage Equipment, electric magnetic storage apparatus, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium More specific example (non exhaustive list) includes: portable computer diskette, hard disk, random access memory (RAM), read-only deposits It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), portable Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon It is stored with punch card or groove internal projection structure and the above-mentioned any appropriate combination of instruction.Calculating used herein above Machine readable storage medium storing program for executing is not interpreted that instantaneous signal itself, the electromagnetic wave of such as radio wave or other Free propagations lead to It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire Electric signal.
Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/ Processing equipment, or outer computer or outer is downloaded to by network, such as internet, local area network, wide area network and/or wireless network Portion stores equipment.Network may include copper transmission cable, optical fiber transmission, wireless transmission, router, firewall, interchanger, gateway Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment In calculation machine readable storage medium storing program for executing.
Computer program instructions for executing disclosure operation can be assembly instruction, instruction set architecture (ISA) instructs, Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages The source code or object code that any combination is write, the programming language include the programming language-of object-oriented such as Smalltalk, C++ etc., and conventional procedural programming languages-such as " C " language or similar programming language.Computer Readable program instructions can be executed fully on the user computer, partly execute on the user computer, be only as one Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part Or it is executed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).In some embodiments, by utilizing computer-readable program instructions Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can Programmed logic array (PLA) (PLA), the electronic circuit can execute computer-readable program instructions, to realize each side of the disclosure Face.
Referring herein to according to the flow chart of the method, apparatus (system) of the embodiment of the present disclosure and computer program product and/ Or block diagram describes various aspects of the disclosure.It should be appreciated that flowchart and or block diagram each box and flow chart and/ Or in block diagram each box combination, can be realized by computer-readable program instructions.
These computer-readable program instructions can be supplied to general purpose computer, special purpose computer or other programmable datas The processor of processing unit, so that a kind of machine is produced, so that these instructions are passing through computer or other programmable datas When the processor of processing unit executes, function specified in one or more boxes in implementation flow chart and/or block diagram is produced The device of energy/movement.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, thus, it is stored with instruction Computer-readable medium then includes a manufacture comprising in one or more boxes in implementation flow chart and/or block diagram The instruction of the various aspects of defined function action.
Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other In equipment, so that series of operation steps are executed in computer, other programmable data processing units or other equipment, to produce Raw computer implemented process, so that executed in computer, other programmable data processing units or other equipment Instruct function action specified in one or more boxes in implementation flow chart and/or block diagram.
The flow chart and block diagram in the drawings show system, method and the computer journeys according to multiple embodiments of the disclosure The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation One module of table, program segment or a part of instruction, the module, program segment or a part of instruction include one or more use The executable instruction of the logic function as defined in realizing.In some implementations as replacements, function marked in the box It can occur in a different order than that indicated in the drawings.For example, two continuous boxes can actually be held substantially in parallel Row, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or The combination of each box in flow chart and the box in block diagram and or flow chart, can the function as defined in executing or dynamic The dedicated hardware based system made is realized, or can be realized using a combination of dedicated hardware and computer instructions.
The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or technological improvement to the technology in market for best explaining each embodiment, or lead this technology Other those of ordinary skill in domain can understand each embodiment disclosed herein.

Claims (24)

1. a kind of content clustering method characterized by comprising
Obtain multiple groups user behavior data;
For each group of user behavior data, the corresponding content array of the user behavior data is determined respectively;
According to the positional relationship in the content array between content, the correlation between content is determined;
Classification belonging to each content is determined using label propagation algorithm according to the correlation between content.
2. the method according to claim 1, wherein being closed according to the position in the content array between content System, determines the correlation between content, comprising:
Correlation between content adjacent in the content array is determined as correlation.
3. method according to claim 1 or 2, which is characterized in that according to the correlation between content, propagated using label Algorithm determines classification belonging to each content, comprising:
Establish non-directed graph, and using each content as the non-directed graph in node;
If the correlation between two contents is correlation, the side between the corresponding node of two contents is established;
Respectively each node distributes label;
The label of the node is updated according to the label of the neighbor node of the node for any one node, wherein the node Neighbor node indicate the node that is connected with the node;
When the label of each node is stablized, according to the label of each node, class belonging to the corresponding content of each node is determined Not.
4. according to the method described in claim 3, it is characterized in that, being saved for any one node according to the neighbours of the node The label of point, updates the label of the node, comprising:
The section is updated according to the most label of frequency of occurrence in the label of the neighbor node of the node for any one node The label of point.
5. according to the method described in claim 3, it is characterized in that, the method is also wrapped before the label for updating the node It includes:
It is relevant two contents for correlation, according to the number of two content adjacent appearance in each content array, And the number that two contents occur respectively in each content array, determine the similarity between two contents.
6. according to the method described in claim 5, it is characterized in that, being saved for any one node according to the neighbours of the node The label of point, updates the label of the node, comprising:
According to similarity of the correlation between relevant two contents, the side between the corresponding node of two contents is determined Weight;
For any one node, according to the side between the label of the neighbor node of the node and the node and neighbor node Weight, update the label of the node.
7. according to the method described in claim 6, it is characterized in that, being saved for any one node according to the neighbours of the node The weight on the side between the label and the node and neighbor node of point, updates the label of the node, comprising:
For any one node, the label of the neighbor node of the node is identified as candidate label;
Determine the corresponding neighbor node of each candidate's label in the neighbor node of the node;
According to the sum of the weight on side between each candidate corresponding neighbor node of label and the node, each candidate label is determined Corresponding weight;
According to the label of the highest candidate tag update node of weight.
8. the method according to claim 1, wherein after determining classification belonging to each content, the side Method further include:
If the similarity of the content in first category and second category meets first condition, merge the first category and described Second category.
9. according to the method described in claim 8, it is characterized in that, after determining classification belonging to each content, the side Method further include:
Determine the first content number in the intersection of the first category and the second category;
Determine the first category and the second category and concentration the second content number;
If the ratio of the first content number and the second content number is greater than first threshold, it is determined that the first category and institute The similarity for stating the content in second category meets the first condition;
If the ratio of the first content number and the second content number is less than or equal to the first threshold, it is determined that described the The similarity of one classification and the content in the second category is unsatisfactory for the first condition.
10. the method according to claim 1, wherein after determining classification belonging to each content, the side Method further include:
Delete the content that second condition is unsatisfactory in each classification.
11. according to the method described in claim 10, it is characterized in that, the clicking rate that the second condition includes: content is less than Second threshold.
12. a kind of content clustering device characterized by comprising
Module is obtained, for obtaining multiple groups user behavior data;
First determining module, in determining that the user behavior data is corresponding respectively for each group of user behavior data Hold sequence;
Second determining module, for determining the correlation between content according to the positional relationship in the content array between content Property;
Third determining module, for being determined belonging to each content according to the correlation between content using label propagation algorithm Classification.
13. device according to claim 12, which is characterized in that second determining module is used for:
Correlation between content adjacent in the content array is determined as correlation.
14. device according to claim 12 or 13, which is characterized in that the third determining module is for including:
First setting up submodule, for establishing non-directed graph, and using each content as the non-directed graph in node;
Second setting up submodule establishes the corresponding section of two contents if being correlation for the correlation between two contents Side between point;
Distribution sub module distributes label for respectively each node;
Submodule is updated, for updating the mark of the node according to the label of the neighbor node of the node for any one node Label, wherein the neighbor node of the node indicates the node being connected with the node;
Submodule is determined, for according to the label of each node, determining that each node is corresponding when the label of each node is stablized Content belonging to classification.
15. device according to claim 14, which is characterized in that the update submodule is used for:
The section is updated according to the most label of frequency of occurrence in the label of the neighbor node of the node for any one node The label of point.
16. device according to claim 14, which is characterized in that described device further include:
4th determining module, for being relevant two contents for correlation, according to two contents in each content array In adjacent appearance number and the number that occurs respectively in each content array of two contents, determine two contents Between similarity.
17. device according to claim 16, which is characterized in that the update submodule includes:
Determination unit determines that two contents are corresponding for the similarity according to correlation between relevant two contents The weight on the side between node;
Updating unit is used for for any one node, according to the label of the neighbor node of the node and the node and neighbours The weight on the side between node updates the label of the node.
18. device according to claim 17, which is characterized in that the updating unit includes:
First determines subelement, for for any one node, the label of the neighbor node of the node being identified as waiting Select label;
Second determines subelement, the corresponding neighbor node of each candidate label in the neighbor node for determining the node;
Third determines subelement, for according to the weight on the side between each candidate corresponding neighbor node of label and the node it With determine the corresponding weight of each candidate label;
Subelement is updated, for the label according to the highest candidate tag update node of weight.
19. device according to claim 12, which is characterized in that described device further include:
Merging module, if the similarity for the content in first category and second category meets first condition, merge described in First category and the second category.
20. device according to claim 19, which is characterized in that described device further include:
5th determining module, the first content number in intersection for determining the first category and the second category;
6th determining module, for determining the first category and the second category and concentration the second content number;
7th determining module, if being greater than first threshold for the ratio of the first content number and the second content number, really The similarity of the fixed first category and the content in the second category meets the first condition;
8th determining module, if being less than or equal to described first for the ratio of the first content number and the second content number Threshold value, it is determined that the similarity of the first category and the content in the second category is unsatisfactory for the first condition.
21. device according to claim 12, which is characterized in that described device further include:
Removing module, for deleting the content for being unsatisfactory for second condition in each classification.
22. device according to claim 21, which is characterized in that the second condition includes: that the clicking rate of content is less than Second threshold.
23. a kind of content clustering device characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to method described in any one of perform claim requirement 1 to 11.
24. a kind of non-volatile computer readable storage medium storing program for executing, is stored thereon with computer program instructions, which is characterized in that institute It states and realizes method described in any one of claim 1 to 11 when computer program instructions are executed by processor.
CN201810226492.4A 2018-03-19 2018-03-19 Content clustering method and device Active CN110287977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810226492.4A CN110287977B (en) 2018-03-19 2018-03-19 Content clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810226492.4A CN110287977B (en) 2018-03-19 2018-03-19 Content clustering method and device

Publications (2)

Publication Number Publication Date
CN110287977A true CN110287977A (en) 2019-09-27
CN110287977B CN110287977B (en) 2021-09-21

Family

ID=68001077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810226492.4A Active CN110287977B (en) 2018-03-19 2018-03-19 Content clustering method and device

Country Status (1)

Country Link
CN (1) CN110287977B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444387A (en) * 2020-03-27 2020-07-24 腾讯科技(深圳)有限公司 Video classification method and device, computer equipment and storage medium
CN111538837A (en) * 2020-04-27 2020-08-14 北京同邦卓益科技有限公司 Method and device for analyzing enterprise operation range information

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254038B (en) * 2011-08-11 2013-01-23 武汉安问科技发展有限责任公司 System and method for analyzing network comment relevance
CN102768670B (en) * 2012-05-31 2014-08-20 哈尔滨工程大学 Webpage clustering method based on node property label propagation
CN103226470B (en) * 2013-03-21 2016-09-14 北京神州绿盟信息安全科技股份有限公司 A kind of method and device determining check item weighted value based on BVS
CN103577562B (en) * 2013-10-24 2016-08-31 河海大学 A kind of many measuring periods sequence similarity analyzes method
EP2980754A1 (en) * 2014-07-28 2016-02-03 Thomson Licensing Method and apparatus for generating temporally consistent superpixels
CN104951505A (en) * 2015-05-20 2015-09-30 中国科学院信息工程研究所 Large-scale data clustering method based on graphic calculation technology

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444387A (en) * 2020-03-27 2020-07-24 腾讯科技(深圳)有限公司 Video classification method and device, computer equipment and storage medium
CN111538837A (en) * 2020-04-27 2020-08-14 北京同邦卓益科技有限公司 Method and device for analyzing enterprise operation range information

Also Published As

Publication number Publication date
CN110287977B (en) 2021-09-21

Similar Documents

Publication Publication Date Title
US10366095B2 (en) Processing time series
Nagata et al. A new genetic algorithm for the asymmetric traveling salesman problem
US11087861B2 (en) Creation of new chemical compounds having desired properties using accumulated chemical data to construct a new chemical structure for synthesis
CN110807515A (en) Model generation method and device
US10055435B2 (en) Interactive presentation of large scale graphs
US20200012733A1 (en) Multi-dimensional knowledge index and application thereof
US10599979B2 (en) Candidate visualization techniques for use with genetic algorithms
CN111797327B (en) Social network modeling method and device
US11595269B1 (en) Identifying upgrades to an edge network by artificial intelligence
CN110287977A (en) Content clustering method and device
Deutsch et al. An open-source program for efficiently computing ultimate pit limits: Mineflow
US20230069079A1 (en) Statistical K-means Clustering
US20190230023A1 (en) Determining connections between nodes in a network
JP7398474B2 (en) Developing and training deep forest models
US10430739B2 (en) Automatic solution to a scheduling problem
CN110309188A (en) Content clustering method and device
JP5555238B2 (en) Information processing apparatus and program for Bayesian network structure learning
US20220188691A1 (en) Machine Learning Pipeline Generation
CN115204931A (en) User service policy determination method and device and electronic equipment
US20170032018A1 (en) Large taxonomy categorization
US11321424B2 (en) Predicting variables where a portion are input by a user and a portion are predicted by a system
CN110309404A (en) Content recommendation method and device
CN109286823A (en) The acquisition methods and device of multimedia content
CN110309298A (en) Theme prediction technique and device
CN110309294A (en) The label of properties collection determines method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200429

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Applicant before: Youku network technology (Beijing) Co., Ltd

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant