CN108446731B

CN108446731B - Content duplication removing method and device

Info

Publication number: CN108446731B
Application number: CN201810220157.3A
Authority: CN
Inventors: 王洁; 徐钊; 史小龙
Original assignee: Qingdao Hisense Media Network Technology Co Ltd
Current assignee: Qingdao Hisense Media Network Technology Co Ltd
Priority date: 2018-03-16
Filing date: 2018-03-16
Publication date: 2021-01-08
Anticipated expiration: 2038-03-16
Also published as: CN108446731A

Abstract

The application relates to a content deduplication method and device, and belongs to the field of communication. The method comprises the following steps: calculating a first probability that each content belongs to each content topic in M content topics according to content description information of each content included in the target micro type, wherein M is an integer greater than 1; obtaining a topic vector of the target micro type according to the first probability that each content belongs to each content topic, wherein the target micro type is one of N micro types, and N is an integer greater than 1; clustering the N micro-types according to the theme vector of each micro-type in the N micro-types to obtain K clusters, wherein K is an integer greater than 1; and respectively selecting the micro type corresponding to each cluster from each cluster in the K clusters, and forming the selected micro types into a micro type set. The method and the device can avoid a large amount of repeated contents among the micro types.

Description

Content duplication removing method and device

Technical Field

The present application relates to the field of communications, and in particular, to a method and an apparatus for content deduplication.

Background

At present, in order to improve user experience, all large content websites present more and more excellent content to users, a waterfall flow presentation mode is continuously pushed out, and the presentation mode can be infinitely loaded in a webpage. The content may be content such as video or music, for example, the current video website displays video in a waterfall streaming presentation manner.

In order to adapt to the waterfall flow presentation manner, the content often defines thousands of micro-types, each micro-type includes a plurality of contents, and then each micro-type and the corresponding content thereof are displayed in the waterfall flow. Since the defined micro-types accumulate more and more with time, there may be some defined micro-types that are similar and have less difference therebetween when defining the micro-types, and for any two micro-types in the defined micro-types, a great amount of repeated content is included between the two micro-types.

Thus, when personalized recommendations are made, there may be a large amount of duplicate content between certain two micro-types or between certain micro-types.

Disclosure of Invention

In order to avoid a large amount of repeated content among micro types, the embodiment of the application provides a content deduplication method and a content deduplication device. The technical scheme is as follows:

in a first aspect, the present application provides a method for content deduplication, the method comprising:

calculating a first probability that each content belongs to each content topic in M content topics according to content description information of each content included in the target micro type, wherein M is an integer greater than 1;

obtaining a topic vector of the target micro type according to the first probability that each content belongs to each content topic, wherein the target micro type is one of N micro types, and N is an integer greater than 1;

clustering the N micro-types according to the theme vector of each micro-type in the N micro-types to obtain K clusters, wherein K is an integer greater than 1;

and respectively selecting the micro type corresponding to each cluster from each cluster in the K clusters, and forming the selected micro types into a micro type set to be recommended.

Optionally, the calculating, according to the content description information of each content included in the target micro-type, a first probability that each content belongs to each content topic in the M content topics includes:

segmenting content description information of target content to obtain a plurality of words, and forming the words into a corpus of the target content, wherein the target content is any one of the target micro-types;

and inputting the corpus into a preset topic model to perform topic operation to obtain a first probability that the target content belongs to each of the M content topics.

Optionally, the obtaining the theme vector of the target micro-type according to the first probability that each content belongs to each content theme includes:

acquiring first probabilities of contents belonging to the same content theme, and acquiring a second probability that the target micro type belongs to the content theme according to the first probabilities;

and forming a topic vector of the target micro type according to the second probability that the target micro type belongs to each content topic in the M content topics.

Optionally, the selecting the micro type corresponding to each cluster from each cluster of the K clusters respectively includes:

determining a centroid vector of a centroid of a target cluster according to a topic vector of each micro type included in the target cluster, wherein the target cluster is any one of the K clusters;

respectively calculating the distance between each micro type and the centroid according to the theme vector of each micro type and the centroid vector of the centroid;

and selecting the micro type corresponding to the target cluster from the target clusters according to the distance between each micro type and the centroid.

counting the number of contents included in each micro type in a target cluster, wherein the target cluster is any one of the K clusters;

and selecting the micro type corresponding to the target cluster from the target cluster according to the content number of each micro type.

Optionally, after the selecting the micro-type corresponding to each cluster from each cluster of the K clusters, the method further includes:

and respectively calculating the recommendation index of each micro type according to the viewing time of each content included in each micro type in the micro type set, and selecting and recommending the micro type from the micro type set according to the recommendation index of each micro type.

In a second aspect, the present application provides an apparatus for content deduplication, the apparatus comprising:

the calculating module is used for calculating a first probability that each content belongs to each content topic in M content topics according to content description information of each content included in the target micro type, wherein M is an integer greater than 1;

an obtaining module, configured to obtain a topic vector of the target micro-type according to a first probability that each content belongs to each content topic, where the target micro-type is one of N micro-types, and N is an integer greater than 1;

the clustering module is used for clustering the N micro-types according to the theme vector of each micro-type in the N micro-types to obtain K clusters, wherein K is an integer greater than 1;

and the selection module is used for respectively selecting the micro-type corresponding to each cluster from each cluster in the K clusters and forming the selected micro-types into a micro-type set to be recommended.

Optionally, the calculation module includes:

the composition unit is used for performing word segmentation on content description information of target content to obtain a plurality of words, and forming the words into a corpus of the target content, wherein the target content is any one of the target micro-types;

and the input unit is used for inputting the corpus into a preset topic model to perform topic operation so as to obtain a first probability that the target content belongs to each of the M content topics.

Optionally, the obtaining module includes:

the acquisition unit is used for acquiring each first probability of each content belonging to the same content theme and acquiring a second probability of the target micro type belonging to the content theme according to each first probability;

a composing unit, configured to compose the second probability that the target micro-type belongs to each of the M content topics into a topic vector of the target micro-type.

Optionally, the selecting module includes:

a determining unit, configured to determine a centroid vector of a centroid of a target cluster according to a topic vector of each micro type included in the target cluster, where the target cluster is any one of the K clusters;

a second calculating unit, configured to calculate a distance between each micro type and the centroid according to the theme vector of each micro type and the centroid vector of the centroid, respectively;

a first selecting unit, configured to select, from the target clusters, a micro-type corresponding to the target cluster according to a distance between each micro-type and the centroid.

Optionally, the selecting module includes:

a counting unit, configured to count the number of contents included in each micro type in a target cluster, where the target cluster is any one of the K clusters;

and the second selection unit is used for selecting the micro type corresponding to the target cluster from the target cluster according to the content number of each micro type.

Optionally, the apparatus further comprises:

and the recommending module is used for respectively calculating the recommending index of each micro type according to the watching time of each content included in each micro type in the micro type set, and selecting and recommending the micro type from the micro type set according to the recommending index of each micro type.

In a third aspect, this application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of the first aspect or any possible implementation manner of the first aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the method comprises the steps of obtaining a theme vector of each micro type in N micro types, clustering the N micro types according to the theme vector of each micro type to obtain K clusters, wherein each cluster comprises micro types with similar contents, and then selecting the micro type corresponding to each cluster from each cluster in the K clusters, so that a large number of micro types with similar contents can be excluded, and the selected micro types do not have a large number of repeated contents.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

fig. 2 is a flowchart of a method for content deduplication provided in an embodiment of the present application;

FIG. 3-1 is a flow chart of another method for content deduplication provided by an embodiment of the present application;

FIG. 3-2 is a schematic diagram of a cluster provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a content deduplication apparatus provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Referring to fig. 1, an embodiment of the present application provides a system architecture for content deduplication, including:

a content library including a plurality of contents, each content having content description information. The content description information of the content may include a content introduction, a content title, and/or a content type of the content, etc. For example, the content may be a video, the content description information may be a video introduction, a video title, a video type, and the like, and the video type may be a video tag, a video secondary type, and the like.

The micro-type library comprises M micro-types, and each micro-type corresponds to one micro-type rule. The content belonging to each micro type can be determined from the content library according to the micro type rule of each micro type, and each micro type and the content belonging to each micro type are correspondingly stored in the corresponding relation between the micro type and the content. For example, taking the content as a video as an example, assuming that a certain micro-type is "review the black abdomen element in nostalgic movies", the rule of the micro-type is { "category": movie "," child _ category _ name ": nostalgic", "tag": black abdomen "}, and the rule of the micro-type is matched with each video in the content library, so as to divide the related movie under the micro-type.

For any micro-type, for convenience of explanation, it is called that the micro-type is a target micro-type, and the content description information of each content belonging to the target micro-type is segmented to obtain a corpus of each content, where the corpus of the content includes the words obtained by segmenting the content description information of the content. Optionally, when segmenting words, only the content introduction and the content title may be segmented, and the content type may be directly used as a word, or the content type may not be segmented.

The topic model can input the corpus of each content included in each micro type into the topic model, and the topic model performs topic operation on the corpus of each content included in each micro type to obtain a topic vector of each micro type respectively. Alternatively, the topic model may generate a model (LDA) for the topic of the document.

And the clustering model can input the theme vector of each micro type into the clustering model, cluster the N micro types through the clustering model and output K clusters. Then, selecting corresponding micro-types from the K clusters for each cluster and forming a micro-type set.

Optionally, M, N, K is an integer greater than 1.

Referring to fig. 2, an embodiment of the present application provides a content deduplication method, which may be applied to the system architecture shown in fig. 1, and includes:

step 201: according to the content description information of each content included in the target micro type, calculating a first probability that each content belongs to each content subject in M content subjects, wherein M is an integer larger than 1.

Step 202: and obtaining a theme vector of the target micro type according to the first probability that each content belongs to each content theme.

Wherein the topic vector comprises a second probability that the target micro-type belongs to each of the M content topics, the target micro-type is one of the N micro-types, and M and N are both integers greater than 1.

Step 203: and clustering the N micro-types according to the theme vector of each micro-type in the N micro-types to obtain K clusters, wherein K is an integer greater than 1.

During implementation, the theme vector of each of the N micro-types may be input to a preset clustering model, and the preset clustering model performs clustering processing on the N micro-types according to the theme vector of each micro-type and outputs K clusters.

Due to the fact that the data complexity of the clustering model is low, the generated calculated amount is small, and the clustering model can be used for fast clustering to obtain K clusters, so that the content deduplication efficiency can be improved, and the calculated amount of content deduplication is reduced.

Step 204: and respectively selecting the micro type corresponding to each cluster from each cluster in the K clusters, and forming the selected micro types into a micro type set.

In the embodiment of the application, the topic vector of each micro type in the N micro types is obtained, the N micro types are clustered according to the topic vector of each micro type to obtain K clusters, each cluster comprises micro types with similar contents, and then the micro type corresponding to each cluster is selected from each cluster in the K clusters, so that a large number of micro types with similar contents can be excluded, and the selected micro types do not have a large number of repeated contents.

Referring to fig. 3-1, an embodiment of the present application provides a content deduplication method, which may be described in detail with respect to the method of the embodiment shown in fig. 2, and includes:

step 301: and calculating a first probability that each content belongs to each content subject in the M content subjects according to the content description information of each content included in the target micro-type, wherein the target micro-type is one of the N micro-types.

Each of the N micro-types corresponds to a micro-type rule. Before executing the step, the content may be matched with the content in the content library according to the micro-type rule corresponding to each micro-type, and each content belonging to each micro-type may be matched.

This step can be implemented by following steps 3011 and 3022, respectively:

3011: the content description information of the target content is segmented to obtain a plurality of words, the words form a corpus of the target content, and the target content is any one of the target micro-types.

The content description information of the target content may include information such as a content introduction, a content title, and/or a content type of the target content. For example, when the target content is a video, the content description information of the target content may include information such as a video profile, a video title, and/or a video type, and the video type may be a video title and/or a video secondary type.

When the content description information of the target content is segmented, the content introduction and the content title of the target content can be segmented to obtain a plurality of words, the content type can be not segmented, and the plurality of words and the content type obtained by segmentation can form a corpus of the target content.

Optionally, the corpus of target content may further include a content identification of the target content.

For example, there is a movie ". star." hollywood ", and the corresponding content is identified as 1526302, and the words obtained after the word segmentation processing is performed on the video introduction and the video title of the movie are" northeast "," chinese image "," killer "," movie factory "," adventure picture "," theater line "," black side "," optimistic "," feeling "," traveling "," fun "," hollywood "," friendship dock "," friendship "," actor "," flee "," grand photo "," modern "," brave "," off ", and" off-site love "; the types of the movie are "love films", "comedy films", "action films" and "drama films".

The obtained words and the type of the movie are combined into a corpus of the movie, which may be: {1526302 [ "love photo", "northeast", "chinese image", "killer", "movie factory", "adventure photo", "theater line", "comedy photo", "black help", "action photo", "optimism", "feeling", "travel", "fun", "hollywood", "drama photo", "friendship", "actor", "flee-to-death", "dailies", "modern", "brave", "allopatric love" ] }.

3012: and inputting the corpus into a preset topic model to perform topic operation to obtain a first probability that the target content belongs to each of the M content topics.

Before executing the step, the number of content topics corresponding to the preset topic model may be set to be M, so that in the step, the preset topic model performs topic operation on the input corpus to obtain M first probabilities, where the M first probabilities are first probabilities that the target content belongs to each of the M content topics.

Optionally, the preset topic model may be LDA or the like.

Optionally, when the topic model outputs the first probability that the target content belongs to each of the M content topics, the topic model may further output words belonging to each content topic in the target content.

And the first probability Pi of the target content belonging to the ith content topic is ni/n, ni is the number of words belonging to the ith content topic in the target content output by the topic model, and n is the number of words included in the target content. Alternatively, the first and second electrodes may be,

the topic model outputs a first word set which belongs to the ith topic in the target content, wherein the first word set comprises word1 and word2 … … word, and the probability that each word belongs to the ith topic is p (word1) and p (word2) … … p (word); the topic model also outputs all words in the target content, i.e., outputs a second set of words, the second set of words including word1, word2 … … word M, M and N are integers greater than or equal to 1, and M is greater than or equal to N.

And constructing a first vector, wherein the first vector is composed of the probability p (word1), p (word2) … … p (word) of each word in the first word set or the first vector is composed of preset values corresponding to each word in the first word set. Optionally, the preset value corresponding to each word may be a value such as 1, 2, or 3, that is, the first vector includes N preset values. For example, if the predetermined value is 1, the first vector x_ti＝[1,……,1]Or x_ti＝[p(word1),……,p(wordN)]。

And constructing a second vector, wherein the second vector comprises a numerical value corresponding to each word in the second word set, and for each word in the second word set, if the word is a word in the first word set, the numerical value corresponding to the word is a preset numerical value, and if the word is not a word in the first word set, the numerical value corresponding to the word is 0. For example, assuming that the preset value is 1, the second vector y is [1,0,1,1,0, … …,1 ].

Then, a first probability that the target content belongs to the ith subject is calculated as

Wherein x is_iIs a first vector x_tiElement (ii) y_iAre elements in the second vector y.

Assuming that the target micro-type includes X contents, for the remaining X-1 contents, the steps 3011 and 3012 are respectively performed to obtain a first probability that each content in the remaining X-1 contents belongs to each content topic in the M content topics. The results obtained are assumed to be shown in table 1 below.

TABLE 1

Step 302: and obtaining a theme vector of the target micro-type according to the first probability that each content in the target micro-type belongs to each content theme.

This step can be realized by two steps 3021 and 3022 as follows:

3021: and acquiring first probabilities of contents belonging to the same content theme, and determining a second probability that the target micro-type belongs to the content theme according to the acquired first probabilities.

Optionally, the obtained first probabilities may be sorted, if the number of the content included in the target micro-type is odd, a first probability located at an intermediate position is obtained from the sorted first probabilities, and the first probability located at the intermediate position is determined as a second probability that the target micro-type belongs to the content subject. If the number of the contents included in the target micro-type is even, two first probabilities located at the middle position are obtained from the sorted first probabilities, one first probability is randomly selected from the two first probabilities located at the middle position, and the selected first probability is determined as a second probability that the target micro-type belongs to the content subject.

Assuming that the target micro-type includes X contents, assuming that X is an odd number, for a content topic 1, obtaining a first probability that each of the X contents belongs to the content topic 1, obtaining X first probabilities, respectively being P11 and P21 … … PX1, sorting the first probabilities P11 and P21 … … PX1, obtaining a first probability located at an intermediate position from the sorted first probabilities, assuming that the obtained first probability is P21, and determining the first probability P21 as a second probability that the target micro-type belongs to the content topic 1; for the content subject 2, obtaining a first probability that each content in the X contents belongs to the content subject 2, obtaining X first probabilities, respectively P12 and P22 … … PX2, the first probabilities P12, P22 … … PX2 are sorted, the first probability located at the intermediate position is obtained from the sorted first probabilities, and assuming that the obtained first probability is P22, the first probability P22 is determined as the second probability that the target micro-type belongs to the content subject 2, … …, for the content theme M, obtaining a first probability that each content in the X contents belongs to the content theme M, obtaining X first probabilities, which are respectively P1M and P2M … … PXM, and sequencing the first probabilities P1M and P2M … … PXM, acquiring the first probability located at the middle position from the sequenced first probabilities, and determining the first probability P2M as a second probability that the target micro type belongs to the content subject M on the assumption that the acquired first probability is P2M. The results obtained are shown in Table 2 below.

TABLE 2

3022: and forming a topic vector of the target micro type according to the second probability that the target micro type belongs to each content topic in the M content topics.

For example, referring to table 2, the second probability that the target micro-type belongs to each of the M content topics is divided into Q1, Q2 … … QM, so that the topic vector of the composed target micro-type is [ Q1, Q2, … …, QM ].

Wherein, there are N micro-types, and for the remaining N-1 micro-types, the operations of the

above steps

301 and 302 are respectively performed on each micro-type in the N-1 micro-types, resulting in a topic vector for each micro-type in the N-1 micro-types.

Step 303: and clustering the N micro-types according to the theme vector of each micro-type in the N micro-types to obtain K clusters, wherein K is an integer greater than 1.

Before the step is executed, the cluster number corresponding to the preset cluster model can be set to be K, so that when the step is executed, the preset cluster model carries out clustering processing on the N micro-types according to the input theme vector of each micro-type to form K clusters, and the K clusters are output.

Wherein each cluster in the K clusters comprises at least one micro type, and for each cluster, the at least one micro type in the cluster is a micro type with similar content.

Optionally, the clustering model may be a Kmeans clustering algorithm or the like.

Step 304: and respectively selecting the micro type corresponding to each cluster from each cluster in the K clusters, and forming the selected micro types into a micro type set to be recommended.

There are various ways to select the micro-type corresponding to each cluster from each cluster. For example, in the embodiments of the present application, an alternative way to implement the present step is listed. This selection may be achieved by the following operations 3041 through 3043, respectively:

3041: and determining a centroid vector of a centroid of the target cluster according to the theme vector of each micro type included in the target cluster, wherein the target cluster is any one of the K clusters.

In implementation, an average vector may be calculated from the topic vector of each micro-type included in the target cluster, and the average vector may be determined as a centroid vector of a centroid of the target cluster.

For example, referring to the target cluster shown in FIG. 3-2, the target cluster includes A, B, C, D and E five micro-types whose positions in the target cluster are shown in FIG. 3-2. And calculating an average vector according to the five micro-type theme vectors, and taking the average vector as a centroid vector to obtain the position of the centroid O.

3042: and respectively calculating the distance between each micro type and the centroid according to the theme vector of each micro type and the centroid vector of the centroid.

For example, the distance between each micro-type and the centroid can be calculated as follows.

In the above formula, d (x)_i,x_center) Is the distance, x, between a micro-type and the centroid_i1、x_i2……x_iMFor the element, x, in the topic vector of the micro type_center1、x_center2……x_centerMIs an element in the centroid vector for that centroid.

For example, referring to fig. 3-2, from the subject vector of the micro type a and the centroid vector of the centroid O, the distance between the micro type a and the centroid O is calculated to be L1; calculating the distance between the micro type B and the centroid O to be L2 according to the theme vector of the micro type B and the centroid vector of the centroid O; calculating the distance between the micro type C and the centroid O to be L3 according to the theme vector of the micro type C and the centroid vector of the centroid O; calculating the distance between the micro type D and the centroid O to be L4 according to the theme vector of the micro type D and the centroid vector of the centroid O; and calculating the distance L5 between the micro type E and the centroid O according to the theme vector of the micro type E and the centroid vector of the centroid O. Among them, referring to fig. 3-2, the distance L3 is the smallest among the distances L1, L2, L3, L4 and L5.

3043: and selecting the micro type corresponding to the target cluster from the target clusters according to the distance between each micro type and the centroid.

Optionally, the micro-type with the smallest distance to the centroid may be selected from the target cluster as the micro-type corresponding to the target cluster.

For example, referring to fig. 3-2, a micro-type C corresponding to a minimum distance L3 may be selected from the micro-types A, B, C, D and E as the micro-type corresponding to the target cluster.

The micro-type with the smallest distance to the centroid is preferably selected, because the micro-type closest to the centroid is taken, the distribution of the content in the micro-type under each content subject is uniform, and thus, the types covered by the content under the micro-type are richer.

In the embodiment of the present application, there may be other selection ways besides the above-listed one way of implementing the step. For another example, a selection manner may be implemented by the following operations 3044 to 3045, which are respectively:

3044: counting the number of contents included in each micro type in the target cluster, wherein the target cluster is any one of K clusters.

3045: and selecting the micro type corresponding to the target cluster from the target clusters according to the content number of each micro type.

Optionally, a micro-type with the largest content number may be selected from the target cluster as the micro-type corresponding to the target cluster. And selecting a micro type with the largest content number can recommend more contents to the user.

Step 305: and respectively calculating the recommendation index of each micro type according to the viewing time of each content included in each micro type in the micro type set.

When recommending content to a user, a viewing history of the user may be obtained, where the viewing history includes information such as a content identifier and a viewing time of each content viewed by the user.

In this step, each content watched by the user may be sorted according to the watching time of each content watched by the user; the interest coefficient of each content is set according to the sorting order of each content.

Optionally, if, during the sorting, each content viewed by the user is sorted in the order from near to far in viewing time, the content with the higher ranking order may have a higher interest coefficient, and the content with the lower ranking order may have a lower interest coefficient.

For example, the interest factor of the content ranked first may be set to 1, the interest factor of the content ranked second may be set to 0.98, and the interest factor of the content ranked third may be set to 0.96 … ….

After the interest coefficient of each content is set, for each micro type, the interest coefficients of the contents included in the micro type are accumulated to obtain the recommendation index of the micro type.

Step 306: and selecting and recommending the micro-type from the micro-type set according to the recommendation index of each micro-type.

Optionally, each micro-type in the micro-type set may be sorted according to the order of the recommendation indexes from the size to the size, and the top y micro-types may be selected, where y is an integer greater than 1, and the y micro-types are recommended to the user.

In the embodiment of the application, according to the content description information of each content in the micro-type, a first probability of each content belonging to each content theme is obtained through a preset theme model; and acquiring a second probability that the micro-type belongs to each content topic according to the first probability of each content in the micro-type, and further acquiring a topic vector of the micro-type. Therefore, the N micro-types can be clustered through a preset clustering model according to the theme vector of each micro-type in the N micro-types to obtain K clusters, each cluster comprises the micro-types with similar contents, the content similarity between the two micro-types arbitrarily belonging to two different clusters is low, and then the micro-type corresponding to each cluster is respectively selected from each cluster in the K clusters, so that a large number of micro-types with similar contents can be eliminated, and the selected micro-types cannot have a large number of repeated contents. In addition, because the micro-types with similar contents can be gathered into a cluster through the theme vector of the micro-types, one micro-type is selected from the cluster, and other micro-types of the cluster are removed, the micro-types with similar contents can be removed, so that the calculation amount of the weight removal is low.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 4, the present application provides an apparatus 400 for content deduplication, the apparatus 400 comprising:

a calculating module 401, configured to calculate, according to content description information of each content included in the target micro-type, a first probability that each content belongs to each content topic in M content topics, where M is an integer greater than 1;

an obtaining module 402, configured to obtain a topic vector of the target micro-type according to a first probability that each content belongs to each content topic, where the target micro-type is one of N micro-types, and N is an integer greater than 1;

a clustering module 403, configured to cluster the N micro-types according to a theme vector of each micro-type of the N micro-types to obtain K clusters, where K is an integer greater than 1;

a selecting module 404, configured to select a micro-type corresponding to each cluster from each cluster of the K clusters, and form a micro-type set from the selected micro-types.

Optionally, the calculating module 401 includes:

Optionally, the obtaining module 402 includes:

Optionally, the selecting module 404 includes:

a calculating unit, configured to calculate a distance between each micro type and the centroid according to the theme vector of each micro type and the centroid vector of the centroid;

Optionally, the selecting module 404 includes:

Optionally, the apparatus 400 further includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 shows a block diagram of a terminal 500 according to an exemplary embodiment of the present invention. The terminal 500 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 500 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

In general, the terminal 500 includes: a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement a method of content deduplication as provided by method embodiments herein.

In some embodiments, the terminal 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, touch screen display 505, camera 506, audio circuitry 507, positioning components 508, and power supply 509.

The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the terminal 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the terminal 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.

The positioning component 508 is used for positioning the current geographic Location of the terminal 500 for navigation or LBS (Location Based Service). The Positioning component 508 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

Power supply 509 is used to power the various components in terminal 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.

The acceleration sensor 511 may detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the touch screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the terminal 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user on the terminal 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side bezel of the terminal 500 and/or an underlying layer of the touch display screen 505. When the pressure sensor 513 is disposed on the side frame of the terminal 500, a user's holding signal of the terminal 500 may be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the touch display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back, or side of the terminal 500. When a physical button or a vendor Logo is provided on the terminal 500, the fingerprint sensor 514 may be integrated with the physical button or the vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 505 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 505 is turned down. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.

A proximity sensor 516, also referred to as a distance sensor, is typically disposed on the front panel of the terminal 500. The proximity sensor 516 is used to collect the distance between the user and the front surface of the terminal 500. In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 gradually decreases, the processor 501 controls the touch display screen 505 to switch from the bright screen state to the dark screen state; when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 becomes gradually larger, the processor 501 controls the touch display screen 505 to switch from the screen-rest state to the screen-on state.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of terminal 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for content deduplication, the method comprising:

respectively selecting a micro type corresponding to each cluster from each cluster in the K clusters, and forming the selected micro types into a micro type set to be recommended;

wherein the calculating a first probability that each content belongs to each of the M content topics according to the content description information of each content included in the target micro-type includes:

2. The method of claim 1, wherein said obtaining a topic vector for said target micro-type based on said first probability that said respective content belongs to said each content topic comprises:

3. The method of claim 1, wherein said selecting, from each of said K clusters, a micro-type to which said each cluster corresponds, respectively, comprises:

4. The method of claim 1, wherein said selecting, from each of said K clusters, a micro-type to which said each cluster corresponds, respectively, comprises:

5. The method of claim 1, wherein after said selecting the micro-type corresponding to each of the K clusters from each of the K clusters, respectively, further comprising:

6. An apparatus for content deduplication, the apparatus comprising:

the selecting module is used for respectively selecting the micro-type corresponding to each cluster from each cluster in the K clusters and forming the selected micro-types into a micro-type set to be recommended;

wherein the calculation module comprises:

7. The apparatus of claim 6, wherein the acquisition module comprises:

8. The apparatus of claim 6, wherein the selection module comprises:

9. The apparatus of claim 6, wherein the selection module comprises:

10. The apparatus of claim 6, wherein the apparatus further comprises:

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 5.