CN113378020A

CN113378020A - Acquisition method, device and computer readable storage medium for similar film watching users

Info

Publication number: CN113378020A
Application number: CN202110638357.2A
Authority: CN
Inventors: 张潇
Original assignee: Shenzhen TCL New Technology Co Ltd
Current assignee: Shenzhen TCL New Technology Co Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-09-10

Abstract

An acquisition method of similar viewing users comprises the following steps: acquiring the film watching statistical data of each film watching user in a plurality of film watching users, wherein the film watching statistical data comprise the ratio of the film and television types watched by each film watching user in a preset time period and the ratio of watched channels; clustering a plurality of film watching users into n user groups according to the label proportion data, wherein n is a natural number greater than 1; setting a similarity threshold corresponding to each user group for the n user groups according to the number of viewing users contained in each user group of the n user groups; and determining film watching users with similar characteristics in each user group based on the similarity threshold set by each user group. According to the technical scheme, the calculated amount is reduced, when similarity sorting is carried out after the similarity of the film watching users is subsequently calculated, the similarity behind partial sorting can be filtered, the calculated amount during sorting can be reduced, and the overflow of the memory can be prevented.

Description

Acquisition method, device and computer readable storage medium for similar film watching users

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method and equipment for acquiring similar film watching users and a computer readable storage medium.

Background

With the advancement of big data technology, it is common in the video industry to recommend video content to users based on user portrayal. The nature of the user representation is to label each user with labels such as gender, age, region, and disposition, and then intervene in the experience enhancement scheme for the user's characteristic attributes. And calculating the similarity of the users based on the user portrait, and then recommending corresponding video content to the users. At present, the similarity calculation of users is optimized among algorithms, namely, the definition of what label is marked on each user, and the algorithm with the best effect and the highest efficiency is continuously found. However, as the user population increases, the data volume expands by a factor, and the requirement for the algorithm increases in difficulty. When the data volume breaks through tens of millions, the conventional similarity calculation method is no longer practical, and the fundamental reason is that the calculation time is suddenly increased due to sudden increase of the data volume, so that the algorithm cannot be normally converged.

Disclosure of Invention

The application provides a method and equipment for obtaining similar viewing users and a computer readable storage medium, so that the calculation time is reduced, and the calculation accuracy of the similarity is improved.

In one aspect, the present application provides a method for obtaining users with similar viewing, including:

acquiring the film watching statistical data of each film watching user in a plurality of film watching users, wherein the film watching statistical data comprise channels watched by each film watching user in a preset time period, watched film types, watched times and label proportion data, and the label proportion data comprise the proportion of the film types watched by each film watching user in the preset time period and the proportion of the watched channels;

clustering the plurality of film watching users into n user groups according to the label proportion data, wherein n is a natural number greater than 1;

setting similarity threshold values corresponding to the user groups for the n user groups according to the number of film watching users contained in the user groups;

and determining film watching users with similar characteristics in each user group based on the similarity threshold set by each user group.

On the other hand, the present application provides an apparatus for obtaining users with similar viewing, comprising:

the system comprises an acquisition module, a comparison module and a comparison module, wherein the acquisition module is used for acquiring the film watching statistical data of each film watching user in a plurality of film watching users, the film watching statistical data comprise channels watched by each film watching user in a preset time period, the types and times of watched films and labels, and the labels comprise the ratio of the types of the watched films and the watched channels watched by each film watching user in the preset time period;

the clustering module is used for clustering the plurality of film watching users into n user groups according to the label proportion data, wherein n is a natural number greater than 1;

the setting module is used for setting a similarity threshold value corresponding to each user group for the n user groups according to the number of film watching users contained in each user group of the n user groups;

and the determining module is used for determining the film watching users with similar characteristics in each user group based on the similarity threshold set by each user group.

In a third aspect, the present application provides an apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the technical solution of the acquisition method for the similar viewing user as described above when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the technical solution of the acquisition method for the similar viewing user as described above.

According to the technical scheme provided by the application, on one hand, since a plurality of film watching users are clustered into n user groups in advance and then calculated in each user group when the film watching users with similar characteristics are determined, compared with the prior art that all film watching users are used as one group to calculate the film watching users with similar characteristics, the calculation amount is reduced; on the other hand, according to the number of film watching users contained in each user group of the n user groups, a corresponding similarity threshold value is set for each user group, so that when similarity sorting is carried out after the similarity of the film watching users is calculated subsequently, the similarity behind partial sorting can be filtered, the calculation amount in sorting can be reduced, and the memory overflow can be prevented; in the third aspect, the similarity is calculated in the clustered user groups, which is equivalent to narrowing the range, so that the calculation of the similarity can be more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for obtaining similar viewing users according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of an acquisition apparatus for similar viewing users according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In this specification, adjectives such as first and second may only be used to distinguish one element or action from another, without necessarily requiring or implying any actual such relationship or order. References to an element or component or step (etc.) should not be construed as limited to only one of the element, component, or step, but rather to one or more of the element, component, or step, etc., where the context permits.

In the present specification, the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The application provides a method for obtaining users with similar viewing, as shown in fig. 1, which mainly includes steps S101 to S104, as detailed below:

step S101: the method comprises the steps of obtaining viewing statistical data of each viewing user in a plurality of viewing users, wherein the viewing statistical data comprise channels watched by each viewing user in a preset time period, types and times of watched videos and label proportion data, and the label proportion data comprise proportion of the types of the videos watched by each viewing user in the preset time period and proportion of the watched channels.

In the embodiment of the present application, the viewing user refers to a user who views video contents such as a television show, a movie, a synthesis art, and a recording, and does not refer to a user who views a movie in a narrow sense. The video information and the channel where the video information is located, such as a drama channel, a movie channel or a kids channel, can be extracted from the movie unit associated media asset library, and then the occupation ratio of the types of videos watched by the film watching users in the preset time period and the occupation ratio of the watched channels are counted, so that the film watching statistical data of each of the plurality of film watching users can be obtained. Because the data volume of the viewing record is large, for example, three million viewing records can be generated each day, one viewing record corresponds to a plurality of video types, and the data volume can break through ten million levels by performing burst operation after the data association, udf of the spark model can be referred to complete the calculation process, so that the data can be read from the hive table, and the data can be converted into udf and stored into the hive table to form a closed-loop data processing flow.

It should be noted that, in the embodiment of the present application, it is considered that the ratio of the types of movies viewed by each viewing user in the preset time period (and/or the ratio of the channels viewed by each viewing user in the preset time period) rather than the absolute number of the types of movies viewed by each viewing user in the preset time period is used as the tag data, and the ratio reflects the preference of the viewing user. For example, viewing user a has watched 100 total love, comedy and horror in the past month, wherein love 60, comedy 30 and horror 10, while viewing user B has watched 300 total love, comedy and horror in the past month, wherein love 100, comedy 20 and horror 180, although viewing user B has watched more than viewing user a has watched love 30, and viewing user a has watched more than viewing user B has watched comedy, whereas viewing user a has watched only 33.33%, 6.67% and 60% of love, therefore, it is reasonable to assume that watching user a prefers love photos and watching user B prefers horror photos.

Step S102: and clustering a plurality of viewing users into n user groups according to the label proportion data, wherein n is a natural number greater than 1.

In this embodiment of the present application, a Mapreduce framework may be adopted, and multiple viewing users are clustered into n user groups according to the label proportion data, and the specific clustering algorithm may be any one of the clustering algorithms in the prior art, for example, any one of hierarchical clustering, k-means algorithm, EM algorithm, DBSCAN algorithm, OPTICS algorithm, Mean Shift algorithm, and spectral clustering algorithm, or a combination thereof, which is not limited in this application.

It should be noted that, since a plurality of viewing users are clustered into n user groups, and then when the similarity of the viewing users is calculated, the similarity of the viewing users in the user group is calculated in each user group instead of using the whole viewing user as a calculation object as in the prior art, the calculation amount is greatly reduced. For example, if there are 3 ten thousand viewing users, and the 3 ten thousand viewing users are regarded as a user group as in the prior art, when calculating the similarity of the viewing users, the relationship matrix will be a matrix of 30000 × 30000 order, which means there will be 9 hundred million pieces of data, and it will be an extremely complicated and computation problem to calculate the similarity of the viewing users by traversing the 9 hundred million pieces of data. However, if the 3 ten thousand viewing users are firstly clustered, and if the clustering is a user group in which 5 viewing users are 6000, the relationship matrix of each user group is a 6000 x 6000 order matrix, that is, when the similarity of the viewing users in each user group is calculated, only 3600 ten thousand data volumes need to be traversed, and compared with the prior art, the calculation amount and the complexity degree are greatly reduced. As for the clustering of multiple viewing users, since the clustering is performed in the Mapreduce framework, the Mapreduce framework can cope with the situation even if the number of viewing users is in the order of tens of millions, compared with the Mapreduce framework, which has strong computing power.

When a plurality of viewing users are clustered, the clustering model obtained through training is stored in a file system of a hadoop frame, and the clustering result is stored in a hive table.

Step S103: and setting a similarity threshold corresponding to each user group for the n user groups according to the number of viewing users contained in each user group of the n user groups.

Generally, as long as the clustering algorithm performed on the user group in the early stage is scientific and reasonable, if the number of viewing users included in the user group 1 is greater than the number of viewing users included in the user group 2, and the same similarity threshold is still set as in the prior art when the similarity threshold is set for the user group 1 and the user group 2, the number of viewing users with similar characteristics in the user group 1 needs to be greater than the number of viewing users with similar characteristics in the user group 2, which also means that when the similarity of viewing users in each user group is calculated, when only the top n viewing users with similar characteristics need to be ranked, more work needs to be performed on the user group 1. Therefore, in the embodiment of the present application, a similarity threshold corresponding to each user group may be set for the n user groups according to the number of viewing users included in each user group of the n user groups. Specifically, a reference similarity threshold Th may be set_bCounting the number of film watching users contained in a first user group and the number of film watching users contained in a second user group of n user groups, and if the number of the film watching users contained in the first user group is larger than the number of the film watching users contained in the second user group, setting a first similarity threshold Th1 for the first user groupSetting a first similarity threshold Th1 for the second user group, wherein the reference similarity threshold Th_bThe relationship between the first similarity threshold Th1 and the first similarity threshold Th2 is Th2<Th_b<Th 1. For example, assume that, as in the prior art, a similarity threshold of 0.8 is set for both the user group 1 and the user group 2, according to the similarity threshold, 7 film watching users with similar features in the user group 1 are provided, and 3 film watching users with similar features in the user group 2 are provided, further assuming that a top3 ranking of similarity is to be performed on each user group, then 7 film watching users with similar features in the user group 1 need to be traversed and ranked, and finally 3 film watching users with similarity in the top3 are excluded. In the embodiment of the present application, it is assumed that a reference similarity threshold Th is set first_bWhen the number of viewing users included in the user group 1 is counted to be greater than the number of viewing users included in the user group 2, the similarity threshold of the user group 1 is compared with the reference similarity threshold Th_bUp-regulation, e.g. setting the similarity threshold of user group 1 to 0.88, and the similarity threshold of user group 2 relative to the reference similarity threshold Th_bFor example, if the similarity threshold of the user group 2 is set to 0.75, according to the above setting, the number of viewing users with similar features in the user group 1 is reduced to 4, and even if the number of viewing users with similar features in the user group 2 increases to 4, the traversing and sorting computation amount of the user group 1 is significantly reduced compared to the case where the similarity threshold is 0.8, where 7 viewing users with similar features in the user group 1 need to be traversed and sorted, and the traversing and sorting computation amount of the user group 2 needs to be traversed and sorted, and although the traversing and sorting computation amount of the user group 2 increases, the total computation amount of the user group 1 and the user group 2 is reduced.

The above assumptions about the user group 1 and the user group 2 are both of a small number of levels, and in fact, when the assumption about the user group 1 or the user group 2 is of several thousands or tens of thousands, the reference similarity threshold Th is set_bThe similarity threshold value of each user group is reasonably adjusted up or down, which is beneficial to obviously reducing the calculated amount, and unnecessary similarity calculation results do not need to be stored or cachedThereby preventing the overflow of the memory.

Step S104: and determining film watching users with similar characteristics in each user group based on the similarity threshold set by each user group.

Specifically, as an embodiment of the present application, the determining of viewing users with similar features in each user group based on the similarity threshold set by each user group may be: and calculating the similarity of the film watching users in each user group according to the label proportion data, if the similarity is greater than a similarity threshold set by each user group, sequencing the similarities from large to small, and determining the film watching users with the top m of sequencing results as the film watching users with similar characteristics, wherein m is a preset value. It should be noted that the similarity of viewing users is calculated in the user groups to which the viewing users belong, rather than between the user groups. As for the specific calculation method of the similarity, a cosine similarity calculation mode may be adopted, and other similarity calculation modes may also be adopted, which is not limited in this application.

For the newly added film viewing user, in the embodiment of the present application, a cold start strategy may be adopted, that is, a trained model is adopted to re-cluster the newly added user and the plurality of film viewing users, where the trained model is a model adopted when the plurality of film viewing users are clustered according to the label proportion data in the foregoing embodiment.

As can be seen from the above method for acquiring similar viewing users illustrated in fig. 1, on one hand, since a plurality of viewing users have been clustered into n user groups in advance, and then calculated in each user group when determining viewing users with similar characteristics, the amount of calculation is reduced compared with the prior art that all viewing users are used as a group to calculate viewing users with similar characteristics; on the other hand, according to the number of film watching users contained in each user group of the n user groups, a corresponding similarity threshold value is set for each user group, so that when similarity sorting is carried out after the similarity of the film watching users is calculated subsequently, the similarity behind partial sorting can be filtered, the calculation amount in sorting can be reduced, and the memory overflow can be prevented; in the third aspect, the similarity is calculated in the clustered user groups, which is equivalent to narrowing the range, so that the calculation of the similarity can be more accurate.

Referring to fig. 2, an obtaining apparatus for users with similar viewing provided in this embodiment of the present application may include an obtaining module 201, a clustering module 202, a setting module 203, and a determining module 204, which are detailed as follows:

the acquisition module 201 is configured to acquire viewing statistical data of each viewing user of a plurality of viewing users, where the viewing statistical data includes a channel watched by each viewing user in a preset time period, a type and a number of watched movies, and tag proportion data, and the tag proportion data includes a proportion of a type of movies watched by each viewing user in the preset time period and a proportion of watched channels;

the clustering module 202 is configured to cluster the plurality of film watching users into n user groups according to the tag proportion data, where n is a natural number greater than 1;

the setting module 203 is used for setting a similarity threshold corresponding to each user group for the n user groups according to the number of film watching users contained in each user group of the n user groups;

and the determining module 204 is configured to determine viewing users with similar features in each user group based on the similarity threshold set by each user group.

Optionally, the setting module 203 illustrated in fig. 2 may include a reference value setting unit, a statistic unit, and a threshold setting unit, where:

a reference value setting unit for setting a reference similarity threshold Th_b；

The statistical unit is used for counting the number of film watching users contained in a first user group and the number of film watching users contained in a second user group of the n user groups;

a threshold setting unit, configured to set a first similarity threshold Th1 for the first user group and a first similarity threshold Th1 for the second user group if the number of viewing users included in the first user group is greater than the number of viewing users included in the second user group,reference similarity threshold Th_bThe relationship between the first similarity threshold Th1 and the first similarity threshold Th2 is Th2<Th_b<Th1。

Optionally, the determining module 204 illustrated in fig. 2 may include a similarity calculating unit, a sorting unit, and a similarity user determining unit, where:

the similarity calculation unit is used for calculating the similarity of the observation users in each user group according to the label proportion data;

the sorting unit is used for sorting the similarity according to a descending order if the similarity is greater than a similarity threshold set by each user group;

and the similarity user determining unit is used for determining the film watching users with the top m of the sequencing result as film watching users with similar characteristics, wherein m is a preset value.

Optionally, the apparatus illustrated in fig. 2 may further include a re-clustering module, configured to re-cluster, if a new film viewing user is added, the new user and the multiple film viewing users by using a trained model, where the trained model is a model used when the multiple film viewing users are clustered according to the tag proportion data.

As can be seen from the above apparatus for acquiring similar viewing users illustrated in fig. 2, on one hand, since a plurality of viewing users have been clustered into n user groups in advance, and then calculated within each user group when determining viewing users with similar characteristics, the amount of calculation is reduced compared with the prior art that all viewing users are used as a group to calculate viewing users with similar characteristics; on the other hand, according to the number of film watching users contained in each user group of the n user groups, a corresponding similarity threshold value is set for each user group, so that when similarity sorting is carried out after the similarity of the film watching users is calculated subsequently, the similarity behind partial sorting can be filtered, the calculation amount in sorting can be reduced, and the memory overflow can be prevented; in the third aspect, the similarity is calculated in the clustered user groups, which is equivalent to narrowing the range, so that the calculation of the similarity can be more accurate.

Fig. 3 is a schematic structural diagram of an apparatus provided in an embodiment of the present application. As shown in fig. 3, the apparatus 3 of this embodiment mainly includes: a processor 30, a memory 31 and a computer program 32 stored in the memory 31 and executable on the processor 30, such as a program resembling the user's acquisition method. The processor 30, when executing the computer program 32, implements the steps in the above-described embodiment of a similar viewing user acquisition method, such as steps S101 to S104 shown in fig. 1. Alternatively, the processor 30, when executing the computer program 32, implements the functions of the modules/units in the above-described apparatus embodiments, such as the functions of the acquiring module 201, the clustering module 202, the setting module 203, and the determining module 204 shown in fig. 2.

Illustratively, the computer program 32 of the acquisition method for similar viewing users mainly includes: acquiring the film watching statistical data of each film watching user in a plurality of film watching users, wherein the film watching statistical data comprise channels watched by each film watching user in a preset time period, watched film types, watched times and label proportion data, and the label proportion data comprise proportion of the film types watched by each film watching user in the preset time period and proportion of the watched channels; clustering a plurality of film watching users into n user groups according to the label proportion data, wherein n is a natural number greater than 1; setting a similarity threshold corresponding to each user group for the n user groups according to the number of viewing users contained in each user group of the n user groups; and determining film watching users with similar characteristics in each user group based on the similarity threshold set by each user group.

The computer program 32 may be partitioned into one or more modules/units, which are stored in the memory 31 and executed by the processor 30 to accomplish the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 32 in the device 3. For example, the computer program 32 may be divided into functions of the acquisition module 201, the clustering module 202, the setting module 203, and the determination module 204 (modules in the virtual device), and the specific functions of each module are as follows: the acquisition module 201 is configured to acquire viewing statistical data of each viewing user of a plurality of viewing users, where the viewing statistical data includes a channel watched by each viewing user in a preset time period, a type and a number of watched movies, and tag proportion data, and the tag proportion data includes a proportion of a type of movies watched by each viewing user in the preset time period and a proportion of watched channels; the clustering module 202 is configured to cluster the plurality of film watching users into n user groups according to the tag proportion data, where n is a natural number greater than 1; the setting module 203 is used for setting a similarity threshold corresponding to each user group for the n user groups according to the number of film watching users contained in each user group of the n user groups; and the determining module 204 is configured to determine viewing users with similar features in each user group based on the similarity threshold set by each user group.

The device 3 may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of a device 3 and does not constitute a limitation of device 3 and may include more or fewer components than shown, or some components in combination, or different components, e.g., a computing device may also include input-output devices, network access devices, buses, etc.

The Processor 30 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may be an internal storage unit of the device 3, such as a hard disk or a memory of the device 3. The memory 31 may also be an external storage device of the device 3, such as a plug-in hard disk provided on the device 3, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 31 may also include both an internal storage unit of the device 3 and an external storage device. The memory 31 is used for storing computer programs and other programs and data required by the device. The memory 31 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as required to different functional units and modules, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the above-mentioned apparatus may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/device and method may be implemented in other ways. For example, the above-described apparatus/device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logic function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a non-transitory computer readable storage medium. Based on such understanding, all or part of the processes in the method of the embodiments described above may also be implemented by instructing related hardware by a computer program, where the computer program similar to the method for obtaining users may be stored in a computer readable storage medium, and when being executed by a processor, may implement the steps of the embodiments of the methods described above, that is, obtaining viewing statistical data of each of a plurality of users, where the viewing statistical data includes a channel watched by each user in a preset time period, a type and a number of watched videos, and tag proportion data, and the tag proportion data includes a proportion of types of videos watched by each user in the preset time period and a proportion of watched channels; clustering a plurality of film watching users into n user groups according to the label proportion data, wherein n is a natural number greater than 1; setting a similarity threshold corresponding to each user group for the n user groups according to the number of viewing users contained in each user group of the n user groups; and determining film watching users with similar characteristics in each user group based on the similarity threshold set by each user group. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The non-transitory computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the non-transitory computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, non-transitory computer readable media does not include electrical carrier signals and telecommunications signals as subject to legislation and patent practice. The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application. The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present invention.

Claims

1. A method for obtaining users with similar viewing, the method comprising:

2. The method for acquiring similar film viewing users as claimed in claim 1, wherein said setting a similarity threshold corresponding to each user group for said n user groups according to the number of film viewing users included in each user group of said n user groups comprises:

setting a reference similarity threshold Th_b；

Counting the number of film watching users contained in the first user group and the number of film watching users contained in the second user group of the n user groups;

if the number of film watching users contained in the first user group is larger than that contained in the second user group, setting a first similarity threshold Th for the first user group₁Setting a first similarity threshold Th for the second group of users₁The reference similarity threshold Th_bA first similarity threshold Th₁And a first similarity threshold Th₂Is Th₂<Th_b<Th₁。

3. The method for acquiring similar viewing users as claimed in claim 1, wherein said determining viewing users with similar characteristics in each user group based on the similarity threshold set by each user group comprises:

calculating the similarity of the viewing users in each user group according to the label proportion data;

if the similarity is larger than the similarity threshold set by each user group, sorting the similarities from large to small;

and determining the film watching users with the top m of the sequencing result as film watching users with similar characteristics, wherein m is a preset value.

4. A method for obtaining users with similar appearances according to any one of claims 1 to 3, characterized in that the method further comprises:

and if the film watching users are newly added, re-clustering the newly added users and the plurality of film watching users by adopting a trained model, wherein the trained model is a model adopted when the plurality of film watching users are clustered according to the label proportion data.

5. An apparatus for obtaining a similar viewing user, the apparatus comprising:

6. The apparatus for obtaining a similar viewing user as in claim 5, wherein said setting module comprises:

The statistical unit is used for counting the number of film watching users contained in the first user group and the number of film watching users contained in the second user group of the n user groups;

a threshold setting unit, configured to set a first similarity threshold Th for the first user group if the number of film viewing users included in the first user group is greater than the number of film viewing users included in the second user group₁Setting a first similarity threshold Th for the second group of users₁The reference similarity threshold Th_bA first similarity threshold Th₁And a first similarity threshold Th₂Is Th₂<Th_b<Th1。

7. The acquisition apparatus for users of similar viewing as claimed in claim 5, wherein said determination module comprises:

8. The apparatus for acquiring similar viewing users as in any one of claims 5 to 7, wherein said apparatus further comprises:

and the re-clustering module is used for re-clustering the newly added users and the plurality of film watching users by adopting a trained model if the film watching users are newly added, wherein the trained model is a model adopted when the plurality of film watching users are clustered according to the label proportion data.

9. An apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.