WO2022156720A1

WO2022156720A1 - Method and apparatus for group control account excavation, device, and storage medium

Info

Publication number: WO2022156720A1
Application number: PCT/CN2022/072806
Authority: WO
Inventors: 曹轲; 钟清华
Original assignee: 百果园技术(新加坡)有限公司; 曹轲
Priority date: 2021-01-25
Filing date: 2022-01-19
Publication date: 2022-07-28
Also published as: CN112819056A

Abstract

Embodiments of the present application provide a method and apparatus for group control account excavation, a device, and a storage medium, and relate to the technical field of livestreaming. The technical solution provided in the embodiments of the present application comprises: acquiring first viewing data of a user group during a set period, each user in the user group corresponding to one piece of first viewing data, and each piece of the first viewing data comprising identity data of a live streamer viewed by the corresponding user during the set period; searching for similar viewing users in the user group on the basis of the first viewing data; excavating a similar viewing user group in the user group on the basis of the similar viewing users, and determining, on the basis of the similar viewing user group, a target user group belonging to group control accounts. The employment of the method solves the technical problem of a group control account excavation process having poor security and being easily cracked.

Description

Group control account mining method, device, equipment and storage medium

This application claims the priority of the Chinese patent application with application number 202110098987.5 filed with the China Patent Office on January 25, 2021, the entire contents of which are incorporated herein by reference.

technical field

The embodiments of the present application relate to the technical field of live webcasting, and in particular, to a group control account mining method, a group control account mining device, a group control account mining device, and a storage medium.

Background technique

"Popularity" is a specific term in the live broadcast industry, which can comprehensively reflect the popularity of the anchor and the quality of the live content. Popularity can be calculated by the number of viewers, viewing length, broadcast length, followings, interaction, barrage, gift rewards and other dimensions. Among them, the number of viewers is an important dimension to measure popularity, and the ranking of the anchors when recommending anchors can be determined by the number of viewers. In addition, many live broadcast platforms settle the anchor's salary based on the number of viewers.

Generally speaking, batch operation of a large number of zombie accounts (ie group control accounts) through group control software can increase the popularity of the host's room. In some related technologies, in order to prevent the emergence of group control accounts, the following methods can be used to detect group control accounts: 1. Device environment aggregation detection method, which is determined by the mobile phone number used when the user registers and the IP address used when watching the host. Whether there is a group control account, among which, the situation that the mobile phone numbers in the group control account share an IP address is more prominent; 2. The room feature anomaly detection method, when the group control account is used to increase popularity, the host's room will be rewarded for gifts, the number of viewers, and the number of bombs. There are abnormalities in the distribution of data characteristics such as the number of scenes. For example, under normal circumstances, when the number of viewers in the host's room reaches the threshold, the gift reward will be in a distribution range, but when the number of viewers in the host's room under the group control account reaches the threshold, the gift reward is obvious. is smaller than the normal distribution interval. At this time, the host with abnormal feature distribution can be found through the abnormal room feature detection method. Although the above method can detect the group control account, the security is low and it is easy to be cracked. For example, using a dynamic IP pool to prevent mobile phone numbers from sharing the same IP address, or using distributed cloud group control account access, switching gift-giving accounts, etc., can avoid abnormal feature distribution.

In general, how to safely and accurately dig out the group control accounts in the live broadcast has become a technical problem that needs to be solved urgently.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a group control account mining method, apparatus, device, and storage medium to solve the technical problems of low security and easy cracking in the group control account mining process.

In the first aspect, an embodiment of the present application provides a method for mining a group control account, including:

Obtain the first viewing data of the user group within a set time period, each user in the user group corresponds to a first viewing data, and each of the first viewing data includes the viewing data of the corresponding user within the set time period streamer identity data;

Find out similar viewing users in the user group according to the first viewing data;

A similar viewing user group is mined from the user group according to the similar viewing users, and a target user group belonging to the group control account is determined according to the similar viewing user group. .

In the second aspect, the embodiment of the present application provides a group control account mining device, including:

The data acquisition module is configured to acquire the first viewing data of a user group within a set time period, each user in the user group corresponds to a first viewing data, and each of the first viewing data includes the corresponding user in the The identity data of the streamers watched during the set time period;

a user search module, configured to search for similar viewing users in the user group according to the first viewing data;

The group control determination module is configured to dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.

In a third aspect, an embodiment of the present application provides a group control account mining device, including: a memory and one or more processors;

the memory configured to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the group control account mining method according to the first aspect.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the group control account mining method described in the first aspect.

The above group control account mining method, device, device and storage medium obtain the first viewing data of the user group within a set time period, and find out similar viewing users in the user group according to the first viewing data, and then, according to the similarity The technical means of mining similar viewing user groups by viewing users and determining group control accounts according to the similar viewing user groups solves the technical problems of low security and easy cracking in the mining process of group control accounts. Even if the group control account uses a dynamic IP pool or uses a distributed cloud group control account to access, it can effectively filter out similar viewing users based on each user's viewing of the anchor, and then accurately mine the group control account in the user group to improve Reduce the cost of group control cheating, prevent the behavior of room brushing, and ensure the authenticity of the anchor's popularity.

Description of drawings

1 is a flowchart of a group control account mining method provided by an embodiment of the present application;

2 is a schematic diagram of a hash bucket provided by an embodiment of the present application;

3 is a flowchart of another group control account mining method according to an embodiment of the present application;

4 is a schematic diagram of a neural network provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application.

Detailed ways

The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all the structures related to the present application.

A group control account refers to using multiple real devices (such as multiple mobile phones) or simulating multiple real devices, and installing script software (currently group control software) in the device to control the application software in the device (such as live broadcast application software) , by modifying the software and hardware information of the device to achieve the effect of simulating manual use of the application software. The group control account can simulate the operation requests of real users to the greatest extent through automated means. In the field of live broadcast, through group control accounts, you can achieve cheating goals such as attracting fans, draining traffic, and swiping advertisements for the live broadcast room. Among them, the way to increase the popularity of the anchor by simulating the normal account of the group control account to enter the host's room is called room brushing.

In order to avoid the influence of the group control account on the normal live broadcast, the embodiment of the present application provides a group control account mining method, so as to mine the group control account safely and accurately. In one example, the group control account mining method may be executed by a group control account mining device, which may be implemented by software and/or hardware, and the group control account mining device may be two or more physical devices. Entity composition can also be a physical entity composition. For example, the group control account mining device may be a computer, tablet computer or other smart device configured with data computing and analysis capabilities.

FIG. 1 is a flowchart of a group control account mining method provided by an embodiment of the present application. Referring to FIG. 1 , the group control account mining method may include:

Step 110: Acquire the first viewing data of the user group within the set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the host identity watched by the corresponding user within the set time period data.

In one embodiment, the user group refers to a set of users who use the live broadcast application software to watch the live broadcast. Viewing data refers to the data that reflects the viewing situation during the user's viewing of the live broadcast. Exemplarily, the viewing data includes at least host identity data viewed by the user. Among them, the host identity data is used to indicate the host identity. Different hosts have different host identity data. The host refers to the user who has registered on the live broadcast application platform and can perform live broadcast. It is understandable that the viewing data may also include content such as the viewing duration of each anchor watched by the user. In one embodiment, each user in the user group also has corresponding user identity data, and different users have different user identity data. When a user enters a host's room, the group control account mining device records the user's identity data, the host's identity data, and viewing duration, and generates a piece of viewing data. When the user enters another live broadcast room, the group control account mining device records the user identity data, anchor identity data, and viewing time again, and generates another piece of viewing data. In one embodiment, the first viewing data refers to a collection of viewing data of a corresponding user within a set time period, which may optionally include anchor identity data viewed by the user within the set time period. Each user corresponds to one piece of first viewing data. The set time period can be set according to the actual situation, for example, the set time period is 24 hours, 12 hours or 48 hours.

Optionally, in order to facilitate subsequent calculation, the host identity data is encoded in the embodiment, and the host identity data is represented in the form of a vector. The embodiment of the encoding rule is not limited. For example, one-hot (One-Hot) encoding is used to vectorize the identity data of each anchor. After that, the first viewing data is composed of vectors corresponding to the identity data of each anchor.

Step 120: Find out similar viewing users in the user group according to the first viewing data.

The group control account has the characteristics of batch, which can enter and exit the room of one or some hosts at the same time. At this time, each user in the same group control account watches the same set of anchors or the same batches, and the duration of each user watching the same live broadcast is also close to the same. On the other hand, the viewing time of normal users (users with non-group control accounts) varies greatly, which conforms to a normal distribution. Moreover, normal users have preferences (that is, there are fixed one or more anchors who like to watch) and randomness ( i.e. random selection of streamers to watch). Therefore, the probability that any two normal users have the same first viewing data is relatively small, while the probability that the group control accounts have the same first viewing data is relatively large. Therefore, in the embodiment, similar viewing users are found in the user group through the first viewing data. Wherein, similar viewing users refer to the same or highly similar anchors watched by two users. It can be understood that each user can form a different similar viewing user, for example, user A and user B can form a similar viewing user, and user A and user C can also form a similar viewing user.

In one embodiment, similar viewing users are selected by calculating the similarity between the first viewing data. Wherein, the method for calculating the similarity is not limited in the embodiment, for example, a method such as cosine similarity, Euclidean distance, etc. is used. Exemplarily, users with high similarity are determined as similar viewing users. For example, when calculating the similarity using cosine similarity, a threshold is set, and the threshold represents the maximum cosine distance between similar viewing users. It can be understood that the smaller the cosine distance, the higher the similarity. At this time, two first viewing users are calculated. After the cosine distance of the data, if the cosine distance is less than the threshold, the two corresponding users are determined as similar viewing users. After the similarity is calculated for all the first viewing data pairwise in the above manner, all similar viewing users can be obtained.

Step 130: Dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.

The similar viewing user group refers to the high similarity between the first viewing data of the users in the group, that is, the anchors watched by the users in the group are the same or highly similar. The group of similar viewing users may be determined by similar viewing users. In one embodiment, a node graph is drawn, each node in the node graph represents a user, and two nodes that view similar users are connected by an edge. It is understandable that for group control accounts, since they perform the same operations in batches, they will form a dense community in the node graph. The community contains a large number of users, while normal users are in the node graph. A more fragmented or community with a smaller number of users. Therefore, a community with dense user connections can be found through the node graph (for example, nodes with a connection relationship are formed into a community), and the found community can be regarded as a similar viewing user group.

Exemplarily, the target user group is determined according to the similar viewing user group, where the target user group refers to the user group corresponding to the group control account. In one embodiment, the found similar viewing user group is determined as the target user group, or, the similar viewing user group whose number of users is higher than the number threshold (this value can be set in combination with the actual situation) is determined as the target user group. In another embodiment, the target user group is determined in combination with the network addresses (eg IP addresses) of the users in the similar viewing user group, the information of the equipment used and/or the viewing duration of each anchor. Among them, the device information is used to distinguish the device used by the user. The network address and device information can be obtained when the user uses the live broadcast application software. For example, count the viewing time of each user in the similar viewing user group for the same anchor. If the viewing time of the same anchor is the same or similar (for example, the difference in viewing time is less than the set duration), then the similar viewing user group is determined as the target user group . For another example, for group control accounts, they share network addresses and/or device information, that is, different users in the target user group have the same network addresses and/or device information. Therefore, whether the similar viewing user group is the target user group may be determined by considering whether the users in the similar viewing user group share network addresses and/or device information.

Above, by acquiring the first viewing data of the user group within the set time period, and finding similar viewing users in the user group according to the first viewing data, then mining similar viewing user groups according to the similar viewing users, and according to the similar viewing users. The technical means of determining the group control account by the viewing user group solves the technical problems of low security and easy cracking during the mining process of the group control account. Even if the group control account uses a dynamic IP pool or uses distributed cloud group control account access, etc., it can effectively filter out similar viewing users based on the situation of each user's viewers, and then accurately mine the group control accounts in the user group to improve Reduce the cost of group control cheating, prevent the behavior of room brushing, and ensure the authenticity of the anchor's popularity.

On the basis of the above embodiment, similar viewing users are determined by calculating the similarity. At this time, step 120 includes steps 121-122:

Step 121: Calculate the viewing similarity among the users in the user group according to the first viewing data.

The viewing similarity is used to reflect the similarity between the first viewing data. The viewing similarity can be calculated by cosine similarity, Euclidean distance, or the like. At this time, there is one viewing similarity between every two users. When the number of users is large (eg, the number of users is greater than 106), a large amount of computation is required to calculate the viewing similarity. Therefore, in the embodiment, each user is roughly divided into buckets, and potentially similar users are divided into a bucket with a high probability, and then the viewing similarity between users in each bucket is calculated, so as to reduce the amount of calculation. Purpose, at this time, step 121 includes steps 1211-1212:

Step 1211: Use the locality-sensitive hash to bucket each first viewing data.

The first viewing data is divided into buckets by using Locality Sensitive Hashing (LSH) to divide the possibly similar first viewing data into one bucket. At this time, the first viewing data in each bucket corresponds to of users can be considered as alternative similar viewing users.

In one embodiment, when using LSH for bucketing, step 1211 includes steps 12111-12113:

Step 12111: Perform a minimum hash calculation on each first viewing data to obtain a corresponding signature vector.

Among them, the minimum hash (minhash) is a commonly used technical means in the LSH calculation process, which is used to calculate and obtain the signature vector (or matrix). In the embodiment, the minimum hash is used to calculate the first viewing data to obtain the signature vector (or matrix). ). In this case, each first viewing data corresponds to a signature vector, and the space occupied by the signature vector is smaller than the space occupied by the first viewing data.

Step 12112: Divide each signature vector into multiple rows, and map each row into a corresponding hash bucket by using a hash function, where the hash function is at least one.

Each signature vector is divided into multiple segments, and the content of each segment is regarded as a band, wherein the number of bands (ie, the number of segments) can be set according to the actual situation, and the number of bands of each signature vector is equal. After that, each row bar is mapped into a corresponding hash bucket by using a hash function, wherein the adopted hash function can be selected according to the actual situation, and one or more hash functions can be used. When using multiple hash functions, each hash function can map the row strip once.

Step 12113: Put the first viewing data corresponding to the row bars mapped to the same hash bucket into the same bucket.

It can be understood that if one or more rows in the two signature vectors are the same, the two signature vectors have a higher similarity, and the more the same rows are, the higher the similarity between the two signature vectors is. Among them, the row bars are the same means that the row bars are mapped to the same hash bucket. Accordingly, in the embodiment, the row bars mapped in the same hash bucket are obtained, then the first viewing data corresponding to each row bar is searched, and each first viewing data found is used as data in the same bucket. At this time, the users corresponding to the first viewing data in the same bucket may be considered as candidate similar viewing users.

For example, FIG. 2 is a schematic diagram of a hash bucket provided by an embodiment of the present application. Figure 2 contains three hash buckets, denoted as band1, band2, and band3 respectively. It should be noted that only part of the rows mapped to band1 are shown in Figure 2 (represented as 10002, 32122, and 01311 in Figure 2). At this time, the first viewing data corresponding to each row in band1 is taken as the data in the same bucket, the first viewing data corresponding to each row in band2 is taken as the data in the same bucket, and the first viewing data corresponding to each row in band3 As the data in the same bucket, the bucketing operation for the first viewing data is further completed.

Step 1212: Calculate the viewing similarity between the first viewing data in each bucket.

Taking the bucket as a unit, the viewing similarity between the first viewing data in each bucket is calculated, and the viewing similarity does not need to be calculated for the first viewing data between the buckets. The embodiment of the calculation method of the viewing similarity is not limited.

Step 122: Find similar viewing users in the user group according to the viewing similarity.

In one embodiment, similar viewing users are found by comparing thresholds. For example, when the Euclidean distance is used to calculate the viewing similarity, the smaller the distance between the two first viewing data, the higher the viewing similarity. Therefore, a distance threshold may be set according to the actual situation, and when the distance is less than the distance threshold, it is determined that the two users are similar viewing users.

As mentioned above, similar viewing users can be accurately found by calculating the viewing similarity, and the local sensitive hash algorithm can avoid the problem of a large amount of calculation of viewing similarity when the number of users is large, and reduce the calculation of finding similar viewing users. the complexity.

FIG. 3 is a flowchart of another group control account mining method according to an embodiment of the present application. The group control account mining method is detailed on the basis of the above embodiment.

In the current embodiment, each anchor identity data corresponds to a vocabulary vector, and the length of the vocabulary vector is equal to the current total number of anchors. Among them, the vocabulary vector is a vector obtained by performing One-Hot encoding on the anchor identity data, and each anchor identity data corresponds to a vocabulary vector. Exemplarily, the dimension of the vocabulary vector may be represented by the length of the vocabulary vector, and the length of the vocabulary vector is equal to the current total number of anchors, wherein the current total number of anchors may be the total number of currently registered anchors in the live broadcast application software, Or, the total number of anchors watched by each user in the user group. For example, the current total number of anchors is 4, then each anchor's identity data is represented by a 4-dimensional vocabulary vector, and the 4 vocabulary vectors are respectively expressed as: [1 0 0 0], [0 1 0 0 ], [0 0 1 0], [0 0 0 1].

In one embodiment, referring to FIG. 3 , the group control account mining method may include:

Step 210: Acquire the first viewing data of the user group within the set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the host identity watched by the corresponding user within the set time period data.

In the embodiment, the first viewing data is represented by a vocabulary vector, for example, the vocabulary vectors of the anchor identity data included in the first viewing data are [1 0 0 0], [0 1 0 0], [0 0 10 respectively ], then the first viewing data is a 3×4 matrix consisting of the aforementioned vocabulary vectors.

Step 220: Use the vocabulary vector corresponding to each first viewing data as training data to obtain the embedded word vector corresponding to each vocabulary vector by training, and the length of the embedded word vector is less than the length of the vocabulary vector.

Exemplarily, when the number of registered anchors in the live broadcast application software is very large (for example, hundreds of thousands or millions of anchors are registered), the length of the corresponding vocabulary vector will be very long. Correspondingly, the length of the first viewing data The dimension will also be very large, which is not conducive to the subsequent calculation of the first viewing data. Therefore, in the embodiment, the dimensionality reduction process is performed on the vocabulary vector according to each first viewing data, and the vector obtained after dimensionality reduction is recorded as the embedded word vector, each vocabulary vector corresponds to an embedded word vector, and different vocabulary vectors May correspond to the same embedded word vector. The length (ie dimension) of the embedded word vector can be set according to the actual situation, such as setting the length to 50. Currently, the length of the embedded word vector is less than the length of the vocabulary vector. In one embodiment, Word2Vec is used to obtain the embedded word vector. Among them, Word2Vec is a natural language processing (Natural Language Processing, NPL) tool, which is used to generate related models of word vectors. The model uses a shallow and two-layer neural network, and after the neural network is trained, the Word2Vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, and the vector is located in the neural network. the hidden layer. In the embodiment, the vector representing the relationship between the vocabulary vectors is recorded as the embedded word vector, that is, the word (that is, the vocabulary vector) can be converted into the embedded word vector through Word2Vec, so that each vocabulary can be quantitatively measured by the embedded word vector. The relationship between table vectors.

In one embodiment, FIG. 4 is a schematic diagram of a neural network provided in this embodiment of the application. The neural network is a neural network used by Word2Vec, and the neural network is a Skip-gram model. In NPL, the Skip-gram model refers to the input After a word, predict its context word as output. Referring to Figure 4, the input layer (Input layer) inputs a V-dimensional vocabulary vector (ie [x ₁ x ₂ ... x _v ]), and the output layer (Output layer) outputs another V-dimensional vocabulary vector (ie [y] ₁ y ₂ ... y _v ]), after the neural network training is completed, the weight from the input layer to the hidden layer is the embedded word vector corresponding to the vocabulary vector, which can represent the input layer. The relationship between the vocabulary vector and the vocabulary vector of the output layer. The transpose of the i-th row in the matrix W _V×N = {w _ki } shown in FIG. 4 serves as the embedded word vector for the vocabulary vector (effectively encoded at the k-th position). The embedded word vector is N-dimensional, and N<<V. It can be understood that when one input word corresponds to outputting multiple words, there are multiple matrices W' _V×N ={w' _ik }, and each matrix correspondingly outputs a set of [y ₁ y ₂ ... y _v ].

In one embodiment, the training process of the neural network is: selecting an input word in a sentence, and defining a skip_window parameter and a num_skips parameter. Among them, the skip_window parameter indicates the number of words selected from the side (left or right) of the current input word in the sentence when training the neural network, through which the word window where the output word of the neural network is located can be determined, and the num_skips parameter indicates the output of different words When the number of different words is selected, the output word is selected from the word window. For example, if the sentence is "there is an apple on the table", skip_window and num_skips are both 2. When training the neural network, the input word is apple, and the corresponding word window is [is an apple on the]. After correlating the context, the neural network obtains There are two corresponding relationships between apple and an and apple and on. At this time, an and on are different words output, and (apple, an) and (apple, on) can be used as two sets of training data for the sentence. That is, input apple and output an or on. After the setting is completed, the vocabulary vector corresponding to the input word is selected from the training data and input to the neural network, and the probability distribution of each input word is obtained according to the output word, and the distribution represents the probability that each input word obtains the same output word. For example, when training a neural network by setting the training data of "the capital of China is Beijing" and "the capital of the United Kingdom is London", if the input word is China or the United Kingdom, the output words after the context will contain words such as "the capital is", so , the probability of related words such as "China" and "UK" should be higher than other words, and the embedded word vectors corresponding to "China" and "UK" are the same or similar. According to the above probability distribution, the matrices W _V×N and W′ _N×V in Fig. 4 are updated by means of gradient descent and backpropagation to realize training. After the training is completed, the embedded word vector of each input word is obtained through the matrix W _{V × N.}

When the above training method is corresponding to the first viewing data, it can be: simulating each first viewing data into a sentence, wherein the vocabulary vector of each anchor identity data is used as a word in the sentence, and then selecting the input word and the output word, In order to train the neural network, and after the training is completed, the embedded word vector of each input word is obtained through the matrix W _{V × N.} Understandably, input words with the same output word have the same or similar embedded word vectors. For example, some first viewing data contains host identity data for host A and host B, respectively, and other first viewing data contains host identity data for host C and host B, respectively, then enter the word host A or host C When the corresponding vocabulary vector is used, the probability of outputting the vocabulary vector corresponding to the anchor B is relatively large. Therefore, the embedded word vectors corresponding to the anchor A and the anchor C are similar or the same. It should be noted that the process of using Word2Vec to obtain the embedded word vector can be regarded as the process of Embedding.

Step 230: Obtain corresponding second viewing data according to the embedded word vector corresponding to the first viewing data.

Exemplarily, after obtaining the embedded word vector corresponding to each vocabulary vector, each embedded word vector is processed to obtain the second viewing data. In the embodiment, the second viewing data refers to the vector obtained by embedding the word vector, The dimension of the second viewing data is smaller than the dimension of the first viewing data. In an embodiment, when the second viewing data is obtained according to the embedded word vector of each anchor's identity data, an average value, a maximum value, or a minimum value can be obtained. Taking the average value as an example, the average value of the same position in each embedded word vector is calculated to obtain the average value, and a vector composed of the average value of each position is taken as the second viewing data. For example, the first viewing data includes host identity data, respectively, host A, host B, and host C, and the second viewing data obtained by averaging the three corresponding embedded word vectors is [0.4234, 0.762, 0.4234], where, The first 0.4234 is the result of averaging the first value of the three embedded word vectors, and so on.

Step 240: Find out similar viewing users in the user group according to the second viewing data.

Wherein, this step is the same as the processing method of finding similar viewing users in the user group according to the first viewing data, such as using a local-sensitive hash method to perform bucketing and finding similar viewing users after bucketing, and the embodiment does not do this. Repeat.

Step 250: Take each user in the user group as a user node, and connect user nodes corresponding to similar viewing users through edges to obtain a node relationship graph.

A node relationship graph refers to a node graph obtained by expressing the relationship between nodes by connecting edges. In this step, the node relationship graph refers to a node graph constructed according to the user group and similar viewing users therein. Each user is correspondingly displayed as a node in the node relationship diagram. In the embodiment, the node representing the user is recorded as a user node. Draw connected edges between user nodes that are similar to the viewing user. It is understandable that the distribution positions of each user node in the node relationship diagram may be selected according to actual conditions, which is not limited in the embodiment. Optionally, when the similarity of the similar viewing users is higher, the weight of the corresponding edge is greater.

Step 260: Process the node relationship graph using a label propagation algorithm to determine a group of similar viewing users.

Label Propagation Algorithm (LPA) is a graph-based semi-supervised learning method. Its basic idea is to use the label information of labeled nodes to predict the label information of unlabeled nodes, which can realize local community division. In the embodiment, in the initial stage of LPA, a label is assigned to each user node in the node relationship graph. In each iteration, each user node will change its label according to the label of the user node connected to itself until the iteration ends. To get similar viewing user groups based on tags. Among them, the rule for changing labels is to use the label that appears most in the connected user nodes as its own label. When the similar viewing user group is determined in the above manner, step 260 may include steps 261-266:

Step 261: Assign a corresponding label to each user node in the node relationship graph.

The embodiment of the label generation rule is not limited. The label currently assigned to each user node can be considered as an initial label, and the initial labels corresponding to each user node are different. In the embodiment, it is assumed that the node relationship graph includes M user nodes. In this case, user node 1 corresponds to label 1, user node i corresponds to label i, 1≤i≤M, and so on.

Step 262: Find a user node in the node relationship graph, and find out all neighboring user nodes of the user node according to the edge connection relationship of the user node.

Exemplarily, the processing process of each user node is the same. Therefore, a user node is used as an example for description. In an embodiment, a user node is searched in a node relationship graph, wherein the search rule embodiment is not limited, such as according to The order of the user nodes is searched in turn. After searching for the user node, search for the neighbor user node of the user node, where the neighbor user node refers to the user node connected to the user node through an edge, or the weight of the edge connected to the user node is greater than the set threshold. . In general, neighbor user nodes and user nodes belong to similar viewing users. Understandably, each user node may correspond to one or more neighbor user nodes, or there may be no neighbor user nodes. If there is no neighbor user node, re-select another user node and repeat this step. If there is a neighbor user node, go to the next step.

Step 263: Count the labels of all neighboring user nodes, and update the label with the most occurrences as the label of the user node.

Obtain the label of each neighbor user node, and determine the label with the most occurrences among the labels. Among them, if there are multiple labels with the most occurrences (for example, the label of each node is the initial label, and each label appears once), then a label is randomly selected from the multiple labels with the most occurrences. After that, update the label of the current user node to the label with the most occurrences.

Understandably, nodes with the same label belong to the same community. After the update is complete, the community to which the current node belongs can be determined.

Step 264: Search for another user node in the node relationship graph, and return to perform the operation of finding all neighboring user nodes of the user node according to the edge connection relationship of the user node, until all user nodes in the node relationship graph are traversed.

After the label is replaced, another user node can be searched in the node relationship graph, and the operation of searching for a neighbor user node in step 262 is returned to. Afterwards, when all user nodes in the node relationship graph have been traversed, it is determined that this round of traversal ends. That is, after traversing M user nodes (that is, fori=1:M), it is determined that this round of traversal ends.

Step 265: Determine whether the current traversal end condition is satisfied. If the traversal end condition is not satisfied, return to step 262. If the traversal end condition is satisfied, step 266 is executed.

The traversal end condition is a restriction condition for stopping traversal, and its content can be set according to the actual situation. In the embodiment, the traversal end condition is that the threshold of the number of traversals is reached or the labels of each user node in the node relationship graph do not change. In one embodiment, the threshold of the number of traversals can be set according to the actual situation. After each round of traversal ends, the recorded number of traversals is incremented by 1, and then it is determined whether the number of traversals reaches the threshold of the number of traversals. Determine that the traversal end condition is not met and start a new round of traversal. In another embodiment, after the current round of traversal is completed, it is determined whether the label of each user node has changed. If the label of at least one user node has changed, it is determined that the traversal end condition is not satisfied and a new round of traversal is started. If the labels of each user node have not changed, it is determined that the traversal end condition is satisfied. In another embodiment, after the current round of traversal is completed, it is determined whether the label of each user node has changed. If the label of at least one user node has changed, it is determined whether the number of traversals has reached the threshold of the number of traversals. It is determined that the traversal end condition is satisfied. Otherwise, it is determined that the traversal end condition is not satisfied and a new round of traversal is started. If the label of each user node does not change, it is determined that the traversal end condition is satisfied.

Step 266 , classify the users corresponding to the user nodes with the same label into the same similar viewing user group.

Find user nodes with the same label in the node relationship graph, and classify them to obtain similar viewing user groups. The tags of the user nodes in each similar viewing user group are the same, and the tag can be used as the ID of the similar viewing user group. For example, user A, user B, user C, and user D are used as user nodes in the graph. After the LPA algorithm ends, each user node is recorded as [1,A],[1,B],[2,C],[ 1, D], the first field is the ID of the similar viewing user group, at this time, user A, user B, and user D belong to the same similar viewing user group.

Step 270: Determine a target user group belonging to the group control account according to the similar viewing user group.

In an embodiment, this step includes at least one of the following schemes:

Scheme 1: If the number of users in the similar viewing user group is greater than or equal to the number threshold, the similar viewing user group is determined as the target user group belonging to the group control account.

Exemplarily, the currently used quantity threshold refers to the minimum number of users included in the group control account, and its value can be set according to the actual situation, for example, the quantity threshold is 50. If the number of users included in the similar viewing user group is greater than or equal to the number threshold, it is confirmed as the target user group. After processing each similar viewing user group according to the above, the target user group can be mined.

Solution 2: If multiple users in the similar viewing user group have the same device information and/or network address information, the similar viewing user group is determined as the target user group belonging to the group control account.

Exemplarily, the device information refers to the related information of the device used by the user when watching the live broadcast, which may be a device identification, etc., and the device information of different devices is different. The network address information refers to the network address used by the user when watching the live broadcast, which may be an IP address. In the embodiments, the device information and the network address information are acquired simultaneously for description. In practical applications, only one type of information may be acquired for processing, and the processing methods are the same. Generally speaking, the probability of repeated use of device information and network address information between non-group control accounts is small, and the probability of repeated use of device information and network address information between group control accounts is high, such as logging in to different accounts through one device. Do a room cleaning. In one embodiment, if multiple users in the similar viewing user group have the same device information and/or network address information, it is determined that they are repeatedly used, and the similar viewing user group is determined as the target user group. Optionally, a first threshold of the same number of users may be set, and the group control account indicates the minimum number of users with the same device information and/or network address information. If the number of users with the same device information and/or network address information reaches the first threshold of the same number of users, it is determined that they are repeatedly used, that is, network addresses and/or devices are aggregated. Therefore, the similar viewing user group is determined as the target user group.

In one embodiment, the similar viewing user group may also be a large anchor user group. Among them, the big anchor has a very high number of viewers and followers, and the division of the big anchor is not limited according to the embodiment. The big anchor user group means that the users included in it will watch several big anchors at a fixed time. It can be understood that the characteristics of the large anchor user group are that the re-use probability of device information and network address information among users is low. At this time, if the users in the similar viewing user group have different device information and network address information, the similar viewing user group is determined as the large anchor user group. Alternatively, if the same device information or network address information exists among the users in the similar viewing user group, and the number of the same users is small (for example, lower than the second threshold of the same number of users), then the similar viewing user group is determined to be large. An anchor user group, wherein the second same user number threshold is lower than the first same user number threshold.

It should be noted that the target user group and the big anchor user group can also be obtained in combination with the number of users and the repeated usage. For example, when the number of users in a similar viewing user group is greater than or equal to the number threshold, it is determined as a suspected target user group. If the suspected target user group has repeated use, it is determined as the target user group. If there is no repeated use, it is determined as the target user group. Determined to be a large anchor user group.

As mentioned above, by training the embedded word vector, it is possible to avoid the problem that the number of anchors is too high, which is not conducive to subsequent calculations, and reduces the dimension of viewing data used in subsequent calculations. In addition, by constructing a node relationship graph and LPA algorithm, similar viewing user groups can be accurately found, and then group control accounts can be identified in the similar viewing user groups in combination with the number of users, device information and/or network address information aggregation, ensuring group control. The accuracy of account identification, and in an unsupervised way, reduces the reliance on tags.

On the basis of the above-mentioned embodiment, device information and/or network address information can also be added to the node relationship graph, and then LPA is directly used to mine group control devices in the node relationship graph. At this time, when step 250 is performed, it also includes: acquiring the device information and/or network address information of each user in the user group; using the device information and/or network address information as an information node, adding the node relationship graph, and comparing the user node and corresponding The information nodes are connected by edges.

In one embodiment, in the node relationship diagram, a node representing device information and/or a node representing network address information is added, each device information corresponds to a node, and each network address information corresponds to a node. The nodes of device information and network address information are collectively referred to as information nodes, and the description is given by adding two types of information nodes at the same time as an example. Exemplarily, if a user uses a certain device information, the user node of the user and the information node of the device information are connected through an edge, and in the same way, the connection between the user node and the information node representing the network address information is established. relation. At this time, the node relationship graph also includes the situation of each user using the device and network address. It can be understood that when LPA is used to process the node relationship graph subsequently, when determining the neighbor user nodes of the user node, not only the connected user nodes but also the connected information nodes are considered. For example, set a higher weight for the edge corresponding to the information node, while reducing the weight of the edge between user nodes. When looking for neighboring user nodes, the user nodes corresponding to similar viewing users who share device information or network address information are used as found neighbor user node. The similar viewing user group excavated in this way excludes the situation of the large anchor user group. Therefore, when step 270 is executed, whether the user group is the target user group can be directly determined by the number of users in the similar viewing user group, without considering the repeated usage.

As mentioned above, by adding device information and/or network address information to the node relationship diagram, the probability that the similar viewing user group mined by the LPA algorithm is a group control account can be increased, the situation of mining a large anchor user group can be avoided, and the follow-up time can be reduced. Computational complexity of the operation.

FIG. 5 is a schematic structural diagram of a group control account mining device provided by an embodiment of the present application. Referring to FIG. 5 , the group control account mining entire device includes a data acquisition module 301 , a user search module 302 and a group control determination module 303 .

Wherein, the data acquisition module 301 is configured to acquire the first viewing data of the user group within a set time period, each user in the user group corresponds to one first viewing data, and each first viewing data includes the corresponding user in the set time period. The identity data of the anchors watched in the segment; the user search module 302 is configured to find similar viewing users in the user group according to the first viewing data; the group control determination module 303 is configured to mine similar viewing users in the user group according to the similar viewing users user group, and determine the target user group belonging to the group control account according to the similar viewing user group.

On the basis of the above embodiment, the device further includes: a training module, configured to use the vocabulary vector corresponding to each first viewing data as training data before finding similar viewing users in the user group according to the first viewing data, The embedded word vector corresponding to each vocabulary vector is obtained by training, each anchor identity data corresponds to a vocabulary vector, the length of the vocabulary vector is equal to the current total number of anchors, and the length of the embedded word vector is less than the length of the vocabulary vector; the viewing data is determined The module is configured to obtain corresponding second viewing data according to the embedded word vector corresponding to the first viewing data. Correspondingly, the user search module 302 is specifically configured to search for similar viewing users in the user group according to the second viewing data.

On the basis of the above embodiment, the user search module 302 includes: a similarity calculation sub-module, configured to calculate the viewing similarity between users in the user group according to the first viewing data; a similarity determination sub-module, configured to calculate the viewing similarity according to the viewing similarity Find similar viewing users in the user base.

On the basis of the above-mentioned embodiment, the similarity calculation sub-module includes: a bucketing unit, configured to bucket each first viewing data by using a local-sensitive hash; an in-bucket calculation unit, configured to calculate the first viewing data in each bucket A viewing similarity between viewing data.

On the basis of the above embodiment, the bucket dividing unit includes: a signature calculation subunit, configured to perform minimum hash calculation on each first viewing data to obtain a corresponding signature vector; a mapping subunit, configured to calculate each signature The vector is divided into multiple rows, and each row is mapped to the corresponding hash bucket using a hash function. The hash function is at least one; the bucket sub-unit is configured to map to the same hash bucket. The first viewing data corresponding to the row bars are classified into the same bucket.

On the basis of the above-mentioned embodiment, the group control determination module 303 includes: a relationship graph construction sub-module, configured to take each user in the user group as a user node, and connect the user nodes corresponding to the similar viewing users through edges, so as to A node relationship graph is obtained; the tag propagation sub-module is configured to process the node relationship graph by using the tag propagation algorithm to determine the similar viewing user group; the first determining sub-module is configured to determine the target user group belonging to the group control account according to the similar viewing user group .

On the basis of the above embodiment, the label propagation sub-module includes: a label assignment unit, configured to assign a corresponding label to each user node in the node relationship graph; a neighbor search unit, configured to search for a user node in the node relationship graph , and find out all the neighboring user nodes of the user node according to the edge connection relationship of the user node; the label updating unit is configured to count the labels of all neighboring user nodes, and update the label with the most occurrences as the label of the user node; the first traversal The unit is configured to search for another user node in the node relationship graph, and returns to perform the operation of finding all neighboring user nodes of the user node according to the edge connection relationship of the user node, until all user nodes in the node relationship graph are traversed; end judgment The unit is configured to judge whether the current traversal end condition is met. The traversal end condition is that the threshold of traversal times is reached or the label of each user node in the node relationship graph has not changed; the second traversal unit is configured to return if the traversal end condition is not met. The operation of searching a user node in the node relationship graph is performed until the traversal end condition is satisfied; the node dividing unit is configured to classify the users corresponding to the nodes with the same label into the same similar viewing user group.

On the basis of the above-mentioned embodiment, the relationship graph construction sub-module is further configured to: obtain the device information and/or network address information of each user in the user group; use the device information and/or network address information as an information node, and add the node relationship graph , and connect user nodes and corresponding information nodes through edges.

On the basis of the above-mentioned embodiment, the group control determination module 303 includes: a first digging sub-module, configured to dig out a similar viewing user group from the user group according to similar viewing users; a second determining sub-module, configured to If the number of users in the group is greater than or equal to the number threshold, the similar viewing user group is determined as the target user group belonging to the group control account.

On the basis of the above embodiment, the group control determination module 303 includes: a second digging sub-module, configured to dig out a similar viewing user group from the user group according to similar viewing users; a third determining sub-module, configured to If multiple users in the group have the same device information and/or network address information, the similar viewing user group is determined as the target user group belonging to the group control account.

The group control account mining device provided above can be used to execute the group control account mining method provided by any of the above embodiments, and has corresponding functions and beneficial effects.

It is worth noting that, in the above-mentioned embodiment of the group control account mining device, the units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; , the specific names of the functional units are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present invention.

FIG. 6 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application. As shown in FIG. 6 , the group control account mining device includes a processor 40, a memory 41, an input device 42 and an output device 43; the number of processors 40 in the group control account mining device can be one or more. One processor 40 is taken as an example. The processor 40 , the memory 41 , the input device 42 and the output device 43 in the group control account mining device may be connected through a bus or other means, and the connection through a bus is taken as an example in FIG. 6 .

As a computer-readable storage medium, the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the group control account mining method in the embodiments of the present application (for example, a group control account mining device). data acquisition module, user search module and group control determination module in The processor 30 executes various functional applications and data processing of the group control account mining device by running the software programs, instructions and modules stored in the memory 41, ie, realizes the above group control account mining method.

The memory 41 can mainly include a stored program area and a stored data area, wherein the stored program area can store the operating system and the application program required for at least one function; the stored data area can store data created according to the use of the group control account mining device, etc. . In addition, the memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, memory 41 may further include memory located remotely relative to processor 40, and these remote memories may be connected to the group control account mining device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

The input device 42 may be configured to receive input numerical or character information, and to generate key signal input related to user settings and function control of the group control account mining device. The output device 43 may include a display device such as a display screen.

The above group control account mining device includes a group control account mining device, which can be used to execute any group control account mining method, and has corresponding functions and beneficial effects.

In addition, an embodiment of the present application also provides a storage medium containing computer-executable instructions, when the computer-executable instructions are executed by a computer processor, the computer-executable instructions are used to execute the group control account mining method provided by any embodiment of the present application. related operations, and have corresponding functions and beneficial effects.

The above are only the preferred embodiments of the present application and the applied technical principles. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present invention. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and may also include more other equivalent embodiments without departing from the concept of the present application. The scope is determined by the scope of the appended claims.

Claims

A group control account mining method, comprising:

Obtain the first viewing data of the user group within a set time period, each user in the user group corresponds to a first viewing data, and each of the first viewing data includes the viewing data of the corresponding user within the set time period streamer identity data;

Find out similar viewing users in the user group according to the first viewing data;

A similar viewing user group is mined from the user group according to the similar viewing users, and a target user group belonging to the group control account is determined according to the similar viewing user group.
The group control account mining method according to claim 1, wherein before finding out similar viewing users in the user group according to the first viewing data, the method comprises:

The vocabulary vector corresponding to each of the first viewing data is used as training data, so as to obtain the embedded word vector corresponding to each of the vocabulary vectors through training, and each of the anchor identity data corresponds to a vocabulary vector, and the vocabulary vector The length of is equal to the current total number of anchors, and the length of the embedded word vector is less than the length of the vocabulary vector;

Obtaining corresponding second viewing data according to the embedded word vector corresponding to the first viewing data;

The finding similar viewing users in the user group according to the first viewing data includes:

Similar viewing users are found in the user group according to the second viewing data.
The group control account mining method according to claim 1, wherein the finding similar viewing users in the user group according to the first viewing data comprises:

Calculate the viewing similarity among the users in the user group according to the first viewing data;

Similar viewing users are found in the user group according to the viewing similarity.
The group control account mining method according to claim 3, wherein the calculating the viewing similarity among the users in the user group according to the first viewing data comprises:

bucketing each of the first viewing data by utilizing the locality-sensitive hash;

The viewing similarity between the first viewing data in each bucket is calculated.
The group control account mining method according to claim 4, wherein the bucketing of each of the first viewing data by using a locality-sensitive hash comprises:

Perform minimum hash calculation on each of the first viewing data to obtain a corresponding signature vector;

Divide each of the signature vectors into multiple rows, and use a hash function to map each of the rows into corresponding hash buckets, where the hash function is at least one;

The first viewing data corresponding to the row bars mapped to the same hash bucket are classified into the same bucket.
The group control account mining method according to any one of claims 1-5, wherein the mining of a similar viewing user group in the user group according to the similar viewing user comprises:

Taking each user in the user group as a user node, and connecting the user nodes corresponding to the similar viewing users through edges to obtain a node relationship graph;

The node relationship graph is processed using a label propagation algorithm to determine similar viewing user groups.
The group control account mining method according to claim 6, wherein the processing of the node relationship graph by using a label propagation algorithm to determine a group of similar viewing users comprises:

assigning a corresponding label to each user node in the node relationship graph;

Find a user node in the node relationship graph, and find out all neighboring user nodes of the user node according to the edge connection relationship of the user node;

Count the labels of all the neighbor user nodes, and update the label with the most occurrences as the label of the user node;

Find another user node in the node relationship graph, and return to perform the operation of finding all neighbor user nodes of the user node according to the edge connection relationship of the user node, until all users in the node relationship graph are traversed node;

Judging whether the current traversal end condition is satisfied, and the traversal end condition is that the threshold of the traversal times is reached or the label of each user node in the node relationship graph has not changed;

If the traversal end condition is not met, return to perform the operation of searching for a user node in the node relationship graph until the traversal end condition is met;

Users corresponding to user nodes with the same label are classified into the same similar viewing user group.
The method for mining group control accounts according to claim 6, wherein each user in the user group is regarded as a user node, and the user nodes corresponding to the similar viewing users are connected through an edge to obtain a node When drawing a relationship diagram, also include:

Obtain device information and/or network address information of each user in the user group;

The device information and/or the network address information are used as information nodes to be added to the node relationship graph, and the user nodes and the corresponding information nodes are connected through edges.
The group control account mining method according to claim 1, 6 or 8, wherein the determining the target user group belonging to the group control account according to the similar viewing user group comprises:

If the number of users in the similar viewing user group is greater than or equal to the number threshold, the similar viewing user group is determined as a target user group belonging to the group control account.
The group control account mining method according to claim 1 or 6, wherein the determining the target user group belonging to the group control account according to the similar viewing user group comprises:

If multiple users in the similar viewing user group have the same device information and/or network address information, the similar viewing user group is determined as a target user group belonging to the group control account.
A group control account mining device, comprising:

The data acquisition module is configured to acquire first viewing data of a user group within a set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the corresponding user in the The identity data of the streamers watched during the set time period;

a user search module, configured to search out similar viewing users in the user group according to the first viewing data;

The group control determination module is configured to dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.
A group control account mining device, comprising: a memory and one or more processors;

the memory configured to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the group control account mining method according to any one of claims 1-10.
A computer-readable storage medium on which a computer program is stored, wherein when the program is executed by a processor, the group control account mining method according to any one of claims 1-10 is implemented.