WO2022156720A1 - Method and apparatus for group control account excavation, device, and storage medium - Google Patents

Method and apparatus for group control account excavation, device, and storage medium Download PDF

Info

Publication number
WO2022156720A1
WO2022156720A1 PCT/CN2022/072806 CN2022072806W WO2022156720A1 WO 2022156720 A1 WO2022156720 A1 WO 2022156720A1 CN 2022072806 W CN2022072806 W CN 2022072806W WO 2022156720 A1 WO2022156720 A1 WO 2022156720A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
viewing
group
user group
node
Prior art date
Application number
PCT/CN2022/072806
Other languages
French (fr)
Chinese (zh)
Inventor
曹轲
钟清华
Original Assignee
百果园技术(新加坡)有限公司
曹轲
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 曹轲 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022156720A1 publication Critical patent/WO2022156720A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/254Management at additional data server, e.g. shopping server, rights management server
    • H04N21/2541Rights Management

Definitions

  • the embodiments of the present application relate to the technical field of live webcasting, and in particular, to a group control account mining method, a group control account mining device, a group control account mining device, and a storage medium.
  • “Popularity” is a specific term in the live broadcast industry, which can comprehensively reflect the popularity of the anchor and the quality of the live content. Popularity can be calculated by the number of viewers, viewing length, broadcast length, followings, interaction, barrage, gift rewards and other dimensions. Among them, the number of viewers is an important dimension to measure popularity, and the ranking of the anchors when recommending anchors can be determined by the number of viewers. In addition, many live broadcast platforms settle the anchor's salary based on the number of viewers.
  • group control accounts ie group control accounts
  • the following methods can be used to detect group control accounts: 1. Device environment aggregation detection method, which is determined by the mobile phone number used when the user registers and the IP address used when watching the host. Whether there is a group control account, among which, the situation that the mobile phone numbers in the group control account share an IP address is more prominent; 2. The room feature anomaly detection method, when the group control account is used to increase popularity, the host's room will be rewarded for gifts, the number of viewers, and the number of bombs.
  • abnormalities in the distribution of data characteristics such as the number of scenes. For example, under normal circumstances, when the number of viewers in the host's room reaches the threshold, the gift reward will be in a distribution range, but when the number of viewers in the host's room under the group control account reaches the threshold, the gift reward is obvious. is smaller than the normal distribution interval.
  • the host with abnormal feature distribution can be found through the abnormal room feature detection method. Although the above method can detect the group control account, the security is low and it is easy to be cracked. For example, using a dynamic IP pool to prevent mobile phone numbers from sharing the same IP address, or using distributed cloud group control account access, switching gift-giving accounts, etc., can avoid abnormal feature distribution.
  • the embodiments of the present application provide a group control account mining method, apparatus, device, and storage medium to solve the technical problems of low security and easy cracking in the group control account mining process.
  • an embodiment of the present application provides a method for mining a group control account, including:
  • each user in the user group corresponds to a first viewing data
  • each of the first viewing data includes the viewing data of the corresponding user within the set time period streamer identity data
  • a similar viewing user group is mined from the user group according to the similar viewing users, and a target user group belonging to the group control account is determined according to the similar viewing user group. .
  • the embodiment of the present application provides a group control account mining device, including:
  • the data acquisition module is configured to acquire the first viewing data of a user group within a set time period, each user in the user group corresponds to a first viewing data, and each of the first viewing data includes the corresponding user in the The identity data of the streamers watched during the set time period;
  • a user search module configured to search for similar viewing users in the user group according to the first viewing data
  • the group control determination module is configured to dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.
  • an embodiment of the present application provides a group control account mining device, including: a memory and one or more processors;
  • the memory configured to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the group control account mining method according to the first aspect.
  • an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the group control account mining method described in the first aspect.
  • the above group control account mining method, device, device and storage medium obtain the first viewing data of the user group within a set time period, and find out similar viewing users in the user group according to the first viewing data, and then, according to the similarity
  • the technical means of mining similar viewing user groups by viewing users and determining group control accounts according to the similar viewing user groups solves the technical problems of low security and easy cracking in the mining process of group control accounts. Even if the group control account uses a dynamic IP pool or uses a distributed cloud group control account to access, it can effectively filter out similar viewing users based on each user's viewing of the anchor, and then accurately mine the group control account in the user group to improve Reduce the cost of group control cheating, prevent the behavior of room brushing, and ensure the authenticity of the anchor's popularity.
  • FIG. 1 is a flowchart of a group control account mining method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a hash bucket provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of another group control account mining method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a neural network provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application.
  • a group control account refers to using multiple real devices (such as multiple mobile phones) or simulating multiple real devices, and installing script software (currently group control software) in the device to control the application software in the device (such as live broadcast application software) , by modifying the software and hardware information of the device to achieve the effect of simulating manual use of the application software.
  • the group control account can simulate the operation requests of real users to the greatest extent through automated means.
  • cheating goals such as attracting fans, draining traffic, and swiping advertisements for the live broadcast room.
  • the way to increase the popularity of the anchor by simulating the normal account of the group control account to enter the host's room is called room brushing.
  • the embodiment of the present application provides a group control account mining method, so as to mine the group control account safely and accurately.
  • the group control account mining method may be executed by a group control account mining device, which may be implemented by software and/or hardware, and the group control account mining device may be two or more physical devices.
  • Entity composition can also be a physical entity composition.
  • the group control account mining device may be a computer, tablet computer or other smart device configured with data computing and analysis capabilities.
  • FIG. 1 is a flowchart of a group control account mining method provided by an embodiment of the present application.
  • the group control account mining method may include:
  • Step 110 Acquire the first viewing data of the user group within the set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the host identity watched by the corresponding user within the set time period data.
  • the user group refers to a set of users who use the live broadcast application software to watch the live broadcast.
  • Viewing data refers to the data that reflects the viewing situation during the user's viewing of the live broadcast.
  • the viewing data includes at least host identity data viewed by the user.
  • the host identity data is used to indicate the host identity.
  • Different hosts have different host identity data.
  • the host refers to the user who has registered on the live broadcast application platform and can perform live broadcast. It is understandable that the viewing data may also include content such as the viewing duration of each anchor watched by the user.
  • each user in the user group also has corresponding user identity data, and different users have different user identity data.
  • the group control account mining device When a user enters a host's room, the group control account mining device records the user's identity data, the host's identity data, and viewing duration, and generates a piece of viewing data. When the user enters another live broadcast room, the group control account mining device records the user identity data, anchor identity data, and viewing time again, and generates another piece of viewing data.
  • the first viewing data refers to a collection of viewing data of a corresponding user within a set time period, which may optionally include anchor identity data viewed by the user within the set time period. Each user corresponds to one piece of first viewing data.
  • the set time period can be set according to the actual situation, for example, the set time period is 24 hours, 12 hours or 48 hours.
  • the host identity data is encoded in the embodiment, and the host identity data is represented in the form of a vector.
  • the embodiment of the encoding rule is not limited.
  • one-hot (One-Hot) encoding is used to vectorize the identity data of each anchor. After that, the first viewing data is composed of vectors corresponding to the identity data of each anchor.
  • Step 120 Find out similar viewing users in the user group according to the first viewing data.
  • the group control account has the characteristics of batch, which can enter and exit the room of one or some hosts at the same time. At this time, each user in the same group control account watches the same set of anchors or the same batches, and the duration of each user watching the same live broadcast is also close to the same.
  • the viewing time of normal users varies greatly, which conforms to a normal distribution.
  • normal users have preferences (that is, there are fixed one or more anchors who like to watch) and randomness ( i.e. random selection of streamers to watch). Therefore, the probability that any two normal users have the same first viewing data is relatively small, while the probability that the group control accounts have the same first viewing data is relatively large.
  • similar viewing users are found in the user group through the first viewing data.
  • similar viewing users refer to the same or highly similar anchors watched by two users. It can be understood that each user can form a different similar viewing user, for example, user A and user B can form a similar viewing user, and user A and user C can also form a similar viewing user.
  • similar viewing users are selected by calculating the similarity between the first viewing data.
  • the method for calculating the similarity is not limited in the embodiment, for example, a method such as cosine similarity, Euclidean distance, etc. is used.
  • users with high similarity are determined as similar viewing users.
  • a threshold is set, and the threshold represents the maximum cosine distance between similar viewing users. It can be understood that the smaller the cosine distance, the higher the similarity.
  • two first viewing users are calculated. After the cosine distance of the data, if the cosine distance is less than the threshold, the two corresponding users are determined as similar viewing users. After the similarity is calculated for all the first viewing data pairwise in the above manner, all similar viewing users can be obtained.
  • Step 130 Dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.
  • the similar viewing user group refers to the high similarity between the first viewing data of the users in the group, that is, the anchors watched by the users in the group are the same or highly similar.
  • the group of similar viewing users may be determined by similar viewing users.
  • a node graph is drawn, each node in the node graph represents a user, and two nodes that view similar users are connected by an edge. It is understandable that for group control accounts, since they perform the same operations in batches, they will form a dense community in the node graph. The community contains a large number of users, while normal users are in the node graph. A more fragmented or community with a smaller number of users. Therefore, a community with dense user connections can be found through the node graph (for example, nodes with a connection relationship are formed into a community), and the found community can be regarded as a similar viewing user group.
  • the target user group is determined according to the similar viewing user group, where the target user group refers to the user group corresponding to the group control account.
  • the found similar viewing user group is determined as the target user group, or, the similar viewing user group whose number of users is higher than the number threshold (this value can be set in combination with the actual situation) is determined as the target user group.
  • the target user group is determined in combination with the network addresses (eg IP addresses) of the users in the similar viewing user group, the information of the equipment used and/or the viewing duration of each anchor.
  • the device information is used to distinguish the device used by the user. The network address and device information can be obtained when the user uses the live broadcast application software.
  • the similar viewing user group For example, count the viewing time of each user in the similar viewing user group for the same anchor. If the viewing time of the same anchor is the same or similar (for example, the difference in viewing time is less than the set duration), then the similar viewing user group is determined as the target user group . For another example, for group control accounts, they share network addresses and/or device information, that is, different users in the target user group have the same network addresses and/or device information. Therefore, whether the similar viewing user group is the target user group may be determined by considering whether the users in the similar viewing user group share network addresses and/or device information.
  • the technical means of determining the group control account by the viewing user group solves the technical problems of low security and easy cracking during the mining process of the group control account. Even if the group control account uses a dynamic IP pool or uses distributed cloud group control account access, etc., it can effectively filter out similar viewing users based on the situation of each user's viewers, and then accurately mine the group control accounts in the user group to improve Reduce the cost of group control cheating, prevent the behavior of room brushing, and ensure the authenticity of the anchor's popularity.
  • step 120 includes steps 121-122:
  • Step 121 Calculate the viewing similarity among the users in the user group according to the first viewing data.
  • the viewing similarity is used to reflect the similarity between the first viewing data.
  • the viewing similarity can be calculated by cosine similarity, Euclidean distance, or the like.
  • the number of users is large (eg, the number of users is greater than 106)
  • a large amount of computation is required to calculate the viewing similarity. Therefore, in the embodiment, each user is roughly divided into buckets, and potentially similar users are divided into a bucket with a high probability, and then the viewing similarity between users in each bucket is calculated, so as to reduce the amount of calculation.
  • step 121 includes steps 1211-1212:
  • Step 1211 Use the locality-sensitive hash to bucket each first viewing data.
  • the first viewing data is divided into buckets by using Locality Sensitive Hashing (LSH) to divide the possibly similar first viewing data into one bucket.
  • LSH Locality Sensitive Hashing
  • step 1211 when using LSH for bucketing, step 1211 includes steps 12111-12113:
  • Step 12111 Perform a minimum hash calculation on each first viewing data to obtain a corresponding signature vector.
  • the minimum hash is a commonly used technical means in the LSH calculation process, which is used to calculate and obtain the signature vector (or matrix).
  • the minimum hash is used to calculate the first viewing data to obtain the signature vector (or matrix).
  • each first viewing data corresponds to a signature vector, and the space occupied by the signature vector is smaller than the space occupied by the first viewing data.
  • Step 12112 Divide each signature vector into multiple rows, and map each row into a corresponding hash bucket by using a hash function, where the hash function is at least one.
  • Each signature vector is divided into multiple segments, and the content of each segment is regarded as a band, wherein the number of bands (ie, the number of segments) can be set according to the actual situation, and the number of bands of each signature vector is equal.
  • each row bar is mapped into a corresponding hash bucket by using a hash function, wherein the adopted hash function can be selected according to the actual situation, and one or more hash functions can be used.
  • each hash function can map the row strip once.
  • Step 12113 Put the first viewing data corresponding to the row bars mapped to the same hash bucket into the same bucket.
  • the row bars are the same means that the row bars are mapped to the same hash bucket. Accordingly, in the embodiment, the row bars mapped in the same hash bucket are obtained, then the first viewing data corresponding to each row bar is searched, and each first viewing data found is used as data in the same bucket. At this time, the users corresponding to the first viewing data in the same bucket may be considered as candidate similar viewing users.
  • FIG. 2 is a schematic diagram of a hash bucket provided by an embodiment of the present application.
  • Figure 2 contains three hash buckets, denoted as band1, band2, and band3 respectively. It should be noted that only part of the rows mapped to band1 are shown in Figure 2 (represented as 10002, 32122, and 01311 in Figure 2).
  • the first viewing data corresponding to each row in band1 is taken as the data in the same bucket
  • the first viewing data corresponding to each row in band2 is taken as the data in the same bucket
  • the first viewing data corresponding to each row in band3 As the data in the same bucket, the bucketing operation for the first viewing data is further completed.
  • Step 1212 Calculate the viewing similarity between the first viewing data in each bucket.
  • the viewing similarity between the first viewing data in each bucket is calculated, and the viewing similarity does not need to be calculated for the first viewing data between the buckets.
  • the embodiment of the calculation method of the viewing similarity is not limited.
  • Step 122 Find similar viewing users in the user group according to the viewing similarity.
  • similar viewing users are found by comparing thresholds. For example, when the Euclidean distance is used to calculate the viewing similarity, the smaller the distance between the two first viewing data, the higher the viewing similarity. Therefore, a distance threshold may be set according to the actual situation, and when the distance is less than the distance threshold, it is determined that the two users are similar viewing users.
  • similar viewing users can be accurately found by calculating the viewing similarity, and the local sensitive hash algorithm can avoid the problem of a large amount of calculation of viewing similarity when the number of users is large, and reduce the calculation of finding similar viewing users. the complexity.
  • FIG. 3 is a flowchart of another group control account mining method according to an embodiment of the present application.
  • the group control account mining method is detailed on the basis of the above embodiment.
  • each anchor identity data corresponds to a vocabulary vector
  • the length of the vocabulary vector is equal to the current total number of anchors.
  • the vocabulary vector is a vector obtained by performing One-Hot encoding on the anchor identity data
  • each anchor identity data corresponds to a vocabulary vector.
  • the dimension of the vocabulary vector may be represented by the length of the vocabulary vector, and the length of the vocabulary vector is equal to the current total number of anchors, wherein the current total number of anchors may be the total number of currently registered anchors in the live broadcast application software, Or, the total number of anchors watched by each user in the user group.
  • each anchor's identity data is represented by a 4-dimensional vocabulary vector, and the 4 vocabulary vectors are respectively expressed as: [1 0 0 0], [0 1 0 0 ], [0 0 1 0], [0 0 0 1].
  • the group control account mining method may include:
  • Step 210 Acquire the first viewing data of the user group within the set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the host identity watched by the corresponding user within the set time period data.
  • the first viewing data is represented by a vocabulary vector, for example, the vocabulary vectors of the anchor identity data included in the first viewing data are [1 0 0 0], [0 1 0 0], [0 0 10 respectively ], then the first viewing data is a 3 ⁇ 4 matrix consisting of the aforementioned vocabulary vectors.
  • Step 220 Use the vocabulary vector corresponding to each first viewing data as training data to obtain the embedded word vector corresponding to each vocabulary vector by training, and the length of the embedded word vector is less than the length of the vocabulary vector.
  • the length of the corresponding vocabulary vector will be very long.
  • the length of the first viewing data The dimension will also be very large, which is not conducive to the subsequent calculation of the first viewing data. Therefore, in the embodiment, the dimensionality reduction process is performed on the vocabulary vector according to each first viewing data, and the vector obtained after dimensionality reduction is recorded as the embedded word vector, each vocabulary vector corresponds to an embedded word vector, and different vocabulary vectors May correspond to the same embedded word vector.
  • the length (ie dimension) of the embedded word vector can be set according to the actual situation, such as setting the length to 50.
  • Word2Vec is used to obtain the embedded word vector.
  • Word2Vec is a natural language processing (Natural Language Processing, NPL) tool, which is used to generate related models of word vectors.
  • NPL Natural Language Processing
  • the model uses a shallow and two-layer neural network, and after the neural network is trained, the Word2Vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, and the vector is located in the neural network. the hidden layer.
  • the vector representing the relationship between the vocabulary vectors is recorded as the embedded word vector, that is, the word (that is, the vocabulary vector) can be converted into the embedded word vector through Word2Vec, so that each vocabulary can be quantitatively measured by the embedded word vector.
  • FIG. 4 is a schematic diagram of a neural network provided in this embodiment of the application.
  • the neural network is a neural network used by Word2Vec, and the neural network is a Skip-gram model.
  • the Skip-gram model refers to the input After a word, predict its context word as output.
  • the input layer inputs a V-dimensional vocabulary vector (ie [x 1 x 2 ... x v ])
  • the output layer outputs another V-dimensional vocabulary vector (ie [y] 1 y 2 ... y v ])
  • the weight from the input layer to the hidden layer is the embedded word vector corresponding to the vocabulary vector, which can represent the input layer.
  • the training process of the neural network is: selecting an input word in a sentence, and defining a skip_window parameter and a num_skips parameter.
  • the skip_window parameter indicates the number of words selected from the side (left or right) of the current input word in the sentence when training the neural network, through which the word window where the output word of the neural network is located can be determined
  • the num_skips parameter indicates the output of different words
  • the output word is selected from the word window. For example, if the sentence is "there is an apple on the table”, skip_window and num_skips are both 2.
  • the input word is apple, and the corresponding word window is [is an apple on the].
  • the neural network After correlating the context, the neural network obtains There are two corresponding relationships between apple and an and apple and on. At this time, an and on are different words output, and (apple, an) and (apple, on) can be used as two sets of training data for the sentence. That is, input apple and output an or on.
  • the vocabulary vector corresponding to the input word is selected from the training data and input to the neural network, and the probability distribution of each input word is obtained according to the output word, and the distribution represents the probability that each input word obtains the same output word.
  • the output words after the context will contain words such as "the capital is”, so , the probability of related words such as "China” and "UK” should be higher than other words, and the embedded word vectors corresponding to "China” and "UK” are the same or similar.
  • the matrices W V ⁇ N and W′ N ⁇ V in Fig. 4 are updated by means of gradient descent and backpropagation to realize training. After the training is completed, the embedded word vector of each input word is obtained through the matrix W V ⁇ N.
  • the above training method is corresponding to the first viewing data, it can be: simulating each first viewing data into a sentence, wherein the vocabulary vector of each anchor identity data is used as a word in the sentence, and then selecting the input word and the output word, In order to train the neural network, and after the training is completed, the embedded word vector of each input word is obtained through the matrix W V ⁇ N. Understandably, input words with the same output word have the same or similar embedded word vectors. For example, some first viewing data contains host identity data for host A and host B, respectively, and other first viewing data contains host identity data for host C and host B, respectively, then enter the word host A or host C When the corresponding vocabulary vector is used, the probability of outputting the vocabulary vector corresponding to the anchor B is relatively large. Therefore, the embedded word vectors corresponding to the anchor A and the anchor C are similar or the same. It should be noted that the process of using Word2Vec to obtain the embedded word vector can be regarded as the process of Embedding.
  • Step 230 Obtain corresponding second viewing data according to the embedded word vector corresponding to the first viewing data.
  • each embedded word vector is processed to obtain the second viewing data.
  • the second viewing data refers to the vector obtained by embedding the word vector.
  • the dimension of the second viewing data is smaller than the dimension of the first viewing data.
  • an average value, a maximum value, or a minimum value can be obtained. Taking the average value as an example, the average value of the same position in each embedded word vector is calculated to obtain the average value, and a vector composed of the average value of each position is taken as the second viewing data.
  • the first viewing data includes host identity data, respectively, host A, host B, and host C
  • the second viewing data obtained by averaging the three corresponding embedded word vectors is [0.4234, 0.762, 0.4234], where, The first 0.4234 is the result of averaging the first value of the three embedded word vectors, and so on.
  • Step 240 Find out similar viewing users in the user group according to the second viewing data.
  • this step is the same as the processing method of finding similar viewing users in the user group according to the first viewing data, such as using a local-sensitive hash method to perform bucketing and finding similar viewing users after bucketing, and the embodiment does not do this. Repeat.
  • Step 250 Take each user in the user group as a user node, and connect user nodes corresponding to similar viewing users through edges to obtain a node relationship graph.
  • a node relationship graph refers to a node graph obtained by expressing the relationship between nodes by connecting edges.
  • the node relationship graph refers to a node graph constructed according to the user group and similar viewing users therein. Each user is correspondingly displayed as a node in the node relationship diagram.
  • the node representing the user is recorded as a user node.
  • Step 260 Process the node relationship graph using a label propagation algorithm to determine a group of similar viewing users.
  • Label Propagation Algorithm is a graph-based semi-supervised learning method. Its basic idea is to use the label information of labeled nodes to predict the label information of unlabeled nodes, which can realize local community division.
  • LPA Label Propagation Algorithm
  • a label is assigned to each user node in the node relationship graph. In each iteration, each user node will change its label according to the label of the user node connected to itself until the iteration ends.
  • the rule for changing labels is to use the label that appears most in the connected user nodes as its own label.
  • step 260 may include steps 261-266:
  • Step 261 Assign a corresponding label to each user node in the node relationship graph.
  • the embodiment of the label generation rule is not limited.
  • the label currently assigned to each user node can be considered as an initial label, and the initial labels corresponding to each user node are different.
  • the node relationship graph includes M user nodes.
  • user node 1 corresponds to label 1
  • user node i corresponds to label i, 1 ⁇ i ⁇ M, and so on.
  • Step 262 Find a user node in the node relationship graph, and find out all neighboring user nodes of the user node according to the edge connection relationship of the user node.
  • a user node is searched in a node relationship graph, wherein the search rule embodiment is not limited, such as according to The order of the user nodes is searched in turn.
  • search for the neighbor user node of the user node where the neighbor user node refers to the user node connected to the user node through an edge, or the weight of the edge connected to the user node is greater than the set threshold.
  • neighbor user nodes and user nodes belong to similar viewing users. Understandably, each user node may correspond to one or more neighbor user nodes, or there may be no neighbor user nodes. If there is no neighbor user node, re-select another user node and repeat this step. If there is a neighbor user node, go to the next step.
  • Step 263 Count the labels of all neighboring user nodes, and update the label with the most occurrences as the label of the user node.
  • each neighbor user node determines the label with the most occurrences among the labels. Among them, if there are multiple labels with the most occurrences (for example, the label of each node is the initial label, and each label appears once), then a label is randomly selected from the multiple labels with the most occurrences. After that, update the label of the current user node to the label with the most occurrences.
  • nodes with the same label belong to the same community. After the update is complete, the community to which the current node belongs can be determined.
  • Step 264 Search for another user node in the node relationship graph, and return to perform the operation of finding all neighboring user nodes of the user node according to the edge connection relationship of the user node, until all user nodes in the node relationship graph are traversed.
  • Step 265 Determine whether the current traversal end condition is satisfied. If the traversal end condition is not satisfied, return to step 262. If the traversal end condition is satisfied, step 266 is executed.
  • the traversal end condition is a restriction condition for stopping traversal, and its content can be set according to the actual situation.
  • the traversal end condition is that the threshold of the number of traversals is reached or the labels of each user node in the node relationship graph do not change.
  • the threshold of the number of traversals can be set according to the actual situation. After each round of traversal ends, the recorded number of traversals is incremented by 1, and then it is determined whether the number of traversals reaches the threshold of the number of traversals. Determine that the traversal end condition is not met and start a new round of traversal. In another embodiment, after the current round of traversal is completed, it is determined whether the label of each user node has changed.
  • the traversal end condition is not satisfied and a new round of traversal is started. If the labels of each user node have not changed, it is determined that the traversal end condition is satisfied. In another embodiment, after the current round of traversal is completed, it is determined whether the label of each user node has changed. If the label of at least one user node has changed, it is determined whether the number of traversals has reached the threshold of the number of traversals. It is determined that the traversal end condition is satisfied. Otherwise, it is determined that the traversal end condition is not satisfied and a new round of traversal is started. If the label of each user node does not change, it is determined that the traversal end condition is satisfied.
  • Step 266 classify the users corresponding to the user nodes with the same label into the same similar viewing user group.
  • each user node is recorded as [1,A],[1,B],[2,C],[ 1, D]
  • the first field is the ID of the similar viewing user group, at this time, user A, user B, and user D belong to the same similar viewing user group.
  • Step 270 Determine a target user group belonging to the group control account according to the similar viewing user group.
  • this step includes at least one of the following schemes:
  • Scheme 1 If the number of users in the similar viewing user group is greater than or equal to the number threshold, the similar viewing user group is determined as the target user group belonging to the group control account.
  • the currently used quantity threshold refers to the minimum number of users included in the group control account, and its value can be set according to the actual situation, for example, the quantity threshold is 50. If the number of users included in the similar viewing user group is greater than or equal to the number threshold, it is confirmed as the target user group. After processing each similar viewing user group according to the above, the target user group can be mined.
  • Solution 2 If multiple users in the similar viewing user group have the same device information and/or network address information, the similar viewing user group is determined as the target user group belonging to the group control account.
  • the device information refers to the related information of the device used by the user when watching the live broadcast, which may be a device identification, etc., and the device information of different devices is different.
  • the network address information refers to the network address used by the user when watching the live broadcast, which may be an IP address.
  • the device information and the network address information are acquired simultaneously for description. In practical applications, only one type of information may be acquired for processing, and the processing methods are the same. Generally speaking, the probability of repeated use of device information and network address information between non-group control accounts is small, and the probability of repeated use of device information and network address information between group control accounts is high, such as logging in to different accounts through one device. Do a room cleaning.
  • the similar viewing user group is determined as the target user group.
  • a first threshold of the same number of users may be set, and the group control account indicates the minimum number of users with the same device information and/or network address information. If the number of users with the same device information and/or network address information reaches the first threshold of the same number of users, it is determined that they are repeatedly used, that is, network addresses and/or devices are aggregated. Therefore, the similar viewing user group is determined as the target user group.
  • the similar viewing user group may also be a large anchor user group.
  • the big anchor has a very high number of viewers and followers, and the division of the big anchor is not limited according to the embodiment.
  • the big anchor user group means that the users included in it will watch several big anchors at a fixed time. It can be understood that the characteristics of the large anchor user group are that the re-use probability of device information and network address information among users is low. At this time, if the users in the similar viewing user group have different device information and network address information, the similar viewing user group is determined as the large anchor user group.
  • the similar viewing user group is determined to be large.
  • An anchor user group wherein the second same user number threshold is lower than the first same user number threshold.
  • the target user group and the big anchor user group can also be obtained in combination with the number of users and the repeated usage. For example, when the number of users in a similar viewing user group is greater than or equal to the number threshold, it is determined as a suspected target user group. If the suspected target user group has repeated use, it is determined as the target user group. If there is no repeated use, it is determined as the target user group. Determined to be a large anchor user group.
  • step 250 it also includes: acquiring the device information and/or network address information of each user in the user group; using the device information and/or network address information as an information node, adding the node relationship graph, and comparing the user node and corresponding The information nodes are connected by edges.
  • a node representing device information and/or a node representing network address information is added, each device information corresponds to a node, and each network address information corresponds to a node.
  • the nodes of device information and network address information are collectively referred to as information nodes, and the description is given by adding two types of information nodes at the same time as an example.
  • the node relationship graph also includes the situation of each user using the device and network address.
  • the probability that the similar viewing user group mined by the LPA algorithm is a group control account can be increased, the situation of mining a large anchor user group can be avoided, and the follow-up time can be reduced. Computational complexity of the operation.
  • FIG. 5 is a schematic structural diagram of a group control account mining device provided by an embodiment of the present application.
  • the group control account mining entire device includes a data acquisition module 301 , a user search module 302 and a group control determination module 303 .
  • the data acquisition module 301 is configured to acquire the first viewing data of the user group within a set time period, each user in the user group corresponds to one first viewing data, and each first viewing data includes the corresponding user in the set time period.
  • the user search module 302 is configured to find similar viewing users in the user group according to the first viewing data;
  • the group control determination module 303 is configured to mine similar viewing users in the user group according to the similar viewing users user group, and determine the target user group belonging to the group control account according to the similar viewing user group.
  • the device further includes: a training module, configured to use the vocabulary vector corresponding to each first viewing data as training data before finding similar viewing users in the user group according to the first viewing data,
  • the embedded word vector corresponding to each vocabulary vector is obtained by training, each anchor identity data corresponds to a vocabulary vector, the length of the vocabulary vector is equal to the current total number of anchors, and the length of the embedded word vector is less than the length of the vocabulary vector; the viewing data is determined
  • the module is configured to obtain corresponding second viewing data according to the embedded word vector corresponding to the first viewing data.
  • the user search module 302 is specifically configured to search for similar viewing users in the user group according to the second viewing data.
  • the user search module 302 includes: a similarity calculation sub-module, configured to calculate the viewing similarity between users in the user group according to the first viewing data; a similarity determination sub-module, configured to calculate the viewing similarity according to the viewing similarity Find similar viewing users in the user base.
  • the similarity calculation sub-module includes: a bucketing unit, configured to bucket each first viewing data by using a local-sensitive hash; an in-bucket calculation unit, configured to calculate the first viewing data in each bucket A viewing similarity between viewing data.
  • the bucket dividing unit includes: a signature calculation subunit, configured to perform minimum hash calculation on each first viewing data to obtain a corresponding signature vector; a mapping subunit, configured to calculate each signature The vector is divided into multiple rows, and each row is mapped to the corresponding hash bucket using a hash function.
  • the hash function is at least one; the bucket sub-unit is configured to map to the same hash bucket. The first viewing data corresponding to the row bars are classified into the same bucket.
  • the group control determination module 303 includes: a relationship graph construction sub-module, configured to take each user in the user group as a user node, and connect the user nodes corresponding to the similar viewing users through edges, so as to A node relationship graph is obtained; the tag propagation sub-module is configured to process the node relationship graph by using the tag propagation algorithm to determine the similar viewing user group; the first determining sub-module is configured to determine the target user group belonging to the group control account according to the similar viewing user group .
  • the label propagation sub-module includes: a label assignment unit, configured to assign a corresponding label to each user node in the node relationship graph; a neighbor search unit, configured to search for a user node in the node relationship graph , and find out all the neighboring user nodes of the user node according to the edge connection relationship of the user node; the label updating unit is configured to count the labels of all neighboring user nodes, and update the label with the most occurrences as the label of the user node; the first traversal The unit is configured to search for another user node in the node relationship graph, and returns to perform the operation of finding all neighboring user nodes of the user node according to the edge connection relationship of the user node, until all user nodes in the node relationship graph are traversed; end judgment The unit is configured to judge whether the current traversal end condition is met.
  • the traversal end condition is that the threshold of traversal times is reached or the label of each user node in the node relationship graph has not changed; the second traversal unit is configured to return if the traversal end condition is not met.
  • the operation of searching a user node in the node relationship graph is performed until the traversal end condition is satisfied; the node dividing unit is configured to classify the users corresponding to the nodes with the same label into the same similar viewing user group.
  • the relationship graph construction sub-module is further configured to: obtain the device information and/or network address information of each user in the user group; use the device information and/or network address information as an information node, and add the node relationship graph , and connect user nodes and corresponding information nodes through edges.
  • the group control determination module 303 includes: a first digging sub-module, configured to dig out a similar viewing user group from the user group according to similar viewing users; a second determining sub-module, configured to If the number of users in the group is greater than or equal to the number threshold, the similar viewing user group is determined as the target user group belonging to the group control account.
  • the group control determination module 303 includes: a second digging sub-module, configured to dig out a similar viewing user group from the user group according to similar viewing users; a third determining sub-module, configured to If multiple users in the group have the same device information and/or network address information, the similar viewing user group is determined as the target user group belonging to the group control account.
  • the group control account mining device provided above can be used to execute the group control account mining method provided by any of the above embodiments, and has corresponding functions and beneficial effects.
  • the units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; , the specific names of the functional units are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present invention.
  • FIG. 6 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application.
  • the group control account mining device includes a processor 40, a memory 41, an input device 42 and an output device 43; the number of processors 40 in the group control account mining device can be one or more.
  • One processor 40 is taken as an example.
  • the processor 40 , the memory 41 , the input device 42 and the output device 43 in the group control account mining device may be connected through a bus or other means, and the connection through a bus is taken as an example in FIG. 6 .
  • the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the group control account mining method in the embodiments of the present application (for example, a group control account mining device).
  • data acquisition module, user search module and group control determination module in The processor 30 executes various functional applications and data processing of the group control account mining device by running the software programs, instructions and modules stored in the memory 41, ie, realizes the above group control account mining method.
  • the memory 41 can mainly include a stored program area and a stored data area, wherein the stored program area can store the operating system and the application program required for at least one function; the stored data area can store data created according to the use of the group control account mining device, etc. .
  • the memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device.
  • memory 41 may further include memory located remotely relative to processor 40, and these remote memories may be connected to the group control account mining device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the input device 42 may be configured to receive input numerical or character information, and to generate key signal input related to user settings and function control of the group control account mining device.
  • the output device 43 may include a display device such as a display screen.
  • the above group control account mining device includes a group control account mining device, which can be used to execute any group control account mining method, and has corresponding functions and beneficial effects.
  • an embodiment of the present application also provides a storage medium containing computer-executable instructions, when the computer-executable instructions are executed by a computer processor, the computer-executable instructions are used to execute the group control account mining method provided by any embodiment of the present application. related operations, and have corresponding functions and beneficial effects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiments of the present application provide a method and apparatus for group control account excavation, a device, and a storage medium, and relate to the technical field of livestreaming. The technical solution provided in the embodiments of the present application comprises: acquiring first viewing data of a user group during a set period, each user in the user group corresponding to one piece of first viewing data, and each piece of the first viewing data comprising identity data of a live streamer viewed by the corresponding user during the set period; searching for similar viewing users in the user group on the basis of the first viewing data; excavating a similar viewing user group in the user group on the basis of the similar viewing users, and determining, on the basis of the similar viewing user group, a target user group belonging to group control accounts. The employment of the method solves the technical problem of a group control account excavation process having poor security and being easily cracked.

Description

群控账号挖掘方法、装置、设备及存储介质Group control account mining method, device, equipment and storage medium
本申请要求在2021年01月25日提交中国专利局,申请号为202110098987.5的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202110098987.5 filed with the China Patent Office on January 25, 2021, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请实施例涉及网络直播技术领域,尤其涉及一种群控账号挖掘方法、一种群控账号挖掘装置、一种群控账号挖掘设备及存储介质。The embodiments of the present application relate to the technical field of live webcasting, and in particular, to a group control account mining method, a group control account mining device, a group control account mining device, and a storage medium.
背景技术Background technique
“人气”是直播行业的一个特定术语,其能够综合反映出主播的受欢迎程度和直播内容质量。人气可通过观众数、观看长度、开播长度、关注数、互动情况、弹幕数、礼物打赏等维度进行计算。其中,观众数是衡量人气的重要维度,推荐主播时各主播的排序可通过观众数决定。并且,很多直播平台通过观众数进行主播的工资结算。"Popularity" is a specific term in the live broadcast industry, which can comprehensively reflect the popularity of the anchor and the quality of the live content. Popularity can be calculated by the number of viewers, viewing length, broadcast length, followings, interaction, barrage, gift rewards and other dimensions. Among them, the number of viewers is an important dimension to measure popularity, and the ranking of the anchors when recommending anchors can be determined by the number of viewers. In addition, many live broadcast platforms settle the anchor's salary based on the number of viewers.
一般而言,通过群控软件批量操作大量僵尸账号(即群控账号)可提升主播房间的人气。一些相关技术中,为了防止群控账号的出现,可采用如下方法检测群控账号:1、设备环境聚集检测法,其通过对用户注册时使用的手机号和观看主播时使用的IP地址来确定是否存在群控账号,其中,群控账号中各手机号共用IP地址的情况较为突出;2、房间特征异常检测法,利用群控账号增加人气时,主播房间内礼物打赏、观众数、弹幕数等数据特征分布存在异常,比如正常情况下主播房间的观众数达到阈值时其礼物打赏会在一个分布区间里,而群控账号下主播房间的观众数达到阈值时其礼物打赏明显小于正常分布区间,此时,通过房间特征异常检测法可以找到特征分布异常的主播。虽然,上述方法可检测出群控账号,但是,安全性较低,易被破解。比如使用动态IP池的方式避免手机号共用相同的IP地址,再如使用分布式云群控账号访问、切换送礼物账号等方式可以避免特征分布异常。Generally speaking, batch operation of a large number of zombie accounts (ie group control accounts) through group control software can increase the popularity of the host's room. In some related technologies, in order to prevent the emergence of group control accounts, the following methods can be used to detect group control accounts: 1. Device environment aggregation detection method, which is determined by the mobile phone number used when the user registers and the IP address used when watching the host. Whether there is a group control account, among which, the situation that the mobile phone numbers in the group control account share an IP address is more prominent; 2. The room feature anomaly detection method, when the group control account is used to increase popularity, the host's room will be rewarded for gifts, the number of viewers, and the number of bombs. There are abnormalities in the distribution of data characteristics such as the number of scenes. For example, under normal circumstances, when the number of viewers in the host's room reaches the threshold, the gift reward will be in a distribution range, but when the number of viewers in the host's room under the group control account reaches the threshold, the gift reward is obvious. is smaller than the normal distribution interval. At this time, the host with abnormal feature distribution can be found through the abnormal room feature detection method. Although the above method can detect the group control account, the security is low and it is easy to be cracked. For example, using a dynamic IP pool to prevent mobile phone numbers from sharing the same IP address, or using distributed cloud group control account access, switching gift-giving accounts, etc., can avoid abnormal feature distribution.
综合,如何安全、准确的挖掘出直播中的群控账号,成为了亟需解决的技术问题。In general, how to safely and accurately dig out the group control accounts in the live broadcast has become a technical problem that needs to be solved urgently.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种群控账号挖掘方法、装置、设备及存储介质,以解决群控账号挖掘过程安全性低、易被破解的技术问题。The embodiments of the present application provide a group control account mining method, apparatus, device, and storage medium to solve the technical problems of low security and easy cracking in the group control account mining process.
第一方面,本申请实施例提供了一种群控账号挖掘方法,包括:In the first aspect, an embodiment of the present application provides a method for mining a group control account, including:
获取用户群在设定时间段内的第一观看数据,所述用户群中每个用户对应一个第一观看数据,每个所述第一观看数据包含相应用户在所述设定时间段内观看的主播身份数据;Obtain the first viewing data of the user group within a set time period, each user in the user group corresponds to a first viewing data, and each of the first viewing data includes the viewing data of the corresponding user within the set time period streamer identity data;
根据所述第一观看数据在所述用户群中查找出相似观看用户;Find out similar viewing users in the user group according to the first viewing data;
根据所述相似观看用户在所述用户群中挖掘出相似观看用户群,并根据所述相似观看用户群确定属于群控账号的目标用户群。。A similar viewing user group is mined from the user group according to the similar viewing users, and a target user group belonging to the group control account is determined according to the similar viewing user group. .
第二方面,本申请实施例提供了一种群控账号挖掘装置,包括:In the second aspect, the embodiment of the present application provides a group control account mining device, including:
数据获取模块,配置为获取用户群在设定时间段内的第一观看数据,所述用户群中每个用户对应一个第一观看数据,每个所述第一观看数据包含相应用户在所述设定时间段内观看的主播身份数据;The data acquisition module is configured to acquire the first viewing data of a user group within a set time period, each user in the user group corresponds to a first viewing data, and each of the first viewing data includes the corresponding user in the The identity data of the streamers watched during the set time period;
用户查找模块,配置为根据所述第一观看数据在所述用户群中查找出相似观看用户;a user search module, configured to search for similar viewing users in the user group according to the first viewing data;
群控确定模块,配置为根据所述相似观看用户在所述用户群中挖掘出相似观看用户群,并根据所述相似观看用户群确定属于群控账号的目标用户群。The group control determination module is configured to dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.
第三方面,本申请实施例提供了一种群控账号挖掘设备,包括:存储器以及一个或多个处理器;In a third aspect, an embodiment of the present application provides a group control account mining device, including: a memory and one or more processors;
所述存储器,配置为存储一个或多个程序;the memory configured to store one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面所述的群控账号挖掘方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the group control account mining method according to the first aspect.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的群控账号挖掘方法。In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the group control account mining method described in the first aspect.
上述群控账号挖掘方法、装置、设备及存储介质,通过获取用户群在设定时间段内的第一观看数据,并根据第一观看数据在用户群中查找出相似观看用户,之后,根据相似观看用户挖掘出相似观看用户群,并根据相似观看用户群确定群控账号的技术手段,解决了群控账号挖掘过程安全性低、易被破解的技术问题。即使群控账号使用动态IP池或者使用分布式云群控账号访问等方式,也可以结合各用户观看主播的情况有效筛选出相似观看用户,之后,准确挖掘 出用户群中的群控账号,提升了群控作弊成本,阻止了刷房行为,保证主播人气的真实性。The above group control account mining method, device, device and storage medium obtain the first viewing data of the user group within a set time period, and find out similar viewing users in the user group according to the first viewing data, and then, according to the similarity The technical means of mining similar viewing user groups by viewing users and determining group control accounts according to the similar viewing user groups solves the technical problems of low security and easy cracking in the mining process of group control accounts. Even if the group control account uses a dynamic IP pool or uses a distributed cloud group control account to access, it can effectively filter out similar viewing users based on each user's viewing of the anchor, and then accurately mine the group control account in the user group to improve Reduce the cost of group control cheating, prevent the behavior of room brushing, and ensure the authenticity of the anchor's popularity.
附图说明Description of drawings
图1为本申请实施例提供的一种群控账号挖掘方法的流程图;1 is a flowchart of a group control account mining method provided by an embodiment of the present application;
图2为本申请实施例提供的哈希桶示意图;2 is a schematic diagram of a hash bucket provided by an embodiment of the present application;
图3为本申请实施例的另一种群控账号挖掘方法的流程图;3 is a flowchart of another group control account mining method according to an embodiment of the present application;
图4为本申请实施例提供的一种神经网络示意图;4 is a schematic diagram of a neural network provided by an embodiment of the present application;
图5为本申请实施例提供的一种群控账号挖掘装置的结构示意图;FIG. 5 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application;
图6为本申请实施例提供的一种群控账号挖掘设备的结构示意图。FIG. 6 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all the structures related to the present application.
群控账号是指使用多台真实设备(如多部手机)或模拟多台真实设备,并在设备中安装脚本软件(当前为群控软件)来控制设备中的应用软件(如直播应用软件),通过修改设备的软硬件信息来达到模拟人工使用应用软件的效果。群控账号可以通过自动化手段最大化模拟真实用户的操作请求。在直播领域中,通过群控账号可以为直播间达到吸粉、引流、刷广告等作弊目标。其中,通过群控账号模拟正常账号进入主播房间而给主播增加人气的方式称为刷房。A group control account refers to using multiple real devices (such as multiple mobile phones) or simulating multiple real devices, and installing script software (currently group control software) in the device to control the application software in the device (such as live broadcast application software) , by modifying the software and hardware information of the device to achieve the effect of simulating manual use of the application software. The group control account can simulate the operation requests of real users to the greatest extent through automated means. In the field of live broadcast, through group control accounts, you can achieve cheating goals such as attracting fans, draining traffic, and swiping advertisements for the live broadcast room. Among them, the way to increase the popularity of the anchor by simulating the normal account of the group control account to enter the host's room is called room brushing.
为了避免群控账号对正常直播的影响,本申请实施例提供一种群控账号挖掘方法,以安全、准确的挖掘出群控账号。在一示例中,群控账号挖掘方法可以由群控账号挖掘设备执行,该群控账号挖掘设备可以通过软件和/或硬件的方式实现,该群控账号挖掘设备可以是两个或多个物理实体构成,也可以是一个物理实体构成。例如,群控账号挖掘设备可以是计算机、平板电脑等配置有数据运算、分析能力的智能设备。In order to avoid the influence of the group control account on the normal live broadcast, the embodiment of the present application provides a group control account mining method, so as to mine the group control account safely and accurately. In one example, the group control account mining method may be executed by a group control account mining device, which may be implemented by software and/or hardware, and the group control account mining device may be two or more physical devices. Entity composition can also be a physical entity composition. For example, the group control account mining device may be a computer, tablet computer or other smart device configured with data computing and analysis capabilities.
图1为本申请实施例提供的一种群控账号挖掘方法的流程图,参考图1,该群控账号挖掘方法可以包括:FIG. 1 is a flowchart of a group control account mining method provided by an embodiment of the present application. Referring to FIG. 1 , the group control account mining method may include:
步骤110、获取用户群在设定时间段内的第一观看数据,用户群中每个用户 对应一个第一观看数据,每个第一观看数据包含相应用户在设定时间段内观看的主播身份数据。Step 110: Acquire the first viewing data of the user group within the set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the host identity watched by the corresponding user within the set time period data.
一个实施例中,用户群是指利用直播应用软件观看直播的用户集合。观看数据是指用户观看直播过程中体现观看情况的数据。示例性的,观看数据至少包括用户观看的主播身份数据。其中,主播身份数据用于表明主播身份,不同主播具有不同的主播身份数据,主播是指在直播应用平台注册过可进行直播的用户。可理解,观看数据还可包括用户观看每个主播的观看时长等内容。一个实施例中,用户群中的每个用户也存在对应的用户身份数据,不同用户具有不同的用户身份数据。当用户进入某一主播房间后,群控账号挖掘设备记录用户身份数据、主播身份数据以及观看时长等内容,并生成一条观看数据。当用户进入另一直播间后,群控账号挖掘设备再次记录用户身份数据、主播身份数据以及观看时间等内容,并生成另一条观看数据。在一实施例中,第一观看数据是指设定时间段内相应用户的观看数据的集合,其可选包括设定时间段内用户观看的主播身份数据。每个用户对应一个第一观看数据。其中,设定时间段可根据实际情况设定,如设定时间段为24小时、12小时或48小时等。In one embodiment, the user group refers to a set of users who use the live broadcast application software to watch the live broadcast. Viewing data refers to the data that reflects the viewing situation during the user's viewing of the live broadcast. Exemplarily, the viewing data includes at least host identity data viewed by the user. Among them, the host identity data is used to indicate the host identity. Different hosts have different host identity data. The host refers to the user who has registered on the live broadcast application platform and can perform live broadcast. It is understandable that the viewing data may also include content such as the viewing duration of each anchor watched by the user. In one embodiment, each user in the user group also has corresponding user identity data, and different users have different user identity data. When a user enters a host's room, the group control account mining device records the user's identity data, the host's identity data, and viewing duration, and generates a piece of viewing data. When the user enters another live broadcast room, the group control account mining device records the user identity data, anchor identity data, and viewing time again, and generates another piece of viewing data. In one embodiment, the first viewing data refers to a collection of viewing data of a corresponding user within a set time period, which may optionally include anchor identity data viewed by the user within the set time period. Each user corresponds to one piece of first viewing data. The set time period can be set according to the actual situation, for example, the set time period is 24 hours, 12 hours or 48 hours.
可选的,为了便于后续的计算,实施例中对主播身份数据进行编码,并以向量的形式表示主播身份数据。其中,编码规则实施例不做限定,例如,采用独热(One-Hot)编码将各主播身份数据向量化表示。之后,由各主播身份数据对应的向量组成第一观看数据。Optionally, in order to facilitate subsequent calculation, the host identity data is encoded in the embodiment, and the host identity data is represented in the form of a vector. The embodiment of the encoding rule is not limited. For example, one-hot (One-Hot) encoding is used to vectorize the identity data of each anchor. After that, the first viewing data is composed of vectors corresponding to the identity data of each anchor.
步骤120、根据第一观看数据在用户群中查找出相似观看用户。Step 120: Find out similar viewing users in the user group according to the first viewing data.
群控账号具有批量的特征,其可以同时进出某个或某些主播的房间。此时,同一批群控账号中每个用户观看的主播集合相同或分批次相同,且各用户观看同一直播的时长也接近一致。而正常用户(非群控账号的用户)间观看时长差异较大,符合正态分布,并且,正常用户在观看时具有偏好性(即存在固定一个或多个喜爱观看的主播)和随机性(即随机选择观看的主播)。因此,任意两个正常用户具有相同的第一观看数据的概率较小,而群控账号间具有相同的第一观看数据的概率较大。因此,实施例中,通过第一观看数据在用户群中查找出相似观看用户。其中,相似观看用户是指两个用户所观看的各主播相同或高度相似。可理解,每个用户可组成不同的相似观看用户,例如,用户A和用户B可以组成相似观看用户,用户A和用户C也可以组成相似观看用户。The group control account has the characteristics of batch, which can enter and exit the room of one or some hosts at the same time. At this time, each user in the same group control account watches the same set of anchors or the same batches, and the duration of each user watching the same live broadcast is also close to the same. On the other hand, the viewing time of normal users (users with non-group control accounts) varies greatly, which conforms to a normal distribution. Moreover, normal users have preferences (that is, there are fixed one or more anchors who like to watch) and randomness ( i.e. random selection of streamers to watch). Therefore, the probability that any two normal users have the same first viewing data is relatively small, while the probability that the group control accounts have the same first viewing data is relatively large. Therefore, in the embodiment, similar viewing users are found in the user group through the first viewing data. Wherein, similar viewing users refer to the same or highly similar anchors watched by two users. It can be understood that each user can form a different similar viewing user, for example, user A and user B can form a similar viewing user, and user A and user C can also form a similar viewing user.
一个实施例中,通过计算各第一观看数据间的相似度选择相似观看用户。 其中,相似度的计算方式实施例不做限定,例如,采用余弦相似度、欧式距离等方式。示例性的,将相似度高的用户确定为相似观看用户。举例而言,采用余弦相似度计算相似度时,设定一阈值,该阈值表示相似观看用户间最大余弦距离,可理解余弦距离越小,相似度越高,此时,计算两个第一观看数据的余弦距离后,若该余弦距离小于该阈值,则将对应的两个用户确定为相似观看用户。按照上述方式对全部第一观看数据均两两计算相似度后,可以得到全部相似观看用户。In one embodiment, similar viewing users are selected by calculating the similarity between the first viewing data. Wherein, the method for calculating the similarity is not limited in the embodiment, for example, a method such as cosine similarity, Euclidean distance, etc. is used. Exemplarily, users with high similarity are determined as similar viewing users. For example, when calculating the similarity using cosine similarity, a threshold is set, and the threshold represents the maximum cosine distance between similar viewing users. It can be understood that the smaller the cosine distance, the higher the similarity. At this time, two first viewing users are calculated. After the cosine distance of the data, if the cosine distance is less than the threshold, the two corresponding users are determined as similar viewing users. After the similarity is calculated for all the first viewing data pairwise in the above manner, all similar viewing users can be obtained.
步骤130、根据相似观看用户在用户群中挖掘出相似观看用户群,并根据相似观看用户群确定属于群控账号的目标用户群。Step 130: Dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.
相似观看用户群是指群内用户的第一观看数据间的相似度高,即群内各用户观看的主播相同或高度相似。相似观看用户群可通过相似观看用户确定。一个实施例中,绘制节点图,节点图中的每个节点代表一个用户,相似观看用户的两个节点间通过一条边连接。可理解,对于群控账号而言,由于其批量进行相同的操作,因此,其在节点图中会形成一个密集的社群,该社群包含的用户数量较多,而正常用户在节点图中较为分散或社群包含的用户数量较小。因此,通过节点图可找到用户连接密集的社群(如将具有连接关系的节点组成一个社群),并将找到的社群作为相似观看用户群。The similar viewing user group refers to the high similarity between the first viewing data of the users in the group, that is, the anchors watched by the users in the group are the same or highly similar. The group of similar viewing users may be determined by similar viewing users. In one embodiment, a node graph is drawn, each node in the node graph represents a user, and two nodes that view similar users are connected by an edge. It is understandable that for group control accounts, since they perform the same operations in batches, they will form a dense community in the node graph. The community contains a large number of users, while normal users are in the node graph. A more fragmented or community with a smaller number of users. Therefore, a community with dense user connections can be found through the node graph (for example, nodes with a connection relationship are formed into a community), and the found community can be regarded as a similar viewing user group.
示例性的,根据相似观看用户群确定目标用户群,其中,目标用户群是指群控账号对应的用户群。一个实施例中,将找到的相似观看用户群确定为目标用户群,或者是,将用户数量高于数量阈值(该值可结合实际情况设定)的相似观看用户群确定为目标用户群。另一实施例中,结合相似观看用户群中各用户的网络地址(如IP地址)、所使用的设备信息和/或各主播的观看时长确定目标用户群。其中,设备信息用于区分用户使用的设备。网络地址和设备信息可以在用户使用直播应用软件时获取。例如,统计相似观看用户群中各用户对同一主播的观看时长,若同一主播的观看时长相同或相似(如观看时长差异小于设定的时长范围),则将相似观看用户群确定为目标用户群。再如,对于群控账号而言,其会共用网络地址和/或设备信息,即目标用户群内存在不同的用户具有相同的网络地址和/或设备信息。因此,可结合相似观看用户群中各用户之间是否共用网络地址和/或设备信息来确定相似观看用户群是否为目标用户群。Exemplarily, the target user group is determined according to the similar viewing user group, where the target user group refers to the user group corresponding to the group control account. In one embodiment, the found similar viewing user group is determined as the target user group, or, the similar viewing user group whose number of users is higher than the number threshold (this value can be set in combination with the actual situation) is determined as the target user group. In another embodiment, the target user group is determined in combination with the network addresses (eg IP addresses) of the users in the similar viewing user group, the information of the equipment used and/or the viewing duration of each anchor. Among them, the device information is used to distinguish the device used by the user. The network address and device information can be obtained when the user uses the live broadcast application software. For example, count the viewing time of each user in the similar viewing user group for the same anchor. If the viewing time of the same anchor is the same or similar (for example, the difference in viewing time is less than the set duration), then the similar viewing user group is determined as the target user group . For another example, for group control accounts, they share network addresses and/or device information, that is, different users in the target user group have the same network addresses and/or device information. Therefore, whether the similar viewing user group is the target user group may be determined by considering whether the users in the similar viewing user group share network addresses and/or device information.
上述,通过获取用户群在设定时间段内的第一观看数据,并根据第一观看数据在用户群中查找出相似观看用户,之后,根据相似观看用户挖掘出相似观 看用户群,并根据相似观看用户群确定群控账号的技术手段,解决了群控账号挖掘过程安全性低、易被破解的技术问题。即使群控账号使用动态IP池或者使用分布式云群控账号访问等方式,也可以结合各用户观看主的情况有效筛选出相似观看用户,之后,准确挖掘出用户群中的群控账号,提升了群控作弊成本,阻止了刷房行为,保证主播人气的真实性。Above, by acquiring the first viewing data of the user group within the set time period, and finding similar viewing users in the user group according to the first viewing data, then mining similar viewing user groups according to the similar viewing users, and according to the similar viewing users. The technical means of determining the group control account by the viewing user group solves the technical problems of low security and easy cracking during the mining process of the group control account. Even if the group control account uses a dynamic IP pool or uses distributed cloud group control account access, etc., it can effectively filter out similar viewing users based on the situation of each user's viewers, and then accurately mine the group control accounts in the user group to improve Reduce the cost of group control cheating, prevent the behavior of room brushing, and ensure the authenticity of the anchor's popularity.
在上述实施例的基础上,采用计算相似度的方式确定相似观看用户,此时,步骤120包括步骤121-步骤122:On the basis of the above embodiment, similar viewing users are determined by calculating the similarity. At this time, step 120 includes steps 121-122:
步骤121、根据第一观看数据计算用户群中各用户间的观看相似度。Step 121: Calculate the viewing similarity among the users in the user group according to the first viewing data.
观看相似度用于体现第一观看数据之间的相似程度。观看相似度可以采用余弦相似度、欧式距离等方式进行计算。此时,每两个用户间对应一个观看相似度。当用户数量较大时(如用户数量大于106),需要很大的计算量计算观看相似度。因此,实施例中先对各用户粗略的分桶,将可能相似的用户以较大的概率分到一个桶中,之后,计算各桶内用户间的观看相似度,以达到减小计算量的目的,此时,步骤121包括步骤1211-步骤1212:The viewing similarity is used to reflect the similarity between the first viewing data. The viewing similarity can be calculated by cosine similarity, Euclidean distance, or the like. At this time, there is one viewing similarity between every two users. When the number of users is large (eg, the number of users is greater than 106), a large amount of computation is required to calculate the viewing similarity. Therefore, in the embodiment, each user is roughly divided into buckets, and potentially similar users are divided into a bucket with a high probability, and then the viewing similarity between users in each bucket is calculated, so as to reduce the amount of calculation. Purpose, at this time, step 121 includes steps 1211-1212:
步骤1211、利用局部敏感哈希对各第一观看数据进行分桶。Step 1211: Use the locality-sensitive hash to bucket each first viewing data.
采用局部敏感哈希(Locality Sensitive Hashing,LSH)对第一观看数据进行分桶,以将可能相似的第一观看数据分到一个桶中,此时,每个桶内的第一观看数据所对应的用户可认为是备选相似观看用户。The first viewing data is divided into buckets by using Locality Sensitive Hashing (LSH) to divide the possibly similar first viewing data into one bucket. At this time, the first viewing data in each bucket corresponds to of users can be considered as alternative similar viewing users.
一个实施例中,利用LSH进行分桶时,步骤1211包括步骤12111-步骤12113:In one embodiment, when using LSH for bucketing, step 1211 includes steps 12111-12113:
步骤12111、分别对各第一观看数据进行最小哈希计算,以得到对应的签名向量。Step 12111: Perform a minimum hash calculation on each first viewing data to obtain a corresponding signature vector.
其中,最小哈希(minhash)是LSH计算过程中常用的技术手段,其用来计算得到签名向量(或矩阵),实施例中,利用最小哈希计算第一观看数据以得到签名向量(或矩阵)。此时,每个第一观看数据对应一个签名向量,且签名向量占用的空间小于第一观看数据占用的空间。Among them, the minimum hash (minhash) is a commonly used technical means in the LSH calculation process, which is used to calculate and obtain the signature vector (or matrix). In the embodiment, the minimum hash is used to calculate the first viewing data to obtain the signature vector (or matrix). ). In this case, each first viewing data corresponds to a signature vector, and the space occupied by the signature vector is smaller than the space occupied by the first viewing data.
步骤12112、将每个签名向量分成多个行条,并利用哈希函数将每个行条分别映射到对应的哈希桶中,哈希函数为至少一个。Step 12112: Divide each signature vector into multiple rows, and map each row into a corresponding hash bucket by using a hash function, where the hash function is at least one.
将各签名向量分成多段,每段内容作为一个行条(band),其中,行条数量(即段数)可以根据实际情况设定,各签名向量的行条数量相等。之后,利用哈希函数将每个行条映射到对应的哈希桶中,其中,采用的哈希函数可以根据实际情况选择,且可以使用一个或多个哈希函数。当使用多个哈希函数时, 每个哈希函数都可以对行条进行一次映射。Each signature vector is divided into multiple segments, and the content of each segment is regarded as a band, wherein the number of bands (ie, the number of segments) can be set according to the actual situation, and the number of bands of each signature vector is equal. After that, each row bar is mapped into a corresponding hash bucket by using a hash function, wherein the adopted hash function can be selected according to the actual situation, and one or more hash functions can be used. When using multiple hash functions, each hash function can map the row strip once.
步骤12113、将映射到同一哈希桶内的行条所对应的第一观看数据归入同一桶中。Step 12113: Put the first viewing data corresponding to the row bars mapped to the same hash bucket into the same bucket.
可理解,如果两个签名向量中的一个或多个行条相同,则两个签名向量具有较高的相似度,且相同的行条数越多,两个签名向量的相似度越高。其中,行条相同是指行条被映射到同一哈希桶中。据此,实施例中获取同一哈希桶内映射的行条,之后,查找各行条对应的第一观看数据,并将查找到的各第一观看数据作为同一桶中的数据。此时,同一桶中的第一观看数据对应的用户可认为是备选相似观看用户。It can be understood that if one or more rows in the two signature vectors are the same, the two signature vectors have a higher similarity, and the more the same rows are, the higher the similarity between the two signature vectors is. Among them, the row bars are the same means that the row bars are mapped to the same hash bucket. Accordingly, in the embodiment, the row bars mapped in the same hash bucket are obtained, then the first viewing data corresponding to each row bar is searched, and each first viewing data found is used as data in the same bucket. At this time, the users corresponding to the first viewing data in the same bucket may be considered as candidate similar viewing users.
举例而言,图2为本申请实施例提供的哈希桶示意图。图2中包含三个哈希桶,分别记为band1、band2和band3,需说明,图2中只示出了映射到band1中的部分行条(图2中表示为10002、32122、01311)。此时,band1中的各行条对应的第一观看数据作为同一桶中的数据,band2中的各行条对应的第一观看数据作为同一桶中的数据,band3中的各行条对应的第一观看数据作为同一桶中的数据,进而完成对第一观看数据的分桶操作。For example, FIG. 2 is a schematic diagram of a hash bucket provided by an embodiment of the present application. Figure 2 contains three hash buckets, denoted as band1, band2, and band3 respectively. It should be noted that only part of the rows mapped to band1 are shown in Figure 2 (represented as 10002, 32122, and 01311 in Figure 2). At this time, the first viewing data corresponding to each row in band1 is taken as the data in the same bucket, the first viewing data corresponding to each row in band2 is taken as the data in the same bucket, and the first viewing data corresponding to each row in band3 As the data in the same bucket, the bucketing operation for the first viewing data is further completed.
步骤1212、计算每个桶内各第一观看数据之间的观看相似度。Step 1212: Calculate the viewing similarity between the first viewing data in each bucket.
以桶为单位,计算每个桶内各第一观看数据之间的观看相似度,而桶间的第一观看数据无需计算观看相似度。其中,观看相似度的计算方式实施例不作限定。Taking the bucket as a unit, the viewing similarity between the first viewing data in each bucket is calculated, and the viewing similarity does not need to be calculated for the first viewing data between the buckets. The embodiment of the calculation method of the viewing similarity is not limited.
步骤122、根据观看相似度在用户群中查找出相似观看用户。Step 122: Find similar viewing users in the user group according to the viewing similarity.
一个实施例中,通过比较阈值的方式查找相似观看用户。例如,采用欧式距离计算观看相似度时,两个第一观看数据间距离越小,观看相似度越高。因此,可结合实际情况设置一距离阈值,当距离小于该距离阈值时,确定两个用户为相似观看用户。In one embodiment, similar viewing users are found by comparing thresholds. For example, when the Euclidean distance is used to calculate the viewing similarity, the smaller the distance between the two first viewing data, the higher the viewing similarity. Therefore, a distance threshold may be set according to the actual situation, and when the distance is less than the distance threshold, it is determined that the two users are similar viewing users.
上述,通过计算观看相似度的方式可准确查找出相似观看用户,且通过局部敏感哈希算法,可避免用户数量较大时观看相似度的计算量较大的问题,降低寻找相似观看用户的计算复杂度。As mentioned above, similar viewing users can be accurately found by calculating the viewing similarity, and the local sensitive hash algorithm can avoid the problem of a large amount of calculation of viewing similarity when the number of users is large, and reduce the calculation of finding similar viewing users. the complexity.
图3为本申请实施例的另一种群控账号挖掘方法的流程图。该群控账号挖掘方法是在上述实施例的基础上进行详细化。FIG. 3 is a flowchart of another group control account mining method according to an embodiment of the present application. The group control account mining method is detailed on the basis of the above embodiment.
当前实施例中,每个主播身份数据对应一个词汇表向量,词汇表向量的长 度等于当前总主播数。其中,词汇表向量是对主播身份数据进行One-Hot编码后得到的向量,每个主播身份数据对应一个词汇表向量。示例性的,词汇表向量的维度可通过词汇表向量的长度表示,且词汇表向量的长度等于当前总主播数,其中,当前总主播数可以是直播应用软件中当前注册过主播的总数量,或者是,用户群中各用户观看的主播的总数量。举例而言,当前总主播数为4,那么,每个主播身份数据均由4维的词汇表向量表示,且4个词汇表向量分别表示为:[1 0 0 0]、[0 1 0 0]、[0 0 1 0]、[0 0 0 1]。In the current embodiment, each anchor identity data corresponds to a vocabulary vector, and the length of the vocabulary vector is equal to the current total number of anchors. Among them, the vocabulary vector is a vector obtained by performing One-Hot encoding on the anchor identity data, and each anchor identity data corresponds to a vocabulary vector. Exemplarily, the dimension of the vocabulary vector may be represented by the length of the vocabulary vector, and the length of the vocabulary vector is equal to the current total number of anchors, wherein the current total number of anchors may be the total number of currently registered anchors in the live broadcast application software, Or, the total number of anchors watched by each user in the user group. For example, the current total number of anchors is 4, then each anchor's identity data is represented by a 4-dimensional vocabulary vector, and the 4 vocabulary vectors are respectively expressed as: [1 0 0 0], [0 1 0 0 ], [0 0 1 0], [0 0 0 1].
在一实施例中,参考图3,该群控账号挖掘方法可以包括:In one embodiment, referring to FIG. 3 , the group control account mining method may include:
步骤210、获取用户群在设定时间段内的第一观看数据,用户群中每个用户对应一个第一观看数据,每个第一观看数据包含相应用户在设定时间段内观看的主播身份数据。Step 210: Acquire the first viewing data of the user group within the set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the host identity watched by the corresponding user within the set time period data.
实施例中,第一观看数据通过词汇表向量表示,例如,第一观看数据包含的主播身份数据的词汇表向量分别为[1 0 0 0]、[0 1 0 0]、[0 0 1 0],那么,该第一观看数据是由前述词汇表向量组成的3×4矩阵。In the embodiment, the first viewing data is represented by a vocabulary vector, for example, the vocabulary vectors of the anchor identity data included in the first viewing data are [1 0 0 0], [0 1 0 0], [0 0 10 respectively ], then the first viewing data is a 3×4 matrix consisting of the aforementioned vocabulary vectors.
步骤220、将各第一观看数据对应的词汇表向量作为训练数据,以训练得到各词汇表向量对应的嵌入词向量,嵌入词向量的长度小于词汇表向量的长度。Step 220: Use the vocabulary vector corresponding to each first viewing data as training data to obtain the embedded word vector corresponding to each vocabulary vector by training, and the length of the embedded word vector is less than the length of the vocabulary vector.
示例性的,当直播应用软件中注册的主播数量很庞大(如注册几十万或几百万的主播)时,其对应的词汇表向量的长度会很长,相应的,第一观看数据的维度也会很大,这样并不利于后续对第一观看数据的计算。因此,实施例中,根据各第一观看数据对词汇表向量进行降维处理,并将降维后得到的向量记为嵌入词向量,每个词汇表向量对应一个嵌入词向量,不同词汇表向量可能对应相同的嵌入词向量。嵌入词向量的长度(即维度)可以根据实际情况设置,如设置长度为50。当前,嵌入词向量的长度小于词汇表向量的长度。一个实施例中,采用Word2Vec得到嵌入词向量。其中,Word2Vec是一种自然语言处理(Natural Language Processing,NPL)工具,其用来产生词向量的相关模型。该模型使用浅而双层的神经网络,且该神经网络训练完成后,Word2Vec模型可用来映射每个词到一个向量,该向量可用来表示词对词之间的关系,且该向量位于神经网络的隐藏层。实施例中,将表示词汇表向量间关系的向量记为嵌入词向量,即通过Word2Vec可以将单词(即词汇表向量)转换成嵌入词向量,这样,就可通过嵌入词向量定量的度量各词汇表向量之间的关系。Exemplarily, when the number of registered anchors in the live broadcast application software is very large (for example, hundreds of thousands or millions of anchors are registered), the length of the corresponding vocabulary vector will be very long. Correspondingly, the length of the first viewing data The dimension will also be very large, which is not conducive to the subsequent calculation of the first viewing data. Therefore, in the embodiment, the dimensionality reduction process is performed on the vocabulary vector according to each first viewing data, and the vector obtained after dimensionality reduction is recorded as the embedded word vector, each vocabulary vector corresponds to an embedded word vector, and different vocabulary vectors May correspond to the same embedded word vector. The length (ie dimension) of the embedded word vector can be set according to the actual situation, such as setting the length to 50. Currently, the length of the embedded word vector is less than the length of the vocabulary vector. In one embodiment, Word2Vec is used to obtain the embedded word vector. Among them, Word2Vec is a natural language processing (Natural Language Processing, NPL) tool, which is used to generate related models of word vectors. The model uses a shallow and two-layer neural network, and after the neural network is trained, the Word2Vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, and the vector is located in the neural network. the hidden layer. In the embodiment, the vector representing the relationship between the vocabulary vectors is recorded as the embedded word vector, that is, the word (that is, the vocabulary vector) can be converted into the embedded word vector through Word2Vec, so that each vocabulary can be quantitatively measured by the embedded word vector. The relationship between table vectors.
一个实施例中,图4为本申请实施例提供的一种神经网络示意图,该神经 网络为Word2Vec使用的神经网络,该神经网络为Skip-gram模型,在NPL里,Skip-gram模型是指输入一个词语后,预测其上下文词语作为输出。参考图4,输入层(Input layer)输入一V维的词汇表向量(即[x 1x 2…x v]),输出层(Output layer)输出另一V维的词汇表向量(即[y 1y 2…y v]),神经网络训练完成后,从输入层(Input layer)到隐含层(Hidden layer)的权重,便是词汇表向量对应的嵌入词向量,其可表示输入层的词汇表向量与输出层的词汇表向量间的关系。图4所示的矩阵W V×N={w ki}中第i行的转置作为词汇表向量(有效编码在第k个位置)的嵌入词向量。嵌入词向量为N维,且N<<V。可理解,当一个输入词对应输出多个词时,存在多个矩阵W’ V×N={w’ ik},且每个矩阵对应输出一组[y 1y 2…y v]。 In one embodiment, FIG. 4 is a schematic diagram of a neural network provided in this embodiment of the application. The neural network is a neural network used by Word2Vec, and the neural network is a Skip-gram model. In NPL, the Skip-gram model refers to the input After a word, predict its context word as output. Referring to Figure 4, the input layer (Input layer) inputs a V-dimensional vocabulary vector (ie [x 1 x 2 ... x v ]), and the output layer (Output layer) outputs another V-dimensional vocabulary vector (ie [y] 1 y 2 ... y v ]), after the neural network training is completed, the weight from the input layer to the hidden layer is the embedded word vector corresponding to the vocabulary vector, which can represent the input layer. The relationship between the vocabulary vector and the vocabulary vector of the output layer. The transpose of the i-th row in the matrix W V×N = {w ki } shown in FIG. 4 serves as the embedded word vector for the vocabulary vector (effectively encoded at the k-th position). The embedded word vector is N-dimensional, and N<<V. It can be understood that when one input word corresponds to outputting multiple words, there are multiple matrices W' V×N ={w' ik }, and each matrix correspondingly outputs a set of [y 1 y 2 ... y v ].
一个实施例中,神经网络的训练过程为:在句子中选取输入词,定义skip_window参数和num_skips参数。其中,skip_window参数表示训练神经网络时从句子中当前输入词一侧(左侧或右侧)选择的词数量,通过该参数可以确定神经网络输出词所在的词窗口,num_skips参数表示输出不同的词时不同词的数量,输出的词从词窗口中选择。例如,句子为“there is an apple on the table”,skip_window和num_skips均为2,训练神经网络时,输入词为apple,相应的词窗口是[is an apple on the],关联上下文后神经网络得到apple和an以及apple和on两组对应关系,此时,an和on为输出的不同词,(apple,an)和(apple,on)可作为该句的两组训练数据。即输入apple后输出an或on。设置完成后,从训练数据中选择输入词对应的词汇表向量输入神经网络,并根据输出词得到各输入词的概率分布,该分布代表各输入词得到相同输出词的概率。例如,通过“中国首都是北京”和“英国首都是伦敦”设置训练数据训练神经网络时,若输入词为中国或英国,则关联上下文后输出词均包含“首都是”这类的词汇,所以,“中国”和“英国”这类相关词的概率应高于其他词,“中国”和“英国”对应的嵌入词向量相同或相似。根据上述概率分布利用梯度下降和反向传播的方式更新图4中的矩阵W V×N和W’ N×V,以实现训练。在训练完成后,通过矩阵W V×N得到各输入词的嵌入词向量。 In one embodiment, the training process of the neural network is: selecting an input word in a sentence, and defining a skip_window parameter and a num_skips parameter. Among them, the skip_window parameter indicates the number of words selected from the side (left or right) of the current input word in the sentence when training the neural network, through which the word window where the output word of the neural network is located can be determined, and the num_skips parameter indicates the output of different words When the number of different words is selected, the output word is selected from the word window. For example, if the sentence is "there is an apple on the table", skip_window and num_skips are both 2. When training the neural network, the input word is apple, and the corresponding word window is [is an apple on the]. After correlating the context, the neural network obtains There are two corresponding relationships between apple and an and apple and on. At this time, an and on are different words output, and (apple, an) and (apple, on) can be used as two sets of training data for the sentence. That is, input apple and output an or on. After the setting is completed, the vocabulary vector corresponding to the input word is selected from the training data and input to the neural network, and the probability distribution of each input word is obtained according to the output word, and the distribution represents the probability that each input word obtains the same output word. For example, when training a neural network by setting the training data of "the capital of China is Beijing" and "the capital of the United Kingdom is London", if the input word is China or the United Kingdom, the output words after the context will contain words such as "the capital is", so , the probability of related words such as "China" and "UK" should be higher than other words, and the embedded word vectors corresponding to "China" and "UK" are the same or similar. According to the above probability distribution, the matrices W V×N and W′ N×V in Fig. 4 are updated by means of gradient descent and backpropagation to realize training. After the training is completed, the embedded word vector of each input word is obtained through the matrix W V × N.
将上述训练方式对应在第一观看数据时,可以是:将各第一观看数据模拟成一语句,其中各主播身份数据的词汇表向量作为语句中的一个词,之后,选择输入词和输出词,以训练神经网络,进而在训练完成后,通过矩阵W V×N得到各输入词的嵌入词向量。可理解,输出词相同的输入词具有相同或相似的嵌入 词向量。例如,某些第一观看数据包含的主播身份数据分别为主播A和主播B,另外一些第一观看数据包含的主播身份数据分别为主播C和主播B,那么,输入词为主播A或主播C对应的词汇表向量时,输出词为主播B对应的词汇表向量的概率较大,因此,主播A和主播C对应的嵌入词向量相近或相同。需说明,利用Word2Vec得到嵌入词向量的过程可以认为是进行Embedding的过程。 When the above training method is corresponding to the first viewing data, it can be: simulating each first viewing data into a sentence, wherein the vocabulary vector of each anchor identity data is used as a word in the sentence, and then selecting the input word and the output word, In order to train the neural network, and after the training is completed, the embedded word vector of each input word is obtained through the matrix W V × N. Understandably, input words with the same output word have the same or similar embedded word vectors. For example, some first viewing data contains host identity data for host A and host B, respectively, and other first viewing data contains host identity data for host C and host B, respectively, then enter the word host A or host C When the corresponding vocabulary vector is used, the probability of outputting the vocabulary vector corresponding to the anchor B is relatively large. Therefore, the embedded word vectors corresponding to the anchor A and the anchor C are similar or the same. It should be noted that the process of using Word2Vec to obtain the embedded word vector can be regarded as the process of Embedding.
步骤230、根据第一观看数据对应的嵌入词向量得到相应的第二观看数据。Step 230: Obtain corresponding second viewing data according to the embedded word vector corresponding to the first viewing data.
示例性的,得到各词汇表向量对应的嵌入词向量后,对各嵌入词向量进行处理,以得到第二观看数据,实施例中,第二观看数据是指通过嵌入词向量得到的向量,第二观看数据的维度小于第一观看数据的维度。在一实施例中,根据各主播身份数据的嵌入词向量得到第二观看数据时,可以采用取平均值、最大值或最小值等方式。以平均值为例,将各嵌入词向量中相同位置的数值进行平均计算以得到平均值,并取各位置的平均值组成的向量作为第二观看数据。例如,第一观看数据包含主播身份数据分别为主播A、主播B和主播C,将三个对应的嵌入词向量进行平均计算后得到的第二观看数据为[0.4234、0.762、0.4234],其中,第一个0.4234是对三个嵌入词向量中第一个数值取平均的结果,依次类推。Exemplarily, after obtaining the embedded word vector corresponding to each vocabulary vector, each embedded word vector is processed to obtain the second viewing data. In the embodiment, the second viewing data refers to the vector obtained by embedding the word vector, The dimension of the second viewing data is smaller than the dimension of the first viewing data. In an embodiment, when the second viewing data is obtained according to the embedded word vector of each anchor's identity data, an average value, a maximum value, or a minimum value can be obtained. Taking the average value as an example, the average value of the same position in each embedded word vector is calculated to obtain the average value, and a vector composed of the average value of each position is taken as the second viewing data. For example, the first viewing data includes host identity data, respectively, host A, host B, and host C, and the second viewing data obtained by averaging the three corresponding embedded word vectors is [0.4234, 0.762, 0.4234], where, The first 0.4234 is the result of averaging the first value of the three embedded word vectors, and so on.
步骤240、根据第二观看数据在用户群中查找出相似观看用户。Step 240: Find out similar viewing users in the user group according to the second viewing data.
其中,该步骤与根据第一观看数据在用户群中查找出相似观看用户的处理方式一样,如利用局部敏感哈希的方式进行分桶并在分桶后查找相似观看用户,实施例对此不作赘述。Wherein, this step is the same as the processing method of finding similar viewing users in the user group according to the first viewing data, such as using a local-sensitive hash method to perform bucketing and finding similar viewing users after bucketing, and the embodiment does not do this. Repeat.
步骤250、将用户群中的每个用户作为一个用户节点,并将相似观看用户对应的用户节点通过边连接,以得到节点关系图。Step 250: Take each user in the user group as a user node, and connect user nodes corresponding to similar viewing users through edges to obtain a node relationship graph.
节点关系图是指将节点间的关系通过连接的边表示后得到的节点图。本步骤中,节点关系图是指根据用户群以及其中的相似观看用户构建的节点图。每个用户在节点关系图中对应显示为一个节点,实施例中,将表示用户的节点记为用户节点。相似观看用户的用户节点间绘制连接的边。可理解,节点关系图中各用户节点的分布位置可以根据实际情况选择,实施例对此不作限定。可选的,相似观看用户的相似度越高时,其对应的边的权重大。A node relationship graph refers to a node graph obtained by expressing the relationship between nodes by connecting edges. In this step, the node relationship graph refers to a node graph constructed according to the user group and similar viewing users therein. Each user is correspondingly displayed as a node in the node relationship diagram. In the embodiment, the node representing the user is recorded as a user node. Draw connected edges between user nodes that are similar to the viewing user. It is understandable that the distribution positions of each user node in the node relationship diagram may be selected according to actual conditions, which is not limited in the embodiment. Optionally, when the similarity of the similar viewing users is higher, the weight of the corresponding edge is greater.
步骤260、利用标签传播算法处理所述节点关系图,以确定相似观看用户群。Step 260: Process the node relationship graph using a label propagation algorithm to determine a group of similar viewing users.
标签传播算法(Label Propagation Algorithm,LPA)是一种基于图的半监督学习方法,其基本思路是用已标记节点的标签信息去预测未标记节点的标签信息, 可以实现局部社区划分。实施例中,在LPA初始阶段,为节点关系图中每个用户节点分配一标签,每次迭代时,各用户节点都会根据与自己相连的用户节点所属的标签更改自己的标签,直到迭代结束,以根据标签得到相似观看用户群。其中,更改标签的规则是将相连的用户节点中出现最多的标签作为自己的标签。按照上述方式确定相似观看用户群时,步骤260可包括步骤261-步骤266:Label Propagation Algorithm (LPA) is a graph-based semi-supervised learning method. Its basic idea is to use the label information of labeled nodes to predict the label information of unlabeled nodes, which can realize local community division. In the embodiment, in the initial stage of LPA, a label is assigned to each user node in the node relationship graph. In each iteration, each user node will change its label according to the label of the user node connected to itself until the iteration ends. To get similar viewing user groups based on tags. Among them, the rule for changing labels is to use the label that appears most in the connected user nodes as its own label. When the similar viewing user group is determined in the above manner, step 260 may include steps 261-266:
步骤261、为节点关系图中的每个用户节点分配相应的标签。Step 261: Assign a corresponding label to each user node in the node relationship graph.
其中,标签生成规则实施例不作限定。当前为每个用户节点分配的标签可以认为是初始的标签,各用户节点对应的初始标签不同。实施例中,设定节点关系图中包含M个用户节点,此时,用户节点1对应标签1,用户节点i对应标签i,1≤i≤M,依次类推。The embodiment of the label generation rule is not limited. The label currently assigned to each user node can be considered as an initial label, and the initial labels corresponding to each user node are different. In the embodiment, it is assumed that the node relationship graph includes M user nodes. In this case, user node 1 corresponds to label 1, user node i corresponds to label i, 1≤i≤M, and so on.
步骤262、在节点关系图中查找一用户节点,并根据用户节点的边连接关系查找出用户节点的全部邻居用户节点。Step 262: Find a user node in the node relationship graph, and find out all neighboring user nodes of the user node according to the edge connection relationship of the user node.
示例性的,各用户节点的处理过程相同,因此,以一个用户节点为例进行描述,在一实施例中,在节点关系图中查找一用户节点,其中,查找规则实施例不作限定,如按照各用户节点的排列顺序依次查找。查找该用户节点后,查找该用户节点的邻居用户节点,其中,邻居用户节点是指与该用户节点通过边连接的用户节点,或者是,与该用户节点连接的边的权重大于设定的阈值。一般而言,邻居用户节点和用户节点属于相似观看用户。可理解,每个用户节点可以对应一或多个邻居用户节点,也可以不存在邻居用户节点。若不存在邻居用户节点,则重新选择另一用户节点,重复本步骤。若存在邻居用户节点,则执行后续步骤。Exemplarily, the processing process of each user node is the same. Therefore, a user node is used as an example for description. In an embodiment, a user node is searched in a node relationship graph, wherein the search rule embodiment is not limited, such as according to The order of the user nodes is searched in turn. After searching for the user node, search for the neighbor user node of the user node, where the neighbor user node refers to the user node connected to the user node through an edge, or the weight of the edge connected to the user node is greater than the set threshold. . In general, neighbor user nodes and user nodes belong to similar viewing users. Understandably, each user node may correspond to one or more neighbor user nodes, or there may be no neighbor user nodes. If there is no neighbor user node, re-select another user node and repeat this step. If there is a neighbor user node, go to the next step.
步骤263、统计全部邻居用户节点的标签,并将出现次数最多的标签更新为用户节点的标签。Step 263: Count the labels of all neighboring user nodes, and update the label with the most occurrences as the label of the user node.
获取每个邻居用户节点的标签,并在各标签中确定出现次数最多的标签。其中,若出现次数最多的标签为多个(如各节点的标签为初始标签,每个标签都出现一次),则在出现次数最多的多个标签中随机选择一个标签。之后,将当前的用户节点的标签更新为出现次数最多的标签。Obtain the label of each neighbor user node, and determine the label with the most occurrences among the labels. Among them, if there are multiple labels with the most occurrences (for example, the label of each node is the initial label, and each label appears once), then a label is randomly selected from the multiple labels with the most occurrences. After that, update the label of the current user node to the label with the most occurrences.
可理解,相同标签的节点属于同一社群。更新完成后,可以确定当前节点所属的社群。Understandably, nodes with the same label belong to the same community. After the update is complete, the community to which the current node belongs can be determined.
步骤264、在节点关系图中查找另一用户节点,并返回执行根据用户节点的边连接关系查找出用户节点的全部邻居用户节点的操作,直到遍历节点关系图 中的全部用户节点。Step 264: Search for another user node in the node relationship graph, and return to perform the operation of finding all neighboring user nodes of the user node according to the edge connection relationship of the user node, until all user nodes in the node relationship graph are traversed.
标签更换后,便可以在节点关系图中查找另一用户节点,并返回执行步骤262中查找邻居用户节点的操作。之后,当节点关系图中全部用户节点均被遍历完成后,确定本轮遍历结束。即遍历M个用户节点(即fori=1:M)后确定本轮遍历结束。After the label is replaced, another user node can be searched in the node relationship graph, and the operation of searching for a neighbor user node in step 262 is returned to. Afterwards, when all user nodes in the node relationship graph have been traversed, it is determined that this round of traversal ends. That is, after traversing M user nodes (that is, fori=1:M), it is determined that this round of traversal ends.
步骤265、判断当前是否满足遍历结束条件。若不满足遍历结束条件,则返回执行步骤262。若满足遍历结束条件,则执行步骤266。Step 265: Determine whether the current traversal end condition is satisfied. If the traversal end condition is not satisfied, return to step 262. If the traversal end condition is satisfied, step 266 is executed.
其中,遍历结束条件是停止遍历的限制条件,其内容可以根据实际情况设定。实施例中,遍历结束条件为达到遍历次数阈值或节点关系图中各用户节点的标签未发生改变。一个实施例中,遍历次数阈值可以根据实际情况设定,每轮遍历结束后,记录的遍历次数加1,之后,判断遍历次数是否达到遍历次数阈值,若是,则确定满足遍历结束条件,否则,确定不满足遍历结束条件并开始新一轮的遍历。另一个实施例中,本轮遍历完成后,确定各用户节点的标签是否发生变化,若至少一个用户节点的标签发生了变化,则确定不满足遍历结束条件并开始新一轮的遍历,若每个用户节点的标签均未变化,则确定满足遍历结束条件。又一实施例中,本轮遍历完成后,确定各用户节点的标签是否发生变化,若至少一个用户节点的标签发生了变化,则判断遍历次数是否达到遍历次数阈值,若达到遍历次数阈值,则确定满足遍历结束条件,否则,确定不满足遍历结束条件并开始新一轮的遍历,若每个用户节点的标签均未变化,则确定满足遍历结束条件。The traversal end condition is a restriction condition for stopping traversal, and its content can be set according to the actual situation. In the embodiment, the traversal end condition is that the threshold of the number of traversals is reached or the labels of each user node in the node relationship graph do not change. In one embodiment, the threshold of the number of traversals can be set according to the actual situation. After each round of traversal ends, the recorded number of traversals is incremented by 1, and then it is determined whether the number of traversals reaches the threshold of the number of traversals. Determine that the traversal end condition is not met and start a new round of traversal. In another embodiment, after the current round of traversal is completed, it is determined whether the label of each user node has changed. If the label of at least one user node has changed, it is determined that the traversal end condition is not satisfied and a new round of traversal is started. If the labels of each user node have not changed, it is determined that the traversal end condition is satisfied. In another embodiment, after the current round of traversal is completed, it is determined whether the label of each user node has changed. If the label of at least one user node has changed, it is determined whether the number of traversals has reached the threshold of the number of traversals. It is determined that the traversal end condition is satisfied. Otherwise, it is determined that the traversal end condition is not satisfied and a new round of traversal is started. If the label of each user node does not change, it is determined that the traversal end condition is satisfied.
步骤266、将具有相同标签的用户节点所对应的用户归入同一相似观看用户群。Step 266 , classify the users corresponding to the user nodes with the same label into the same similar viewing user group.
在节点关系图中查找具有相同标签的用户节点,并进行分类,以得到相似观看用户群。每个相似观看用户群中用户节点的标签相同,该标签可以作为相似观看用户群的ID。例如,用户A、用户B、用户C、用户D作为图中的用户节点,LPA算法结束后,各用户节点分别记为[1,A],[1,B],[2,C],[1,D],第一字段为相似观看用户群的ID,此时,用户A、用户B、用户D属于同一相似观看用户群。Find user nodes with the same label in the node relationship graph, and classify them to obtain similar viewing user groups. The tags of the user nodes in each similar viewing user group are the same, and the tag can be used as the ID of the similar viewing user group. For example, user A, user B, user C, and user D are used as user nodes in the graph. After the LPA algorithm ends, each user node is recorded as [1,A],[1,B],[2,C],[ 1, D], the first field is the ID of the similar viewing user group, at this time, user A, user B, and user D belong to the same similar viewing user group.
步骤270、根据相似观看用户群确定属于群控账号的目标用户群。Step 270: Determine a target user group belonging to the group control account according to the similar viewing user group.
实施例中,该步骤包括下述至少一种方案:In an embodiment, this step includes at least one of the following schemes:
方案一、若相似观看用户群的用户数量大于或等于数量阈值,则将相似观 看用户群确定为属于群控账号的目标用户群。Scheme 1: If the number of users in the similar viewing user group is greater than or equal to the number threshold, the similar viewing user group is determined as the target user group belonging to the group control account.
示例性的,当前使用的数量阈值是指群控账号包含的最小用户数量,其值可以根据实际情况设定,如数量阈值为50。若相似观看用户群包含的用户数量大于或等于数量阈值,则将其确认为目标用户群。按照上述对每个相似观看用户群处理后,便可以挖掘到目标用户群。Exemplarily, the currently used quantity threshold refers to the minimum number of users included in the group control account, and its value can be set according to the actual situation, for example, the quantity threshold is 50. If the number of users included in the similar viewing user group is greater than or equal to the number threshold, it is confirmed as the target user group. After processing each similar viewing user group according to the above, the target user group can be mined.
方案二、若相似观看用户群中多个用户具有相同的设备信息和/或网络地址信息,则将相似观看用户群确定为属于群控账号的目标用户群。Solution 2: If multiple users in the similar viewing user group have the same device information and/or network address information, the similar viewing user group is determined as the target user group belonging to the group control account.
示例性的,设备信息是指用户观看直播时使用的设备的相关信息,其可以是设备标识等,不同设备的设备信息不同。网络地址信息是指用户观看直播时使用的网络地址,其可以为IP地址。实施例中,以同时获取设备信息和网络地址信息进行描述,实际应用中,可以只获取一类信息进行处理,且处理方式相同。一般而言,非群控账号间设备信息和网络地址信息被重复使用的概率较小,群控账号间设备信息和网络地址信息被重复使用的概率较大,如通过一台设备登陆不同的账号进行刷房。一个实施例中,如果相似观看用户群中多个用户间存在相同的设备信息和/或网络地址信息,则确定其存在重复使用的情况,将相似观看用户群确定为目标用户群。可选的,可设置第一相同用户数量阈值,其群控账号中表示具有相同设备信息和/或网络地址信息的最小用户数量。若具有相同设备信息和/或网络地址信息的用户数量达到第一相同用户数量阈值,则确定其存在重复使用的情况,即出现网络地址和/或设备聚集的情况。因此,将相似观看用户群确定为目标用户群。Exemplarily, the device information refers to the related information of the device used by the user when watching the live broadcast, which may be a device identification, etc., and the device information of different devices is different. The network address information refers to the network address used by the user when watching the live broadcast, which may be an IP address. In the embodiments, the device information and the network address information are acquired simultaneously for description. In practical applications, only one type of information may be acquired for processing, and the processing methods are the same. Generally speaking, the probability of repeated use of device information and network address information between non-group control accounts is small, and the probability of repeated use of device information and network address information between group control accounts is high, such as logging in to different accounts through one device. Do a room cleaning. In one embodiment, if multiple users in the similar viewing user group have the same device information and/or network address information, it is determined that they are repeatedly used, and the similar viewing user group is determined as the target user group. Optionally, a first threshold of the same number of users may be set, and the group control account indicates the minimum number of users with the same device information and/or network address information. If the number of users with the same device information and/or network address information reaches the first threshold of the same number of users, it is determined that they are repeatedly used, that is, network addresses and/or devices are aggregated. Therefore, the similar viewing user group is determined as the target user group.
一个实施例中,相似观看用户群还可能为大主播用户群。其中,大主播具有极高的观众数和关注数,大主播的划分依据实施例不作限定。大主播用户群是指其包含的用户会固定观看几个大主播。可理解,大主播用户群的特性是其用户间设备信息和网络地址信息重复使用概率低。此时,若相似观看用户群中各用户间具有不同的设备信息和网络地址信息,则将相似观看用户群确定为大主播用户群。或者是,若相似观看用户群中各用户间存在相同的设备信息或网络地址信息,且相同的用户数量较少(如低于第二相同用户数量阈值),则将相似观看用户群确定为大主播用户群,其中,第二相同用户数量阈值低于第一相同用户数量阈值。In one embodiment, the similar viewing user group may also be a large anchor user group. Among them, the big anchor has a very high number of viewers and followers, and the division of the big anchor is not limited according to the embodiment. The big anchor user group means that the users included in it will watch several big anchors at a fixed time. It can be understood that the characteristics of the large anchor user group are that the re-use probability of device information and network address information among users is low. At this time, if the users in the similar viewing user group have different device information and network address information, the similar viewing user group is determined as the large anchor user group. Alternatively, if the same device information or network address information exists among the users in the similar viewing user group, and the number of the same users is small (for example, lower than the second threshold of the same number of users), then the similar viewing user group is determined to be large. An anchor user group, wherein the second same user number threshold is lower than the first same user number threshold.
需说明,也可以结合用户数量和重复使用情况得到目标用户群和大主播用户群。例如,相似观看用户群的用户数量大于或等于数量阈值时,将其确定为 疑似目标用户群,若疑似目标用户群存在重复使用情况,则确定为目标用户群,若不存在重复使用情况,则确定为大主播用户群。It should be noted that the target user group and the big anchor user group can also be obtained in combination with the number of users and the repeated usage. For example, when the number of users in a similar viewing user group is greater than or equal to the number threshold, it is determined as a suspected target user group. If the suspected target user group has repeated use, it is determined as the target user group. If there is no repeated use, it is determined as the target user group. Determined to be a large anchor user group.
上述,通过训练嵌入词向量的方式,可以避免主播数量过高不利于后续计算的问题,降低了后续计算时使用的观看数据的维度。并且,通过构建节点关系图和LPA算法可准确查找出相似观看用户群,进而结合用户数量、设备信息和/或网络地址信息聚集情况在相似观看用户群中识别出群控账号,保证了群控账号识别的准确性,且通过无监督的方式,减少了对标签的依赖。As mentioned above, by training the embedded word vector, it is possible to avoid the problem that the number of anchors is too high, which is not conducive to subsequent calculations, and reduces the dimension of viewing data used in subsequent calculations. In addition, by constructing a node relationship graph and LPA algorithm, similar viewing user groups can be accurately found, and then group control accounts can be identified in the similar viewing user groups in combination with the number of users, device information and/or network address information aggregation, ensuring group control. The accuracy of account identification, and in an unsupervised way, reduces the reliance on tags.
在上述实施例的基础上,还可在节点关系图中添加设备信息和/网络地址信息,之后,直接利用LPA在节点关系图挖掘群控设备。此时,执行步骤250时还包括:获取用户群中各用户的设备信息和/或网络地址信息;将设备信息和/或网络地址信息作为信息节点,加入节点关系图,并将用户节点和相应的信息节点通过边连接。On the basis of the above-mentioned embodiment, device information and/or network address information can also be added to the node relationship graph, and then LPA is directly used to mine group control devices in the node relationship graph. At this time, when step 250 is performed, it also includes: acquiring the device information and/or network address information of each user in the user group; using the device information and/or network address information as an information node, adding the node relationship graph, and comparing the user node and corresponding The information nodes are connected by edges.
一个实施例中,在节点关系图中,添加表示设备信息的节点和/或表示网络地址信息的节点,每个设备信息对应一个节点,每个网络地址信息对应一个节点,实施例中,将表示设备信息和网络地址信息的节点统称为信息节点,并以同时添加两类信息节点为例进行描述。示例性的,若某个用户使用某个设备信息,则将该用户的用户节点和该设备信息的信息节点通过边连接,按照同样方式,建立用户节点和表示网络地址信息的信息节点间的连接关系。此时,节点关系图还包含各用户使用设备和网络地址的情况。可理解,后续利用LPA处理节点关系图时,确定用户节点的邻居用户节点时,不仅考虑相连的用户节点,还包括相连的信息节点。例如,将信息节点对应的边设置较高的权重,同时降低用户节点间边的权重,在查找邻居用户节点时,将共用设备信息或网络地址信息的相似观看用户所对应的用户节点作为查找到的邻居用户节点。这样挖掘出的相似观看用户群排除了大主播用户群的情况。因此,执行步骤270时可直接通过相似观看用户群的用户数量确定其是否为目标用户群,无需考虑重复使用情况。In one embodiment, in the node relationship diagram, a node representing device information and/or a node representing network address information is added, each device information corresponds to a node, and each network address information corresponds to a node. The nodes of device information and network address information are collectively referred to as information nodes, and the description is given by adding two types of information nodes at the same time as an example. Exemplarily, if a user uses a certain device information, the user node of the user and the information node of the device information are connected through an edge, and in the same way, the connection between the user node and the information node representing the network address information is established. relation. At this time, the node relationship graph also includes the situation of each user using the device and network address. It can be understood that when LPA is used to process the node relationship graph subsequently, when determining the neighbor user nodes of the user node, not only the connected user nodes but also the connected information nodes are considered. For example, set a higher weight for the edge corresponding to the information node, while reducing the weight of the edge between user nodes. When looking for neighboring user nodes, the user nodes corresponding to similar viewing users who share device information or network address information are used as found neighbor user node. The similar viewing user group excavated in this way excludes the situation of the large anchor user group. Therefore, when step 270 is executed, whether the user group is the target user group can be directly determined by the number of users in the similar viewing user group, without considering the repeated usage.
上述,通过在节点关系图中添加设备信息和/或网络地址信息,可以提高利用LPA算法挖掘出的相似观看用户群为群控账号的概率,避免挖掘出大主播用户群的情况,降低了后续操作过程的计算复杂度。As mentioned above, by adding device information and/or network address information to the node relationship diagram, the probability that the similar viewing user group mined by the LPA algorithm is a group control account can be increased, the situation of mining a large anchor user group can be avoided, and the follow-up time can be reduced. Computational complexity of the operation.
图5为本申请实施例提供的一种群控账号挖掘装置的结构示意图,参考图5, 该群控账号挖掘整装置包括:数据获取模块301、用户查找模块302和群控确定模块303。FIG. 5 is a schematic structural diagram of a group control account mining device provided by an embodiment of the present application. Referring to FIG. 5 , the group control account mining entire device includes a data acquisition module 301 , a user search module 302 and a group control determination module 303 .
其中,数据获取模块301,配置为获取用户群在设定时间段内的第一观看数据,用户群中每个用户对应一个第一观看数据,每个第一观看数据包含相应用户在设定时间段内观看的主播身份数据;用户查找模块302,配置为根据第一观看数据在用户群中查找出相似观看用户;群控确定模块303,配置为根据相似观看用户在用户群中挖掘出相似观看用户群,并根据相似观看用户群确定属于群控账号的目标用户群。Wherein, the data acquisition module 301 is configured to acquire the first viewing data of the user group within a set time period, each user in the user group corresponds to one first viewing data, and each first viewing data includes the corresponding user in the set time period. The identity data of the anchors watched in the segment; the user search module 302 is configured to find similar viewing users in the user group according to the first viewing data; the group control determination module 303 is configured to mine similar viewing users in the user group according to the similar viewing users user group, and determine the target user group belonging to the group control account according to the similar viewing user group.
在上述实施例的基础上,所述装置还包括:训练模块,配置为根据第一观看数据在用户群中查找出相似观看用户之前,将各第一观看数据对应的词汇表向量作为训练数据,以训练得到各词汇表向量对应的嵌入词向量,每个主播身份数据对应一个词汇表向量,词汇表向量的长度等于当前总主播数,嵌入词向量的长度小于词汇表向量的长度;观看数据确定模块,配置为根据第一观看数据对应的嵌入词向量得到相应的第二观看数据。相应的,用户查找模块302具体配置为根据第二观看数据在用户群中查找出相似观看用户。On the basis of the above embodiment, the device further includes: a training module, configured to use the vocabulary vector corresponding to each first viewing data as training data before finding similar viewing users in the user group according to the first viewing data, The embedded word vector corresponding to each vocabulary vector is obtained by training, each anchor identity data corresponds to a vocabulary vector, the length of the vocabulary vector is equal to the current total number of anchors, and the length of the embedded word vector is less than the length of the vocabulary vector; the viewing data is determined The module is configured to obtain corresponding second viewing data according to the embedded word vector corresponding to the first viewing data. Correspondingly, the user search module 302 is specifically configured to search for similar viewing users in the user group according to the second viewing data.
在上述实施例的基础上,用户查找模块302包括:相似度计算子模块,配置为根据第一观看数据计算用户群中各用户间的观看相似度;相似确定子模块,配置为根据观看相似度在用户群中查找出相似观看用户。On the basis of the above embodiment, the user search module 302 includes: a similarity calculation sub-module, configured to calculate the viewing similarity between users in the user group according to the first viewing data; a similarity determination sub-module, configured to calculate the viewing similarity according to the viewing similarity Find similar viewing users in the user base.
在上述实施例的基础上,相似度计算子模块包括:分桶单元,配置为利用局部敏感哈希对各第一观看数据进行分桶;桶内计算单元,配置为计算每个桶内各第一观看数据之间的观看相似度。On the basis of the above-mentioned embodiment, the similarity calculation sub-module includes: a bucketing unit, configured to bucket each first viewing data by using a local-sensitive hash; an in-bucket calculation unit, configured to calculate the first viewing data in each bucket A viewing similarity between viewing data.
在上述实施例的基础上,分桶单元包括:签名计算子单元,配置为分别对各第一观看数据进行最小哈希计算,以得到对应的签名向量;映射子单元,配置为将每个签名向量分成多个行条,并利用哈希函数将每个行条分别映射到对应的哈希桶中,哈希函数为至少一个;划桶子单元,配置为将映射到同一哈希桶内的行条所对应的第一观看数据归入同一桶中。On the basis of the above embodiment, the bucket dividing unit includes: a signature calculation subunit, configured to perform minimum hash calculation on each first viewing data to obtain a corresponding signature vector; a mapping subunit, configured to calculate each signature The vector is divided into multiple rows, and each row is mapped to the corresponding hash bucket using a hash function. The hash function is at least one; the bucket sub-unit is configured to map to the same hash bucket. The first viewing data corresponding to the row bars are classified into the same bucket.
在上述实施例的基础上,群控确定模块303包括:关系图构建子模块,配置为将用户群中的每个用户作为一个用户节点,并将相似观看用户对应的用户节点通过边连接,以得到节点关系图;标签传播子模块,配置为利用标签传播算法处理节点关系图,以确定相似观看用户群;第一确定子模块,配置为根据相似观看用户群确定属于群控账号的目标用户群。On the basis of the above-mentioned embodiment, the group control determination module 303 includes: a relationship graph construction sub-module, configured to take each user in the user group as a user node, and connect the user nodes corresponding to the similar viewing users through edges, so as to A node relationship graph is obtained; the tag propagation sub-module is configured to process the node relationship graph by using the tag propagation algorithm to determine the similar viewing user group; the first determining sub-module is configured to determine the target user group belonging to the group control account according to the similar viewing user group .
在上述实施例的基础上,标签传播子模块包括:标签分配单元,配置为为节点关系图中的每个用户节点分配相应的标签;邻居查找单元,配置为在节点关系图中查找一用户节点,并根据用户节点的边连接关系查找出用户节点的全部邻居用户节点;标签更新单元,配置为统计全部邻居用户节点的标签,并将出现次数最多的标签更新为用户节点的标签;第一遍历单元,配置为在节点关系图中查找另一用户节点,并返回执行根据用户节点的边连接关系查找出用户节点的全部邻居用户节点的操作,直到遍历节点关系图中的全部用户节点;结束判断单元,配置为判断当前是否满足遍历结束条件,遍历结束条件为达到遍历次数阈值或节点关系图中各用户节点的标签未发生改变;第二遍历单元,配置为若不满足遍历结束条件,则返回执行在节点关系图中查找一用户节点的操作,直到满足遍历结束条件;节点划分单元,配置为将具有相同标签的节点所对应的用户归入同一相似观看用户群。On the basis of the above embodiment, the label propagation sub-module includes: a label assignment unit, configured to assign a corresponding label to each user node in the node relationship graph; a neighbor search unit, configured to search for a user node in the node relationship graph , and find out all the neighboring user nodes of the user node according to the edge connection relationship of the user node; the label updating unit is configured to count the labels of all neighboring user nodes, and update the label with the most occurrences as the label of the user node; the first traversal The unit is configured to search for another user node in the node relationship graph, and returns to perform the operation of finding all neighboring user nodes of the user node according to the edge connection relationship of the user node, until all user nodes in the node relationship graph are traversed; end judgment The unit is configured to judge whether the current traversal end condition is met. The traversal end condition is that the threshold of traversal times is reached or the label of each user node in the node relationship graph has not changed; the second traversal unit is configured to return if the traversal end condition is not met. The operation of searching a user node in the node relationship graph is performed until the traversal end condition is satisfied; the node dividing unit is configured to classify the users corresponding to the nodes with the same label into the same similar viewing user group.
在上述实施例的基础上,关系图构建子模块还配置为:获取用户群中各用户的设备信息和/或网络地址信息;将设备信息和/或网络地址信息作为信息节点,加入节点关系图,并将用户节点和相应的信息节点通过边连接。On the basis of the above-mentioned embodiment, the relationship graph construction sub-module is further configured to: obtain the device information and/or network address information of each user in the user group; use the device information and/or network address information as an information node, and add the node relationship graph , and connect user nodes and corresponding information nodes through edges.
在上述实施例的基础上,群控确定模块303包括:第一掘子模块,配置为根据相似观看用户在用户群中挖掘出相似观看用户群;第二确定子模块,配置为若相似观看用户群的用户数量大于或等于数量阈值,则将相似观看用户群确定为属于群控账号的目标用户群。On the basis of the above-mentioned embodiment, the group control determination module 303 includes: a first digging sub-module, configured to dig out a similar viewing user group from the user group according to similar viewing users; a second determining sub-module, configured to If the number of users in the group is greater than or equal to the number threshold, the similar viewing user group is determined as the target user group belonging to the group control account.
在上述实施例的基础上,群控确定模块303包括:第二掘子模块,配置为根据相似观看用户在用户群中挖掘出相似观看用户群;第三确定子模块,配置为若相似观看用户群中多个用户具有相同的设备信息和/或网络地址信息,则将相似观看用户群确定为属于群控账号的目标用户群。On the basis of the above embodiment, the group control determination module 303 includes: a second digging sub-module, configured to dig out a similar viewing user group from the user group according to similar viewing users; a third determining sub-module, configured to If multiple users in the group have the same device information and/or network address information, the similar viewing user group is determined as the target user group belonging to the group control account.
上述提供的群控账号挖掘装置可用于执行上述任意实施例提供的群控账号挖掘方法,具备相应的功能和有益效果。The group control account mining device provided above can be used to execute the group control account mining method provided by any of the above embodiments, and has corresponding functions and beneficial effects.
值得注意的是,上述群控账号挖掘装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。It is worth noting that, in the above-mentioned embodiment of the group control account mining device, the units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; , the specific names of the functional units are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present invention.
图6为本申请实施例提供的一种群控账号挖掘设备的结构示意图。如图6 所示,该群控账号挖掘设备包括处理器40、存储器41、输入装置42以及输出装置43;群控账号挖掘设备中处理器40的数量可以是一个或多个,图6中以一个处理器40为例。群控账号挖掘设备中处理器40、存储器41、输入装置42以及输出装置43可以通过总线或其他方式连接,图6中以通过总线连接为例。FIG. 6 is a schematic structural diagram of a group control account mining device according to an embodiment of the present application. As shown in FIG. 6 , the group control account mining device includes a processor 40, a memory 41, an input device 42 and an output device 43; the number of processors 40 in the group control account mining device can be one or more. One processor 40 is taken as an example. The processor 40 , the memory 41 , the input device 42 and the output device 43 in the group control account mining device may be connected through a bus or other means, and the connection through a bus is taken as an example in FIG. 6 .
存储器41作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的群控账号挖掘方法对应的程序指令/模块(例如,群控账号挖掘装置中的数据获取模块、用户查找模块和群控确定模块)。处理器30通过运行存储在存储器41中的软件程序、指令以及模块,从而执行群控账号挖掘设备的各种功能应用以及数据处理,即实现上述的群控账号挖掘方法。As a computer-readable storage medium, the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the group control account mining method in the embodiments of the present application (for example, a group control account mining device). data acquisition module, user search module and group control determination module in The processor 30 executes various functional applications and data processing of the group control account mining device by running the software programs, instructions and modules stored in the memory 41, ie, realizes the above group control account mining method.
存储器41可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据群控账号挖掘设备的使用所创建的数据等。此外,存储器41可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器41可进一步包括相对于处理器40远程设置的存储器,这些远程存储器可以通过网络连接至群控账号挖掘设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 41 can mainly include a stored program area and a stored data area, wherein the stored program area can store the operating system and the application program required for at least one function; the stored data area can store data created according to the use of the group control account mining device, etc. . In addition, the memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, memory 41 may further include memory located remotely relative to processor 40, and these remote memories may be connected to the group control account mining device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
输入装置42可配置为接收输入的数字或字符信息,以及产生与群控账号挖掘设备的用户设置以及功能控制有关的键信号输入。输出装置43可包括显示屏等显示设备。The input device 42 may be configured to receive input numerical or character information, and to generate key signal input related to user settings and function control of the group control account mining device. The output device 43 may include a display device such as a display screen.
上述群控账号挖掘设备包含群控账号挖掘装置,可以用于执行任意群控账号挖掘方法,具备相应的功能和有益效果。The above group control account mining device includes a group control account mining device, which can be used to execute any group control account mining method, and has corresponding functions and beneficial effects.
此外,本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行本申请任意实施例所提供的群控账号挖掘方法中的相关操作,且具备相应的功能和有益效果。In addition, an embodiment of the present application also provides a storage medium containing computer-executable instructions, when the computer-executable instructions are executed by a computer processor, the computer-executable instructions are used to execute the group control account mining method provided by any embodiment of the present application. related operations, and have corresponding functions and beneficial effects.
上述仅为本申请的较佳实施例及所运用技术原理。本领域技术人员会理解,本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例, 在不脱离本申请构思的情况下,还可以包括更多其他等效实施例,而本申请的范围由所附的权利要求范围决定。The above are only the preferred embodiments of the present application and the applied technical principles. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present invention. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and may also include more other equivalent embodiments without departing from the concept of the present application. The scope is determined by the scope of the appended claims.

Claims (13)

  1. 一种群控账号挖掘方法,其中,包括:A group control account mining method, comprising:
    获取用户群在设定时间段内的第一观看数据,所述用户群中每个用户对应一个第一观看数据,每个所述第一观看数据包含相应用户在所述设定时间段内观看的主播身份数据;Obtain the first viewing data of the user group within a set time period, each user in the user group corresponds to a first viewing data, and each of the first viewing data includes the viewing data of the corresponding user within the set time period streamer identity data;
    根据所述第一观看数据在所述用户群中查找出相似观看用户;Find out similar viewing users in the user group according to the first viewing data;
    根据所述相似观看用户在所述用户群中挖掘出相似观看用户群,并根据所述相似观看用户群确定属于群控账号的目标用户群。A similar viewing user group is mined from the user group according to the similar viewing users, and a target user group belonging to the group control account is determined according to the similar viewing user group.
  2. 根据权利要求1所述的群控账号挖掘方法,其中,所述根据所述第一观看数据在所述用户群中查找出相似观看用户之前,包括:The group control account mining method according to claim 1, wherein before finding out similar viewing users in the user group according to the first viewing data, the method comprises:
    将各所述第一观看数据对应的词汇表向量作为训练数据,以训练得到各所述词汇表向量对应的嵌入词向量,每个所述主播身份数据对应一个词汇表向量,所述词汇表向量的长度等于当前总主播数,所述嵌入词向量的长度小于所述词汇表向量的长度;The vocabulary vector corresponding to each of the first viewing data is used as training data, so as to obtain the embedded word vector corresponding to each of the vocabulary vectors through training, and each of the anchor identity data corresponds to a vocabulary vector, and the vocabulary vector The length of is equal to the current total number of anchors, and the length of the embedded word vector is less than the length of the vocabulary vector;
    根据所述第一观看数据对应的嵌入词向量得到相应的第二观看数据;Obtaining corresponding second viewing data according to the embedded word vector corresponding to the first viewing data;
    所述根据所述第一观看数据在所述用户群中查找出相似观看用户包括:The finding similar viewing users in the user group according to the first viewing data includes:
    根据所述第二观看数据在所述用户群中查找出相似观看用户。Similar viewing users are found in the user group according to the second viewing data.
  3. 根据权利要求1所述的群控账号挖掘方法,其中,所述根据所述第一观看数据在所述用户群中查找出相似观看用户包括:The group control account mining method according to claim 1, wherein the finding similar viewing users in the user group according to the first viewing data comprises:
    根据所述第一观看数据计算所述用户群中各用户间的观看相似度;Calculate the viewing similarity among the users in the user group according to the first viewing data;
    根据所述观看相似度在所述用户群中查找出相似观看用户。Similar viewing users are found in the user group according to the viewing similarity.
  4. 根据权利要求3所述的群控账号挖掘方法,其中,所述根据所述第一观看数据计算所述用户群中各用户间的观看相似度包括:The group control account mining method according to claim 3, wherein the calculating the viewing similarity among the users in the user group according to the first viewing data comprises:
    利用局部敏感哈希对各所述第一观看数据进行分桶;bucketing each of the first viewing data by utilizing the locality-sensitive hash;
    计算每个桶内各所述第一观看数据之间的观看相似度。The viewing similarity between the first viewing data in each bucket is calculated.
  5. 根据权利要求4所述的群控账号挖掘方法,其中,所述利用局部敏感哈希对各所述第一观看数据进行分桶包括:The group control account mining method according to claim 4, wherein the bucketing of each of the first viewing data by using a locality-sensitive hash comprises:
    分别对各所述第一观看数据进行最小哈希计算,以得到对应的签名向量;Perform minimum hash calculation on each of the first viewing data to obtain a corresponding signature vector;
    将每个所述签名向量分成多个行条,并利用哈希函数将每个所述行条分别映射到对应的哈希桶中,所述哈希函数为至少一个;Divide each of the signature vectors into multiple rows, and use a hash function to map each of the rows into corresponding hash buckets, where the hash function is at least one;
    将映射到同一哈希桶内的行条所对应的第一观看数据归入同一桶中。The first viewing data corresponding to the row bars mapped to the same hash bucket are classified into the same bucket.
  6. 根据权利要求1-5任一所述的群控账号挖掘方法,其中,所述根据所述 相似观看用户在所述用户群中挖掘出相似观看用户群包括:The group control account mining method according to any one of claims 1-5, wherein the mining of a similar viewing user group in the user group according to the similar viewing user comprises:
    将所述用户群中的每个用户作为一个用户节点,并将所述相似观看用户对应的用户节点通过边连接,以得到节点关系图;Taking each user in the user group as a user node, and connecting the user nodes corresponding to the similar viewing users through edges to obtain a node relationship graph;
    利用标签传播算法处理所述节点关系图,以确定相似观看用户群。The node relationship graph is processed using a label propagation algorithm to determine similar viewing user groups.
  7. 根据权利要求6所述的群控账号挖掘方法,其中,所述利用标签传播算法处理所述节点关系图,以确定相似观看用户群包括:The group control account mining method according to claim 6, wherein the processing of the node relationship graph by using a label propagation algorithm to determine a group of similar viewing users comprises:
    为所述节点关系图中的每个用户节点分配相应的标签;assigning a corresponding label to each user node in the node relationship graph;
    在所述节点关系图中查找一用户节点,并根据所述用户节点的边连接关系查找出所述用户节点的全部邻居用户节点;Find a user node in the node relationship graph, and find out all neighboring user nodes of the user node according to the edge connection relationship of the user node;
    统计全部所述邻居用户节点的标签,并将出现次数最多的标签更新为所述用户节点的标签;Count the labels of all the neighbor user nodes, and update the label with the most occurrences as the label of the user node;
    在所述节点关系图中查找另一用户节点,并返回执行根据所述用户节点的边连接关系查找出所述用户节点的全部邻居用户节点的操作,直到遍历所述节点关系图中的全部用户节点;Find another user node in the node relationship graph, and return to perform the operation of finding all neighbor user nodes of the user node according to the edge connection relationship of the user node, until all users in the node relationship graph are traversed node;
    判断当前是否满足遍历结束条件,所述遍历结束条件为达到遍历次数阈值或所述节点关系图中各用户节点的标签未发生改变;Judging whether the current traversal end condition is satisfied, and the traversal end condition is that the threshold of the traversal times is reached or the label of each user node in the node relationship graph has not changed;
    若不满足遍历结束条件,则返回执行在所述节点关系图中查找一用户节点的操作,直到满足遍历结束条件;If the traversal end condition is not met, return to perform the operation of searching for a user node in the node relationship graph until the traversal end condition is met;
    将具有相同标签的用户节点所对应的用户归入同一相似观看用户群。Users corresponding to user nodes with the same label are classified into the same similar viewing user group.
  8. 根据权利要求6所述的群控账号挖掘方法,其中,所述将所述用户群中的每个用户作为一个用户节点,并将所述相似观看用户对应的用户节点通过边连接,以得到节点关系图时,还包括:The method for mining group control accounts according to claim 6, wherein each user in the user group is regarded as a user node, and the user nodes corresponding to the similar viewing users are connected through an edge to obtain a node When drawing a relationship diagram, also include:
    获取所述用户群中各用户的设备信息和/或网络地址信息;Obtain device information and/or network address information of each user in the user group;
    将所述设备信息和/或所述网络地址信息作为信息节点,加入所述节点关系图,并将所述用户节点和相应的信息节点通过边连接。The device information and/or the network address information are used as information nodes to be added to the node relationship graph, and the user nodes and the corresponding information nodes are connected through edges.
  9. 根据权利要求1、6或8所述的群控账号挖掘方法,其中,所述根据所述相似观看用户群确定属于群控账号的目标用户群包括:The group control account mining method according to claim 1, 6 or 8, wherein the determining the target user group belonging to the group control account according to the similar viewing user group comprises:
    若所述相似观看用户群的用户数量大于或等于数量阈值,则将所述相似观看用户群确定为属于群控账号的目标用户群。If the number of users in the similar viewing user group is greater than or equal to the number threshold, the similar viewing user group is determined as a target user group belonging to the group control account.
  10. 根据权利要求1或6所述的群控账号挖掘方法,其中,所述根据所述相似观看用户群确定属于群控账号的目标用户群包括:The group control account mining method according to claim 1 or 6, wherein the determining the target user group belonging to the group control account according to the similar viewing user group comprises:
    若所述相似观看用户群中多个用户具有相同的设备信息和/或网络地址信息,则将相似观看用户群确定为属于群控账号的目标用户群。If multiple users in the similar viewing user group have the same device information and/or network address information, the similar viewing user group is determined as a target user group belonging to the group control account.
  11. 一种群控账号挖掘装置,其中,包括:A group control account mining device, comprising:
    数据获取模块,配置为获取用户群在设定时间段内的第一观看数据,所述用户群中每个用户对应一个第一观看数据,每个所述第一观看数据包含相应用户在所述设定时间段内观看的主播身份数据;The data acquisition module is configured to acquire first viewing data of a user group within a set time period, each user in the user group corresponds to a first viewing data, and each first viewing data includes the corresponding user in the The identity data of the streamers watched during the set time period;
    用户查找模块,配置为根据所述第一观看数据在所述用户群中查找出相似观看用户;a user search module, configured to search out similar viewing users in the user group according to the first viewing data;
    群控确定模块,配置为根据所述相似观看用户在所述用户群中挖掘出相似观看用户群,并根据所述相似观看用户群确定属于群控账号的目标用户群。The group control determination module is configured to dig out a similar viewing user group from the user group according to the similar viewing users, and determine a target user group belonging to the group control account according to the similar viewing user group.
  12. 一种群控账号挖掘设备,其中,包括:存储器以及一个或多个处理器;A group control account mining device, comprising: a memory and one or more processors;
    所述存储器,配置为存储一个或多个程序;the memory configured to store one or more programs;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-10中任一所述的群控账号挖掘方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the group control account mining method according to any one of claims 1-10.
  13. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-10中任一所述的群控账号挖掘方法。A computer-readable storage medium on which a computer program is stored, wherein when the program is executed by a processor, the group control account mining method according to any one of claims 1-10 is implemented.
PCT/CN2022/072806 2021-01-25 2022-01-19 Method and apparatus for group control account excavation, device, and storage medium WO2022156720A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110098987.5 2021-01-25
CN202110098987.5A CN112819056A (en) 2021-01-25 2021-01-25 Group control account mining method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022156720A1 true WO2022156720A1 (en) 2022-07-28

Family

ID=75859172

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/072806 WO2022156720A1 (en) 2021-01-25 2022-01-19 Method and apparatus for group control account excavation, device, and storage medium

Country Status (2)

Country Link
CN (1) CN112819056A (en)
WO (1) WO2022156720A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819056A (en) * 2021-01-25 2021-05-18 百果园技术(新加坡)有限公司 Group control account mining method, device, equipment and storage medium
CN113449309B (en) * 2021-06-28 2023-10-27 平安银行股份有限公司 Terminal security state identification method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210883A (en) * 2018-05-09 2019-09-06 腾讯科技(深圳)有限公司 The recognition methods of team control account, device, server and storage medium
CN110413707A (en) * 2019-07-22 2019-11-05 百融云创科技股份有限公司 The excavation of clique's relationship is cheated in internet and checks method and its system
CN111401775A (en) * 2020-03-27 2020-07-10 深圳壹账通智能科技有限公司 Information analysis method, device, equipment and storage medium of complex relation network
US20200311159A1 (en) * 2019-03-31 2020-10-01 Td Ameritrade Ip Company, Inc. Recommendation System for Providing Personalized and Mixed Content on a User Interface based on Content and User Similarity
CN112819056A (en) * 2021-01-25 2021-05-18 百果园技术(新加坡)有限公司 Group control account mining method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105741175B (en) * 2016-01-27 2019-08-20 电子科技大学 A method of account in online social networks is associated

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210883A (en) * 2018-05-09 2019-09-06 腾讯科技(深圳)有限公司 The recognition methods of team control account, device, server and storage medium
US20200311159A1 (en) * 2019-03-31 2020-10-01 Td Ameritrade Ip Company, Inc. Recommendation System for Providing Personalized and Mixed Content on a User Interface based on Content and User Similarity
CN110413707A (en) * 2019-07-22 2019-11-05 百融云创科技股份有限公司 The excavation of clique's relationship is cheated in internet and checks method and its system
CN111401775A (en) * 2020-03-27 2020-07-10 深圳壹账通智能科技有限公司 Information analysis method, device, equipment and storage medium of complex relation network
CN112819056A (en) * 2021-01-25 2021-05-18 百果园技术(新加坡)有限公司 Group control account mining method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112819056A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
US20200195667A1 (en) Url attack detection method and apparatus, and electronic device
Li et al. Spotting fake reviews via collective positive-unlabeled learning
WO2022156720A1 (en) Method and apparatus for group control account excavation, device, and storage medium
CN110795657B (en) Article pushing and model training method and device, storage medium and computer equipment
CN111161311A (en) Visual multi-target tracking method and device based on deep learning
US8396855B2 (en) Identifying communities in an information network
US10284623B2 (en) Optimized browser rendering service
CN111382283B (en) Resource category label labeling method and device, computer equipment and storage medium
CN110046706B (en) Model generation method and device and server
Orman et al. Towards realistic artificial benchmark for community detection algorithms evaluation
US20160132415A1 (en) Testing insecure computing environments using random data sets generated from characterizations of real data sets
US20120143844A1 (en) Multi-level coverage for crawling selection
US11270227B2 (en) Method for managing a machine learning model
CN110706015A (en) Advertisement click rate prediction oriented feature selection method
Tang et al. Person re-identification with feature pyramid optimization and gradual background suppression
CN110969200A (en) Image target detection model training method and device based on consistency negative sample
CN114329455B (en) User abnormal behavior detection method and device based on heterogeneous graph embedding
Wang et al. MOL: Towards accurate weakly supervised remote sensing object detection via Multi-view nOisy Learning
CN111079930A (en) Method and device for determining quality parameters of data set and electronic equipment
CN108830302B (en) Image classification method, training method, classification prediction method and related device
Zhang et al. Detecting community structures in networks by label propagation with prediction of percolation transition
CN113079168B (en) Network anomaly detection method and device and storage medium
KR102348368B1 (en) Device, method, system and computer readable storage medium for generating training data of machine learing model and generating fake image using machine learning model
CN114510592A (en) Image classification method and device, electronic equipment and storage medium
CN112906824A (en) Vehicle clustering method, system, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22742199

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22742199

Country of ref document: EP

Kind code of ref document: A1