CN110362831B - Target user identification method, device, electronic equipment and storage medium - Google Patents
Target user identification method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110362831B CN110362831B CN201910649162.0A CN201910649162A CN110362831B CN 110362831 B CN110362831 B CN 110362831B CN 201910649162 A CN201910649162 A CN 201910649162A CN 110362831 B CN110362831 B CN 110362831B
- Authority
- CN
- China
- Prior art keywords
- user
- characteristic value
- target
- barrage
- day
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000007477 logistic regression Methods 0.000 claims abstract description 19
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 13
- 230000014509 gene expression Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007499 fusion processing Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000989913 Gunnera petaloidea Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000000049 pigment Substances 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000007711 solidification Methods 0.000 description 1
- 230000008023 solidification Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/442—Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
- H04N21/44213—Monitoring of end-user related data
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/442—Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
- H04N21/44213—Monitoring of end-user related data
- H04N21/44222—Analytics of user selections, e.g. selection of programs or purchase activity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/478—Supplemental services, e.g. displaying phone caller identification, shopping application
- H04N21/4788—Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Social Psychology (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Library & Information Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A target user identification method is applied to the technical field of communication and comprises the following steps: and acquiring text characteristics of all the barrages released by the user on the same day, constructing a logistic regression model according to the text characteristics, acquiring characteristic values of the user on the same day, acquiring historical barrage data of the user, and judging whether the user is a target user or not based on the characteristic values of the user on the same day and the historical barrage data. The invention also discloses a target user identification device, electronic equipment and storage medium, which are used for comprehensively identifying the user by combining the historical barrage data issued by the user, improving the identification accuracy of the target user and preventing the problem that the normal user cannot normally use the barrage function to key barrage information.
Description
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a target user identification method, apparatus, electronic device, and storage medium.
Background
The barrage is one of the direct means of user interaction in the live platform, and high-quality barrage culture is easier for users to impress on the live platform, so that the user of the platform is increased. However, in the live broadcast process, some users often release a bullet screen whose content does not conform to the bullet screen rule. Resulting in an excessive number of barrages occupying the entire screen. We will refer to this as the target user.
The main characteristic of the target user is that the words are not civilized and comprise keywords such as dirty words. The method adopted by the prior art is that keywords are filtered by using regular expressions aiming at the barrage, then the barrage is forbidden to be released again by a user who releases the barrage containing the keywords, the method has unilateral performance, a normal user is easy to misjudge as a target user, the accuracy is low, and meanwhile, the normal user cannot normally use the barrage function to key barrage information.
Disclosure of Invention
The invention mainly aims to provide a target user identification method, a target user identification device, electronic equipment and a storage medium, which can improve the accuracy of target user identification and prevent the problem that a normal user cannot normally use a barrage function to key barrage information.
To achieve the above object, a first aspect of an embodiment of the present invention provides a target user identification method, including:
acquiring text characteristics of all barrages released by a user in the same day;
constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
acquiring historical barrage data of the user;
and judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
Further, the determining whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data includes:
calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data;
judging whether the comprehensive characteristic value of the user is larger than a preset threshold value or not;
and if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user.
Further, the calculating the comprehensive feature value of the user based on the comprehensive feature value of the user on the same day and the historical barrage data comprises:
acquiring the number N of active days of the user, and collecting all bullet screen sets T of the user history tota1 The target barrage set T of the user history spam All barrages set C of the user on the same day total The target barrage set C of the user on the same day spam ;
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
Further, the calculating the comprehensive feature value of the user based on the feature value of the user on the current day and the historical barrage data comprises:
acquiring the active days N of the user;
enabling the user to integrate featuresThe value is Y, the characteristic value of the user on the current day is f (x), and the characteristic value of the user on the previous day isThen:
further, the text features include punctuation features, expressive features, negative word features, and TF-IDF values for each term in the all of the barrages.
Further, the constructing a logistic regression model according to the text features, and obtaining the feature value of the user on the same day includes:
acquiring preset weight coefficients corresponding to all the characteristics in the text characteristics;
let the Nth feature in the text feature be X n X, th n Corresponding preset weight coefficient is theta i The characteristic value of the user on the same day is f (x), and then:
where e is a natural constant and b is a natural number.
Further, when the text feature includes TF-IDF values of the respective terms in the all the barrages, the obtaining the text feature of all the barrages issued by the user on the same day includes:
let the TF-IDF value of the ith term in the jth barrage be TF-IDF i,j Then:
wherein n is i,j Representing the number of occurrences of the ith term in the jth barrage, Σ k n k,j Represents the number of all words in the jth barrage, |D| represents the number of all barragesIs the total number of barrages, |{ j: t is t i ∈d j And the number of the bullet screen strips containing the ith entry in all bullet screens is represented by the number of the bullet screen strips.
A second aspect of an embodiment of the present invention provides a target user identifying apparatus, including:
the first acquisition module is used for acquiring text characteristics of all the barrages released by the user in the same day;
the construction module is used for constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
the second acquisition module is used for acquiring the historical barrage data of the user;
and the judging module is used for judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
A third aspect of an embodiment of the present invention provides an electronic device, including:
the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, and is characterized in that the processor realizes the target user identification method provided by the first aspect of the embodiment of the invention when executing the program.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the target user identification method provided by the first aspect of the embodiments of the present invention.
According to the target user identification method, the target user identification device, the electronic equipment and the storage medium, text characteristics of all the barrages issued by the user on the same day are obtained, a logistic regression model is constructed according to the text characteristics, the characteristic value of the user on the same day is obtained, historical barrage data of the user are obtained, whether the user is the target user is judged based on the characteristic value of the user on the same day and the historical barrage data, the user is comprehensively identified by combining the historical barrage data issued by the user, the identification accuracy of the target user is improved, and the problem that normal users cannot normally use barrage functions to key barrage information is prevented.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other drawings may be obtained from them without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a target user identification method according to an embodiment of the invention;
FIG. 2 is a flowchart of a target user identification method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a target user identification apparatus according to an embodiment of the present invention;
fig. 4 shows a hardware configuration diagram of an electronic device.
Detailed Description
In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention will be clearly described in conjunction with the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, in the method for identifying a target user in the present disclosure, each step may be performed in part on a terminal, and the other steps may be performed on a server, or each step may be performed on a terminal, for example, off-line target user identification, so that the following steps performed by the server are exemplary, but not all, execution manners.
Referring to fig. 1, fig. 1 is a flowchart of a target user identification method according to an embodiment of the present invention, where the method can be applied to an electronic device with a barrage release function, and the electronic device includes: a cell phone, tablet (Portable Android Device, PAD), notebook, personal digital assistant (Personal Digital Assistant, PDA), etc., the method comprising the steps of:
s101, acquiring text characteristics of all barrages released by a user in the same day;
in a live web site, a user communicates with a host or other user by publishing a barrage. The content characteristics of the barrage released from the user may be such that the user is a normal user or a potential target user. The content characteristics of the barrage mainly comprise punctuation mark characteristics, expression characteristics, negative word characteristics and TF-IDF values of various entries in all barrages in the embodiment of the invention.
The bullet screen released by the user can be a bullet screen obtained by a server from a terminal, and the terminal comprises, but is not limited to, electronic equipment with a bullet screen releasing function, such as a smart phone, a tablet computer, a smart home appliance and the like.
For punctuation features and expressive features, the live platform is more biased towards spoken language, the barrage is significantly different from written language, is generally shorter, and there are a large number of expressions, pigment words, etc. Therefore, punctuation marks and expressions in the barrage are extracted to be used as the basis for judging whether the current user is a potential target user or not. For example, the number of the cells to be processed, negative expressions, "???" "......."etc. More, for emoji expressions, an emoji word list is constructed, and the emoji expressions are mapped into a single feature to be processed.
For negative word features, the target user often employs words with negative properties to post the barrage. For example, the ironic words such as "haha", "sham", "living in dream", and the like. By basing such negative terms, it may be initially determined whether the user is a potential target user.
The TF-IDF values of the entries in all the barrages are common technical means for determining text features for those skilled in the art, and are not described herein.
S102, constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
logistic regression (Logistic Regression) is a classification model in machine learning and is very widely used in practice due to the simplicity and efficiency of the algorithm. The logistic regression model can be utilized to comprehensively evaluate various characteristics, the evaluation result is more accurate, and the accuracy of target user discrimination can be effectively improved. Meanwhile, the logistic regression model is relatively simpler than other artificial intelligence models, and the time required for discrimination can be effectively reduced. The server can conveniently generate an accurate discrimination score by utilizing bullet screen features issued by users by utilizing a logistic regression model. For example, 30 minutes, 90 minutes, etc.
S103, acquiring historical barrage data of the user;
in order to avoid misjudgment of a normal user, in this embodiment, the user's historical barrage data is considered, and whether the user is a real target user is comprehensively judged according to the barrage data of the user on the same day and the historical barrage data.
The historical barrage data of the user can be the historical barrage data obtained by the server from the terminal, or the terminal historical barrage data stored by the server.
And S104, judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
Based on the characteristic value of the user on the same day and the historical barrage data, calculating the comprehensive characteristic value of the user, judging whether the comprehensive characteristic value of the user is larger than a preset threshold value, and if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user.
The server obtains the comprehensive score of the recent period by adopting the characteristic value of the user on the same day and the historical barrage data, so that the evaluation of the target user is more personalized and objective. After the user is judged to be the target user, operations such as forbidden language or blocking the account can be implemented on the target user according to the platform rule.
In the embodiment of the invention, the text characteristics of all the barrages released by the user on the same day are obtained, the logistic regression model is constructed according to the text characteristics, the characteristic value of the user on the same day is obtained, the historical barrage data of the user is obtained, whether the user is a target user or not is judged based on the characteristic value of the user on the same day and the historical barrage data, the user is comprehensively identified by combining the historical barrage data released by the user, the identification accuracy of the target user is improved, and the problem that the normal user cannot normally use the barrage function to key barrage information is prevented.
Referring to fig. 2, fig. 2 is a flowchart of a target user identification method according to an embodiment of the invention, where the method can be applied to an electronic device, and the electronic device includes: a mobile phone, a tablet personal computer (PAD), a notebook computer, a personal digital assistant (Personal Digital Assistant, PDA), etc., the method mainly comprises the following steps:
s201, acquiring text characteristics of all barrages released by a user on the same day;
for punctuation features, expression features and negative word features, the punctuation features, expression features and negative word features can be calculated by adopting a regularized matching method. The feature calculation method using regularization matching is well known to those skilled in the art, and will not be described here.
TF-IDF values for each entry in all barrages. And extracting punctuation features and expression features, then, word segmentation is carried out on the rest text, and TF-IDF values of the entries in all the barrages are calculated. Specifically, a preset number of words in front of word frequency in the barrage can be taken as words that we need to process, for example 10000 or 20000, and other words are discarded.
S202, constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
representing the obtained text features as { x } 1 ,x 2 ,x 3 ,…,x n Let Nth feature in text feature be X n X, th n Corresponding preset weight coefficient is theta i The user's global feature value on the same day is f (x), and the logistic regression model may be:
where e is a natural constant and b is a natural number.
More, the model carries out sigmoid function processing on the output result on the basis of linear regression, so that the final output value of the model is a continuous value between 0 and 1. By using a maximum likelihood estimation method, defining a loss function when training a model as likelihood probability of the model, and solving a logarithm convenient derivative:
wherein,representing for target user X i Model output of y i Representing target user x i Is a real tag of (a). The solution of the model is usually a gradient ascent method, which can be summarized simply as follows: substituting f (x) into the loss function, then deriving the loss function, taking the gradient rising direction, carrying out parameter iteration, and repeating the steps until convergence. The gradient ascent algorithm is well known to those skilled in the art, and will not be described in detail herein.
S203, acquiring historical barrage data of the user;
in an embodiment of the present invention, the user's historical barrage data includes a number of days of activity of the user, all barrage sets of the user's history, target barrage sets of the user's history, all barrage sets of the user's day, target barrage sets of the user's day, and characteristic values of the user on a previous day.
In one embodiment of the present invention, the user's historical bullet screen data includes the number of days the user was active, and the user's characteristic value of the previous day.
More, the calculation process of the feature value of the user on the previous day is the same as the process of calculating the integrated feature value of the user on the current day described in the above steps S201 to S202. Therefore, the description is omitted.
S204, judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
In one embodiment of the present invention, the number of active days N of the user is obtained, and all bullet screen sets T of the user history are obtained ttotal Target barrage set T of user history spam All barrages set C for the user on the same day total Target barrage set C for user on the same day spam 。
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day, the characteristic value of the user on the previous day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5. In this formula, we integrate the user's history into feature valuesCharacteristic value of the day of the user->Fusion is performed. For the historical integrated characteristic value, we do a certain proportion +.>Then the characteristic value of the current day is weighted +.>The method is added into the comprehensive characteristic value, and has the advantages of retaining the influence of the historical characteristic value and guaranteeing that the characteristic value of the current day is reflected in the comprehensive characteristic value. On the other hand, for weight given, we borrow from the specific content of the user's current day barrage and the specific content C of the user's history barrage total 、C spatm 、T total 、T spam Characterized by the content of the target barrage in the barrage content, use +.>Fusion of the characteristic values of the day with +.>The characteristic values of the history are fused, wherein,representing the ratio of the current target barrage in the current barrage, +.>Representing the ratio of the history object barrage in the history barrage, +.>Representing the ratio of the total target backdrop (current target backdrop and history target backdrop) to the total backdrop (current backdrop plus history backdrop). In addition, in order to be more flexible in the actual use process, for the given weight, a preset parameter sigma is added, so that the model can be changed according to the actual scene. Further, in order to expand the influence of the current eigenvalue on the integrated eigenvalue in the fusion process, the weight correspondence of the sum of the current eigenvalue contributions is set to +.>Correspondingly, in order to reduce the influence of the history feature values on the integrated feature values in the fusion process, the history feature values are contributedThe weight of the sum is set to +.>Therefore, in the formula, at least one-day contribution of the characteristic value of the current day enters the comprehensive characteristic value, characteristic solidification of the comprehensive characteristic value caused by accumulation of the historical characteristic value is effectively prevented, and the final comprehensive characteristic value can reflect the historical integral information of the user more easily under the influence of the combination of the characteristic value of the current day.
The method for acquiring the target barrage can be performed by a common method for a person skilled in the art. For example, keyword filtering can be performed through regularized expressions, and then a target barrage containing the keywords.
More, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. At the same time, for some special users, e.g. batch registered users, it is desirable that the user has a faster current day score weight, so we add the parameter σ to strengthen the impact of current day score on the history score. For parameter σ, 1.ltoreq.σ.ltoreq.1.5 is generally defined, the larger the parameter, the higher the current day score impact. The specific value of sigma is generally determined by data analysis during actual use.
For example, in general, σ will typically be set smaller for newly registered users because newly registered users have less historical barrage data, which is easily ignored if the barrage data weight of the day is too large. For old users, sigma is generally set larger, because the historical barrage data of the old users are rich, the influence of the barrage data on the current day on the historical barrage data is not too large, so that the influence of the barrage data on the current day can be enhanced by setting the sigma larger. Wherein, the setting of the specific value of sigma is set according to the registration duration of the user. For example, the registration period is classified into six levels, one within one month, two levels from one month to three months, three levels from three months to one year, four levels from one year to three years, five levels from three years to five years, six levels above five years, 1 for the first level user σ, 1.1 for the second level user σ, 1.2 for the third level user σ, 1.3 for the fourth level user σ, 1.4 for the fifth level user σ, and 1.5 for the sixth level user σ.
The following illustrates a specific calculation procedure:
assuming that the overall characteristic value of a certain user on the previous day is 60 points, the characteristic value of the user on the current day is 90 points, and the number of active days is >30 days. On the same day, the user sends 20 target barrages after weight removal, and the total barrage amount is 30 barrages after weight removal. The total amount of the historical target barrages of the user is 2000 after the barrages are de-duplicated, and the total amount of the historical barrages is 5000 after the barrages are de-duplicated. After the current bullet screen is added to the historical bullet screen, the total amount of the target bullet screens of the user after the weight is removed is 2010, and the total amount of the target bullet screens after the weight is removed is 5020. Defining sigma=1, and presetting a threshold value as the comprehensive characteristic value of the user on the same day as follows:
it can be seen that the user has reached 90 minutes on the day, although the feature value is higher. However, since the user is historically a normal user, the overall feature value is low. Since the integrated feature value is < 70 points, the user is not considered as the target user on the same day.
The algorithm uses the difference between the current target barrage ratio of the user and the historical target barrage ratio to strengthen the current characteristic value of the user. If the user has a higher duty ratio of the target barrage in the barrage of the day, the characteristic value of the day has a higher specific gravity when fusion is performed. On the other hand, since the spoken target barrage is easily repeatedly transmitted, the barrage like "spoken Buddhist" appears in large numbers, so that the target barrage duty ratio is increased. Therefore, in statistics, the target barrage and the history barrage are subjected to deduplication counting to obtain the value of |C spam |、|T spam I and T spam +C spam |。
In an embodiment of the present invention, the number of active days N of the user is obtained, so that the comprehensive feature value of the user is Y, the feature value of the user on the same day is f (x), and the feature value of the user on the previous day isThen:
in this formula, the historical active days of the user are N, plus the weighting fusion of the day, i.e., (N+1), we use the idea of exponential averaging. For the historical eigenvalues of N days, the contribution of each day to the integrated eigenvalue is the same, namelySumming the historical contributions of N days to get +.>However, in order to make the influence of the current day eigenvalue on the final result prominent, we expand the weight of the current day eigenvalue by 2, namely +.>This allows the current day eigenvalues to be weighted twice as much as the historical day eigenvalues. Correspondingly, the weight correspondence of the sum of contributions to the historical characteristic values is reduced to +.>The processing is to expand the influence of the characteristic value of the current day in the fusion process, and the final comprehensive characteristic value can more easily reflect the historical integral information of the user by utilizing the idea of the index average. Wherein, the weight of expanding the characteristic value of the current day can be 3, 4, 5 and other numbers without departing from the inventive concept of the formula, and the weight of the sum of the history contributions can be correspondingly reduced to +.>And the like, and is not particularly limited.
Further, in the above formula, the coefficient of f (x) is calculated byThe denominator can be obtained after the coefficients of (2) are added, and the calculated comprehensive characteristic value can be ensured to be finally in [0,1 ]]This facilitates user filtering through a given threshold during actual application. Regardless of threshold variations due to distribution variations after long iterations.
More, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. And combining the historical barrage data issued by the user to comprehensively identify the user, so that the identification accuracy of the target user is improved, and the problem that the barrage information cannot be entered by the normal user through the barrage function is prevented.
In detail, please refer to the related description of the embodiment shown in fig. 1, and the detailed description is omitted.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a target user identifying apparatus according to an embodiment of the present invention, the apparatus may be built into an electronic device, and the apparatus mainly includes:
a first acquisition module 301, a construction module 302, a second acquisition module 303 and a judgment module 304;
the first obtaining module 301 is configured to obtain text features of all the barrages released by the user on the same day;
in a live web site, a user communicates with a host or other user by publishing a barrage. The content characteristics of the barrage released from the user may be such that the user is a normal user or a potential target user. The content characteristics of the barrage mainly comprise punctuation mark characteristics, expression characteristics, negative word characteristics and TF-IDF values of various entries in all barrages in the embodiment of the invention.
The construction module 302 is configured to construct a logistic regression model according to the text feature, so as to obtain a feature value of the user on the same day;
representing the obtained text features as { x } 1 ,x 2 ,x 3 ,…,x n Let Nth in text featureIs characterized by X n X, th n Corresponding preset weight coefficient is theta i The user's global feature value on the same day is f (x), and the logistic regression model may be:
where e is a natural constant and b is a natural number.
A second obtaining module 303, configured to obtain historical barrage data of the user;
in an embodiment of the present invention, the user's historical barrage data includes a number of days of activity of the user, all barrage sets of the user's history, target barrage sets of the user's history, all barrage sets of the user's day, target barrage sets of the user's day, and feature values of the user on a previous day are.
In an embodiment of the present invention, the historical bullet screen data of the user includes an active day of the user, and a characteristic value of the user on a previous day is.
And the judging module 304 is configured to judge whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data.
In an embodiment of the present invention, the user's historical barrage data includes the number of days of activity of the user, all barrage sets of the user's history, target barrage sets of the user's history, all barrage sets of the user's day, target barrage sets of the user's day, and the characteristic value of the user on the previous day is
In one embodiment of the present invention, the user's historical barrage data includes the number of days the user was active, and the characteristic value of the user on the previous day is
More, the calculation process of the feature value of the user on the previous day is the same as the process of calculating the integrated feature value of the user on the current day described in the above steps S201 to S202. Therefore, the description is omitted.
In one embodiment of the present invention, the number of active days N of the user is obtained, and all bullet screen sets T of the user history are obtained total Target barrage set T of user history spam All barrages set C for the user on the same day total Target barrage set C for user on the same day spam 。
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day, the characteristic value of the user on the previous day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
More, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. At the same time, for some special users, e.g. batch registered users, it is desirable that the user has a faster current day score weight, so we add the parameter σ to strengthen the impact of current day score on the history score. For parameter σ, 1.ltoreq.σ.ltoreq.1.5 is generally defined, the larger the parameter, the higher the current day score impact. The specific value of sigma is generally determined by data analysis during actual use.
The algorithm uses the difference between the current target barrage ratio of the user and the historical target barrage ratio to strengthen the current characteristic value of the user. If the user has a higher duty ratio of the target barrage in the barrage of the day, the characteristic value of the day has a higher specific gravity when fusion is performed. On the other hand, since the spoken target barrage is easily repeatedly transmitted, the barrage like "spoken Buddhist" appears in large numbers, so that the target barrage duty ratio is increased. Therefore, in statistics, the target barrage and the history barrage are subjected to deduplication counting to obtain the value of |C spam |、|T spam I and T spam +C spam |。
In an embodiment of the present invention, the number of active days N of the user is obtained, so that the comprehensive feature value of the user is Y, the feature value of the user on the same day is f (x), and the feature value of the user on the previous day isThen:
more, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. The method is simple in algorithm, and combines the historical barrage data issued by the user to comprehensively identify the user, so that the identification accuracy of the target user is improved, and the problem that the normal user cannot normally use the barrage function to key barrage information is prevented.
In detail, please refer to the related descriptions of the embodiments shown in fig. 1 to 2, which are not repeated here.
Referring to fig. 4, fig. 4 shows a hardware configuration diagram of an electronic device.
The electronic device described in the present embodiment includes:
the memory 41, the processor 42 and the computer program stored in the memory 41 and executable on the processor, the processor executing the program implements the target user identification method described in the embodiments shown in the foregoing fig. 1 to 2.
Further, the electronic device further includes:
at least one input device 43; at least one output device 44.
The memory 41, the processor 42, the input device 43 and the output device 44 are connected by a bus 45.
The input device 43 may be a camera, a touch panel, a physical button, a mouse, or the like. The output device 44 may be in particular a display screen.
The memory 41 may be a high-speed random access memory (RAM, random Access Memory) memory or a non-volatile memory (non-volatile memory), such as a disk memory. Memory 41 is used to store a set of executable program code and processor 42 is coupled to memory 41.
Further, the embodiment of the present invention also provides a computer readable storage medium, which may be provided in the terminal in each of the above embodiments, and the computer readable storage medium may be a memory in the embodiment shown in fig. 4. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the target user identification method described in the embodiments shown in the foregoing fig. 1-2. Further, the computer-readable medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, etc. which may store the program code.
In the various embodiments provided herein, it should be understood that the disclosed apparatus and methods may be implemented in other ways. For example, the embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication links shown or discussed with each other may be indirect coupling or communication links through interfaces, modules, or in electrical, mechanical, or other forms.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present invention.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The foregoing describes the target user identifying method, apparatus, electronic device and storage medium provided by the present invention, and those skilled in the art will recognize that there are variations in terms of specific embodiments and application scope according to the concepts of the embodiments of the present invention, and in summary, the disclosure should not be construed as limiting the invention.
Claims (8)
1. A method for identifying a target user, comprising:
acquiring text characteristics of all barrages released by a user in the same day;
constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
acquiring historical barrage data of the user;
judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data; the determining whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data comprises:
calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data;
judging whether the comprehensive characteristic value of the user is larger than a preset threshold value or not;
if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user;
the calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data comprises the following steps:
acquiring the number N of active days of the user, and collecting all bullet screen sets T of the user history total The target barrage set T of the user history spam All barrages set C of the user on the same day total The target barrage set C of the user on the same day spam ;
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
2. The method of claim 1, wherein calculating the integrated feature value of the user based on the feature value of the user on the current day and the historical barrage data comprises:
acquiring the active days N of the user;
let the comprehensive characteristic value of the user be Y, the characteristic value of the user on the current day be f (x), and the characteristic value of the user on the previous day beThen:
3. the method of any one of claims 1 to 2, wherein the text features include punctuation features, expressive features, negative word features, and TF-IDF values for the respective entries in all of the barrages.
4. The method for identifying a target user according to claim 3, wherein constructing a logistic regression model according to the text feature, and obtaining the feature value of the user on the current day comprises:
acquiring preset weight coefficients corresponding to all the characteristics in the text characteristics;
let the Nth feature in the text feature be X n X, th n Corresponding preset weight coefficient is theta i The characteristic value of the user on the same day is f (x), and then:
where e is a natural constant and b is a natural number.
5. The method of claim 1, wherein when the text feature includes TF-IDF values of respective terms in all of the barrages, the obtaining text features of all of the barrages issued by the user on the same day comprises:
let the TF-IDF value of the ith term in the jth barrage be TF-IDF i,j Then:
wherein n is i,j Indicating that the ith term appears in the jth barrageIs the number of times of sigma k n k,j Representing the number of all words in the jth barrage, |D| represents the total number of barrages of all barrages, |{ j: t is t i ∈d j And the number of the bullet screen strips containing the ith entry in all bullet screens is represented by the number of the bullet screen strips.
6. A target user identification apparatus, comprising:
the first acquisition module is used for acquiring text characteristics of all the barrages released by the user in the same day;
the construction module is used for constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
the second acquisition module is used for acquiring the historical barrage data of the user;
the judging module is used for judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data; the determining whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data comprises:
calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data;
judging whether the comprehensive characteristic value of the user is larger than a preset threshold value or not;
if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user;
the calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data comprises the following steps:
acquiring the number N of active days of the user, and collecting all bullet screen sets T of the user history total The target barrage set T of the user history spam All barrages set C of the user on the same day total The target barrage set C of the user on the same day spam ;
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
7. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the target user identification method according to any one of claims 1 to 5 when executing the computer program.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the target user identification method of any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910649162.0A CN110362831B (en) | 2019-07-17 | 2019-07-17 | Target user identification method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910649162.0A CN110362831B (en) | 2019-07-17 | 2019-07-17 | Target user identification method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110362831A CN110362831A (en) | 2019-10-22 |
CN110362831B true CN110362831B (en) | 2024-02-23 |
Family
ID=68220752
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910649162.0A Active CN110362831B (en) | 2019-07-17 | 2019-07-17 | Target user identification method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110362831B (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107656918B (en) * | 2017-05-10 | 2019-07-05 | 平安科技(深圳)有限公司 | Obtain the method and device of target user |
CN108108912A (en) * | 2018-01-10 | 2018-06-01 | 百度在线网络技术(北京)有限公司 | Method of discrimination, device, server and the storage medium of interactive low quality user |
CN109766435A (en) * | 2018-11-06 | 2019-05-17 | 武汉斗鱼网络科技有限公司 | The recognition methods of barrage classification, device, equipment and storage medium |
-
2019
- 2019-07-17 CN CN201910649162.0A patent/CN110362831B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110362831A (en) | 2019-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110472675B (en) | Image classification method, image classification device, storage medium and electronic equipment | |
CN111371767B (en) | Malicious account identification method, malicious account identification device, medium and electronic device | |
CN109840413B (en) | Phishing website detection method and device | |
CN111291817B (en) | Image recognition method, image recognition device, electronic equipment and computer readable medium | |
EP2786221A2 (en) | Classifying attribute data intervals | |
CN113220886A (en) | Text classification method, text classification model training method and related equipment | |
CN110135681A (en) | Risk subscribers recognition methods, device, readable storage medium storing program for executing and terminal device | |
US20190220924A1 (en) | Method and device for determining key variable in model | |
CN111159481B (en) | Edge prediction method and device for graph data and terminal equipment | |
CN111898675B (en) | Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment | |
CN111694954B (en) | Image classification method and device and electronic equipment | |
CN117349899A (en) | Sensitive data processing method, system and storage medium based on forgetting model | |
CN112257689A (en) | Training and recognition method of face recognition model, storage medium and related equipment | |
CN108076032B (en) | Abnormal behavior user identification method and device | |
CN110362831B (en) | Target user identification method, device, electronic equipment and storage medium | |
CN109600627B (en) | Video identification method and device | |
CN111507850A (en) | Authority guaranteeing method and related device and equipment | |
CN110688451A (en) | Evaluation information processing method, evaluation information processing device, computer device, and storage medium | |
CN112463964B (en) | Text classification and model training method, device, equipment and storage medium | |
CN113420699A (en) | Face matching method and device and electronic equipment | |
CN113112347A (en) | Determination method of hasty collection decision, related device and computer storage medium | |
CN109308565B (en) | Crowd performance grade identification method and device, storage medium and computer equipment | |
CN112507912A (en) | Method and device for identifying illegal picture | |
CN110543634A (en) | corpus data set processing method and device, electronic equipment and storage medium | |
CN116663648B (en) | Model training method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20231227 Address after: Room 205, Room 206, Room J1447, No. 1045 Tianyuan Road, Tianhe District, Guangzhou City, Guangdong Province, 510000 Applicant after: Guangzhou Caimeng Technology Co.,Ltd. Address before: 430000 room 007, A301, third floor, building B1, software industry phase 4.1, No. 1, Software Park East Road, Donghu New Technology Development Zone, Wuhan City, Hubei Province (Wuhan area of free trade zone) Applicant before: WUHAN DOUYU YULE NETWORK TECHNOLOGY Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |